여러 사이트 이동하면서 크롤링하는것을 배운다.

Web Scraping with Python Code 에 대한 코드를 정리하였다.

Ch03. 크롤링 시작하기

3.1 단일 도메인 내의 이동

여러사이트를 이동하는 스크레이퍼를 통해 크롤링
핵심은 재귀, 무한 반복
반드시 대역폭에 세심한 주의 기울여야, 타겟 서버 부하 줄일 방법 강구

링크 목록 가져오기

‘a’ -> ‘href’ 을 뽑아옴
문제 : 이상한 링크들이 따라옴
해결 :
- id가 bodyContent인 div안에 있다.
- URL에는 세미콜론이 포함되어 있지 않다.
- URL은 /wiki/로 시작한다.

from urllib.request import urlopen
from bs4 import BeautifulSoup 

html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find_all('a'):
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Wikipedia:Protection_policy#semi
#mw-head
#p-search
/wiki/Kevin_Bacon_(disambiguation)
/wiki/File:Kevin_Bacon_SDCC_2014.jpg
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
#cite_note-1
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Holly_Near
http://baconbros.com/
#cite_note-2
#cite_note-actor-3
/wiki/Footloose_(1984_film)

Random Walk

from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re

# 무작위 항목
random.seed(datetime.datetime.now())

# /wiki/<article_name> 형태인 URL을 받고 링크된 항목 URL 목록 전체를 반환
def getLinks(articleUrl):
    html = urlopen('http://en.wikipedia.org{}'.format(articleUrl))
    bs = BeautifulSoup(html, 'html.parser')
    return bs.find('div', {'id':'bodyContent'}).find_all('a', href=re.compile('^(/wiki/)((?!:).)*$'))

# 무작위 링크를 선택
links = getLinks('/wiki/Kevin_Bacon')
while len(links) > 0:
    newArticle = links[random.randint(0, len(links)-1)].attrs['href']
    print(newArticle)
    links = getLinks(newArticle)

/wiki/David_Boreanaz
/wiki/Hal_Jordan
/wiki/Legion_of_Super-Villains
/wiki/Lucy_Lane
/wiki/Saturn_Girl
/wiki/The_Lightning_Saga
/wiki/Manhunters_(DC_Comics)
/wiki/Hector_Hammond
/wiki/Anti-Monitor
/wiki/Superman
/wiki/Eradicator_(comics)
/wiki/Morgan_Edge
/wiki/Teen_Titans
/wiki/Gnarrk
/wiki/Garth_(comics)
/wiki/Silas_Stone



---------------------------------------------------------------------------

KeyboardInterrupt                         Traceback (most recent call last)

<ipython-input-2-ee18b17d20d4> in <module>
     15     newArticle = links[random.randint(0, len(links)-1)].attrs['href']
     16     print(newArticle)
---> 17     links = getLinks(newArticle)

3.2 전체 사이트 크롤링

이전 섹션은 링크에서 링크로 웹사이트를 무작위 이동했다.
여기선 모든 페이지를 어떤 시스템에 따라 분류해야할 떄 쓴다.
유용할때
- 사이트맵 생성 : 사이트 내부 접근 권한 없을때 크롤러로 내부 링크를 모두 수집해서 사이트맵 생성
- 데이터 수집 : 뉴스 기사 같은거 수집 할때

Recursively crawling an entire site

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #We have encountered a new page
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks('')

/wiki/Wikipedia
/wiki/Wikipedia:Protection_policy#semi
/wiki/Wikipedia:Requests_for_page_protection



---------------------------------------------------------------------------

KeyboardInterrupt                         Traceback (most recent call last)

<ipython-input-3-94457f9052fa> in <module>
     16                 pages.add(newPage)
     17                 getLinks(newPage)
---> 18 getLinks('')

Crawling across the Internet

페이지 제목, 첫번째 문단, 편집 페이지를 가르키는 링크를 수집하는 스크레이퍼를 만들어 보자
가장 먼저 할일은 사이트의 페이지 몇개를 살펴보며 패턴는 찾음
데이터들을 print 하게끔 만듦

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    try:
        print(bs.h1.get_text())
        print(bs.find(id ='mw-content-text').find_all('p')[0])
        print(bs.find(id='ca-edit').find('span').find('a').attrs['href'])
    except AttributeError:
        print('This page is missing something! Continuing.')
    
    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #We have encountered a new page
                newPage = link.attrs['href']
                print('-'*20)
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks('')

Main Page
<p><i><b><a href="/wiki/The_Cabinet_of_Dr._Caligari" title="The Cabinet of Dr. Caligari">The Cabinet of Dr. Caligari</a></b></i> is a German <a href="/wiki/Silent_film" title="Silent film">silent</a> <a href="/wiki/Horror_film" title="Horror film">horror film</a>, first released on 26 February 1920. It was directed by <a href="/wiki/Robert_Wiene" title="Robert Wiene">Robert Wiene</a> and written by <a href="/wiki/Hans_Janowitz" title="Hans Janowitz">Hans Janowitz</a> and <a href="/wiki/Carl_Mayer" title="Carl Mayer">Carl Mayer</a>. Considered the quintessential work of <a href="/wiki/German_Expressionism" title="German Expressionism">German Expressionist</a> cinema, it tells the story of an insane hypnotist (<a href="/wiki/Werner_Krauss" title="Werner Krauss">Werner Krauss</a>) who uses a <a href="/wiki/Sleepwalking" title="Sleepwalking">somnambulist</a> (<a href="/wiki/Conrad_Veidt" title="Conrad Veidt">Conrad Veidt</a>) to commit murders. The film features a dark and twisted visual style. The sets have sharp-pointed forms, oblique and curving lines, and structures that lean and twist in unusual angles. The film's design team, <a href="/wiki/Hermann_Warm" title="Hermann Warm">Hermann Warm</a>, <a href="/wiki/Walter_Reimann" title="Walter Reimann">Walter Reimann</a> and <a href="/wiki/Walter_R%C3%B6hrig" title="Walter Röhrig">Walter Röhrig</a>, recommended a fantastic, graphic style over a naturalistic one. With a violent and  insane authority figure as its antagonist, the film expresses the theme of brutal and irrational authority. Considered a classic, it helped draw worldwide attention to the artistic merit of German cinema and had a major influence on American films, particularly in the genres of horror and <a href="/wiki/Film_noir" title="Film noir">film noir</a>. (<b><a href="/wiki/The_Cabinet_of_Dr._Caligari" title="The Cabinet of Dr. Caligari">Full article...</a></b>) 
</p>
This page is missing something! Continuing.
--------------------
/wiki/Wikipedia
Wikipedia
<p class="mw-empty-elt">
</p>
This page is missing something! Continuing.
--------------------
/wiki/Wikipedia:Protection_policy#semi
Wikipedia:Protection policy
<p class="mw-empty-elt">
</p>
This page is missing something! Continuing.
--------------------
/wiki/Wikipedia:Requests_for_page_protection
Wikipedia:Requests for page protection
<p>This page is for requesting that a page, file or template be <strong>fully protected</strong>, <strong>create protected</strong> (<a href="/wiki/Wikipedia:Protection_policy#Creation_protection" title="Wikipedia:Protection policy">salted</a>), <strong>extended confirmed protected</strong>, <strong>semi-protected</strong>, added to <strong>pending changes</strong>, <strong>move-protected</strong>, <strong>template protected</strong>, <strong>upload protected</strong> (file-specific), or <strong>unprotected</strong>. Please read up on the <a href="/wiki/Wikipedia:Protection_policy" title="Wikipedia:Protection policy">protection policy</a>. Full protection is used to stop edit warring between multiple users or to prevent vandalism to <a href="/wiki/Wikipedia:High-risk_templates" title="Wikipedia:High-risk templates">high-risk templates</a>; semi-protection and pending changes are usually used only to prevent IP and new user vandalism (see the <a href="/wiki/Wikipedia:Rough_guide_to_semi-protection" title="Wikipedia:Rough guide to semi-protection">rough guide to semi-protection</a>); and move protection is used to stop <a href="/wiki/Wikipedia:Page-move_war" title="Wikipedia:Page-move war">page-move wars</a>. Extended confirmed protection is used where semi-protection has proved insufficient (see the <a href="/wiki/Wikipedia:Rough_guide_to_extended_confirmed_protection" title="Wikipedia:Rough guide to extended confirmed protection">rough guide to extended confirmed protection</a>)
</p>
This page is missing something! Continuing.
--------------------
/wiki/Wikipedia:Protection_policy#move
Wikipedia:Protection policy
<p class="mw-empty-elt">
</p>
This page is missing something! Continuing.
--------------------
/wiki/Wikipedia:Lists_of_protected_pages



---------------------------------------------------------------------------

KeyboardInterrupt                         Traceback (most recent call last)

<ipython-input-4-ac53a9af3c4b> in <module>
     24                 pages.add(newPage)
     25                 getLinks(newPage)
---> 26 getLinks('')

3.3 인터넷 크롤링

Crawling across the Internet

외부 링크까지도 가져오기

from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import re
import datetime
import random

pages = set()
random.seed(datetime.datetime.now())

#Retrieves a list of all Internal links found on a page
def getInternalLinks(bs, includeUrl):
    includeUrl = '{}://{}'.format(urlparse(includeUrl).scheme, urlparse(includeUrl).netloc)
    internalLinks = []
    #Finds all links that begin with a "/"
    for link in bs.find_all('a', href=re.compile('^(/|.*'+includeUrl+')')):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in internalLinks:
                if(link.attrs['href'].startswith('/')):
                    internalLinks.append(includeUrl+link.attrs['href'])
                else:
                    internalLinks.append(link.attrs['href'])
    return internalLinks
            
#Retrieves a list of all external links found on a page
def getExternalLinks(bs, excludeUrl):
    externalLinks = []
    #Finds all links that start with "http" that do
    #not contain the current URL
    for link in bs.find_all('a', href=re.compile('^(http|www)((?!'+excludeUrl+').)*$')):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in externalLinks:
                externalLinks.append(link.attrs['href'])
    return externalLinks

def getRandomExternalLink(startingPage):
    html = urlopen(startingPage)
    bs = BeautifulSoup(html, 'html.parser')
    externalLinks = getExternalLinks(bs, urlparse(startingPage).netloc)
    if len(externalLinks) == 0:
        print('No external links, looking around the site for one')
        domain = '{}://{}'.format(urlparse(startingPage).scheme, urlparse(startingPage).netloc)
        internalLinks = getInternalLinks(bs, domain)
        return getRandomExternalLink(internalLinks[random.randint(0,
                                    len(internalLinks)-1)])
    else:
        return externalLinks[random.randint(0, len(externalLinks)-1)]
    
def followExternalOnly(startingSite):
    externalLink = getRandomExternalLink(startingSite)
    print('Random external link is: {}'.format(externalLink))
    followExternalOnly(externalLink)
            
followExternalOnly('http://oreilly.com')

Random external link is: https://oreilly.formulated.by/scme-miami-2020/
Random external link is: https://www.linkedin.com/in/keith-carswell-71aa5299/



---------------------------------------------------------------------------

HTTPError                                 Traceback (most recent call last)

<ipython-input-11-a23b136d3234> in <module>
     52     followExternalOnly(externalLink)
     53 
---> 54 followExternalOnly('http://oreilly.com')

Collects a list of all external URLs found on the site

저장하기

allExtLinks = set()
allIntLinks = set()


def getAllExternalLinks(siteUrl):
    html = urlopen(siteUrl)
    domain = '{}://{}'.format(urlparse(siteUrl).scheme,
                              urlparse(siteUrl).netloc)
    bs = BeautifulSoup(html, 'html.parser')
    internalLinks = getInternalLinks(bs, domain)
    externalLinks = getExternalLinks(bs, domain)

    for link in externalLinks:
        if link not in allExtLinks:
            allExtLinks.add(link)
            print(link)
    for link in internalLinks:
        if link not in allIntLinks:
            allIntLinks.add(link)
            getAllExternalLinks(link)


allIntLinks.add('http://oreilly.com')
getAllExternalLinks('http://oreilly.com')

https://www.oreilly.com
https://www.oreilly.com/sign-in.html
https://www.oreilly.com/online-learning/try-now.html
https://www.oreilly.com/online-learning/index.html
https://www.oreilly.com/online-learning/individuals.html
https://www.oreilly.com/online-learning/teams.html
https://www.oreilly.com/online-learning/enterprise.html
https://www.oreilly.com/online-learning/government.html
https://www.oreilly.com/online-learning/academic.html
https://www.oreilly.com/online-learning/features.html
https://www.oreilly.com/online-learning/custom-services.html
https://www.oreilly.com/online-learning/pricing.html
https://www.oreilly.com/conferences/
   
---------------------------------------------------------------------------

KeyboardInterrupt                         Traceback (most recent call last)

<ipython-input-12-fe3fc94e1803> in <module>
     23 
     24 allIntLinks.add('http://oreilly.com')
---> 25 getAllExternalLinks('http://oreilly.com')

3.4 스크래파이를 사용한 크롤링

스크래파이라는 반복작업을 줄여주는 라이브러리도 존재함

[코드정리]Web Scraping with Python(Ch03. 크롤링 시작하기)

Ch03. 크롤링 시작하기

3.1 단일 도메인 내의 이동

링크 목록 가져오기

Random Walk

3.2 전체 사이트 크롤링

Recursively crawling an entire site

Crawling across the Internet

3.3 인터넷 크롤링

Crawling across the Internet

Collects a list of all external URLs found on the site

3.4 스크래파이를 사용한 크롤링

데이터표류기

Ch03. 크롤링 시작하기

3.1 단일 도메인 내의 이동

링크 목록 가져오기

Random Walk

3.2 전체 사이트 크롤링

Recursively crawling an entire site

Crawling across the Internet

3.3 인터넷 크롤링

Crawling across the Internet

Collects a list of all external URLs found on the site

3.4 스크래파이를 사용한 크롤링

Share this post