웹 크롤링 문법
urlretrieve, urlopen, urlparse, urlencode

urlretrieve

import urllib.request

imgUrl = "http://imgnews.naver.net/image/002/2017/03/13/0002026985_001_20170313153101670.jpg"
savePath1 = "c:/Users/yeajunseok/Documents/Atom/WebCrawling_python/session2/test1.jpg"

urllib.request.urlretrieve(imgUrl, savePath1)

print("다운로드 완료!")

urlretrieve
내가 정한 경로에 이미지를 다이렉트로 다운받는다.
‘파싱’이 필요없는 데이터(이미지, 엑셀, 여러문서 한번에 다운받을때)는 urlretrieve를 사용한다.

urlopen

import urllib.request

imgUrl = "http://imgnews.naver.net/image/002/2017/03/13/0002026985_001_20170313153101670.jpg"
----------------------------------
# 방법 <1>, close를 해줘야 한다.
f = urllib.request.urlopen(imgUrl).read() #우리의 하드디스크에 저장되기 전에 변수에(메모리)저장 시킨다.

savefile1 = open(savePath1, 'wb') #w : write, r : read, a : add / b: 바이너리로
savefile1.write(f)
savefile1.close()
----------------------------------
# 방법 <2>, 위 방법보다 with문을 사용하는것이 좋다. 왜냐하면 with문은 자동으로 close를 해주기 때문이다.
f2 = urllib.request.urlopen(htmlUrl).read()

with open(savePath2, 'wb') as savefile2:
    savefile2.write(f2)

print("다운로드 완료!")

urlopen
변수(메모리)에 먼저 할당하고 -> 파싱 -> 저장
중간에서 필요로하는 것을 분석할때 urlopen을 사용하자.

urlretrieve과 urlopen의 차이점!!! 중요함!!
urlopen은
변수(메모리)에 먼저 할당하고 -> 파싱 -> 저장 urlretrieve는
바로 다이렉트로 다운받는다.(주소를 내가 정한 경로에 바로 저장한다)

파싱이 필요없는 데이터(이미지, 엑셀, 여러문서 한번에 다운받때)는 urlretrieve를 사용하자.
중간에서 필요로하는 것을 분석할때 urlopen을 사용하자.

다양한 urllib.request.urlopen(url).???()

다양한 urllib.request.urlopen(url).???()의 결과값들
geturl(), getcode(), status, info(), getheaders(), read(), read(숫자)

url = "http://www.encar.com"
mem = urllib.request.urlopen(url)

print(mem)
#결과값: <http.client.HTTPResponse object at 0x00000122130F1518>
print(type(mem))
#결과값: <class 'http.client.HTTPResponse'>
print("geturl : ", mem.geturl())
#결과값: geturl :  http://www.encar.com/index.do
---
print("code : ", mem.getcode())
print("status : ", mem.status)
#결과: 200, 200은 정산, 404은 없음, 403은 거절당함, 500은 서버애러.
---
print("info : ", mem.info())
print("headers : ", mem.getheaders())
#리스트형태로 나온다.
---
print("read : ", mem.read())
print("read : ", mem.read(10)) #숫자만큼만 읽어온다.
print("read : ", mem.read(50).decode("utf-8")) #euc-kr

urlparse

from urllib.parse import urlparse #urllib.parse 중에서 urlparse를 갖고온다.
print(urlparse("www.naver.com"))
#결과값: ParseResult(scheme='', netloc='', path='www.naver.com', params='', query='', fragment='')

from urllib.parse import urlparse
의 의미는 urllib.parse 중에서 urlparse를 갖고온다.

urlencode

import urllib.request
from urllib.parse import urlencode

API = "https://api.ipify.org"
values = {
    'format':'json'
}
print('이전:',values)
#결과값: {'format': 'json'}
params = urlencode(values)
print('이후:',params)
#결과값: format=json

url = API + "?" + params
print("요청 url:",url)
#결과값: 요청 url: https://api.ipify.org?format=json

reqData = urllib.request.urlopen(url).read().decode('utf-8')
print("결과값:",reqData)
#결과값: {"ip":"59.29.245.38"}

Twitter Facebook LinkedIn

웹 크롤링_urllib

Yea JunSeok

urlretrieve

urlopen

다양한 urllib.request.urlopen(url).???()

urlparse

urlencode

공유하기

참고

DevOps_2_webserver

DevOps_1_

파이썬 프로그래밍_5_try except

파이썬 프로그래밍_10_Pandas_DataFrame