[python, BeautifulSoup] 파이썬의 BeautifulSoup 알아보기

BeautifulSoup는 무슨 라이브러리인가?

BeautifulSoup는 Python 프로그래밍 언어로 작성된 HTML 및 XML 파일을 구문 분석하기 위한 라이브러리이다.주로 웹 스크래핑에 사용되며, 복잡한 HTML 구조에서 데이터를 추출하는 데 매우 유용하다.BeautifulSoup는 다양한 Parser를 지원하여 유연하고 강력한 HTML 및 XML 처리 기능을 제공한다.

설치

BeautifulSoup는 pip를 통해 설치할 수 있다.

pip install beautifulsoup4
pip install lxml
pip install html5lib

주요 기능

Parser

BeautifulSoup는 여러 종류의 파서를 지원한다. 각 파서마다 장단점이 있으며, 상황에 맞는 파서를 선택할 수 있다.

html.parser : Python 표준 라이브러리에 포함된 기본 HTML 파서. 빠르고 간단하지만 일부 HTML5 기능을 완전히 지원하지 않을 수 있다.
lxml: 매우 빠르고 강력한 HTML 및 XML 파서. 더 많은 기능을 제공하며 대부분의 경우 가장 빠른 선택이다.
html5lib : HTML5를 완벽하게 준수하는 파서. 가장 느리지만 가장 유연하며, 잘못된 HTML을 처리하는 데 강력하다.

from bs4 import BeautifulSoup

html_doc = "<html><head><title>Example</title></head><body><p>Some content.</p></body></html>"

# 기본 파서 사용
soup = BeautifulSoup(html_doc, 'html.parser')

# lxml 파서 사용
soup = BeautifulSoup(html_doc, 'lxml')

# html5lib 파서 사용
soup = BeautifulSoup(html_doc, 'html5lib')

문서 탐색 및 조작

BeautifulSoup를 사용하면 DOM(Document Object Model) 구조를 탐색하고 조작할 수 있다.

주요 메서드와 기능을 알아보자.

find()와 find_all()

find(): 조건에 맞는 첫 번째 태그를 반환합니다.
find_all(): 조건에 맞는 모든 태그를 리스트로 반환합니다.

html_doc = """
<html>
<head>
  <title>The Dormouse's story</title>
</head>
<body>
  <p class="title"><b>The Dormouse's story</b></p>
  <p class="story">Once upon a time there were three little sisters; and their names were
  <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
  <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  and they lived at the bottom of a well.</p>
  <p class="story">...</p>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# 첫 번째 <a> 태그를 찾습니다.
a_tag = soup.find('a')
print(a_tag)

# 모든 <a> 태그를 찾습니다.
a_tags = soup.find_all('a')
for tag in a_tags:
    print(tag)

select()

CSS 선택자를 사용하여 태그를 찾을 수 있습니다. 이를 통해 더욱 정교한 검색이 가능하다.

# 모든 <a> 태그를 찾습니다.
a_tags = soup.select('a')

# id가 link1인 태그를 찾습니다.
link1 = soup.select('#link1')

# class가 sister인 모든 태그를 찾습니다.
sisters = soup.select('.sister')

get_text()

태그 내의 모든 텍스트를 추출할 수 있다.

# <p> 태그 내의 텍스트를 추출합니다.
story_paragraph = soup.find('p', class_='story')
print(story_paragraph.get_text())

태그의 속성에 접근하기

태그의 속성을 딕셔너리 형태로 접근할 수 있다.

a_tag = soup.find('a')
print(a_tag['href'])  # http://example.com/elsie
print(a_tag.attrs)  # {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

HTML 및 XML 구문 분석

BeautifulSoup는 HTML과 XML 문서를 파싱하여 파이썬 객체로 변환한다.
이를 통해 DOM(Document Object Model) 구조를 쉽게 탐색하고 조작할 수 있다.

태그 탐색 및 검색

특정 태그, 속성, 텍스트 등을 검색하고 추출할 수 있다.
CSS 선택자 또는 정규 표현식을 사용하여 복잡한 쿼리도 수행할 수 있다.

데이터 추출

웹 페이지에서 필요한 데이터 요소를 추출하고, 이를 구조화된 형식으로 변환할 수 있다.
예를 들어, 테이블 데이터, 기사 내용, 이미지 URL 등을 쉽게 추출할 수 있다.

문서 수정

HTML 또는 XML 문서를 수정하고, 태그를 추가하거나 제거하는 등의 작업을 수행할 수 있다.

예제 : 웹 스크래핑

웹 스크래핑은 BeautifuslSoup의 주요 사용 사례 중 하나이다. 예를 들어, 특정 웹사이트에서의 데이터를 추출하는 과정이있다.

import requests
from bs4 import BeautifulSoup

# 웹 페이지 요청
url = 'https://example.com'
response = requests.get(url)

# BeautifulSoup 객체 생성
soup = BeautifulSoup(response.text, 'html.parser')

# 페이지 제목 추출
title = soup.title.text
print(f"Title: {title}")

# 모든 링크 추출
for link in soup.find_all('a'):
    print(link.get('href'))

고급 기능

BeautifulSoup는 더 복잡한 작업을 수행하기 위해 다양한 고급 기능을 제공한다.

문서 수정

BeautifulSoup를 사용하여 HTML 문서를 수정할 수 있다.

# 새로운 태그 추가
new_tag = soup.new_tag('p')
new_tag.string = "This is a new paragraph."
soup.body.append(new_tag)

print(soup.prettify())

부모, 자식 및 형제 탐색

BeautifulSoup는 태그 간의 계층적 관계를 탐색할 수 있는 기능을 제공한다.

# 부모 태그 접근
parent = a_tag.parent

# 자식 태그 접근
children = list(soup.body.children)

# 형제 태그 접근
next_sibling = a_tag.find_next_sibling()
previous_sibling = a_tag.find_previous_sibling()

마치며

BeautifulSoup는 웹 페이지에서 데이터를 추출하고, 구조를 탐색하고, 문서를 수정하는 데 매우 유용한 도구이다.

다양한 파서와 함께 사용하여 HTML 및 XML 파일을 유연하게 처리할 수 있으며,

웹 스크래핑 및 데이터 분석 작업을 효율적으로 수행할 수 있다. BeautifulSoup의 직관적인 API와 강력한 기능을 통해 다양한 웹 데이터 처리 작업을 쉽게 해결할 수 있다.

'IT > 자동화, 웹스크래핑' 카테고리의 다른 글

[puppeteer,node.js] 웹 자동화 라이브러리 Puppeteer (0)	2024.05.29

goodchuck

[python, BeautifulSoup] 파이썬의 BeautifulSoup 알아보기

BeautifulSoup는 무슨 라이브러리인가?

설치

주요 기능

Parser

문서 탐색 및 조작

find()와 find_all()

select()

get_text()

태그의 속성에 접근하기

HTML 및 XML 구문 분석

태그 탐색 및 검색

데이터 추출

문서 수정

예제 : 웹 스크래핑

고급 기능

문서 수정

부모, 자식 및 형제 탐색

마치며

'IT > 자동화, 웹스크래핑' 카테고리의 다른 글

티스토리툴바

[python, BeautifulSoup] 파이썬의 BeautifulSoup 알아보기

BeautifulSoup는 무슨 라이브러리인가?

설치

주요 기능

Parser

문서 탐색 및 조작

find()와 find_all()

select()

get_text()

태그의 속성에 접근하기

HTML 및 XML 구문 분석

태그 탐색 및 검색

데이터 추출

문서 수정

예제 : 웹 스크래핑

고급 기능

문서 수정

부모, 자식 및 형제 탐색

마치며

'IT > 자동화, 웹스크래핑' 카테고리의 다른 글

관련글

티스토리툴바