How to parse HTML in Python using BeautifulSoup module

python-web-scraping-parse-html-beautifulsoup-html-page-feature-image

In this tutorial, we’re gonna look at way to use BeautifulSoup module to parse HTML in Python.

Parse HTML in Python using BeautifulSoup

Assume that we want to parse a simple HTML file with some different tags and attributes like this:

python-web-scraping-parse-html-beautifulsoup-html-page

BeautifulSoup is a module that allows us to extract data from an HTML page. You will find it working with HTML easier than regex. We will:
– able to use simple methods and Pythonic idioms searching tree, then extract what we need without boilerplate code.
– not have to think about encoding (or just have to specify original encoding) because BeautifulSoup automatically converts incoming documents to Unicode and outgoing documents to UTF-8.

Use BeautifulSoup for parsing HTML data

Install BeautifulSoup module

Open cmd, then run:
pip install beautifulsoup4

Once the installation is successful, we can see beautifulsoup4 folder at Python\Python[version]\Lib\site-packages.

Now we can import the module by running import bs4.

Create BeautifulSoup object
From response of a website

When our PC connects to internet, we can use requests module to download HTML file.
Run cmd: pip install requests to install the module.

raise_for_status() method ensures that our program halts if a bad download occurs.

From HTML file on PC

We can load HTML file on pC by passing a File object to bs4.BeautifulSoup() function.

Find HTML elements

Once we have BeautifulSoup object, we can use its select('selector') method with selector as input string to search for appropriate elements we need.

Here are some useful selectors:

  • select('certain-tag'): all elements which HTML tag are certain-tag
  • select('#certain-id'): element with id attribute of certain-id
  • select('.certain-class'): all elements that use certain-class as CSS class
  • select('tag-a tag-b'): all tag-a elements which are inside tag-b elements
  • select('tag-a > tag-b'): all tag-a elements which are directly inside tag-b elements (without any element between)
  • select('certain-tag[att]'): all certain-tag elements that have att attribute
  • select('certain-tag[att="val"]'): all certain-tag elements that have att attribute with value val

In addition to the selectors above, we can also make more custom ones such as: select('.certain-class certain-tag'), select('tag-a tag-b tag-c'), select('.class-a .class-b')

BeautifulSoup also provides select_one() method that finds only the first tag that matches the selector.

Parse data from HTML elements

On the HTML element, we:
– use getText() to get element’s text/ inner HTML.
– call attrs for element’s attributes.
– use get('attr') to access element’s attr attribute.



By grokonez | January 19, 2019.


Related Posts


Got Something To Say:

Your email address will not be published. Required fields are marked *

*