How to parse HTML in Python using BeautifulSoup module

python-web-scraping-parse-html-beautifulsoup-html-page-feature-image

In this tutorial, we’re gonna look at way to use BeautifulSoup module to parse HTML in Python.

Parse HTML in Python using BeautifulSoup

Assume that we want to parse a simple HTML file with some different tags and attributes like this:



	grokonez


grokonez.com

javasampleapproach.com

Programming Tutorials

Java, Javascript, Python Technology

Be bold in stating your key points. Put them in a list:

  • The first item in your list
  • The second item; italicize key words

Improve your image by including an image.

A Great HTML Resource

Add a link to your favorite Web site. Break up your page with a horizontal rule or two.


Finally, link to another page in your own Web site.

%MINIFYHTMLddd79723e9cb52be4cf9c3b2b341c5e812%%MINIFYHTMLddd79723e9cb52be4cf9c3b2b341c5e813%

© grokonez 2019

%MINIFYHTMLddd79723e9cb52be4cf9c3b2b341c5e814%

python-web-scraping-parse-html-beautifulsoup-html-page

BeautifulSoup is a module that allows us to extract data from an HTML page. You will find it working with HTML easier than regex. We will:
– able to use simple methods and Pythonic idioms searching tree, then extract what we need without boilerplate code.
– not have to think about encoding (or just have to specify original encoding) because BeautifulSoup automatically converts incoming documents to Unicode and outgoing documents to UTF-8.

Use BeautifulSoup for parsing HTML data

Install BeautifulSoup module

Open cmd, then run:
pip install beautifulsoup4

Once the installation is successful, we can see beautifulsoup4 folder at Python\Python[version]\Lib\site-packages.

Now we can import the module by running import bs4.

Create BeautifulSoup object
From response of a website

When our PC connects to internet, we can use requests module to download HTML file.
Run cmd: pip install requests to install the module.

>>> import requests, bs4
>>> response = requests.get('https://grokonez.com/wp-content/uploads/2019/01/grokonez.html')
>>> response.raise_for_status()
>>> gkzSoup = bs4.BeautifulSoup(response.text)
>>> type(gkzSoup)

raise_for_status() method ensures that our program halts if a bad download occurs.

From HTML file on PC

We can load HTML file on pC by passing a File object to bs4.BeautifulSoup() function.

>>> import bs4
>>> gkzFile = open('grokonez.html')
>>> gkzSoup = bs4.BeautifulSoup(gkzFile.read())
>>> type(gkzSoup)

Find HTML elements

Once we have BeautifulSoup object, we can use its select('selector') method with selector as input string to search for appropriate elements we need.

Here are some useful selectors:

  • select('certain-tag'): all elements which HTML tag are certain-tag
  • select('#certain-id'): element with id attribute of certain-id
  • select('.certain-class'): all elements that use certain-class as CSS class
  • select('tag-a tag-b'): all tag-a elements which are inside tag-b elements
  • select('tag-a > tag-b'): all tag-a elements which are directly inside tag-b elements (without any element between)
  • select('certain-tag[att]'): all certain-tag elements that have att attribute
  • select('certain-tag[att="val"]'): all certain-tag elements that have att attribute with value val
# select('tag')
>>> gkzSoup.select('li')
[
  • The first item in your list
  • ,
  • The second item; italicize key words
  • ] # select('#id') >>> gkzSoup.select('#about') [

    Finally, link to another page in your own Web site.

    ] # select('.class') >>> gkzSoup.select('.gkz-large') [

    grokonez.com

    ,

    Programming Tutorials

    Java, Javascript, Python Technology

    ] # select('tag-a tag-b') >>> gkzSoup.select('div li') [
  • The first item in your list
  • ,
  • The second item; italicize key words
  • ] # select('tag-a > tag-b') >>> gkzSoup.select('div > li') [] >>> gkzSoup.select('div > ul') [
    • The first item in your list
    • The second item; italicize key words
    ] # select('tag[att]') >>> gkzSoup.select('h1[site]') [

    grokonez.com

    ,

    javasampleapproach.com

    ] # select('tag[att="val"]') >>> gkzSoup.select('h1[site="grokonez.com"]') [

    grokonez.com

    ]

    In addition to the selectors above, we can also make more custom ones such as: select('.certain-class certain-tag'), select('tag-a tag-b tag-c'), select('.class-a .class-b')

    >>> gkzSoup.select('.gkz-large h2')
    [

    Programming Tutorials

    ] >>> gkzSoup.select('div ul li') [
  • The first item in your list
  • ,
  • The second item; italicize key words
  • ] >>> gkzSoup.select('.gkz-header .gkz-title') [

    Programming Tutorials

    Java, Javascript, Python Technology

    ] >>> gkzSoup.select('p#about a') [another page]

    BeautifulSoup also provides select_one() method that finds only the first tag that matches the selector.

    >>> gkzSoup.select_one('li')
    
  • The first item in your list
  • Parse data from HTML elements

    On the HTML element, we:
    – use getText() to get element’s text/ inner HTML.
    – call attrs for element’s attributes.
    – use get('attr') to access element’s attr attribute.

    >>> els = gkzSoup.select('div h1')
    # [
    # 

    grokonez.com

    , #

    javasampleapproach.com

    # ] >>> els[0].getText() 'grokonez.com' >>> els[0].attrs {'class': ['gkz-large'], 'site': 'grokonez.com'} >>> els[0].get('class') ['gkz-large'] >>> els[1].get('site') 'javasampleapproach.com'


    By grokonez | January 19, 2019.


    Related Posts


    2 thoughts on “How to parse HTML in Python using BeautifulSoup module”

    1. Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

    Got Something To Say:

    Your email address will not be published. Required fields are marked *

    *