Data Acquisition 2 | Beautifulsoup

Series: Data Acquisition

Data Acquisition 2 | Beautifulsoup

  1. Installation

To begin using Beautifulsoup, we have to install it first.

$ pip install beautifulsoup4

Then we have to import this package at the front of our python file,

from bs4 import BeautifulSoup

2. Create the Soup

We can create the soup by a local html file,

with open(<path of the file>) as f:    
text = f.read()
soup = BeautifulSoup(text, 'html.parser')

or we can create the soup by an online webpage,

import requests
reqs = requests.get(<URL>)
reqs.encoding = 'utf-8'. # this can be changed
soup = BeautifulSoup(reqs.text, 'html.parser')

3. Play with the Soup

  • formatting the html code to make it easier for reading
soup.prettify()
  • get the title line of the page
soup.title
  • get the title text of the page
soup.title.string

or

soup.title.text
  • get all the <a> tags
soup.find_all('a')
  • get the link of all the <a> tags including None
for link in soup.find_all('a'):
print(link.get('href'))
  • get the link of all the <a> tags, not including None
for link in soup.find_all('a'):
print(link.get('href'))
  • print the name of a tag
tag.name
  • print the text of a tag
tag.text
  • print the attributes dictionary of a tag
tag.attrs
  • print the id of a tag
tag['id']

or

tag.get('id')
  • find all the images
soup.find_all('img')
  • find all the links to the images
for link in soup.find_all('img'):
print(link.get('src'))
  • get a list of all the children tags in a tag
tag.contents
  • get a generator of a list of all the children tags in a tag
tag.children
  • find a tag with a specific attribute
soup.find(<attr>=<value>)
  • find the first <a> tag
soup.find('a')
  • find the next tag
tag.next_element
  • find the next tag
tag.last_element
  • find <a> tags and <b> tags
soup.find_all(["a", "b"])
  • find the first <a> tag with an id
soup.find('a', id=<ID>)
  • find the first <a> tag with a class
soup.find('a', class_=<CLASS>)
  • find the first n <a> tags
soup.find_all("a", limit=<n>)