Data Acquisition 2 | Beautifulsoup
Data Acquisition 2 | Beautifulsoup

- Installation
To begin using Beautifulsoup, we have to install it first.
$ pip install beautifulsoup4
Then we have to import this package at the front of our python file,
from bs4 import BeautifulSoup
2. Create the Soup
We can create the soup by a local html file,
with open(<path of the file>) as f:
text = f.read()
soup = BeautifulSoup(text, 'html.parser')
or we can create the soup by an online webpage,
import requests
reqs = requests.get(<URL>)
reqs.encoding = 'utf-8'. # this can be changed
soup = BeautifulSoup(reqs.text, 'html.parser')
3. Play with the Soup
- formatting the html code to make it easier for reading
soup.prettify()
- get the title line of the page
soup.title
- get the title text of the page
soup.title.string
or
soup.title.text
- get all the <a> tags
soup.find_all('a')
- get the link of all the <a> tags including None
for link in soup.find_all('a'):
print(link.get('href'))
- get the link of all the <a> tags, not including None
for link in soup.find_all('a'):
print(link.get('href'))
- print the name of a tag
tag.name
- print the text of a tag
tag.text
- print the attributes dictionary of a tag
tag.attrs
- print the id of a tag
tag['id']
or
tag.get('id')
- find all the images
soup.find_all('img')
- find all the links to the images
for link in soup.find_all('img'):
print(link.get('src'))
- get a list of all the children tags in a tag
tag.contents
- get a generator of a list of all the children tags in a tag
tag.children
- find a tag with a specific attribute
soup.find(<attr>=<value>)
- find the first <a> tag
soup.find('a')
- find the next tag
tag.next_element
- find the next tag
tag.last_element
- find <a> tags and <b> tags
soup.find_all(["a", "b"])
- find the first <a> tag with an id
soup.find('a', id=<ID>)
- find the first <a> tag with a class
soup.find('a', class_=<CLASS>)
- find the first n <a> tags
soup.find_all("a", limit=<n>)