Beautiful Soup 4 - Web Scraping With Python

Posted under » Python on 4 Sep 2023

Sometimes before you start a data analysis project, you need to have data. You may need to get data from a website and this is why you need web scraping.

I can do this with PHP or Selenium but it is much simpler to use Beautiful Soup. It is part of Jupyter Python so you don't have to install it on its own.

Here is a basic request that will output raw html

from bs4 import BeautifulSoup
import requests

url = "https://www.anoneh.com?page=1"
result = requests.get(url)
print(result.text)

You can make it look pretty with the default html.parser or parser of your choice eg. html5lib if you have it installed.

url = "https://www.anoneh.com?page=1"
result = requests.get(url)
naise = BeautifulSoup(result.text, "html5lib")
print(naise.prettify())

This is the most important part... getting data. To begin, lets get the title which happens only once.

tit = naise.title

<title>Beautiful Soup 4 - Web Scraping With Python</title>

tit = naise.title.string

Beautiful Soup 4 - Web Scraping With Python

tit = naise.title.parent.name

head

Next we get the links, there could be many

dvd = naise.a # one and first
dvd = naise.find_all('a') # many

Now lets see each one of them

for link in naise.find_all('a'):
    print(link.get('href'))

update.php
newrelease.php
newentries.php
wanted.php

In the above example, you can also get the title within a

for link in naise.find_all('a'):
    print(link.get('title'))

If you just want to see the text without any distractions

# for links
for link in naise.find_all('a'):
# or the whole page
for link in naise:
    print(link.get_text())

Lets say we want to get the text within div

<div class="id">sample-137</div>

for link in naise.find_all("div", "id"):
    print(link)

# if we want to get everything 
# but if we just want to get 'sample-137'

for link in naise.find_all("div", "id"):
    print(link.get_text())

Other than HTML tags, we may need to look for things that are within a div id.

dvd = naise.find(id="vid_102")