Posted under » Python on 4 Sep 2023
Sometimes before you start a data analysis project, you need to have data. You may need to get data from a website and this is why you need web scraping.
I can do this with PHP or Selenium but it is much simpler to use Beautiful Soup. It is part of Jupyter Python so you don't have to install it on its own.
Here is a basic request that will output raw html
from bs4 import BeautifulSoup import requests url = "https://www.anoneh.com?page=1" result = requests.get(url) print(result.text)
You can make it look pretty with the default html.parser or parser of your choice eg. html5lib if you have it installed.
url = "https://www.anoneh.com?page=1" result = requests.get(url) naise = BeautifulSoup(result.text, "html5lib") print(naise.prettify())
This is the most important part... getting data. To begin, lets get the title which happens only once.
tit = naise.title <title>Beautiful Soup 4 - Web Scraping With Python</title> tit = naise.title.string Beautiful Soup 4 - Web Scraping With Python tit = naise.title.parent.name head
Next we get the links, there could be many
dvd = naise.a # one and first dvd = naise.find_all('a') # many
Now lets see each one of them
for link in naise.find_all('a'): print(link.get('href')) update.php newrelease.php newentries.php wanted.php
In the above example, you can also get the title within a
for link in naise.find_all('a'): print(link.get('title'))
If you just want to see the text without any distractions
# for links for link in naise.find_all('a'): # or the whole page for link in naise: print(link.get_text())
Lets say we want to get the text within div
<div class="id">sample-137</div> for link in naise.find_all("div", "id"): print(link) # if we want to get everything # but if we just want to get 'sample-137' for link in naise.find_all("div", "id"): print(link.get_text())
Other than HTML tags, we may need to look for things that are within a div id.
dvd = naise.find(id="vid_102")