# Start with the usual imports
# We'll use these throughout
import pandas as pd
from bs4 import BeautifulSoup
import requests
Week 6B
Web Scraping
- Section 401
- Wednesday, October 11, 2023
Week 6 agenda: web scraping
Last time: - Why web scraping? - Getting familiar with the Web - Web scraping: extracting data from static sites
Today: - Practice with web scraping - How to deal with dynamic content
Part 1: Web scraping exercises
For each of the exercises, use the Web Inspector to inspect the structure of the relevant web page, and identify the HTML content you will need to scrape with Python.
1. The number of votes for Cherelle Parker in the May mayoral primary
The relevant URL is: https://philadelphiaresults.azurewebsites.us/ResultsSW.aspx?type=MAY&map=CTY
Hint: We’re interested in just a single HTML element so you can inspect the website, identify the right element, and copy the selector for the element.
# Make the request
= "https://philadelphiaresults.azurewebsites.us/ResultsSW.aspx?type=MAY&map=CTY"
url = requests.get(url)
response
# Initialize the soup for this page
= BeautifulSoup(response.content, "html.parser") soup1
# Select the element holding the number of votes
= "#ctl01 > div.outside-wrapper > div:nth-child(16) > div:nth-child(1) > div.section.group.winner > div.col.display-results-box-f > h1" selector
# Select the element
= soup1.select_one(selector) element
# Get the text of the element
= element.text
votes_str
votes_str
'81,080'
# Convert to integer!
= int(votes_str.replace(",", ""))
num_votes
# Print
print(f"Number of votes for Cherelle Parker = {num_votes}")
Number of votes for Cherelle Parker = 81080
2. How many millions of people are currently experiencing drought?
Relevant URL: https://www.drought.gov/current-conditions
Hint: We’re interested in just a single HTML element so you can inspect the website, identify the right element, and copy the selector for the element.
# Make the request
= "https://www.drought.gov/current-conditions"
url = requests.get(url)
response
# Initialize the soup for this page
= BeautifulSoup(response.content, "html.parser") soup2
= "#block-uswds-drought-content > div > div > div.grid-container.grid-container--standard.padding-top-6 > div:nth-child(5) > div > div:nth-child(3) > div > div > div.u--color--accent.text-center.font-sans-xl.field.field--name-field-number-stat.field--type-string.field--label-hidden" selector
soup2.select_one(selector).text
'95.8 Million'
3. Scrape the Weitzman School directory
The Weitzman School lists their directory of people on this page: https://www.design.upenn.edu/people/list. From this site, let’s extract out following information:
- The person’s name;
- title, and;
- associated department.
This example is similar to the Inquirer Clean Plates demo from last lecture.
The info we want for each person is wrapped up in a <div>
element. You can select all of those elements, loop over each one in a “for” loop, extract the three pieces of content we want from each <div>
, and then save the result to a list.
# Make the request
= "https://www.design.upenn.edu/people/list"
url = requests.get(url)
response
# Initialize the soup for this page
= BeautifulSoup(response.content, "html.parser") soup3
# Select all rows
= soup3.select(".views-row") rows
len(rows)
536
# Extract out specific content from each row
print(rows[0].prettify())
<div class="views-row">
<a class="list-item profile-item" href="/graduate-admissions/people/anushka-samant">
<span class="text">
<span class="title heading-5">
Anushka Samant
</span>
<span class="meta body-small">
MCP '24
</span>
</span>
<span class="dept body-subhead">
Admissions
</span>
<span class="arrow">
<span aria-hidden="true" class="fa fa-arrow-right">
</span>
</span>
</a>
<div class="views-field views-field-edit-node-1 edit">
</div>
</div>
# Save content here
= []
data
# Loop over all rows
for row in rows:
# Person name
= row.select_one(".title").text
person
# Title
= row.select_one(".meta").text
title
# Deptarment
= row.select_one(".dept").text.strip()
dept
"person": person, "title": title, "dept": dept})
data.append({
= pd.DataFrame(data)
data
data
person | title | dept | |
---|---|---|---|
0 | Anushka Samant | MCP '24 | Admissions |
1 | Tom Abel | External Faculty Collaborator | Center for Environmental Building & Design |
2 | Mostafa Akbari | PhD Candidate in Architecture | Architecture |
3 | Masoud Akbarzadeh | Assistant Professor of Architecture | Architecture |
4 | Scott Aker | Lecturer | Architecture |
... | ... | ... | ... |
531 | Dr. Hao Zheng | PhD Architecture Alum, 2023 | PhD Architecture |
532 | Yefan Zhi | PhD Student in Architecture | Architecture |
533 | Cynthia Zhou | Animator Researcher//Design and Fine Arts | Penn Animation as Research Lab |
534 | Jessica Zofchak | Lecturer | Architecture |
535 | Syd Zolf | Artist in Residence CPCW & GSWS, FNAR Lecturer | Fine Arts |
536 rows × 3 columns
Part 2: What about dynamic content?
How do you scrape data that is loaded via Javascript or only appears after user interaction?
You’ll need a web browser installed to use selenium
, e.g., FireFox, Google Chrome, Edge, etc.
Selenium
- Designed as a framework for testing webpages during development
- Provides an interface to interact with webpages just as a user would
- Becoming increasingly popular for web scraping dynamic content from pages
Two common use cases
Two cases when requests.get()
won’t work:
- Many sites load content via javascript, so requests.get() won’t return anything. It doesn’t load any javascript on the page, just returns the static HTML content.
- If the site requires user interaction via buttons, dropdowns, etc., then get requests won’t be able to show that information.
But how to solve them?
- Look on the “Network” tab for the external requests that the javascript is using to pull the data.
- Selenium to the rescue!
Let’s try it out!
Example #1: Philly’s Property App
- URL: https://property.phila.gov/
- Let’s see if we can extract the property assessment values for my old apartment in Fairmount
Problem: We’ll need to enter an address, click a button, and THEN scrape the information.
1. Initialize the web driver
The initialization steps will depend on which browser you want to use!
# Import the webdriver from selenium
from selenium import webdriver
If you are working on Binder, you’ll need to use FireFox in “headless” mode, which prevents a browser window from opening.
If you are working locally, it’s better to run with the default options — you’ll be able to see the browser window open and change as we perform the web scraping.
Using Google Chrome
# UNCOMMENT BELOW TO USE CHROME
= webdriver.Chrome() driver
Using Firefox
If you are working on Binder, use the below code!
# UNCOMMENT BELOW IF ON BINDER
# options = webdriver.FirefoxOptions()
# IF ON BINDER, RUN IN "HEADLESS" MODE (NO BROWSER WINDOW IS OPENED)
# COMMENT THIS LINE IF WORKING LOCALLY
# options.add_argument("--headless")
# Initialize
# driver = webdriver.Firefox(options=options)
Using Microsoft Edge
# UNCOMMENT BELOW TO USE MICROSOFT EDGE
#driver = webdriver.Edge()
3. Select the address input element
After it loads, let’s take a look in the web inspector. We’ll need to identify the search bar so we can input our desired address.
It looks like the search bar has a class of "pvm-search-control-input"
.
# Use the dot syntax for class selectors
= ".pvm-search-control-input" address_input_selector
We can select it directly in selenium using the driver.find_element()
function and telling selenium we are using a css selector to do the selection.
from selenium.webdriver.common.by import By
# Select the address input by the element's CSS selector
= driver.find_element(By.CSS_SELECTOR, address_input_selector) address_input
4. Type the address
Use the send_keys()
function to enter the text into the input element:
# Input our example address
"1739 Wallace St #102") address_input.send_keys(
6. Use BeautifulSoup to parse the results
- Use the
page_source
attribute to get the current HTML displayed on the page - Initialize a “soup” object with the HTML
= BeautifulSoup(driver.page_source, "html.parser") propertySoup
From the web inspector, it looks like the table we want has an ID of "ownerProperties"
. Let’s select all of the elements with that ID:
# Use the # for ID selector
= "#ownerProperties"
table_selector
= propertySoup.select(table_selector) tables
len(tables)
2
We selected by element ID, which should be unique but sometimes isn’t! Here we see that multiple tables all share the same ID on the page.
The table we want is the second table on the page with the ID “ownerProperties” — let’s select it.
1] tables[
<table class="stack" data-v-49670532="" id="ownerProperties" role="grid"><thead data-v-49670532=""><tr data-v-49670532=""><th class="" data-v-49670532="" title=""> Year </th><th class="" data-v-49670532="" title=""> Market Value </th><th class="" data-v-49670532="" title=""> Taxable Land </th><th class="" data-v-49670532="" title=""> Taxable Improvement </th><th class="" data-v-49670532="" title=""> Exempt Land </th><th class="" data-v-49670532="" title=""> Exempt Improvement </th></tr></thead><tbody data-v-49670532=""><tr class="" data-v-49670532="" data-v-6ed23da8=""><td class="" data-v-6ed23da8="" sorttable_customkey="2024"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">2024</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">2024</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="291400"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$291,400</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$291,400</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="43700"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$43,700</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$43,700</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="247700"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$247,700</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$247,700</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td></tr><tr class="" data-v-49670532="" data-v-6ed23da8=""><td class="" data-v-6ed23da8="" sorttable_customkey="2023"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">2023</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">2023</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="291400"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$291,400</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$291,400</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="43700"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$43,700</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$43,700</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="247700"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$247,700</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$247,700</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td></tr><tr class="" data-v-49670532="" data-v-6ed23da8=""><td class="" data-v-6ed23da8="" sorttable_customkey="2022"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">2022</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">2022</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="264900"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$264,900</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$264,900</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="39735"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$39,735</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$39,735</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="225165"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$225,165</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$225,165</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td></tr><tr class="" data-v-49670532="" data-v-6ed23da8=""><td class="" data-v-6ed23da8="" sorttable_customkey="2021"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">2021</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">2021</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="264900"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$264,900</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$264,900</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="39735"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$39,735</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$39,735</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="225165"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$225,165</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$225,165</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td></tr><tr class="" data-v-49670532="" data-v-6ed23da8=""><td class="" data-v-6ed23da8="" sorttable_customkey="2020"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">2020</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">2020</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="264900"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$264,900</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$264,900</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="39735"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$39,735</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$39,735</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="225165"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$225,165</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$225,165</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td></tr><tr class="" data-v-49670532="" data-v-6ed23da8=""><td class="" data-v-6ed23da8="" sorttable_customkey="2019"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">2019</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">2019</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="264900"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$264,900</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$264,900</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="39735"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$39,735</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$39,735</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="185165"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$185,165</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$185,165</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="40000"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$40,000</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$40,000</div><!-- --></div></div></td></tr><tr class="" data-v-49670532="" data-v-6ed23da8=""><td class="" data-v-6ed23da8="" sorttable_customkey="2018"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">2018</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">2018</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="240800"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$240,800</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$240,800</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="38528"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$38,528</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$38,528</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="172272"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$172,272</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$172,272</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="30000"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$30,000</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$30,000</div><!-- --></div></div></td></tr><tr class="" data-v-49670532="" data-v-6ed23da8=""><td class="" data-v-6ed23da8="" sorttable_customkey="2017"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">2017</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">2017</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="240800"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$240,800</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$240,800</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="38528"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$38,528</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$38,528</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="172272"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$172,272</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$172,272</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="30000"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$30,000</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$30,000</div><!-- --></div></div></td></tr><tr class="" data-v-49670532="" data-v-6ed23da8=""><td class="" data-v-6ed23da8="" sorttable_customkey="2016"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">2016</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">2016</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="209400"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$209,400</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$209,400</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="20940"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$20,940</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$20,940</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="158460"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$158,460</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$158,460</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="30000"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$30,000</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$30,000</div><!-- --></div></div></td></tr><tr class="" data-v-49670532="" data-v-6ed23da8=""><td class="" data-v-6ed23da8="" sorttable_customkey="2015"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">2015</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">2015</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="209400"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$209,400</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$209,400</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="20940"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$20,940</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$20,940</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="158460"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$158,460</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$158,460</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="30000"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$30,000</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$30,000</div><!-- --></div></div></td></tr><!-- --></tbody></table>
= tables[1] table
Select all the rows in the table (the <tr>
elements):
# Select all row (<tr>) elements in the table tbody
= table.select("tbody tr") rows
Now, we can take a look at each <tr>
element and extract each cell’s data in the <td>
cell
print(rows[0].prettify())
<tr class="" data-v-49670532="" data-v-6ed23da8="">
<td class="" data-v-6ed23da8="" sorttable_customkey="2024">
<div data-v-2ca357c0="" data-v-6ed23da8="">
<div data-v-2ca357c0="">
</div>
<!-- -->
</div>
<b data-v-6ed23da8="" style="display: none;">
<!-- -->
<div data-v-6ed23da8="">
2024
</div>
</b>
<div data-v-6ed23da8="">
<!-- -->
<div data-v-6ed23da8="">
<div data-v-6ed23da8="">
2024
</div>
<!-- -->
</div>
</div>
</td>
<td class="" data-v-6ed23da8="" sorttable_customkey="291400">
<div data-v-2ca357c0="" data-v-6ed23da8="">
<div data-v-2ca357c0="">
</div>
<!-- -->
</div>
<b data-v-6ed23da8="" style="display: none;">
<!-- -->
<div data-v-6ed23da8="">
$291,400
</div>
</b>
<div data-v-6ed23da8="">
<!-- -->
<div data-v-6ed23da8="">
<div data-v-6ed23da8="">
$291,400
</div>
<!-- -->
</div>
</div>
</td>
<td class="" data-v-6ed23da8="" sorttable_customkey="43700">
<div data-v-2ca357c0="" data-v-6ed23da8="">
<div data-v-2ca357c0="">
</div>
<!-- -->
</div>
<b data-v-6ed23da8="" style="display: none;">
<!-- -->
<div data-v-6ed23da8="">
$43,700
</div>
</b>
<div data-v-6ed23da8="">
<!-- -->
<div data-v-6ed23da8="">
<div data-v-6ed23da8="">
$43,700
</div>
<!-- -->
</div>
</div>
</td>
<td class="" data-v-6ed23da8="" sorttable_customkey="247700">
<div data-v-2ca357c0="" data-v-6ed23da8="">
<div data-v-2ca357c0="">
</div>
<!-- -->
</div>
<b data-v-6ed23da8="" style="display: none;">
<!-- -->
<div data-v-6ed23da8="">
$247,700
</div>
</b>
<div data-v-6ed23da8="">
<!-- -->
<div data-v-6ed23da8="">
<div data-v-6ed23da8="">
$247,700
</div>
<!-- -->
</div>
</div>
</td>
<td class="" data-v-6ed23da8="" sorttable_customkey="">
<div data-v-2ca357c0="" data-v-6ed23da8="">
<div data-v-2ca357c0="">
</div>
<!-- -->
</div>
<b data-v-6ed23da8="" style="display: none;">
<!-- -->
<div data-v-6ed23da8="">
$0
</div>
</b>
<div data-v-6ed23da8="">
<!-- -->
<div data-v-6ed23da8="">
<div data-v-6ed23da8="">
$0
</div>
<!-- -->
</div>
</div>
</td>
<td class="" data-v-6ed23da8="" sorttable_customkey="">
<div data-v-2ca357c0="" data-v-6ed23da8="">
<div data-v-2ca357c0="">
</div>
<!-- -->
</div>
<b data-v-6ed23da8="" style="display: none;">
<!-- -->
<div data-v-6ed23da8="">
$0
</div>
</b>
<div data-v-6ed23da8="">
<!-- -->
<div data-v-6ed23da8="">
<div data-v-6ed23da8="">
$0
</div>
<!-- -->
</div>
</div>
</td>
</tr>
It looks like the content for each cell is inside a <b>
tag inside a <td>
cell:
# Empty list to store scraped data
= []
scraped_data
# Loop over all rows
for row in rows:
#
# Create a list for each cell in this particular row
= [cell.text for cell in row.select("td b")]
row_data
# Save it
scraped_data.append(row_data)
scraped_data
[['2024', '$291,400', '$43,700', '$247,700', '$0', '$0'],
['2023', '$291,400', '$43,700', '$247,700', '$0', '$0'],
['2022', '$264,900', '$39,735', '$225,165', '$0', '$0'],
['2021', '$264,900', '$39,735', '$225,165', '$0', '$0'],
['2020', '$264,900', '$39,735', '$225,165', '$0', '$0'],
['2019', '$264,900', '$39,735', '$185,165', '$0', '$40,000'],
['2018', '$240,800', '$38,528', '$172,272', '$0', '$30,000'],
['2017', '$240,800', '$38,528', '$172,272', '$0', '$30,000'],
['2016', '$209,400', '$20,940', '$158,460', '$0', '$30,000'],
['2015', '$209,400', '$20,940', '$158,460', '$0', '$30,000']]
Make it into a DataFrame:
= pd.DataFrame(
assessments
scraped_data,=[
columns"Year",
"Market Value",
"Taxable Land",
"Taxable Improvement",
"Exempt Land",
"Exempt Improvement",
],
)
assessments
Year | Market Value | Taxable Land | Taxable Improvement | Exempt Land | Exempt Improvement | |
---|---|---|---|---|---|---|
0 | 2024 | $291,400 | $43,700 | $247,700 | $0 | $0 |
1 | 2023 | $291,400 | $43,700 | $247,700 | $0 | $0 |
2 | 2022 | $264,900 | $39,735 | $225,165 | $0 | $0 |
3 | 2021 | $264,900 | $39,735 | $225,165 | $0 | $0 |
4 | 2020 | $264,900 | $39,735 | $225,165 | $0 | $0 |
5 | 2019 | $264,900 | $39,735 | $185,165 | $0 | $40,000 |
6 | 2018 | $240,800 | $38,528 | $172,272 | $0 | $30,000 |
7 | 2017 | $240,800 | $38,528 | $172,272 | $0 | $30,000 |
8 | 2016 | $209,400 | $20,940 | $158,460 | $0 | $30,000 |
9 | 2015 | $209,400 | $20,940 | $158,460 | $0 | $30,000 |
Success!
Example #2: Scraping the Philadelphia Municipal Courts portal
- URL: https://ujsportal.pacourts.us/CaseSearch
- Given a Police incident number, we’ll see if there is an associated court case with the incident
Problem: we’ll need to click several buttons before we can see the info we want!
Run the scraping analysis
Strategy:
- Rely on the Web Inspector to identify specific elements of the webpage
- Use Selenium to interact with the webpage
- Change dropdown elements
- Click buttons
1. Open the URL
# Open the URL
= "https://ujsportal.pacourts.us/CaseSearch"
url driver.get(url)
2. Create a dropdown “Select” element
We’ll need to: - Select the dropdown element on the main page by its ID - Initialize a selenium
Select()
object
# Use the Web Inspector to get the css selector of the dropdown select element
= "#SearchBy-Control > select" dropdown_selector
# Select the dropdown by the element's CSS selector
= driver.find_element(By.CSS_SELECTOR, dropdown_selector) dropdown
Create the Select object:
from selenium.webdriver.support.ui import Select
# Initialize a Select object
= Select(dropdown) dropdown_select
3. Change the selected text in the dropdown
Change the selected element: “Police Incident/Complaint Number”
# Set the selected text in the dropdown element
"Incident Number") dropdown_select.select_by_visible_text(
4. Set the incident number
# Get the input element for the DC number
= "#IncidentNumber-Control > input[type=text]"
incident_input_selector = driver.find_element(By.CSS_SELECTOR, incident_input_selector) incident_input
# Clear any existing entry
incident_input.clear()
# Input our example incident number
"1725088232") incident_input.send_keys(
6. Use BeautifulSoup to parse the results
- Use the
page_source
attribute to get the current HTML displayed on the page - Initialize a “soup” object with the HTML
= BeautifulSoup(driver.page_source, "html.parser") courtsSoup
Now we can: - Identify the element holding all of the results - Within this container, find the <table>
element and each <tr>
element within the table
# Select the results container by its ID
= courtsSoup.select_one("#caseSearchResultGrid") results_table
# Get all of the <tr> rows inside the tbody element
# NOTE: we using nested selections here!
= results_table.select("tbody > tr") results_rows
Example: The number of court cases
# Number of court cases
= len(results_rows)
number_of_cases print(f"Number of courts cases: {number_of_cases}")
Number of courts cases: 2
Example: Extract the text elements from the first row of the results
= results_rows[0] first_row
print(first_row.prettify())
<tr class="slide-active">
<td class="display-none">
1
</td>
<td class="display-none">
0
</td>
<td>
MC-51-CR-0030672-2017
</td>
<td>
Common Pleas
</td>
<td>
Comm. v. Velquez, Victor
</td>
<td>
Closed
</td>
<td>
10/13/2017
</td>
<td>
Velquez, Victor
</td>
<td>
09/05/1974
</td>
<td>
Philadelphia
</td>
<td>
MC-01-51-Crim
</td>
<td>
U0981035
</td>
<td>
1725088232-0030672
</td>
<td>
1725088232
</td>
<td class="display-none">
</td>
<td class="display-none">
</td>
<td class="display-none">
</td>
<td class="display-none">
</td>
<td>
<div class="grid inline-block">
<div>
<div class="inline-block">
<a class="icon-wrapper" href="/Report/CpDocketSheet?docketNumber=MC-51-CR-0030672-2017&dnh=%2FGgePQykMpAymRENgxLBzg%3D%3D" target="_blank">
<img alt="Docket Sheet" class="icon-size" src="https://ujsportal.pacourts.us/resource/Images/svg-defs.svg?v=qJ77ypOpzSMFk7r1gsI6H0xjdteha_ZIjvGslGgQV2M#icon-document-letter-D" title="Docket Sheet"/>
<label class="link-text">
Docket Sheet
</label>
</a>
</div>
</div>
</div>
<div class="grid inline-block">
<div>
<div class="inline-block">
<a class="icon-wrapper" href="/Report/CpCourtSummary?docketNumber=MC-51-CR-0030672-2017&dnh=%2FGgePQykMpAymRENgxLBzg%3D%3D" target="_blank">
<img alt="Court Summary" class="icon-size" src="https://ujsportal.pacourts.us/resource/Images/svg-defs.svg?v=qJ77ypOpzSMFk7r1gsI6H0xjdteha_ZIjvGslGgQV2M#icon-court-summary" title="Court Summary"/>
<label class="link-text">
Court Summary
</label>
</a>
</div>
</div>
</div>
</td>
</tr>
# Extract out all of the "<td>" cells from the first row
= first_row.select("td")
td_cells
# Loop over each <td> cell
for cell in td_cells:
# Extract out the text from the <td> element
= cell.text
text
# Print out text
if text != "":
print(text)
1
0
MC-51-CR-0030672-2017
Common Pleas
Comm. v. Velquez, Victor
Closed
10/13/2017
Velquez, Victor
09/05/1974
Philadelphia
MC-01-51-Crim
U0981035
1725088232-0030672
1725088232
Docket SheetCourt Summary
7. Close the driver!
driver.close()
That’s it!
- Next week: Part 2 of “getting data” with APIs
- See you on Monday!