Geospatial Data Science in Python
  • Syllabus
  • Schedule
    • Section 401
    • Section 402
  • Content
  • Assignments
    • Overview
    • Section 401
    • Section 402
  • Resources
  • GitHub
  • Canvas
  • Ed Discussion

Week 6B
Web Scraping

  • Section 401
  • Wednesday, October 11, 2023

Week 6 agenda: web scraping

Last time: - Why web scraping? - Getting familiar with the Web - Web scraping: extracting data from static sites

Today: - Practice with web scraping - How to deal with dynamic content

# Start with the usual imports
# We'll use these throughout
import pandas as pd
from bs4 import BeautifulSoup
import requests

Part 1: Web scraping exercises

For each of the exercises, use the Web Inspector to inspect the structure of the relevant web page, and identify the HTML content you will need to scrape with Python.

1. The number of votes for Cherelle Parker in the May mayoral primary

The relevant URL is: https://philadelphiaresults.azurewebsites.us/ResultsSW.aspx?type=MAY&map=CTY

Hint: We’re interested in just a single HTML element so you can inspect the website, identify the right element, and copy the selector for the element.

# Make the request
url = "https://philadelphiaresults.azurewebsites.us/ResultsSW.aspx?type=MAY&map=CTY"
response = requests.get(url)

# Initialize the soup for this page
soup1 = BeautifulSoup(response.content, "html.parser")
# Select the element holding the number of votes
selector = "#ctl01 > div.outside-wrapper > div:nth-child(16) > div:nth-child(1) > div.section.group.winner > div.col.display-results-box-f > h1"
# Select the element
element = soup1.select_one(selector)
# Get the text of the element
votes_str = element.text

votes_str
'81,080'
# Convert to integer!
num_votes = int(votes_str.replace(",", ""))

# Print
print(f"Number of votes for Cherelle Parker = {num_votes}")
Number of votes for Cherelle Parker = 81080

2. How many millions of people are currently experiencing drought?

Relevant URL: https://www.drought.gov/current-conditions

Hint: We’re interested in just a single HTML element so you can inspect the website, identify the right element, and copy the selector for the element.

# Make the request
url = "https://www.drought.gov/current-conditions"
response = requests.get(url)

# Initialize the soup for this page
soup2 = BeautifulSoup(response.content, "html.parser")
selector = "#block-uswds-drought-content > div > div > div.grid-container.grid-container--standard.padding-top-6 > div:nth-child(5) > div > div:nth-child(3) > div > div > div.u--color--accent.text-center.font-sans-xl.field.field--name-field-number-stat.field--type-string.field--label-hidden"
soup2.select_one(selector).text
'95.8 Million'

3. Scrape the Weitzman School directory

The Weitzman School lists their directory of people on this page: https://www.design.upenn.edu/people/list. From this site, let’s extract out following information:

  • The person’s name;
  • title, and;
  • associated department.
Tip

This example is similar to the Inquirer Clean Plates demo from last lecture.

The info we want for each person is wrapped up in a <div> element. You can select all of those elements, loop over each one in a “for” loop, extract the three pieces of content we want from each <div>, and then save the result to a list.

# Make the request
url = "https://www.design.upenn.edu/people/list"
response = requests.get(url)

# Initialize the soup for this page
soup3 = BeautifulSoup(response.content, "html.parser")
# Select all rows
rows = soup3.select(".views-row")
len(rows)
536
# Extract out specific content from each row
print(rows[0].prettify())
<div class="views-row">
 <a class="list-item profile-item" href="/graduate-admissions/people/anushka-samant">
  <span class="text">
   <span class="title heading-5">
    Anushka Samant
   </span>
   <span class="meta body-small">
    MCP '24
   </span>
  </span>
  <span class="dept body-subhead">
   Admissions
  </span>
  <span class="arrow">
   <span aria-hidden="true" class="fa fa-arrow-right">
   </span>
  </span>
 </a>
 <div class="views-field views-field-edit-node-1 edit">
 </div>
</div>
# Save content here
data = []

# Loop over all rows
for row in rows:
    # Person name
    person = row.select_one(".title").text

    # Title
    title = row.select_one(".meta").text

    # Deptarment
    dept = row.select_one(".dept").text.strip()

    data.append({"person": person, "title": title, "dept": dept})


data = pd.DataFrame(data)

data
person title dept
0 Anushka Samant MCP '24 Admissions
1 Tom Abel External Faculty Collaborator Center for Environmental Building & Design
2 Mostafa Akbari PhD Candidate in Architecture Architecture
3 Masoud Akbarzadeh Assistant Professor of Architecture Architecture
4 Scott Aker Lecturer Architecture
... ... ... ...
531 Dr. Hao Zheng PhD Architecture Alum, 2023 PhD Architecture
532 Yefan Zhi PhD Student in Architecture Architecture
533 Cynthia Zhou Animator Researcher//Design and Fine Arts Penn Animation as Research Lab
534 Jessica Zofchak Lecturer Architecture
535 Syd Zolf Artist in Residence CPCW & GSWS, FNAR Lecturer Fine Arts

536 rows × 3 columns

Part 2: What about dynamic content?

How do you scrape data that is loaded via Javascript or only appears after user interaction?

Note: web browser needed

You’ll need a web browser installed to use selenium, e.g., FireFox, Google Chrome, Edge, etc.

Selenium

  • Designed as a framework for testing webpages during development
  • Provides an interface to interact with webpages just as a user would
  • Becoming increasingly popular for web scraping dynamic content from pages

Two common use cases

Two cases when requests.get() won’t work:

  1. Many sites load content via javascript, so requests.get() won’t return anything. It doesn’t load any javascript on the page, just returns the static HTML content.
  2. If the site requires user interaction via buttons, dropdowns, etc., then get requests won’t be able to show that information.

But how to solve them?

  1. Look on the “Network” tab for the external requests that the javascript is using to pull the data.
  2. Selenium to the rescue!

Let’s try it out!

Example #1: Philly’s Property App

  • URL: https://property.phila.gov/
  • Let’s see if we can extract the property assessment values for my old apartment in Fairmount

Problem: We’ll need to enter an address, click a button, and THEN scrape the information.

1. Initialize the web driver

The initialization steps will depend on which browser you want to use!

# Import the webdriver from selenium
from selenium import webdriver
Important: Working on Binder

If you are working on Binder, you’ll need to use FireFox in “headless” mode, which prevents a browser window from opening.

If you are working locally, it’s better to run with the default options — you’ll be able to see the browser window open and change as we perform the web scraping.

Using Google Chrome
# UNCOMMENT BELOW TO USE CHROME

driver = webdriver.Chrome()
Using Firefox

If you are working on Binder, use the below code!

# UNCOMMENT BELOW IF ON BINDER

# options = webdriver.FirefoxOptions()

# IF ON BINDER, RUN IN "HEADLESS" MODE (NO BROWSER WINDOW IS OPENED)
# COMMENT THIS LINE IF WORKING LOCALLY
# options.add_argument("--headless")

# Initialize
# driver = webdriver.Firefox(options=options)
Using Microsoft Edge
# UNCOMMENT BELOW TO USE MICROSOFT EDGE

#driver = webdriver.Edge()

2. Navigate to the URL

url = "https://property.phila.gov/"
driver.get(url)

3. Select the address input element

After it loads, let’s take a look in the web inspector. We’ll need to identify the search bar so we can input our desired address.

It looks like the search bar has a class of "pvm-search-control-input".

# Use the dot syntax for class selectors
address_input_selector = ".pvm-search-control-input"

We can select it directly in selenium using the driver.find_element() function and telling selenium we are using a css selector to do the selection.

from selenium.webdriver.common.by import By
# Select the address input by the element's CSS selector
address_input = driver.find_element(By.CSS_SELECTOR, address_input_selector)

4. Type the address

Use the send_keys() function to enter the text into the input element:

# Input our example address
address_input.send_keys("1739 Wallace St #102")

5. Select the search button and click it

Using the web inspector, it looks like both the clear and search buttons have the class of "pvm-search-control-button". Let’s select both, choose the second button, and click it.

button_selector = ".pvm-search-control-button"

# Get the button element
buttons = driver.find_elements(By.CSS_SELECTOR, button_selector)
delete_button, search_button = buttons
# Click it
search_button.click()

6. Use BeautifulSoup to parse the results

  • Use the page_source attribute to get the current HTML displayed on the page
  • Initialize a “soup” object with the HTML
propertySoup = BeautifulSoup(driver.page_source, "html.parser")

From the web inspector, it looks like the table we want has an ID of "ownerProperties". Let’s select all of the elements with that ID:

# Use the # for ID selector
table_selector = "#ownerProperties"

tables = propertySoup.select(table_selector)
len(tables)
2
Watch out!

We selected by element ID, which should be unique but sometimes isn’t! Here we see that multiple tables all share the same ID on the page.

The table we want is the second table on the page with the ID “ownerProperties” — let’s select it.

tables[1]
<table class="stack" data-v-49670532="" id="ownerProperties" role="grid"><thead data-v-49670532=""><tr data-v-49670532=""><th class="" data-v-49670532="" title=""> Year </th><th class="" data-v-49670532="" title=""> Market Value </th><th class="" data-v-49670532="" title=""> Taxable Land </th><th class="" data-v-49670532="" title=""> Taxable Improvement </th><th class="" data-v-49670532="" title=""> Exempt Land </th><th class="" data-v-49670532="" title=""> Exempt Improvement </th></tr></thead><tbody data-v-49670532=""><tr class="" data-v-49670532="" data-v-6ed23da8=""><td class="" data-v-6ed23da8="" sorttable_customkey="2024"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">2024</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">2024</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="291400"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$291,400</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$291,400</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="43700"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$43,700</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$43,700</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="247700"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$247,700</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$247,700</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td></tr><tr class="" data-v-49670532="" data-v-6ed23da8=""><td class="" data-v-6ed23da8="" sorttable_customkey="2023"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">2023</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">2023</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="291400"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$291,400</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$291,400</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="43700"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$43,700</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$43,700</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="247700"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$247,700</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$247,700</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td></tr><tr class="" data-v-49670532="" data-v-6ed23da8=""><td class="" data-v-6ed23da8="" sorttable_customkey="2022"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">2022</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">2022</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="264900"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$264,900</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$264,900</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="39735"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$39,735</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$39,735</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="225165"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$225,165</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$225,165</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td></tr><tr class="" data-v-49670532="" data-v-6ed23da8=""><td class="" data-v-6ed23da8="" sorttable_customkey="2021"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">2021</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">2021</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="264900"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$264,900</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$264,900</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="39735"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$39,735</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$39,735</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="225165"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$225,165</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$225,165</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td></tr><tr class="" data-v-49670532="" data-v-6ed23da8=""><td class="" data-v-6ed23da8="" sorttable_customkey="2020"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">2020</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">2020</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="264900"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$264,900</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$264,900</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="39735"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$39,735</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$39,735</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="225165"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$225,165</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$225,165</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td></tr><tr class="" data-v-49670532="" data-v-6ed23da8=""><td class="" data-v-6ed23da8="" sorttable_customkey="2019"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">2019</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">2019</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="264900"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$264,900</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$264,900</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="39735"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$39,735</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$39,735</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="185165"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$185,165</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$185,165</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="40000"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$40,000</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$40,000</div><!-- --></div></div></td></tr><tr class="" data-v-49670532="" data-v-6ed23da8=""><td class="" data-v-6ed23da8="" sorttable_customkey="2018"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">2018</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">2018</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="240800"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$240,800</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$240,800</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="38528"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$38,528</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$38,528</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="172272"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$172,272</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$172,272</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="30000"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$30,000</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$30,000</div><!-- --></div></div></td></tr><tr class="" data-v-49670532="" data-v-6ed23da8=""><td class="" data-v-6ed23da8="" sorttable_customkey="2017"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">2017</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">2017</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="240800"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$240,800</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$240,800</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="38528"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$38,528</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$38,528</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="172272"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$172,272</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$172,272</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="30000"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$30,000</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$30,000</div><!-- --></div></div></td></tr><tr class="" data-v-49670532="" data-v-6ed23da8=""><td class="" data-v-6ed23da8="" sorttable_customkey="2016"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">2016</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">2016</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="209400"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$209,400</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$209,400</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="20940"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$20,940</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$20,940</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="158460"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$158,460</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$158,460</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="30000"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$30,000</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$30,000</div><!-- --></div></div></td></tr><tr class="" data-v-49670532="" data-v-6ed23da8=""><td class="" data-v-6ed23da8="" sorttable_customkey="2015"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">2015</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">2015</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="209400"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$209,400</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$209,400</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="20940"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$20,940</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$20,940</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="158460"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$158,460</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$158,460</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey=""><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$0</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$0</div><!-- --></div></div></td><td class="" data-v-6ed23da8="" sorttable_customkey="30000"><div data-v-2ca357c0="" data-v-6ed23da8=""><div data-v-2ca357c0=""></div><!-- --></div><b data-v-6ed23da8="" style="display: none;"><!-- --><div data-v-6ed23da8="">$30,000</div></b><div data-v-6ed23da8=""><!-- --><div data-v-6ed23da8=""><div data-v-6ed23da8="">$30,000</div><!-- --></div></div></td></tr><!-- --></tbody></table>
table = tables[1]

Select all the rows in the table (the <tr> elements):

# Select all row (<tr>) elements in the table tbody
rows = table.select("tbody tr")

Now, we can take a look at each <tr> element and extract each cell’s data in the <td> cell

print(rows[0].prettify())
<tr class="" data-v-49670532="" data-v-6ed23da8="">
 <td class="" data-v-6ed23da8="" sorttable_customkey="2024">
  <div data-v-2ca357c0="" data-v-6ed23da8="">
   <div data-v-2ca357c0="">
   </div>
   <!-- -->
  </div>
  <b data-v-6ed23da8="" style="display: none;">
   <!-- -->
   <div data-v-6ed23da8="">
    2024
   </div>
  </b>
  <div data-v-6ed23da8="">
   <!-- -->
   <div data-v-6ed23da8="">
    <div data-v-6ed23da8="">
     2024
    </div>
    <!-- -->
   </div>
  </div>
 </td>
 <td class="" data-v-6ed23da8="" sorttable_customkey="291400">
  <div data-v-2ca357c0="" data-v-6ed23da8="">
   <div data-v-2ca357c0="">
   </div>
   <!-- -->
  </div>
  <b data-v-6ed23da8="" style="display: none;">
   <!-- -->
   <div data-v-6ed23da8="">
    $291,400
   </div>
  </b>
  <div data-v-6ed23da8="">
   <!-- -->
   <div data-v-6ed23da8="">
    <div data-v-6ed23da8="">
     $291,400
    </div>
    <!-- -->
   </div>
  </div>
 </td>
 <td class="" data-v-6ed23da8="" sorttable_customkey="43700">
  <div data-v-2ca357c0="" data-v-6ed23da8="">
   <div data-v-2ca357c0="">
   </div>
   <!-- -->
  </div>
  <b data-v-6ed23da8="" style="display: none;">
   <!-- -->
   <div data-v-6ed23da8="">
    $43,700
   </div>
  </b>
  <div data-v-6ed23da8="">
   <!-- -->
   <div data-v-6ed23da8="">
    <div data-v-6ed23da8="">
     $43,700
    </div>
    <!-- -->
   </div>
  </div>
 </td>
 <td class="" data-v-6ed23da8="" sorttable_customkey="247700">
  <div data-v-2ca357c0="" data-v-6ed23da8="">
   <div data-v-2ca357c0="">
   </div>
   <!-- -->
  </div>
  <b data-v-6ed23da8="" style="display: none;">
   <!-- -->
   <div data-v-6ed23da8="">
    $247,700
   </div>
  </b>
  <div data-v-6ed23da8="">
   <!-- -->
   <div data-v-6ed23da8="">
    <div data-v-6ed23da8="">
     $247,700
    </div>
    <!-- -->
   </div>
  </div>
 </td>
 <td class="" data-v-6ed23da8="" sorttable_customkey="">
  <div data-v-2ca357c0="" data-v-6ed23da8="">
   <div data-v-2ca357c0="">
   </div>
   <!-- -->
  </div>
  <b data-v-6ed23da8="" style="display: none;">
   <!-- -->
   <div data-v-6ed23da8="">
    $0
   </div>
  </b>
  <div data-v-6ed23da8="">
   <!-- -->
   <div data-v-6ed23da8="">
    <div data-v-6ed23da8="">
     $0
    </div>
    <!-- -->
   </div>
  </div>
 </td>
 <td class="" data-v-6ed23da8="" sorttable_customkey="">
  <div data-v-2ca357c0="" data-v-6ed23da8="">
   <div data-v-2ca357c0="">
   </div>
   <!-- -->
  </div>
  <b data-v-6ed23da8="" style="display: none;">
   <!-- -->
   <div data-v-6ed23da8="">
    $0
   </div>
  </b>
  <div data-v-6ed23da8="">
   <!-- -->
   <div data-v-6ed23da8="">
    <div data-v-6ed23da8="">
     $0
    </div>
    <!-- -->
   </div>
  </div>
 </td>
</tr>

It looks like the content for each cell is inside a <b> tag inside a <td> cell:

# Empty list to store scraped data
scraped_data = []

# Loop over all rows
for row in rows:
    #
    # Create a list for each cell in this particular row
    row_data = [cell.text for cell in row.select("td b")]

    # Save it
    scraped_data.append(row_data)

scraped_data
[['2024', '$291,400', '$43,700', '$247,700', '$0', '$0'],
 ['2023', '$291,400', '$43,700', '$247,700', '$0', '$0'],
 ['2022', '$264,900', '$39,735', '$225,165', '$0', '$0'],
 ['2021', '$264,900', '$39,735', '$225,165', '$0', '$0'],
 ['2020', '$264,900', '$39,735', '$225,165', '$0', '$0'],
 ['2019', '$264,900', '$39,735', '$185,165', '$0', '$40,000'],
 ['2018', '$240,800', '$38,528', '$172,272', '$0', '$30,000'],
 ['2017', '$240,800', '$38,528', '$172,272', '$0', '$30,000'],
 ['2016', '$209,400', '$20,940', '$158,460', '$0', '$30,000'],
 ['2015', '$209,400', '$20,940', '$158,460', '$0', '$30,000']]

Make it into a DataFrame:

assessments = pd.DataFrame(
    scraped_data,
    columns=[
        "Year",
        "Market Value",
        "Taxable Land",
        "Taxable Improvement",
        "Exempt Land",
        "Exempt Improvement",
    ],
)

assessments
Year Market Value Taxable Land Taxable Improvement Exempt Land Exempt Improvement
0 2024 $291,400 $43,700 $247,700 $0 $0
1 2023 $291,400 $43,700 $247,700 $0 $0
2 2022 $264,900 $39,735 $225,165 $0 $0
3 2021 $264,900 $39,735 $225,165 $0 $0
4 2020 $264,900 $39,735 $225,165 $0 $0
5 2019 $264,900 $39,735 $185,165 $0 $40,000
6 2018 $240,800 $38,528 $172,272 $0 $30,000
7 2017 $240,800 $38,528 $172,272 $0 $30,000
8 2016 $209,400 $20,940 $158,460 $0 $30,000
9 2015 $209,400 $20,940 $158,460 $0 $30,000

Success!

Example #2: Scraping the Philadelphia Municipal Courts portal

  • URL: https://ujsportal.pacourts.us/CaseSearch
  • Given a Police incident number, we’ll see if there is an associated court case with the incident

Problem: we’ll need to click several buttons before we can see the info we want!

Run the scraping analysis

Strategy:

  • Rely on the Web Inspector to identify specific elements of the webpage
  • Use Selenium to interact with the webpage
    • Change dropdown elements
    • Click buttons

1. Open the URL

# Open the URL
url = "https://ujsportal.pacourts.us/CaseSearch"
driver.get(url)

2. Create a dropdown “Select” element

We’ll need to: - Select the dropdown element on the main page by its ID - Initialize a selenium Select() object

# Use the Web Inspector to get the css selector of the dropdown select element
dropdown_selector = "#SearchBy-Control > select" 
# Select the dropdown by the element's CSS selector
dropdown = driver.find_element(By.CSS_SELECTOR, dropdown_selector)

Create the Select object:

from selenium.webdriver.support.ui import Select

# Initialize a Select object
dropdown_select = Select(dropdown)

3. Change the selected text in the dropdown

Change the selected element: “Police Incident/Complaint Number”

# Set the selected text in the dropdown element
dropdown_select.select_by_visible_text("Incident Number")

4. Set the incident number

# Get the input element for the DC number
incident_input_selector = "#IncidentNumber-Control > input[type=text]" 
incident_input = driver.find_element(By.CSS_SELECTOR, incident_input_selector)
# Clear any existing entry
incident_input.clear()

# Input our example incident number
incident_input.send_keys("1725088232")

5. Click the search button!

# Submit the search
search_button_selector = "#btnSearch"
search_button = driver.find_element(By.CSS_SELECTOR, search_button_selector)
search_button.click()

6. Use BeautifulSoup to parse the results

  • Use the page_source attribute to get the current HTML displayed on the page
  • Initialize a “soup” object with the HTML
courtsSoup = BeautifulSoup(driver.page_source, "html.parser")

Now we can: - Identify the element holding all of the results - Within this container, find the <table> element and each <tr> element within the table

# Select the results container by its ID 
results_table = courtsSoup.select_one("#caseSearchResultGrid")
# Get all of the <tr> rows inside the tbody element 
# NOTE: we using nested selections here!
results_rows = results_table.select("tbody > tr")

Example: The number of court cases

# Number of court cases
number_of_cases = len(results_rows)
print(f"Number of courts cases: {number_of_cases}")
Number of courts cases: 2

Example: Extract the text elements from the first row of the results

first_row = results_rows[0]
print(first_row.prettify())
<tr class="slide-active">
 <td class="display-none">
  1
 </td>
 <td class="display-none">
  0
 </td>
 <td>
  MC-51-CR-0030672-2017
 </td>
 <td>
  Common Pleas
 </td>
 <td>
  Comm. v. Velquez, Victor
 </td>
 <td>
  Closed
 </td>
 <td>
  10/13/2017
 </td>
 <td>
  Velquez, Victor
 </td>
 <td>
  09/05/1974
 </td>
 <td>
  Philadelphia
 </td>
 <td>
  MC-01-51-Crim
 </td>
 <td>
  U0981035
 </td>
 <td>
  1725088232-0030672
 </td>
 <td>
  1725088232
 </td>
 <td class="display-none">
 </td>
 <td class="display-none">
 </td>
 <td class="display-none">
 </td>
 <td class="display-none">
 </td>
 <td>
  <div class="grid inline-block">
   <div>
    <div class="inline-block">
     <a class="icon-wrapper" href="/Report/CpDocketSheet?docketNumber=MC-51-CR-0030672-2017&amp;dnh=%2FGgePQykMpAymRENgxLBzg%3D%3D" target="_blank">
      <img alt="Docket Sheet" class="icon-size" src="https://ujsportal.pacourts.us/resource/Images/svg-defs.svg?v=qJ77ypOpzSMFk7r1gsI6H0xjdteha_ZIjvGslGgQV2M#icon-document-letter-D" title="Docket Sheet"/>
      <label class="link-text">
       Docket Sheet
      </label>
     </a>
    </div>
   </div>
  </div>
  <div class="grid inline-block">
   <div>
    <div class="inline-block">
     <a class="icon-wrapper" href="/Report/CpCourtSummary?docketNumber=MC-51-CR-0030672-2017&amp;dnh=%2FGgePQykMpAymRENgxLBzg%3D%3D" target="_blank">
      <img alt="Court Summary" class="icon-size" src="https://ujsportal.pacourts.us/resource/Images/svg-defs.svg?v=qJ77ypOpzSMFk7r1gsI6H0xjdteha_ZIjvGslGgQV2M#icon-court-summary" title="Court Summary"/>
      <label class="link-text">
       Court Summary
      </label>
     </a>
    </div>
   </div>
  </div>
 </td>
</tr>
# Extract out all of the "<td>" cells from the first row
td_cells = first_row.select("td")

# Loop over each <td> cell
for cell in td_cells:
    
    # Extract out the text from the <td> element
    text = cell.text
    
    # Print out text
    if text != "":
        print(text)
1
0
MC-51-CR-0030672-2017
Common Pleas
Comm. v. Velquez, Victor
Closed
10/13/2017
Velquez, Victor
09/05/1974
Philadelphia
MC-01-51-Crim
U0981035
1725088232-0030672
1725088232
Docket SheetCourt Summary

7. Close the driver!

driver.close()

That’s it!

  • Next week: Part 2 of “getting data” with APIs
  • See you on Monday!
Content 2023 by Nick Hand, Quarto layout adapted from Andrew Heiss’s Data Visualization with R course
All content licensed under a Creative Commons Attribution-NonCommercial 4.0 International license (CC BY-NC 4.0)
 
Made with and Quarto
View the source at GitHub