import geopandas as gpd
import holoviews as hv
import hvplot.pandas
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
Week 7A
Getting Data, Part 2: Working with APIs
- Section 401
- Monday Oct 16, 2023
Week #7 Agenda
- Introduction to APIs
- Natural language processing via Philly’s 311 API
- Word frequencies
- Sentiment analysis
- Pulling census data and shape files using Python
Update your local environment!
- Small update to the course’s Python environment
- Update the environment on your laptop using these instructions on course website
# Show all columns
= 999 pd.options.display.max_columns
Part 1: Introduction to APIs
Or, how to pull data from the web using Python
Application programming interface (API):
(noun): A particular set of rules and specifications that software programs can follow to communicate with each other and exchange data.
Example APIs
- Socrata Open Data: https://dev.socrata.com/
- Open Data Philly: https://opendataphilly.org
- US Census Bureau: https://www.census.gov/data/developers/data-sets.html
- Bureau of Labor Statistics: https://www.bls.gov/developers/
- US Geological Survey: https://www.usgs.gov/products/data-and-tools/apis
- US Environmental Protection Agency: https://www.epa.gov/enviro/web-services
- Google APIs: https://console.cloud.google.com/apis/library
- Facebook: https://developers.facebook.com/docs/apis-and-sdks/
- Twitter: https://developer.twitter.com/en/docs/api-reference-index.html (RIP 💀)
- Foursquare: https://developer.foursquare.com/
- Instagram: https://www.instagram.com/developer/
- Yelp: https://www.yelp.com/developers
When accessing data via API, many services will require you to register an API key to prevent you from overloading the service with requests.
Example #1: Automated data feeds
The simplest form of API is when data providers maintain data files via a URL that are automatically updated with new data over time.
USGS real-time earthquake feeds
This is an API for near-real-time data about earthquakes, and data is provided in GeoJSON format over the web.
The API has a separate endpoint for each version of the data that users might want. No authentication is required.
API documentation:
http://earthquake.usgs.gov/earthquakes/feed/v1.0/geojson.php
Sample API endpoint, for magnitude 4.5+ earthquakes in past day:
http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/4.5_day.geojson
GeoPandas can read GeoJSON files from the web directly. Simply pass the URL to the gpd.read_file()
function:
# Download data on magnitude 2.5+ quakes from the past week
= (
endpoint_url "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_week.geojson"
)= gpd.read_file(endpoint_url) earthquakes
earthquakes.head()
id | mag | place | time | updated | tz | url | detail | felt | cdi | mmi | alert | status | tsunami | sig | net | code | ids | sources | types | nst | dmin | rms | gap | magType | type | title | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | us6000lg18 | 4.50 | Mariana Islands region | 1697495094131 | 1697496425040 | NaN | https://earthquake.usgs.gov/earthquakes/eventp... | https://earthquake.usgs.gov/earthquakes/feed/v... | NaN | NaN | NaN | NaN | reviewed | 0 | 312 | us | 6000lg18 | ,us6000lg18, | ,us, | ,origin,phase-data, | 27.0 | 3.168 | 0.46 | 165.0 | mb | earthquake | M 4.5 - Mariana Islands region | POINT Z (147.53230 17.94240 10.00000) |
1 | ak023daay0wr | 2.50 | Kodiak Island region, Alaska | 1697491317711 | 1697493637040 | NaN | https://earthquake.usgs.gov/earthquakes/eventp... | https://earthquake.usgs.gov/earthquakes/feed/v... | NaN | NaN | NaN | NaN | automatic | 0 | 96 | ak | 023daay0wr | ,us6000lg02,ak023daay0wr, | ,us,ak, | ,origin,phase-data, | NaN | NaN | 1.16 | NaN | ml | earthquake | M 2.5 - Kodiak Island region, Alaska | POINT Z (-153.12000 57.13730 0.00000) |
2 | ak023daavntz | 2.50 | 99 km SE of Old Harbor, Alaska | 1697490628779 | 1697493347040 | NaN | https://earthquake.usgs.gov/earthquakes/eventp... | https://earthquake.usgs.gov/earthquakes/feed/v... | NaN | NaN | NaN | NaN | reviewed | 0 | 96 | ak | 023daavntz | ,us6000lg05,ak023daavntz, | ,us,ak, | ,origin,phase-data, | NaN | NaN | 1.54 | NaN | ml | earthquake | M 2.5 - 99 km SE of Old Harbor, Alaska | POINT Z (-152.13590 56.57760 15.00000) |
3 | hv73610612 | 2.86 | 21 km W of Volcano, Hawaii | 1697485792740 | 1697487909640 | NaN | https://earthquake.usgs.gov/earthquakes/eventp... | https://earthquake.usgs.gov/earthquakes/feed/v... | NaN | NaN | NaN | NaN | reviewed | 0 | 126 | hv | 73610612 | ,us6000lfz9,hv73610612, | ,us,hv, | ,origin,phase-data, | 53.0 | NaN | 0.12 | 68.0 | ml | earthquake | M 2.9 - 21 km W of Volcano, Hawaii | POINT Z (-155.44050 19.46667 6.83000) |
4 | us6000lfyi | 4.30 | La Rioja-San Juan border region, Argentina | 1697482144060 | 1697483277040 | NaN | https://earthquake.usgs.gov/earthquakes/eventp... | https://earthquake.usgs.gov/earthquakes/feed/v... | NaN | NaN | NaN | NaN | reviewed | 0 | 284 | us | 6000lfyi | ,us6000lfyi, | ,us, | ,moment-tensor,origin,phase-data, | 48.0 | 1.728 | 1.08 | 81.0 | mwr | earthquake | M 4.3 - La Rioja-San Juan border region, Argen... | POINT Z (-68.25610 -29.32270 39.71600) |
Let’s explore the data interactively with Folium:
earthquakes.explore()
Lots of other automated feeds available, updated every minute:
Example #2: The CARTO API
- Philadelphia hosts the majority of its open data on OpenDataPhilly in the cloud using CARTO
- They provide an API to download the data
- You can access the API documentation on the dataset page on OpenDataPhilly
For example: shooting victims in Philadelphia
https://www.opendataphilly.org/dataset/shooting-victims
and the API documentation:
https://cityofphiladelphia.github.io/carto-api-explorer/#shootings
Let’s take a look at the download URL for the data in the GeoJSON format:
The anatomy of an API request
- Base URL: shown in blue
- ?: shown in purple; it separates the base URL from the query parameters
- Query parameters: underlined in red; these parameters allow the user to customize the data response
- &: underlined in green; the separator between the query parameters
So, let’s break down the URL into its component parts:
# The API endpoint
= "https://phl.carto.com/api/v2/sql"
carto_api_endpoint
# The query parameters
= {
params "q": "SELECT * FROM shootings",
"format": "geojson",
"skipfields": "cartodb_id",
# Note: we won't need the filename parameter, since we're not saving the data to a file
# "filename": "shootings"
}
The q
parameter is a SQL query. It allows you to select a specific subset of data from the larger database.
CARTO API documentation: https://carto.com/developers/sql-api/
SQL documentation: https://www.postgresql.org/docs/9.1/sql.html
General Query Syntax
SELECT [field names] FROM [table name] WHERE [query]
Let’s try it out in Python
We’ll use Python’s requests
library to use a “get” request to query the API endpoint with our desired query. Similar to our web scraping requests!
import requests
Let’s make the get request and pass the query parameters via the params
keyword:
= requests.get(carto_api_endpoint, params=params)
response
response
<Response [200]>
# Get the returned data in JSON format
# This is a dictionary
= response.json() features
type(features)
dict
# What are the keys?
list(features.keys())
['type', 'features']
"type"] features[
'FeatureCollection'
# Let's look at the first feature
"features"][0] features[
{'type': 'Feature',
'geometry': {'type': 'Point', 'coordinates': [-75.162274, 39.988715]},
'properties': {'objectid': 10797984,
'year': 2020,
'dc_key': '202022062735.0',
'code': '411',
'date_': '2020-09-12T00:00:00Z',
'time': '21:04:00',
'race': 'W',
'sex': 'M',
'age': '53',
'wound': 'Foot',
'officer_involved': 'N',
'offender_injured': 'N',
'offender_deceased': 'N',
'location': '18TH & DAUPHIN ST',
'latino': 1,
'point_x': -75.16227376,
'point_y': 39.98871456,
'dist': '22',
'inside': 0,
'outside': 1,
'fatal': 0}}
Use the GeoDataFrame.from_features()
function to create a GeoDataFrame.
= gpd.GeoDataFrame.from_features(features, crs="EPSG:4326") shootings
Don’t forget to specify the CRS of the input data when using GeoDataFrame.from_features()
.
shootings.head()
geometry | objectid | year | dc_key | code | date_ | time | race | sex | age | wound | officer_involved | offender_injured | offender_deceased | location | latino | point_x | point_y | dist | inside | outside | fatal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | POINT (-75.16227 39.98871) | 10797984 | 2020 | 202022062735.0 | 411 | 2020-09-12T00:00:00Z | 21:04:00 | W | M | 53 | Foot | N | N | N | 18TH & DAUPHIN ST | 1.0 | -75.162274 | 39.988715 | 22 | 0.0 | 1.0 | 0.0 |
1 | POINT (-75.14885 39.98795) | 10797985 | 2020 | 202022062782.0 | 111 | 2020-09-13T00:00:00Z | 10:50:00 | W | M | 22 | Multiple/Head | N | N | N | 1000 BLOCK W ARIZONA ST | 0.0 | -75.148851 | 39.987952 | 22 | 0.0 | 1.0 | 1.0 |
2 | POINT (-75.16062 39.98691) | 10797986 | 2020 | 202022063114.0 | 411 | 2020-09-14T00:00:00Z | 20:46:00 | B | M | 42 | Multiple | N | N | N | 1600 BLOCK SUSQUEHANNA AVE | 0.0 | -75.160616 | 39.986907 | 22 | 0.0 | 1.0 | 0.0 |
3 | POINT (-75.16062 39.98691) | 10797987 | 2020 | 202022063126.0 | 411 | 2020-09-14T00:00:00Z | 20:46:00 | B | M | 17 | Buttocks | N | N | N | 1600 BLOCK SUSQUEHANNA AVE | 0.0 | -75.160616 | 39.986907 | 22 | 0.0 | 1.0 | 0.0 |
4 | POINT (-75.15398 39.98373) | 10797988 | 2020 | 202022063470.0 | 111 | 2020-09-16T00:00:00Z | 12:10:00 | B | M | 39 | Head | N | N | N | 2000 BLOCK W SUSQUEHANNA AVE | 0.0 | -75.153980 | 39.983726 | 22 | 0.0 | 1.0 | 1.0 |
At-Home Exercise: Visualizing shootings data
Step 1: Prep the data
- Drop rows where the geometry is NaN
- Convert to a better CRS (e.g., 3857)
- Load city limits from Open Data Philly and trim to those only within city limits (some shootings have wrong coordinates, outside Philadelphia!)
# make sure we remove missing geometries
= shootings.dropna(subset=["geometry"])
shootings
# convert to a better CRS
= shootings.to_crs(epsg=3857) shootings
= gpd.read_file(
city_limits "https://opendata.arcgis.com/datasets/405ec3da942d4e20869d4e1449a2be48_0.geojson"
=3857) ).to_crs(epsg
# Remove any shootings that are outside the city limits
= gpd.sjoin(shootings, city_limits, predicate="within", how="inner").drop(
shootings =["index_right"]
columns )
Step 2: Plot the points
A quick plot with geopandas to show the shootings as points, and overlay Philadelphia ZIP codes.
# From Open Data Philly
= gpd.read_file(
zip_codes "https://opendata.arcgis.com/datasets/b54ec5210cee41c3a884c9086f7af1be_0.geojson"
=3857) ).to_crs(epsg
= plt.subplots(figsize=(6, 6))
fig, ax
# ZIP Codes
=3857).plot(ax=ax, facecolor="none", edgecolor="black")
zip_codes.to_crs(epsg
# Shootings
=ax, color="crimson", markersize=10, alpha=0.4)
shootings.plot(ax ax.set_axis_off()
Step 3: Make a (more useful) hex bin map
# Initialize the axes
= plt.subplots(figsize=(10, 10), facecolor=plt.get_cmap("viridis")(0))
fig, ax
# Convert to Web Mercator and plot the hexbins
= shootings.geometry.x
x = shootings.geometry.y
y =40, mincnt=1, cmap="viridis")
ax.hexbin(x, y, gridsize
# overlay the city limits
=3857).plot(
zip_codes.to_crs(epsg=ax, facecolor="none", linewidth=0.5, edgecolor="white"
ax
)
ax.set_axis_off()
Example: Count the total number of rows in a table
The SQL COUNT
function can be applied to count all rows.
= requests.get(
response ={"q": "SELECT COUNT(*) FROM shootings"}
carto_api_endpoint, params )
response.json()
{'rows': [{'count': 15096}],
'time': 0.006,
'fields': {'count': {'type': 'number', 'pgtype': 'int8'}},
'total_rows': 1}
It’s always a good idea to check how many rows you might be downloading before requesting all of the data from an API!
Example: Select all columns, limiting the total number returned
The LIMIT
function limits the number of returned rows. It is very useful for taking a quick look at the format of a database.
# Limit the returned data to only 1 row
= "SELECT * FROM shootings LIMIT 1"
query
# Make the request
= {"q": query, "format": "geojson"}
params = requests.get(carto_api_endpoint, params=params) response
Create the GeoDataFrame:
= gpd.GeoDataFrame.from_features(response.json(), crs="EPSG:4326")
df
df
geometry | cartodb_id | objectid | year | dc_key | code | date_ | time | race | sex | age | wound | officer_involved | offender_injured | offender_deceased | location | latino | point_x | point_y | dist | inside | outside | fatal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | POINT (-75.16227 39.98871) | 1 | 10797984 | 2020 | 202022062735.0 | 411 | 2020-09-12T00:00:00Z | 21:04:00 | W | M | 53 | Foot | N | N | N | 18TH & DAUPHIN ST | 1 | -75.162274 | 39.988715 | 22 | 0 | 1 | 0 |
Example: Select by specific column values
Use the WHERE
function to select a subset where the logical condition is true.
Example #1: Select nonfatal shootings only
# Select nonfatal shootings only
= "SELECT * FROM shootings WHERE fatal = 0" query
# Make the request
= {"q": query, "format": "geojson"}
params = requests.get(carto_api_endpoint, params=params)
response
# Make the GeoDataFrame
= gpd.GeoDataFrame.from_features(response.json(), crs="EPSG:4326")
nonfatal
# Print
print("number of nonfatal shootings = ", len(nonfatal))
number of nonfatal shootings = 11878
Example #2: Select shootings in 2023
# Select based on "date_"
= "SELECT * FROM shootings WHERE date_ >= '1/1/23'" query
# Make the request
= {"q": query, "format": "geojson"}
params = requests.get(carto_api_endpoint, params=params)
response
# Make the GeoDataFrame
= gpd.GeoDataFrame.from_features(response.json(), crs="EPSG:4326")
shootings_2023
# Print
print("number of shootings in 2023 = ", len(shootings_2023))
number of shootings in 2023 = 1413
At-Home Exercise: Explore trends by month and day of week
Step 1: Convert the date column to DateTime
objects
Add Month and Day of Week columns
# Condddvert the data column to a datetime object
"date"] = pd.to_datetime(shootings["date_"]) shootings[
# Add new columns: Month and Day of Week
"Month"] = shootings["date"].dt.month
shootings["Day of Week"] = shootings["date"].dt.dayofweek # Monday is 0, Sunday is 6 shootings[
Step 2: Calculate number of shootings by month and day of week
Use the familiar Groupby –> size()
= shootings.groupby(["Month", "Day of Week"]).size()
count = count.reset_index(name="Count")
count count.head()
Month | Day of Week | Count | |
---|---|---|---|
0 | 1 | 0 | 168 |
1 | 1 | 1 | 157 |
2 | 1 | 2 | 143 |
3 | 1 | 3 | 139 |
4 | 1 | 4 | 143 |
Step 3: Make a heatmap using hvplot
# Remember 0 is Monday and 6 is Sunday
count.hvplot.heatmap(="Day of Week",
x="Month",
y="Count",
C="viridis",
cmap=400,
width=500,
height=True,
flip_yaxis )
Trends: more shootings on the weekends and in the summer months
Example #3: GeoServices
A GeoService is a standardized format for returning GeoJSON files over the web
Originally developed by Esri, in 2010 the specification was transferred to the Open Web Foundation.
Documentation: http://geoservices.github.io/
Example: Philadelphia neighborhoods
OpenDataPhilly provides GeoService API endpoints for the geometry hosted on its platform
https://opendataphilly.org/datasets/philadelphia-neighborhoods/
# The base URL for the neighborhoods layer
= "https://services1.arcgis.com/a6oRSxEw6eIY5Zfb/arcgis/rest/services/Philadelphia_Neighborhoods/FeatureServer/0" neighborhood_url
The query/
endpoint
Layers have a single endpoint for requesting features: the /query
endpoint.
= neighborhood_url + "/query" neighborhood_query_endpoint
neighborhood_query_endpoint
'https://services1.arcgis.com/a6oRSxEw6eIY5Zfb/arcgis/rest/services/Philadelphia_Neighborhoods/FeatureServer/0/query'
The allowed parameters
The main request parameters are:
where
: A SQL-like string to select a subset of features; to get all data, use1=1
(always True)outFields
: The list of columns to return; to get all, use*
f
: The returned format; for GeoJSON, use “geojson”outSR
: The desired output CRS
= {
params "where": "1=1", # Give me all rows
"outFields": "*", # All fields
"f": "geojson", # GeoJSON format
"outSR": "4326", # The desired output CRS
}
Make the request:
= requests.get(neighborhood_query_endpoint, params=params) r
Get the features and create the GeoDataFrame:
= r.json()
json
= json["features"] features
= gpd.GeoDataFrame.from_features(features, crs="EPSG:4326") hoods
hoods
geometry | FID | NAME | LISTNAME | MAPNAME | Shape_Leng | Shape__Area | Shape__Length | |
---|---|---|---|---|---|---|---|---|
0 | POLYGON ((-75.06773 40.00540, -75.06765 40.005... | 1 | BRIDESBURG | Bridesburg | Bridesburg | 27814.546521 | 7.066581e+06 | 11074.587308 |
1 | POLYGON ((-75.01560 40.09487, -75.01768 40.092... | 2 | BUSTLETON | Bustleton | Bustleton | 48868.458365 | 1.812721e+07 | 19485.686530 |
2 | POLYGON ((-75.18848 40.07273, -75.18846 40.072... | 3 | CEDARBROOK | Cedarbrook | Cedarbrook | 20021.415802 | 3.950980e+06 | 7980.051802 |
3 | POLYGON ((-75.21221 40.08603, -75.21211 40.086... | 4 | CHESTNUT_HILL | Chestnut Hill | Chestnut Hill | 56394.297195 | 1.265385e+07 | 22476.270236 |
4 | POLYGON ((-75.18479 40.02837, -75.18426 40.027... | 5 | EAST_FALLS | East Falls | East Falls | 27400.776417 | 6.434879e+06 | 10907.970158 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
153 | POLYGON ((-75.18373 39.91351, -75.18194 39.913... | 154 | PACKER_PARK | Packer Park | Packer Park | 21816.017948 | 4.584862e+06 | 8444.679631 |
154 | POLYGON ((-75.14654 39.93005, -75.14804 39.921... | 155 | PENNSPORT | Pennsport | Pennsport | 11823.233108 | 1.026917e+06 | 4706.597865 |
155 | POLYGON ((-75.16986 39.92312, -75.17015 39.921... | 156 | NEWBOLD | Newbold | Newbold | 10052.570885 | 9.281435e+05 | 3994.098029 |
156 | POLYGON ((-75.17630 39.92425, -75.17798 39.923... | 157 | WEST_PASSYUNK | West Passyunk | West Passyunk | 10499.291848 | 1.027608e+06 | 4173.301421 |
157 | POLYGON ((-75.15684 39.92897, -75.15712 39.927... | 158 | EAST_PASSYUNK | East Passyunk | East Passyunk | 10987.761846 | 1.028395e+06 | 4368.384350 |
158 rows × 8 columns
Now let’s select data just for East Falls and subset the returned columns:
= {
params "where": "LISTNAME = 'East Falls'",
"outFields": ["LISTNAME"],
"f": "geojson",
"outSR": "4326",
}
= requests.get(neighborhood_query_endpoint, params=params) r
= gpd.GeoDataFrame.from_features(r.json()["features"], crs="EPSG:4326")
east_falls
east_falls
geometry | LISTNAME | |
---|---|---|
0 | POLYGON ((-75.18479 40.02837, -75.18426 40.027... | East Falls |
Part 2: Natural Language Processing and the 311 Request API
In part two, we’ll pull data using the API for the Philly 311 system, available at: https://iframe.publicstuff.com/#?client_id=242
We saw this site previously when we talked about web scraping last week:
Let’s take another look at the address where the site is pulling its data from:
https://vc0.publicstuff.com/api/2.0/requests_list?client_id=242&device=iframe&limit=35&page=1
This is just an API request! It’s an example of a non-public, internal API, but we can reverse-engineer it to extract the data we want!
Break it down into it’s component parts:
- Base URL: https://vc0.publicstuff.com/api/2.0/requests_list
- Query parameters: client_id, device, limit, page
It looks likes “client_id” identifies data for the City of Philadelphia, which is definitely a required parameter. Otherwise, the other parameters seem optional, returning the requests on a certain device and viewing page.
Let’s test it out. We’ll grab 2 requests from the first page:
= requests.get(
r "https://vc0.publicstuff.com/api/2.0/requests_list",
={"client_id": 242, "page": 1, "limit": 2},
params
)
= r.json()
json
json
{'response': {'requests': [{'request': {'id': 14500812,
'image_thumbnail': '',
'title': 'Traffic Sign Complaint',
'description': 'People speed up the street like they’re all 95 it needs to have street cushions put in or speed bumps whatever you wanna call',
'status': 'submitted',
'address': '3532 Mercer St,Philadelphia, PA 19134',
'location': 'Philadelphia, Pennsylvania',
'zipcode': '19134',
'foreign_id': '16306373',
'date_created': 1697500549,
'count_comments': 0,
'count_followers': 0,
'count_supporters': 0,
'lat': 39.98944,
'lon': -75.095472,
'user_follows': 0,
'user_comments': 0,
'user_request': 0,
'rank': '1',
'user': 'Lisa BUCHER'}},
{'request': {'primary_attachment': {'id': 4618920,
'extension': 'jpeg',
'content_type': 'image/jpeg',
'url': 'https://d17aqltn7cihbm.cloudfront.net/uploads/021404250e358776466e2a1eed0b37b7',
'versions': {'small': 'https://d17aqltn7cihbm.cloudfront.net/uploads/small_021404250e358776466e2a1eed0b37b7',
'medium': 'https://d17aqltn7cihbm.cloudfront.net/uploads/medium_021404250e358776466e2a1eed0b37b7',
'large': 'https://d17aqltn7cihbm.cloudfront.net/uploads/large_021404250e358776466e2a1eed0b37b7'}},
'id': 14500796,
'image_thumbnail': 'https://d17aqltn7cihbm.cloudfront.net/uploads/small_021404250e358776466e2a1eed0b37b7',
'title': 'Abandoned Automobile',
'description': 'This vehicle has been parked on the 5300 block of Sycamore Street. The car appears to be stolen and parts from the vehicle were removed such as Steering wheel – and other items',
'status': 'submitted',
'address': '5414 B St, Philadelphia, PA 19120, USA',
'location': '',
'zipcode': None,
'foreign_id': '16306368',
'date_created': 1697500317,
'count_comments': 0,
'count_followers': 0,
'count_supporters': 0,
'lat': 40.0324275675885,
'lon': -75.1184703964844,
'user_follows': 0,
'user_comments': 0,
'user_request': 0,
'rank': '1',
'user': ''}}],
'count': '2',
'benchmark': 6.994715929031372,
'status': {'type': 'success',
'message': 'Success',
'code': 200,
'code_message': 'Ok'}}}
Now we need to understand the structure of the response. First, access the list of requests:
= json["response"]["requests"]
request_list
request_list
[{'request': {'id': 14500812,
'image_thumbnail': '',
'title': 'Traffic Sign Complaint',
'description': 'People speed up the street like they’re all 95 it needs to have street cushions put in or speed bumps whatever you wanna call',
'status': 'submitted',
'address': '3532 Mercer St,Philadelphia, PA 19134',
'location': 'Philadelphia, Pennsylvania',
'zipcode': '19134',
'foreign_id': '16306373',
'date_created': 1697500549,
'count_comments': 0,
'count_followers': 0,
'count_supporters': 0,
'lat': 39.98944,
'lon': -75.095472,
'user_follows': 0,
'user_comments': 0,
'user_request': 0,
'rank': '1',
'user': 'Lisa BUCHER'}},
{'request': {'primary_attachment': {'id': 4618920,
'extension': 'jpeg',
'content_type': 'image/jpeg',
'url': 'https://d17aqltn7cihbm.cloudfront.net/uploads/021404250e358776466e2a1eed0b37b7',
'versions': {'small': 'https://d17aqltn7cihbm.cloudfront.net/uploads/small_021404250e358776466e2a1eed0b37b7',
'medium': 'https://d17aqltn7cihbm.cloudfront.net/uploads/medium_021404250e358776466e2a1eed0b37b7',
'large': 'https://d17aqltn7cihbm.cloudfront.net/uploads/large_021404250e358776466e2a1eed0b37b7'}},
'id': 14500796,
'image_thumbnail': 'https://d17aqltn7cihbm.cloudfront.net/uploads/small_021404250e358776466e2a1eed0b37b7',
'title': 'Abandoned Automobile',
'description': 'This vehicle has been parked on the 5300 block of Sycamore Street. The car appears to be stolen and parts from the vehicle were removed such as Steering wheel – and other items',
'status': 'submitted',
'address': '5414 B St, Philadelphia, PA 19120, USA',
'location': '',
'zipcode': None,
'foreign_id': '16306368',
'date_created': 1697500317,
'count_comments': 0,
'count_followers': 0,
'count_supporters': 0,
'lat': 40.0324275675885,
'lon': -75.1184703964844,
'user_follows': 0,
'user_comments': 0,
'user_request': 0,
'rank': '1',
'user': ''}}]
We need to extract out the “request” key of each list entry. Let’s do that and create a DataFrame:
= pd.DataFrame([r["request"] for r in request_list]) data
data.head()
id | image_thumbnail | title | description | status | address | location | zipcode | foreign_id | date_created | count_comments | count_followers | count_supporters | lat | lon | user_follows | user_comments | user_request | rank | user | primary_attachment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14500812 | Traffic Sign Complaint | People speed up the street like they’re all 95... | submitted | 3532 Mercer St,Philadelphia, PA 19134 | Philadelphia, Pennsylvania | 19134 | 16306373 | 1697500549 | 0 | 0 | 0 | 39.989440 | -75.095472 | 0 | 0 | 0 | 1 | Lisa BUCHER | NaN | |
1 | 14500796 | https://d17aqltn7cihbm.cloudfront.net/uploads/... | Abandoned Automobile | This vehicle has been parked on the 5300 block... | submitted | 5414 B St, Philadelphia, PA 19120, USA | None | 16306368 | 1697500317 | 0 | 0 | 0 | 40.032428 | -75.118470 | 0 | 0 | 0 | 1 | {'id': 4618920, 'extension': 'jpeg', 'content_... |
Success! But we want to build up a larger dataset…let’s pull data for the first 3 pages of data. This will take a minute or two…
# Store the data we request
= []
data
# Total number of pages
= 3
total_pages
# Loop over each page
for page_num in range(1, total_pages + 1):
# Print out the page number
print(f"Getting data for page #{page_num}...")
# Make the request
= requests.get(
r "https://vc0.publicstuff.com/api/2.0/requests_list",
={
params"client_id": 242, # Unique identifier for Philadelphia
"page": page_num, # What page of data to pull
"limit": 200, # How many rows per page
},
)
# Get the json
= r.json()
d
# Add the new data to our list and save
= data + [r["request"] for r in d["response"]["requests"]]
data
# Create a dataframe
= pd.DataFrame(data) data
Getting data for page #1...
Getting data for page #2...
Getting data for page #3...
len(data)
600
data.head()
primary_attachment | id | image_thumbnail | title | description | status | address | location | zipcode | foreign_id | date_created | count_comments | count_followers | count_supporters | lat | lon | user_follows | user_comments | user_request | rank | user | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | {'id': 4619003, 'extension': 'jpg', 'content_t... | 14500990 | https://d17aqltn7cihbm.cloudfront.net/uploads/... | Graffiti Removal | On the side of the house, the side facing Wood... | submitted | 1320 S Markoe St, Philadelphia, PA 19143, USA | None | 16306421 | 1697503690 | 0 | 0 | 0 | 39.943466 | -75.209642 | 0 | 0 | 0 | 1 | WalkAbout | |
1 | NaN | 14500986 | Homeless Encampment | There is a growing encampment near my house. T... | submitted | 2037 S Broad St, Philadelphia, PA 19148, USA | None | 16306417 | 1697503541 | 0 | 0 | 0 | 39.924470 | -75.169041 | 0 | 0 | 0 | 1 | jimmy.qaqish | ||
2 | {'id': 4618993, 'extension': 'jpg', 'content_t... | 14500969 | https://d17aqltn7cihbm.cloudfront.net/uploads/... | Graffiti Removal | On the west side of S. Markoe Street across fr... | submitted | 1305 S Markoe St, Philadelphia, PA 19143, USA | None | 16306412 | 1697503108 | 0 | 0 | 0 | 39.943874 | -75.209812 | 0 | 0 | 0 | 1 | WalkAbout | |
3 | NaN | 14500967 | Abandoned Automobile | The trailer has been parked on the corner sinc... | submitted | 2535 E Ontario St, Philadelphia, PA 19134, USA | None | 16306411 | 1697503064 | 0 | 0 | 0 | 39.988810 | -75.099584 | 0 | 0 | 0 | 1 | cebonfan | ||
4 | NaN | 14500963 | Abandoned Automobile | Parked on the sidewalk for over 6 months | submitted | 2634 Sears St, Philadelphia, PA 19146, USA | None | 16306410 | 1697502982 | 0 | 0 | 0 | 39.936311 | -75.188364 | 0 | 0 | 0 | 1 |
Let’s focus on the “description” column. This is the narrative text that the user inputs when entering a 311 request, and it is an example of semi-structured data. For the rest of today, we’ll focus on how to extract information from semi-structured data.
Semi-structured data
Data that contains some elements that cannot be easily consumed by computers
Examples: human-readable text, audio, images, etc
Key challenges
- Text mining: analyzing blocks of text to extract the relevant pieces of information
- Natural language processing (NLP): programming computers to process and analyze human languages
- Sentiment analysis: analyzing blocks of text to derive the attitude or emotional state of the person
Twitter is one of the main API examples of semi-structured data, but since Elon Musk overhauled the API access, it’s become prohibitively expensive to access (RIP 💀)
To get started, let’s remove any requests where the description is missing:
= data.dropna(subset=["description"])
data = data.loc[data["description"] != ""] data_final
# Strip out spaces and convert to a list
= data_final["description"].str.strip().tolist()
descriptions
10] descriptions[:
['On the side of the house, the side facing Woodland Avenue.',
'There is a growing encampment near my house. The people there have broken into a fenced area. They are yelling, leaving trash, defecating, and attempting to open cars on the street.',
'On the west side of S. Markoe Street across from #1305.',
'The trailer has been parked on the corner since the summer and has not moved. The street facing tires are flat. The trailer seems abandoned.',
'Parked on the sidewalk for over 6 months',
'The street light in front of 8219 Fayette st was out last night and is still out tonight. It very, very dark!',
'On the southwest corner of the intersection of S. 46th Street and Woodland Avenue.',
"On the 'No Dumping' street sign on the south side of Cedar Avenue about 11 steps west of S. 46th Street. Next to the tiny park.",
'Apartment complex is dumping trash on our street again! Please help or tell me the proper channels I need to take. Thank you!',
'People speed up the street like they’re all 95 it needs to have street cushions put in or speed bumps whatever you wanna call']
Use case #1: calculating word frequencies
An example of text mining
Text mining and dealing with messy data
Some steps to clean up our text data:
- Break strings into words
- Remove capitalization
- Remove stop words
- Remove punctuation
1. Break strings into words
Use the .split()
command to break a string into words by splitting on spaces.
= "This is an Example"
example_string
example_string.split()
['This', 'is', 'an', 'Example']
= [desc.split() for desc in descriptions] descriptions_words
0] descriptions_words[
['On',
'the',
'side',
'of',
'the',
'house,',
'the',
'side',
'facing',
'Woodland',
'Avenue.']
This is a list of lists, e.g., the first element is a list of words. Let’s flatten this into a list of just words:
= []
descriptions_words_flat
for list_of_words in descriptions_words:
for word in list_of_words:
descriptions_words_flat.append(word)
0] descriptions_words_flat[
'On'
len(descriptions_words_flat)
9370
2. Convert all words to lower case
Use .lower()
makes all words lower cased
= [word.lower() for word in descriptions_words_flat] descriptions_words_lower
10] descriptions_words_lower[:
['on',
'the',
'side',
'of',
'the',
'house,',
'the',
'side',
'facing',
'woodland']
len(descriptions_words_lower)
9370
3. Remove stop words
Common words that do not carry much significance and are often ignored in text analysis.
We can use the nltk
package.
The “Natural Language Toolkit” https://www.nltk.org/
Import and download the stop words:
import nltk
"stopwords"); nltk.download(
[nltk_data] Downloading package stopwords to /Users/nhand/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
Get the list of common stop words:
= list(set(nltk.corpus.stopwords.words("english")))
stop_words
10] stop_words[:
["wasn't",
'will',
'for',
'ourselves',
'while',
'during',
'your',
'here',
'some',
"that'll"]
len(stop_words)
179
= []
descriptions_no_stop
for word in descriptions_words_lower:
if word not in stop_words:
descriptions_no_stop.append(word)
= [
descriptions_no_stop for word in descriptions_words_lower if word not in stop_words
word ]
len(descriptions_no_stop)
5439
4. Remove punctuation
Get the list of common punctuation:
import string
= list(string.punctuation) punctuation
5] punctuation[:
['!', '"', '#', '$', '%']
Remove punctuation from words:
= []
descriptions_final
# Loop over all words
for word in descriptions_no_stop:
# Remove any punctuation from the words
for p in punctuation:
= word.replace(p, "")
word
# Save it if the string is not empty
if word != "":
descriptions_final.append(word)
Convert to a Dataframe with one column:
= pd.DataFrame({"words": descriptions_final}) words
words.head()
words | |
---|---|
0 | side |
1 | house |
2 | side |
3 | facing |
4 | woodland |
Calculate the word frequencies
Use a pandas groupby and sort to put in descending order:
= (
N "words", as_index=False)
words.groupby(
.size()"size", ascending=False, ignore_index=True)
.sort_values( )
The top 15 words by frequency:
= N.head(15)
top15
top15
words | size | |
---|---|---|
0 | street | 145 |
1 | trash | 113 |
2 | side | 66 |
3 | block | 40 |
4 | vehicle | 39 |
5 | sidewalk | 38 |
6 | please | 38 |
7 | property | 35 |
8 | lot | 35 |
9 | corner | 34 |
10 | car | 32 |
11 | front | 31 |
12 | graffiti | 30 |
13 | months | 30 |
14 | st | 29 |
Plot the frequencies
Use seaborn
to plot our DataFrame of word counts…
= plt.subplots(figsize=(8, 8))
fig, ax
# Plot horizontal bar graph
sns.barplot(="words",
y="size",
x=top15,
data=ax,
ax="#cc3000",
color=1.0,
saturation
)
"Most Common Words Found in 311 Requests", fontsize=16); ax.set_title(
Takeaway: Philly cares about trash! They don’t call it Filthadelphia for nothing…
To be continued next time!
See you on Wednesday!