Geospatial Data Science in Python
  • Syllabus
  • Schedule
    • Section 401
    • Section 402
  • Content
  • Assignments
    • Overview
    • Section 401
    • Section 402
  • Resources
  • GitHub
  • Canvas
  • Ed Discussion

Welcome to MUSA 550:
Geospatial Data Science in Python

  • Section 401
  • Aug. 30, 2023

Today

  • Course logistics
  • Using Jupyter Notebooks and Jupyter Lab
  • Introduction to Python & Pandas

Who am I?

My day job

  • My name is Nick Hand
  • For the past 5+ years, I have led a small data science team in the City Controller’s Office
  • What did we do?:
    • Objective, data-driven analysis of financial policies impacting Philadelphia
    • Increasing transparency through data releases and interactive reports
  • In two weeks, I start as a data scientist working at the Consumer Finance Protection Bureau in the federal government

In my time at the Controller’s Office, we covered a range of policy issues in the city:

  • Analysis of the fairness and accuracy of property assessments
  • Interactive reports for the City’s cash levels
  • Analysis of the 10-Year Tax Abatement program
  • Interactive dashboard of Soda Tax spending
  • Visualization of paving & potholes
  • Series on the impact of COVID-19 on Philadelphia’s small businesses and neighborhoods
  • Visualization of the City’s most recent budget
  • Interactive report on redlining in Philadelphia
  • Analysis of the impact of gun violence on housing prices
  • Interactive dashboard of shooting victims in Philadelphia
  • Interactive dashboard of neighborhood well-being

Note

To see more of our work, check out: https://controller.phila.gov/policy-analysis

Previously:
Astrophysics Ph.D. at Berkeley

How did I get here?

  • Astrophysics/physics to data science is becoming increasingly common
  • Landed a job through Twitter: https://www.parkingjawn.com
    • Dashboard visualization of monthly parking tickets in Philadelphia
    • Data from OpenDataPhilly

A nice example of how exploratory analysis + a well-designed dashbord can lead to insights: - The power of cross-filtering: different views of the same data across multiple dimensions - See drop in parking tickets over Jan 24-26, 2016 due to snowstorm

Parking Jawn is not Python based, but dovetails nicely with one of the main goals of the course: > How can we effectively explore and extract insight from complex datasets?

Course logistics

General Info

  • Two 90-minute lectures per week — mix of lecturing, interactive demos, and in-class lab time
  • My email: nhand@design.upenn.edu
    • Office Hours:
      • 2-hours during the week
      • Office hours will be by appointment and remote via Zoom. You will be able to sign up for 1 (or more) 15-minute time slot via the Canvas calendar.
      • Time is to be determined
  • Teaching Assistant: Teresa Chang
    • Email: thchang@design.upenn.edu
    • Office hours: TBD

Course Websites

Course has four websites (sorry!). They are:

  • Main Course: https://musa-550-fall-2023.github.io
  • Github: https://github.com/MUSA-550-Fall-2023
  • Canvas: https://canvas.upenn.edu/courses/1740535
  • Ed Discussion: https://edstem.org/us/courses/42616/discussion/

Each will have its own purpose:

Main course website

  • Course schedule with links to weekly slides
  • Resources for learning Python, setting up software, and dealing with common issues
  • General course info and policies
  • Quick links to the other websites for the course

Github

  • Github organization set up for the course
  • Each week and assignment will have its own Github repository
  • Assignments will also be submitted through Github

Canvas

  • Will be used sign up for remote office hours and provide Zoom links for office hours
  • Grading will also be tracked here

Ed Discussion

  • Will be used for question & answer forum for course materials and assignments
  • Announcements will also be made here so make sure you check frequently or turn on your notifications!
  • Main method of communication will be through announcements on this site
  • Participation grade (5% of total grade) will also be determined by user activity on the forum

Main course website

Highlights

  • Syllabus
  • Schedule
  • Resources & Guides:
    • Python resources
    • Initial installation guide
  • Weekly content
  • Assignments
  • Quick links to Canvas, Ed Discussion, GitHub homepage

Course Github

The goals of this course

  • Provide students with the knowledge and tools to turn data into meaningful insights and stories
  • Focus on the modern data science tools within the Python ecosystem
  • The pipeline approach to data science:
    • gathering, storing, analyzing, and visualizing data to tell stories
  • Real-world applications of analysis techniques in the urban planning and public policy realm

What we’ll cover

Module 1

Exploratory Data Science: Students will be introduced to the main tools needed to get started analyzing and visualizing data using Python

Module 2

Introduction to Geospatial Data Science: Building on the previous set of tools, this module will teach students how to work with geospatial datasets using a range of modern Python toolkits.

Module 3

Data Ingestion & Big Data: Students will learn how to collect new data through web scraping and APIs, as well as how to work effectively with the large datasets often encountered in real-world applications.

Module 4

From Exploration to Storytelling: With a solid foundation, students will learn the latest tools to present their analysis results using web-based formats to transform their insights into interactive stories.

Module 5

Geospatial Data Science in the Wild: Armed with the necessary data science tools, the final module introduces a range of advanced analytic and machine learning techniques using a number of innovative examples from modern researchers.

Assignments and grading

  • Grading:
    • 50% homework
    • 45% final project
    • 5% participation (based on class and Ed Discussion participation)
  • While you are required to submit all six assignments, the assignment with the lowest grade will not count towards your final grade.
  • There’s no penalty for late assignments. I would highly recommend staying caught up on lectures and assignments as much as possible, but if you need to turn something in a few days late, there won’t be a penalty.

Note: Homeworks will be assigned (roughly) every two and a half weeks.

The course schedule

Check out the schedule page for the most up-to-date details on lectures, assignment due dates, etc.

Final project

The final project is to replicate the pipeline approach on a dataset (or datasets) of your choosing.

Students will be required to use several of the analysis techniques taught in the class and produce a web-based data visualization that effectively communicates the empirical results to a non-technical audience.

More info will be posted here: https://github.com/MUSA-550-Fall-2023/final-project

Any questions so far?

Initial surveys

Roll call: https://bit.ly/musa550-roll-call

Some initial feedback: https://bit.ly/musa550-initial-feedback

Okay, let’s get started…

The Incredible Growth of Python

Source: A 2017 analysis of StackOverflow posts

The rise of the Jupyter notebook

The engine of collaborative data science

  • First started by a physics grad student around 2001
  • Known as the IPython notebook originally
  • Starting getting popular in ~2011
  • First funding received in 2015 — the Jupyter notebook was born

Google searches for Jupyter notebook

Key features

  • Aimed at “computational narratives” — telling stories with data
  • Interactive, reproducible, shareable, user-friendly, visualization-focused
  • Fully open-source and managed by the community

Very versatile: good for both exploratory data analysis and polished finished products

Important

The lecture slides in the course will all be Jupyter notebooks. The preferred interface for editing and executing them will be JupyterLab. That’s what I’m using now!

Tip

For more info on Jupyter notebooks and JupyterLab, check out the guide on the course website.

In particulary, I strongly encourage you to go through the official documentation for JupyterLab and Jupyter notebooks:

  • Starting JupyterLab
  • The JupyterLab interface
  • The structure of a notebook document
  • The notebook workflow
  • Working with notebooks in JupyterLab
  • Working with files

Beyond the Jupyter notebook

Google Colab is the most popular alternative to Jupyter notebooks.

  • A fancier notebook experience built on top of Jupyter notebook
  • Running in the cloud on Google’s servers
  • An internal Google product that was released publicly
  • Very popular for Python-based machine learning, since it provides low-barrier access to GPU resources which can be very helpful for training machine learning models
  • We’ll focus on the open-source Jupyter notebook as the foundation for this course

See, for example: https://colab.research.google.com/notebooks/welcome.ipynb

The Binder service

https://mybinder.org

  • A free, open-source service, supported by donors
  • Allows you to launch a repository of Jupyter notebooks on GitHub in an executable environment in the cloud
  • Amazing if you want to make your code immediately reproducible by anyone, anywhere.
  • Note: as a free service, it can be a bit slow sometimes

Class lectures

Weekly lectures are available on Binder! In the README for each week’s repository on GitHub, you will see badges to launch the lecture slides on Binder.


You can also access these links from the content section of the course website. For example, here is:

https://musa-550-fall-2023.github.io/content/week-1/

Suggested weekly workflow
  • Set up local Python environment as part of first homework assignment
  • Each week, you will have two options to follow along with lectures:
    1. Using Binder in the cloud, launching via the button on the week’s repository
    2. Download the week’s repository to your laptop and launch the notebook locally
  • Work on homeworks locally on your laptop — Binder is only a temporary environment (no save features)

Check out the content overview page on the main course website for more info!

Now to the fun stuff…

Jupyter notebooks are a mix of code cells and text cells in Markdown. You can change the type of cell in the top menu bar.

This cell is a Markdown cell.

# Comments begin with a "#" character in Python
# A simple code cell
# SHIFT-ENTER to execute


x = 10
print(x)
10

Python data types

# integer
a = 10

# float
b = 10.5

# string
c = "this is a test string"

# lists
d = list(range(0, 10))

# booleans
e = False

# dictionaries
f = {"key1": 1, "key2": 2}
print(a)
print(b)
print(c)
print(d)
print(e)
print(f)
10
10.5
this is a test string
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
False
{'key1': 1, 'key2': 2}
Note

Unlike R, you’ll need to use quotes more often in Python, particularly around strings and keys of dictionaries.

Alternative method for creating a dictionary

We can use the dict() function, which is built in to the Python language. More on functions in a bit…

f = dict(key1=1, key2=2, key3=3)

f
{'key1': 1, 'key2': 2, 'key3': 3}

Accessing dictionary values

# Access the value with key 'key1'
f['key1']
1

Accessing list values

d
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# Access the second list entry (0 is the first index)
d[0]  
0

Accessing characters of a string

c
'this is a test string'
# the first character
c[0]
't'

Iterators and for loops

Important

Be sure to use the right indentation in for loops!

# Variable that will track the sum
result_sum = 0

# Variable i takes on values [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
for i in range(0, 10):
        
    # Indented, so it runs for each iteration of the loop
    print(i)

    result_sum = result_sum + i
0
1
2
3
4
5
6
7
8
9
print(result_sum)
45

Python’s inline syntax

a = range(0, 10) # this is an iterator
print(a)
range(0, 10)

Use the list() function to iterate over it and make it into a list:

# convert it to a list explicitly
a = list(range(10))

# Output it from the cell
a
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# or use the INLINE syntax; this is the SAME
a = [i for i in range(10)]

a
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Python functions

def function_name(arg1, arg2, arg3):
    
    .
    .
    .
    code lines (indented)
    .
    .
    .
    
    return result
def compute_square(x):
    
    sq = x * x
    return sq
sq = compute_square(10)
print(sq)
100

Keywords: arguments with a default!

def compute_product(x, y=5):
    return x * y
# use the default value for y
print(compute_product(3))
15
# specify a y value other than the default
print(compute_product(5, 10))
50
# can also explicitly tell Python which arguments are which
print(compute_product(5, y=2))
print(compute_product(y=2, x=5))
10
10
print(compute_product(x=5, y=4))
20
# argument names must match the function signature though!
print(compute_product(x=5, z=5))
TypeError: compute_product() got an unexpected keyword argument 'z'

Getting help in the notebook

Use tab auto-completion and the ? and ?? operators

this_variable_has_a_long_name = 5
# try hitting tab after typing this_ 
this_variable_has_a_long_name
# Forget how to create a range? --> use the help message
range?

Peeking at the source code for a function

Use the ?? operator

# Lets re-define compute_product() and add a docstring between """ """
def compute_product(x, y=5):
    """
    This computes the product of x and y
    
    
    This is all part of the comment.
    """
    return x * y
compute_product??
Note

The question mark operator gives you access to the help message for any variable or function.

I use this frequently and it is a great way to understand what a function actually does.

The JupyterLab Debugger

You can enable Debugging mode in JupyterLab by clicking on the “bug” icon in the top right:

This should open the Debugger panel on the right side of JupyterLab. One of the most useful parts of this panel is the “Variables” section, which gives you the current values of all defined variables in the notebook.

Tip

For more information on the debugger, see the JupyterLab docs.

Getting more Python help

This was a very brief introduction to Python and Python syntax. We’ll continue practicing and reinforcing the proper syntax throughout the next few weeks, but it can definitely be frustrating. Hang in there!

Important: DataCamp Tutorials

DataCamp is providing 6 months of complimentary access to its courses for students in MUSA 550. Whether you have experience with Python or not, this is a great opportunity to learn the basics of Python and practice your skills.

It is strongly recommended that you watch some or all of the introductory videos below to build a stronger Python foundation for the semester. The more advanced, intermediate courses are also great — the more the merrier!

For more info, including how to sign up, check out the resources section of the website.

Additional Python resources are listed on our course website under “Resources”

https://musa-550-fall-2023.github.io/resource/python.html

In addition to the DataCamp videos, there are links to lots of online tutorials:

  • Introductory level tutorials
  • More in depth tutorials
  • The r/learnpython subreddit has a great wiki of resources
  • The Berkeley Institute for Data Science has also compiled a number of Python resources

One more thing: working outside the notebook

In this class, we will almost exclusively work inside Jupyter notebooks — you’ll be writing Python code and doing data analysis directly in the notebook.

The more traditional method of using Python is to put your code into a .py file and execute it via the command line (known as the Miniforge/Anaconda Prompt on Windows or Terminal app on MacOS).

See this section of the Practical Python Programming tutorial for more info.

The JupyterLab text editor

There is a file called hello_world.py in the repository for week 1. If we execute it, it should print out “Hello, World” to the command line.

First, let’s open up the .py file in the JupyterLab text editor. Double click on the “hello_world.py” item in the file browser on the left:

This will open the file and allow you to make edits. You should see the following:

# Our first Python program
print("Hello World!")
Tip

See the JupyterLab docs for more info on the text editor.

Using the JupyterLab Terminal

To execute the file, we can use the built-in Terminal feature in JupyterLab using the following steps:

  1. Bring up the “Launcher” tab by clicking on the blue button with a plus sign in the upper left.
  2. Click on the “Terminal” button”
  3. When the terminal opens, type the following:
python hello_world.py

And you should see the following output:

Hello World!

It should look something like this:

Code editors

The JupyterLab text editor will work in a pinch, but it’s not usually the best option when writing software outside the notebook. Other code editors will provide a nice interface for writing Python code and some even have fancy features, like real-time syntax checking and syntax highlighting.

My recommended option is Visual Studio Code.

See you next week!

  • No lecture on Monday next week due to Labor Day
  • Next class a week from today on Wednesday 9/6
  • In the meantime:
    • Follow [the guide] for setting up your local Python environment
    • Check out Python resources + DataCamp courses in the meantime
    • Homework #1 will be posted a week from today (Wednesday 9/6, due on 9/20)
Content 2023 by Nick Hand, Quarto layout adapted from Andrew Heiss’s Data Visualization with R course
All content licensed under a Creative Commons Attribution-NonCommercial 4.0 International license (CC BY-NC 4.0)
 
Made with and Quarto
View the source at GitHub