Data + Python @ NYU Stern

Data Bootcamp: Undergrad Fall 2017

This page is your key resource for the course. Everything you need is here! Below are links to key documents such as the syllabus, the book, and the blog. Moreover, there is a date by date list of topics, and links to material used in each class. Please watch this site regularly to stay up to date.

Last update: \today

Where and When

  • Who: Michael Waugh (instructor), Felipe Alves (teaching fellow)

  • Meeting times: Monday and Wednesday 330-445pm

  • Meeting place: KMC 5-140

  • THE SYLLABUS. All the important details about the course, procedures, important dates, etc.

  • THE BOOK. The topics in the first half are all in the book. We will follow this closely. At the book link, click the large blue Read button to read online – or download the pdf. Both come with links.

  • THE BLOG. Remember this course is a data course that uses Python. In THE BLOG, I’ll discuss interesting uses of data that I find on the web and talk through various issues.

  • GitHub REPOSITORY All the material is here, code, past projects, book, etc.

  • DISCUSSION GROUP This is where you can get help from other classmates and from myself.

  • DUE DATES This provides a summary for the due dates for code practice, midterm exam, final project.

Important Dates

  • Midterm Exam: November 8, 2017. Quick info: 75 minutes, in class, open book, open internet if the wireless is up, bring one page of notes.

  • Final Project Due Date: End of Day December 21, 2017

Week By Week Guide…

Topic 1. Data + Python = Magic!

Handouts: Outline | Book (Click on blue “Read” button) | Three ideas
Examples: Gapminder | cancer screening | Uber in NYC | medical expenditures | mortality | earthquake | Gender pay gap | Fertility | Vaccines
Summary: It’s nice to have skills; installing Anaconda; Spyder and Jupyter/IPython; data; questions; idea machines.

Topic 2. Python fundamentals 1

Handouts: Outline | Book chapter | Code Practice #1 (Due 5pm September 22, hard copy)
Summary: Calculations; assignments; strings; lists; tuples; built-in functions; objects; methods; tab completion.

Topic 3. Python fundamentals 2

Handouts: Outline | Book chapter | Code Practice #2 (Due October 4th, hard copy 5pm) Markdown template
Summary: True and False; comparisons; conditionals; slicing; loops; function definitions and returns; dictionaries.

Topic 4. Data input: Pandas

Handouts: Outline | Book chapter | Code Practice #3 (Due October 20) (code template)
Summary: Packages; import; Pandas; csv files; reading csv/xls files; dataframes; columns; index; APIs.

Topic 5. Python graphics: Matplotlib fundamentals

Handouts: Outline | Book chapter | Code Practice A (try by November 3) (Download “Raw” as ipynb)
Summary: Approach to graphics focused on the fig/ax objects and methods; lines, scatters, bars, horizontal bars, histograms, styles.

In class code/lectures:

Topic 6. Review & applications

Handouts: Outline | Code (review | applications)
Summary: Exam review, followed by applications to get us thinking about interesting datasets and how to work with them.

Topic 7. Midterm Exam: November 8

Posted after class: Exam with answers

Topic 8. Thinking about projects

Handouts: Outline | Project Examples | Code (examples | current indicators | demography | Airbnb)
Summary: Projects: say something interesting with data. Idea machines. Examples.

Topic 9. More Pandas: Cleaning

Handouts: Code
Summary: Pandas has incredible facilities for managing data. We look at fixing numbers misidentified as strings, managing missing observations, selecting variables and observations, and the isin and contains methods. Application: What is the price of Guacamole at Chipotle?

Topic 10. More Pandas: Merging

Handouts: Code
Summary: Often we need to combine data from two or more dataframes. We explore the merge feature of Pandas. Along the way we take an extended detour to review methods for downloading and unzipping compressed files.

Topic 10. More Pandas: Census API and Mapping.

Summary: The US government has a massive amounts of data that can be easily accessed. We explore this and then merge the census data with election results. Application: Who voted for whom in the 2016 Presidential Election? As we answer this question, we learn some basic mapping skills.

Topic 11. More Pandas: Regression Analysis

Summary: A brief introduction to Regression analysis in Python. We continue to explore Who voted for whom in the 2016 Presidential Election?

Topic: Web scraping

Handouts: Code
Summary: Python has great tools for scraping data off websites. We will give a very light introduction to some of the routines we’ve found useful for doing this.

Topic: More Pandas: Combining & summarizing data

Handouts: Code (combining|summarizing)
Summary: Combining dataframes (merge, concatenate). Statistics (mean, median, quantiles), categorical variables, grouping data by categories, counts and statistics by category.

Topic: Advanced graphics with Plotly and Seaborn

Handouts: Code (Plotly | Seaborn )
Summary: We cover more advanced graphics using the seaborn and plotly packages. We show how to leverage our knowledge of matplotlib to jumpstart our usage of these two packages. We also show how seaborn can be used to easliy construct common, yet sophisticated graphics with little additional effort. We also show how to leverage plotly’s unique features to do things that are very difficult, or sometimes impossible, to do with matplotlib.

Topic: Updating and installing packages

Handouts: Book chapter
Summary: Using conda, pip, etc. Updating Anaconda, installing Seaborn, Plotly, and Pandas-Datareader.

The next three topics provide some structure for thinking about data. Distributions is about the frequencies of various outcomes: stock returns, incomes of individuals, medical expenses, movie grosses. Dependence is about connections between two variables, a connection often summarized (incompletely) by their correlation. Dynamics is about the relation between a variable at two different dates. Is strong economic growth followed by the same? How do bond ratings evolve?

Topic. Distributions

Handouts: Outline | Code
Summary: Some data is usefully described not by (say) its mean or median, but by its range of outcomes. Examples include equity returns, the age distribution of the population, size of firms, and incomes of individuals. We describe distributions with histograms, smoothed histograms (kde’s), and so on. We introduce the Numpy package along the way and use ipywidgets in Jupyter to add interactivity to our code.

Topic. Statistics & Machine Learning

Handouts: Outline | Code
Summary: These are whole subjects, not topics, but we thought a brief overview of their history would be useful. We combine it an application to multivariate regression with two packages, StatsModels (statistics) and Scikit-Learn (machine learning).