# Data Bootcamp: Undergrad Spring 2018

This page is your key resource for the course. Everything you need is here! Below are links to key documents such as the syllabus, the book, the blog, and my GitHub repository for the class. Moreover, there is a date by date list of topics, and links to material used in each class. **Please watch this site regularly to stay up to date.**

Last update: 1/16/2018

## Where and When

Who: Michael Waugh (instructor), Vineetha Kutty (teaching fellow)

Meeting times: Tuesday and Thursday 11am-12:15pm

Meeting place: KMC 3-70

## Important Links

**THE SYLLABUS.**All the important details about the course, procedures, important dates, etc.**THE BOOK.**The topics in the first half are all in the book. We will follow this closely. At the book link, click the large blue Read button to read online – or download the pdf. Both come with links.**THE BLOG.**Remember this course is a data course that uses Python. In THE BLOG, I’ll discuss interesting uses of data that I find on the web and talk through various issues.**My GitHub REPOSITORY**Here I will post notebooks from each class.

## Important Dates

**Mini-Midterm #1:****February 22nd, 2018.**Quick info: Python fundamentals 1-2, 30 minutes, in class, open book, open internet if the wireless is up, bring one page of notes.**Mini-Midterm #2:****March 22nd, 2018.**Quick info: Intro to Pandas and Matplotlib, 30 minutes, in class, open book, open internet if the wireless is up, bring one page of notes.**Three Project Ideas:****March 30, 2018.**Jupyter Notebook with three project ideas, briefly flushed out and potential data sources. Hard copy due at 5pm.**Talk with me about project ideas**Appointment calendar here**Final Project Proposal + Data Report:****April 24th, 2018.**Jupyter Notebook with final proposal. More details to be provided.**Hard copy due at 5pm.****Final Project**Due Date:**End of Day May 14, 2018**

## Week By Week Guide…

## Topic 1. Data + Python = Magic!

**Handouts:** Outline | Book (Click on blue “Read” button) | Three ideas

**Examples:** Gapminder | cancer screening | Uber in NYC | medical expenditures | mortality | earthquake | Gender pay gap | Fertility | Vaccines

**Summary:** It’s nice to have skills; installing Anaconda; Jupyter/IPython; data; questions; idea machines.

## Topic 2. Python fundamentals 1

**Handouts:** Outline | Book chapter | Code Practice #1 (Due 5pm February 2nd, hard copy)

**Summary:** Calculations; assignments; strings; lists; tuples; built-in functions; objects; methods; tab completion.

## Topic 3. Python fundamentals 2

**Handouts:** Outline | Book chapter | Code Practice #2 (Due 5pm February 16, hard copy)

**Summary:** True and False; comparisons; conditionals; slicing; loops; function definitions and returns; dictionaries.

**Mini-Midterm #1:** **February 22nd, 2017.** Quick info: Python fundamentals 1-2, 30 minutes, in class, open book, open internet if the wireless is up, bring one page of notes.

## Topic 3.5: Updating and installing packages

**Handouts:** Book chapter

**Summary:** Using conda, pip, etc. Updating Anaconda, installing Seaborn, Plotly, and Pandas-Datareader.

## Topic 4. Intro to Pandas

**Handouts:** Outline | Book chapter

**Summary:** Packages; import; Pandas; csv files; reading csv/xls files; dataframes; columns; index; APIs.

## Topic 5. Python graphics: Matplotlib fundamentals

**Handouts:** Outline | Book chapter | Code Practice #3 (Due March 9th)

**Summary:** Approach to graphics focused on the fig/ax objects and methods; lines, scatters, bars, horizontal bars, histograms, styles.

In class code/lectures:

GDP and its comovement with subcomponents (fundamentals of plotting, line and scatter plots).

Why are some countries rich, others poor? (more histograms, bar charts, fancy scatter plots, data wrangling).

**Mini-Midterm #2:** **March 22nd, 2018.** Quick info: Intro to Pandas and Matplotlib, 45 minutes, in class, open book, open internet if the wireless is up, bring one page of notes.

## Topic 6. Thinking about projects

**Handouts:** Outline | Project Examples | Code (examples | current indicators | demography | Airbnb)

**Summary:** Projects: Say something interesting with data. Idea machines. Examples.

## Topic 7. More Pandas: Cleaning

**Handouts:** Code

**Summary:** Pandas has incredible facilities for managing data. We look at fixing numbers misidentified as strings, managing missing observations, selecting variables and observations, and the `isin`

and `contains`

methods. Application: What is the price of Guacamole at Chipotle?

## Topic 8. More Pandas: Shaping

**Handouts:** Code

**Summary:** Understand and be able to apply the melt/stack/unstack/pivot methods.

## Topic 9. More Pandas: Groupby, Aggregation, Pivot Tables

**Handouts:** Code

**Summary:** Explore the use of `groupby`

and related operations.

## Topic 10. More Pandas: Merging

**Handouts:** Code

**Summary:** Often we need to combine data from two or more dataframes. We explore the `merge`

feature of Pandas. Along the way we take an extended detour to review methods for downloading and unzipping compressed files.

## Topic 11. Census API

**Handouts:** Code

**Summary:** The US government has a massive amounts of data that can be easily accessed. We explore this and then merge the census data with election results. Application: Who voted for whom in the 2016 Presidential Election?

## Topic 12. GeoPandas and Mapping

**Handouts:** [Code]()

**Summary:** Here we use the `GeoPandas`

package and learn some basic mapping skills.

## Topic 10. More Pandas: Time Series Data

**Handouts:**
Code

**Summary** Time series features of Pandas (when the index is set to a DateTime index):

## Topic 12. Putting it All Together…

**Summary:** Examples of projects from (start to finish) with interesting datasets.

## If Time Permits…

## Basic Regression Analysis

**Handouts:**
Code

**Summary:** A brief introduction to Regression analysis in Python. We continue to explore Who voted for whom in the 2016 Presidential Election?

## More Pandas: Combining & summarizing data

**Handouts:** Code (combining|summarizing)

**Summary:** Combining dataframes (merge, concatenate). Statistics (mean, median, quantiles), categorical variables, grouping data by categories, counts and statistics by category.

## Advanced graphics with Plotly and Seaborn

**Handouts:** Code (Plotly | Seaborn )

**Summary:** We cover more advanced graphics using the seaborn and plotly packages. We show how to leverage our knowledge of matplotlib to jumpstart our usage of these two packages. We also show how seaborn can be used to easliy construct common, yet sophisticated graphics with little additional effort. We also show how to leverage plotly’s unique features to do things that are very difficult, or sometimes impossible, to do with matplotlib.

## Web scraping

**Handouts:** Code

**Summary:** Python has great tools for scraping data off websites. We will give a *very light* introduction to some of the routines we’ve found useful for doing this.

The next three topics provide some structure for thinking about data. *Distributions* is about the frequencies of various outcomes: stock returns, incomes of individuals, medical expenses, movie grosses. *Dependence* is about connections between two variables, a connection often summarized (incompletely) by their correlation. *Dynamics* is about the relation between a variable at two different dates. Is strong economic growth followed by the same? How do bond ratings evolve?

## Distributions

**Handouts:** Outline | Code

**Summary:** Some data is usefully described not by (say) its mean or median, but by its range of outcomes. Examples include equity returns, the age distribution of the population, size of firms, and incomes of individuals. We describe distributions with histograms, smoothed histograms (kde’s), and so on. We introduce the Numpy package along the way and use ipywidgets in Jupyter to add interactivity to our code.

## Statistics & Machine Learning

**Handouts:** Outline | Code

**Summary:** These are whole subjects, not topics, but we thought a brief overview of their history would be useful. We combine it an application to multivariate regression with two packages, StatsModels (statistics) and Scikit-Learn (machine learning).