# Data Bootcamp: Topic outlines & links

A list of topics with links to material used in class. We expect each topic to take roughly one week – maybe a little more.

The topics in the first half follow **THE BOOK**. At the book link, click the large blue Read button to read online – or download the pdf. Both come with links.

## Topic 1. Data + Python = Magic!

**Handouts:** Outline | Book (Click on blue “Read” button) | Three ideas

**Examples:** Gapminder | cancer screening | Uber in NYC | medical expenditures | mortality | earthquake | Gender pay gap | Fertility | Vaccines

**Summary:** It’s nice to have skills; installing Anaconda; Spyder and Jupyter/IPython; data; questions; idea machines.

## Topic 2. Python fundamentals 1

**Handouts:** Outline | Book chapter | Code Practice #1

**Summary:** Calculations; assignments; strings; lists; tuples; built-in functions; objects; methods; tab completion.

## Topic 3. Python fundamentals 2

**Handouts:** Outline | Book chapter | Code Practice #2

**Summary:** True and False; comparisons; conditionals; slicing; loops; function definitions and returns; dictionaries.

## Topic 4. Data input: Packages and Pandas

**Handouts:** Outline | Book chapter | Code (Download “Raw”) | Code Practice 3 (coming soon) (code template)

**Summary:** Packages; import; Pandas; csv files; reading csv/xls files; dataframes; columns; index; APIs.

## Topic 5. Python graphics: Matplotlib fundamentals

**Handouts:** Outline | Book chapter | Code (Download “Raw” as ipynb) | Code Practice A (Download “Raw” as ipynb)

**Summary:** Three approaches to graphics: dataframe plot methods, plot(x,y), and fig/ax objects and methods; lines, scatters, bars, horizontal bars, styles.

## Topic 6. Review & applications

**Handouts:** Outline | Code (review | applications)

**Summary:** Exam review, followed by applications to get us thinking about interesting datasets and how to work with them.

## Topic 7. Exam

**Posted after class:** Exam with answers

## Topic 8. Thinking about projects

**Handouts:** Outline | Project Examples | Code (examples | current indicators | demography | Airbnb)

**Summary:** Projects: say something interesting with data. Idea machines. Examples.

## Remaining topics will depend upon interest – yours and ours.

We have almost enough for another course. If there’s something you’d particularly like to see, let us know.The plan is to mix data applications and tools in parallel. These topics are generally shorter than before.

## Topic 9. More Pandas: Cleaning

**Handouts:** Outline | Code

**Summary:** Pandas has incredible facilities for managing data. We look at fixing numbers misidentified as strings, managing missing observations, selecting variables and observations, and the `isin`

and `contains`

methods.

## Topic 10. More Pandas: Shaping

**Handouts:** Outline | Code

**Summary:** Here we introduce for key methods for “shaping” our data: `df.set_index`

, `df.reset_index`

, `df.stack`

, and `df.unstack`

. When we say shaping we mean manipulating the data so we get specific row and column labels.

## Topic 11. Updating and installing packages

**Handouts:** Book chapter

**Summary:** Using conda, pip, etc. Updating Anaconda, installing Seaborn, Plotly, and Pandas-Datareader.

## Topic 12. More Pandas: Combining & summarizing data

**Handouts:** Code (combining|summarizing)

**Summary:** Combining dataframes (merge, concatenate). Statistics (mean, median, quantiles), categorical variables, grouping data by categories, counts and statistics by category.

## Topic. Advanced graphics with Seaborn and Plotly

**Handouts:** Code (Plotly | Seaborn )

**Summary:** We cover more advanced graphics using the seaborn and plotly packages. We show how to leverage our knowledge of matplotlib to jumpstart our usage of these two packages. We also show how seaborn can be used to easliy construct common, yet sophisticated graphics with little additional effort. We also show how to leverage plotly’s unique features to do things that are very difficult, or sometimes impossible, to do with matplotlib.

## Topic. Web scraping

**Handouts:** Code

**Summary:** Python has great tools for scraping data off websites. We will give a *very light* introduction to some of the routines we’ve found useful for doing this.

The next three topics provide some structure for thinking about data. *Distributions* is about the frequencies of various outcomes: stock returns, incomes of individuals, medical expenses, movie grosses. *Dependence* is about connections between two variables, a connection often summarized (incompletely) byt their correlation. *Dynamics* is about the relation between a variable at two different dates. Is strong economic growth followed by the same? How do bond ratings evolve?

## Topic. Distributions

**Handouts:** Outline | Code

**Summary:** Some data is usefully described not by (say) its mean or median, but by its range of outcomes. Examples include equity returns, the age distribution of the population, size of firms, and incomes of individuals. We describe distributions with histograms, smoothed histograms (kde’s), and so on. We introduce the Numpy package along the way and use ipywidgets in Jupyter to add interactivity to our code.

## Topic. Dependence

**Handouts:** Outline | Code

**Summary:**

## Topic. Dynamics

**Handouts:** Outline | Code

**Summary:**

## Topic. Statistics & Machine Learning

**Handouts:** Outline | Code

**Summary:** These are whole subjects, not topics, but we thought a brief overview of their history would be useful. We combine it an application to multivariate regression with two packages, StatsModels (statistics) and Scikit-Learn (machine learning).