Data + Python @ NYU Stern

Data Bootcamp: Data sources & applications

There’s an enormous amount of public data available online: data about countries, about markets, about individuals, and about companies. Here are some of our favorites; we use most of them in class. We link to a larger but less well organized list at the end.

Data about countries

Go-to sources:

  • FRED. Large collection of easy-to-use time series data from the St Louis Fed. Comes with online graphing tools, Excel plug-in, etc. We use the Pandas tool. One thing to keep in mind: it’s a mess if you mix data with different “frequencies” (monthly, quarterly, annual).

  • World Bank. Annual data on the economic and social environments of a broad range of countries. You can download the whole thing as a large csv or (our preference) use the Pandas tool. The latter gives us doubly indexed dataframes, with observations indexed by year and country.

Others we like:

  • WEO. The IMF’s World Economic Outlook comes out twice a year. Annual macroeconomic data for most countries, 1980 to 5-10 years in the future. We read the whole thing from their spreadsheet link.

  • PWT. The Penn World Table is the best single source of basic macroeconomic data presented on a comparable basis (““PPP adjusted”) for most countries. Annual from 1950 to (roughly) 3-4 years ago. We read the whole thing from their spreadsheet link.

  • UN Population data. Annual data for most countries of the population by age. Includes estimates from 1950 and projections to 2100. We read the whole thing from their spreadsheet link. Also data on births (fertility) and deaths (mortality).

Data about financial markets

  • Fama-French equity returns. The leading source of equity returns for investment research, courtesy of Gene Fama and Ken French. Text files on Ken French’s website are easily read into Excel. The Pandas tool is even better – when it works.

  • Yahoo and Google finance. Pandas also has tools for accessing daily stock prices and other financial data.

  • Quandl. A nice aggregator of economic and financial information. Uses the Quandl package, which comes with Anaconda. Much of it is free, but they also serve as an interface to paid data subscriptions.

Data about individuals

There is lots of survey data online, which gives us individual outcomes (the employment status of people with specific characteristics, for example) as well as the usual average outcomes (the unemployment rate). That allows us to compare outcomes of various groups: rich and poor, young and old, black and white, and so on. Anthony Damico’s asdfree collection includes an extensive list with descriptions. Gianluca Violante’s guide to micro data is a a similar list focused on US sources. It’s aimed at PhD students, but you should get the idea. These sources are not necessarily easy to use, but they’re incredibly informative. Keep in mind that we have experts on hand to help with any that interest you.

Here are some we have used:

  • ACS. The American Community Survey from the US Census covers demography (age, sex, ethnicity, location), economics (employment and income), education, and many other subjects. The Public Use Microdata Sample (PUMS) contains individual responses. This guide was written for journalists. Ari Lambstein has a shorter guide to navigating the universe of Census surveys. The Minnesota Population Center has a nice user interface for the ACS and other micro-data sources.

  • CPS. The Current Population Survey collects information about employment status, income, and a broad range of demographic variables (age, education, ethnicity). The Minnesota interface is useful here, too.

  • ATUS. The American Time Use Survey describes how people spend their time: employed, doing housework, watching tv, etc. This article summarizes academic work done on similar surveys in many countries. The Times is unusually fond of this survey.

  • MEPS. The Medical Expenditure Panel Survey is the leading source of information about individual healthcare, including insurance status and expenditures.

Miscellaneous other sources

Some that appeal to us, but please send suggestions:

  • Kaggle datasets. Kaggle, the data competition outfit, has just opened a datasets section that comes with data, documentation, coding enviroments, and forums. More on their blog.

  • Airbnb. Data on locations, rentals, and reviews. Chase loves this. Good input for a map project?

  • NYC Open Data. Data collected by the City of New York. There’s too much to summarize, but it includes taxis (every taxi ride in the city), restaurant inspections, and much much more. I Quant NY has some applications. 538 combined the taxi data with similar information about Uber, which they posted on their repo

Data applications

Data journalism:

  • ESPN’s 538 Blog. The high end of data journalism. They often post their data as csv’s in their data repository.

  • NYT’s Upshot. Great graphics, including these two examples. They list sources, but don’t typically post data.

  • Tim Taylor’s Conversable Economist blog. Tim’s a former journalist, so a better writer than most economists. He has a daily post about a topical economic issue, often with graphs we can use to track down data sources. If you don’t recognize a source, ask us about it.

Graphics:

  • Our World in Data. A website devoted to data visualizations. There’s an economic development tenor, but they cover a broad range of topics: population, energy, education, and much much more.

  • Flowing Data. Nathan Yau’s daily graphic. A good source of ideas and advice.

  • VizWiz. Andy Kriebel’s “visualization” blog. A steady stream of examples and advice, including the invaluable Makeover Monday. Tagline: “Friends Don’t Let Friends Use Pie Charts.”

  • Data is beautiful. On Reddit. Relatively unfiltered, but a good source of ideas.

More

This is not for the timid, but we have a huge collection of data sources and applications. Get a cold drink and a comfy chair and see what strikes your fancy. Active investing? Movie grosses? Sports? College Scorecard? Shooting deaths? All this and more. Similar courage is called for if you go to Awesome Public Datasets. There’s way too much there, but one advantage is that it goes beyond economics and finance.