October 15, 2014 by Joshua Wright
EMSI provides a composite dataset that integrates over 90 federal and state labor market data sources into one robust database. This is the foundational dataset that drives our suite of products and services. EMSI data provides insight into regional economies through its look at industries, occupations, postsecondary training programs, demographics, wages, and more. We provide this data at the state, county, and metro area levels, with ZIP code estimates available for core data (employment, earnings, demographics). We update all of this data four times per year.
In this article, we’ll provide a brief explanation of the sources we use, how we deal with suppressions, and how we link disparate datasets. To dig deeper on EMSI data and where it comes from, we encourage you to participate in the EMSI Certification Program.
EMSI starts by downloading data from government sources like the Bureau of Economic Analysis (BEA), the U.S. Census Bureau, the Bureau of Labor Statistics, and others. From these sources come particular datasets (see our complete list).
When we first receive these datasets, they are large, they show data for different geographies, they may show us ranges rather than specific numbers, or they may have varying levels of detail. The biggest hurdle we run into are suppressions.
Data suppressions are how EMSI refers to data points that are non-disclosed when we receive them. Suppressions are created by the government organizations that publish the data products in order for them to comply with various laws and regulations that are in place to help protect the privacy of the businesses that report to them. These datasets are published by these government organizations primarily for statistical purposes.
Think of it like a Sudoku puzzle. There are some numbers showing, but there are also a lot of empty cells. It is at this point that EMSI’s sophisticated algorithms make it possible for us to replace these suppressions with mathematically educated estimates. We use numbers that we know from one set to inform our estimate for another set. We do this many times for each geographic area. In the end, we have to do this at the national level down to the ZIP code level. And we update them every quarter. These algorithms have taken EMSI years to develop.
After we have gathered the data and gone through our suppression process, it’s time to connect the data. A large part of our work uncovering suppressions happens when we compile industry data, which covers 1,100 detailed industries for every county and ZIP code in the U.S. After we have completed that portion of the process, we run the data through a staffing pattern, which gives us an idea of what the occupation distribution is across those industries. Many industries, from hospitals to doctor’s offices to your local school, might employ a nurse, for example. We use these staffing patterns to discover how these jobs are distributed. In the end, we connect these 1,100 industries to over 800 occupations. The final step is connecting the occupation information to training programs that are tied to them, which we are able to do using education completion data from the National Center for Education Statistics.
Put simply, EMSI data comes from a bunch of government sources. We take all of these different sources and use the strengths of one set to overcome with the weaknesses of another until we have one comprehensive, consistent, and complete dataset.