Should We Offer a Data Science Program?

A New Way to Discover Program Niches
By Dr. Yustina Saleh and Rob Sentz

The massive amount of data generated by modern computing power has in turn generated intense demand for data scientists in many diverse industries and occupations. To better understand this emerging issue and how to respond to it, Emsi applied a new methodology that evaluates and categorizes the major skills emanating from employer demand. Our goal is to provide colleges, universities, and other workforce professionals with actionable insight to inform program development around this and other critical, highly nuanced employer needs.

Introduction

Eastern Washington University (EWU) recently announced a partnership with Microsoft to offer a degree in integrated data analytics—the only program in the country with this integration. Students in the final year of EWU’s data analytics undergraduate program will take Microsoft’s 10-course Professional Program in Data Science, meaning they’ll graduate with a BS in data analytics and a Microsoft certificate.

The Microsoft-EWU partnership illustrates what can happen when employers and higher education team up to design specialized programs to meet niche market needs. Obviously, such alignment isn’t feasible for every area of study (we wouldn’t want Microsoft developing English curriculum), but for specialized skill needs, such partnerships can be very effective.

How can other institutions develop a skills-based program for high-priority, intensely specialized needs like data science? Is this possible without a relationship with a major employer like Microsoft? How do you know if your college should create a data science program (or any program) at all?

Takeaways

  • Using our own data science skills, we analyzed all job postings (aka real-time labor market data) that mention data science, and discovered that the skills in this area tend to cluster in specialized “vertical lanes” that are unique to specific industries and employers, and less specialized “horizontal lanes” that are the core skills relevant to all industries and employers across a region.
  • Nationally, data science is a broad skillset defined by skill clusters in four key areas: Analytics, Software/Web Apps, Business Intelligence, and Statistics. Regionally, the skill clusters vary tremendously, based on local industries and employers.
  • This has big implications for course design. When considering new programs or courses focused on analytics and data science, colleges and universities can use such analysis to refine and differentiate their courses in order to meet specific economic niches, provide students with better value, and satisfy employer needs.

Let’s explore the answers by digging into the fascinating world of data science.

Defining Data Science

Data science is a byproduct of millions of Americans using our computers, phones, internet, and other ubiquitous technologies to shop, bank, communicate, travel, and find one another. The result is massive treasure troves of raw digital data—“big data”—that must be managed, processed, and interpreted. That’s where data scientists come in. The ability to turn this data into useful information is a key skillset that impacts almost every industry and influences a staggering variety of occupations.

To learn more, check out this data scientist overview by SAS, the global leader in data science tools and education.

While data science manifests in many areas, it is fundamentally a mix of three disciplines: 1) a “horizontal” skill (spanning much of the job market) like mathematics, 2) a “vertical” skill (highly specialized) like computer programming, and 3) subject matter expertise driven by a particular industry.

Figure 1: Data Science Competency

For the past few months, we’ve been digging through job postings to analyze the structure and function of data science in the labor market. Figure 2 shows the top four data science skill clusters at the national level, organized into handy “swim lanes”: Analytics, Software/Web Apps, Business Intelligence, and Statistical Modeling. Within each lane, you have the primary skills that characterize that cluster. These swim lanes define the work of data scientists across every sector in our economy, and help explain the skill needs of employers by region. While most data scientists need familiarity with at least a few of these clusters, the concentration will vary according to experience, industry sector, and region.

Figure 2: The National Data Science Skills Cluster

1. Analytics – The use of advanced statistical techniques, data visualization, and, in many cases, web data applications to discover and interpret large datasets. The emphasis is on data discovery and fine-tuning, which has significant ramifications on how businesses measure success.

2. Software/Web Apps – The development of web applications that sit on the large datasets and help communicate with external users via user-friendly dashboards and tools.

3. Business Intelligence - The use of processes, methods, measurements, and systems to track, analyze, and forecast key performance indicators (KPI) over time.

4. Statistical Modeling – The use of mathematical models that allow for standardized processes and turn data into generalizable findings using statistics and probability distributions. Such models help describe, explain, and predict certain phenomena.

Two lanes not included on the graphic are Big Data/Cloud Computing and Computer Programming.

Big Data/Cloud Computing refers to accessing and processing large amounts of disparate (frequently unstructured) data, and using cloud technologies to process and store the data. Computer Programming refers to scaling and automating solutions to data problems through computer algorithms. These two lanes are good examples of skill clusters that form their own vertical columns, but also show up horizontally across other lanes. More on this later.

What the Four Lanes Tell Us

Each lane represents a unique area where skills coalesce around a particular need (Analytics, Software, Business Intelligence, etc.). The lanes occur in order (left to right) based on their differentiation in the region. For instance, the most prominent skill cluster within data science at the national level is Analytics, hence Analytics is listed first.

The X (horizontal) axis shows the skill frequency across job postings. If a skill shows up to the right in the lane, the skill occurs more frequently in job postings. If a skill shows up to the left, it is less frequent.

Keep in mind—all the skills in the lane are significant. The right/left orientation simply visualizes the way some skills pop up in job postings more than others. For instance, in the Business Intelligence lane, data warehousing and data integration occur to the right, which means they show up more frequently in job postings. Data governance and metadata show up further to the left, indicating that they do not show up as frequently in the business intelligence postings.

The Y (vertical) axis shows the strength of the relationship between the skill itself (like digital marketing) and the cluster (Analytics). The higher the correlation, the higher the skill appears in the lane. When a skill shows up lower in the lane, it doesn’t mean the skill is less important; rather, it means the skill simultaneously explains/characterizes the lane itself and occurs across other lanes.

In the Analytics lane, for instance, digital marketing is high because it strongly correlates to the Analytics cluster; it explains the requirements and needs of employers within Analytics. Mathematical optimization, on the other hand, while also significant in explaining these requirements and needs, is not as distinct to the Analytics cluster since it also drives Analytics and shows up in other clusters. Hence, mathematical optimization appears lower in the Analytics lane. Both are important, but one (digital marketing) is more specialized while the other (mathematical optimization) occurs more frequently in other lanes.

What can we do with this? A college can use this cluster order to pinpoint critical areas of specialization. Once you identify the skills common to multiple clusters, you have what could form the core courses in a data science program.

Figure 3. Vertical and Horizontal Skills in Data Science

Figure 3 above shows another way to consider the skills within data science. We identified what we referred to earlier as “vertical” skills—highly specialized skills tied to particular employers and industries. We also identified “horizontal” skills—the core, foundational skills found across all jobs and industries in a particular region (in this case, the entire US).

Let’s unpack this by taking a closer look at some of the major vertical skills in the national data science skill cluster. Then we’ll dig into the horizontal skills.

Vertical Data Science Skills

Lane 1 - Analytics – In both Figure 2 and Figure 3, we see that Analytics is the most distinctive skillset for data science at the national level. This means Analytics is the most coherent skillset within the larger data science skill cluster. (As we shall see, the order of these lanes switches based on the region or industry. Analytics won’t always be the No. 1 lane.) The order of the lanes has more to do with the lane’s distinctiveness or differentiation in the region, not its importance. (All the lanes are important—the ordering just tells us which lanes are more distinct for the region in question.)

When we zoom in on the lane itself, we observe that it is defined by a tight grouping of skills: search engine optimization (SEO), marketing analytics, personalized advertising/marketing, marketing automation, and digital marketing. All of these are, not surprisingly, related to understanding and enhancing consumer/user interaction through online shopping, which is where the bulk of job posting activity occurs.

Lane 2 - Software / Web Applications – After the need for people who can crunch the data comes another lane, closely tied to Analytics. Software and Web Apps turns data into actionable intel for organizations to use. There’s tremendous need for skills like UI/UX design, HTML, XML, CSS, Agile software development, and service-oriented architecture. Web applications are critical because they help people derive insight from data.

Lane 3 - Business Intelligence – The third lane is about developing data architectures and infrastructure, and establishing proper data governance frameworks to ensure data quality and security. It’s the job of the business intelligence team to facilitate the movement of data cross various data systems and to write algorithms that track and measure key performance indicators (KPIs). Skills like ETL (extract, transform, load) and TCP (transmission control protocol) are important in the cluster. Proficiency in reporting tools (like SAP Crystal Reports and Oracle Transactional Business Intelligence) is a must. Many Business Intelligence skills often reside within IT departments.

Lane 4 - Statistical Modeling – Here we see what many people associate with data science—math and statistical modeling skills like regression analysis, factor analysis, and forecasting. This lane is also characterized by skills and knowledge in tools used by statisticians: R, SAS, IBM SPSS, etc.. While this lane is still vertical (specialized), many of the skills also occur in other lanes, which leads us to our next point.

Horizontal (or Core) Skills

Now let’s talk about the horizontal skills found in our nationwide cluster. While vertical skills cluster together according to unique needs driven by industries (like Amazon’s specialized needs for people who can analyze a lot of consumer transactions), horizontal skills manifest more laterally. They exist across all lanes and are not isolated in any one industry.

The takeaway? These are skills that higher education and training should focus on when preparing students for the job market.

Data Management and Analysis – Data management, a prerequisite to any data project, is focused on collecting, cleaning, and synthesizing data from different sources. Data analysis follows the data cleansing and transformation process, and involves discovering and analyzing a phenomenon or business problem.

Business Analytics – This mix of statistical analysis and operations analysis is similar to business intelligence. However, unlike business intel, which often resides in the worlds of IT and product, business analytics usually lives in operations or finance.

Cloud Computing (or distributed storage and processing) – Here’s the chief big data skill: the ability to use and apply cloud computing techniques and tools to manage and analyze big data.

Open Source Technologies – Analysts are expected to be proficient in finding the right type of open source technology and navigating typical challenges (reliability, efficiency, etc.). Examples of open source tools: R, Python (statistical programming), Apache (cloud computing), Talend (data integration), plot.ly, and googleVis (data visualization).

Data Visualization – This is communicating the results of data analysis through visualization tools like Tableau, Qlik, FusionCharts, HighCharts, or Domo.

Statistical Software – This is a vertical skill within the Statistics lane, but it is also a key horizontal skill for all data science pros, entailing proficiency in statistical software, SAS, R, SPSS, or Stata.

The Impact of Amazon on the Data Science Cluster

WWhat drives these national trends? That would be Amazon. With its voracious appetite for workers proficient in analytics, software, business intelligence, and statistical analysis (all four lanes), the mammoth online retailer is shaping the data science cluster across the US. Figure 4 below provides a quick look at Amazon’s dominance in job postings for data science. No other company gets close to this volume.

Figure 4: Top Companies Looking For Data Science Skills

What can we do with this insight? The first lesson is that if we used the national aggregate of all data science job postings, we’d essentially be responding to the needs at Amazon. That’s not necessarily bad (especially if you have a relationship with Amazon like EWU has with Microsoft, or if you want Amazon to locate in your backyard), but it might lead you astray if you wanted to develop a region-specific program.

So, the question is, how could we use this insight to tailor a data science program to the needs of a specific region? Let’s demonstrate using New York City and Washington D.C. as examples.

Data Science in New York City

The data science cluster in New York City (Figure 5 below) is quite similar to the nation. This isn’t overly surprising. We would expect an economy as big as NYC’s to roughly approximates the nation. The top three lanes are Software/Web Apps, Analytics, and Business Intelligence. Data science jobs are dominated by the usual tech suspects: Amazon, Oracle, and IBM.

Figure 5. The New York City Data Science Skills Cluster

However, there are a couple key differences that alter how we should approach data science in the Big Apple. Take a look at NYC’s fourth lane, Machine Learning. In general, this involves developing systems that can use data to learn continuously, detect new trends without human intervention, and look for trends programmatically. This cluster includes skills like text mining, data mining, and natural language processing (NLP).

In NYC particularly, machine learning has an interesting twist: most of the machine learning job postings are in financial services. This makes sense, given the presence of JPMorgan Chase, Bloomberg, Morgan Stanley, Citigroup, AIG, KPMG and others (see Figure 6 below). NYC’s high concentration of financial services companies is creating a special niche and differentiated data science skills cluster.

Here’s where the rubber meets the road. Institutions that want to offer a data science program in New York should consider including finance, actuarial, and investment-related coursework—assuming postsecondary providers desire to meet the needs of the regional economy, rather than those of the nation as a whole.

Figure 6. New York City Metro Area: The Top Companies Looking for Data Science Skills
Figure 7. New York City Metro Area:
VERTICAL AND HORIZONTAL SKILLS IN DATA SCIENCE

Now let’s break down the vertical and horizontal skills for NYC. See Figure 7 above.

1. Software/Web Apps – Software is NYC’s dominant lane. It is driven by the need for JavaScript, HTML, XML, CSS, Agile software development, UI design, and software planning.

2. Analytics – The Analytics lane is similar to what we saw at the national level: digital analytics, SEO, marketing analytics, demand generation, and marketing automation.

3. Business Intelligence – This lane is also similar to what we saw for the nation: high demand for people who can make data easily and securely accessible.

4. Machine Learning – Here we see results of NYC’s concentration of financial companies. Data mining, text mining, natural language processing (NLP), and predictive analytics result largely from financial companies’ demand for workers who can detect patterns within massive datasets.

Perhaps most importantly, our analysis reveals a unique set of horizontal, core skills for New York City—financial services and investment banking. This indicates that the four vertical lanes (Software, Analytics, Business Intelligence, and Machine Learning) rest on top of finance and investment banking. Hence, data scientists in NYC would do well to be well-versed in these two subjects. These skillsets also jive with the other horizontal skills we saw at the national level: cloud computing, statistical analysis, open source technologies, and data visualization. Data scientists are also expected to be proficient in relational database analysis tools like MySQL.

Data Science in Washington D.C.

So, what does data science look like in the nation’s capital? Figure 8 below shows us.

Figure 8. The Washington D.C. Area Data Science Skills Cluster

D.C.’s top lane is Social Science Research. Here, the high concentration of public sector and nonprofit consulting companies as well as defense contractors are the driving force. We also see acute need for Big Data/Cloud Computing and specific skills like Apache Hadoop.

Top employers for data science workers (Figure 9 below) include the Census Bureau, defense and aerospace contractors such as BAE systems and Northrop Grumman, and information and consulting firms serving government sectors like CACI and Deloitte.

Figure 9: Washington D.C. Metro Area: Top Companies Looking for Data Science Skills

Let’s take a quick look at the vertical skills. See Figure 10 below.


Figure 10. Washington D.C. Metro Area: Vertical vs. Horizontal Skills in Data Science

1. Social Science Research – Now this is interesting and unexpected. The Social Science lane is actually the driving force for data science in D.C.

2. Software / Web Applications – This lane is being shaped by demand for UI/UX design and software development.

3. Business Intelligence – Business Intelligence has been the third lane for each of our three analyses. The D.C. cluster is very similar to NYC and the nation: data warehousing, ETL, data profiling, and dashboard development.

4. Big Data/Cloud Computing – While most data science jobs in NYC require a solid foundation in Big Data/Cloud Computing, in D.C., it’s a more specialized field. Not every data scientist is expected to have this skill. It’s a plus rather than a basic requirement. For example, social science quantitative analysts may not need to be as well-versed in cloud computing and unstructured data analysis, but if they are, they have the competitive edge over others.

D.C.’s horizontal skills are also quite different from NYC or the nation, showcasing a few we haven’t seen before:

Geospatial analysis – The statistical analysis of datasets with a geographic or spatial aspect. Think about advanced mapping systems that help governments and government contractors track medical issues, crime, demographics, labor markets, and the like.

Performance / process improvement – Monitoring and measuring data that results from business processes so that organizations can make modifications or improvements.

Measurement and signature intelligence – Related to intelligence gathering. The data comes from “specific technical sensors for the purpose of identifying any distinctive features associated with the source, emitter, or sender and to facilitate subsequent identification and/or measurement of the same.”

Technical writing – Communicating complex concepts in fields like software development, engineering, finance, medicine, and other specialized fields.

So, what are the ramifications? The best data science programs for D.C. will target social sciences graduates—specifically international relations, political science, sociology, psychology, human services, disabilities, and behavioral health. (Who said you couldn’t do anything with that psychology major?) Data science programs should also key in on intelligence gathering, geospatial analysis, process improvement, and technical writing.

How Should I Use This Analysis?

We recommend the following first steps for various stakeholders:

Colleges and universities can use this insight to tweak current programs and tailor future programs to fit local needs. No matter the skill or industry, this approach will help schools shape their content to the demands of the labor market.

Employers should familiarize themselves with these critical skills in order to find schools producing the right talent.

Students and jobseekers must be aware of both horizontal (core) and vertical (specialized) skills in order to choose the right school and programs. These are also the skills they should feature in their résumés!

Economic developers can use such insight about the critical skill demands in their region to improve business retention, expansion, and recruiting.

Bottom Line: Understand Your Market

Here’s the big E on the eye chart. The various “swim lanes” within the broader data science skill cluster vary wildly because they are oriented to regional industry activity. Since industries drive job creation and the demand for knowledge, skills, and abilities (KSAs), we should root everything in regional economics—program development, recruiting, education planning, and economic development.

Emsi has been banging this drum for years. Program development, workforce development, economic development, and talent acquisition are best done at the local level where organizations can apply a targeted, economics/data-centric strategy. Our research here gives us new handles on program and curriculum design that help us create that targeted strategy.

For colleges and universities in particular, this paper demonstrates a better way to ensure their courses are both broad and narrow enough. Yes, programs should equip students with a basic foundation of data science skills, but they should also be tailored to the unique market they serve.

That is exactly what Eastern Washington University has done. Through its partnership with Microsoft, the university addresses key regional needs, and thus stands out in a sea of data science and analytics programs. In- stitutions that wish to do likewise will use job posting data (and other sources of data) to keep a pulse on industry needs, pursue industry part- nerships, and develop a niche in the market—any market, not just data science. In so doing, such institutions will help students achieve their ulti- mate goal: success in a highly competitive labor market.

Emsi provides higher education, workforce investment, economic development, and talent acquisition professionals with high-quality labor market data and expert analysis to connect people with education and employment. Since 2000, hundreds of institutions and employers across the US, UK, and Canada have used Emsi to align programs with the region’s needs, recruit and retain students, equip students with the right career vision, support local business activity, and find talent. Learn more at www.economicmodeling.com and follow us @desktopecon or on LinkedIn.

If you would like to explore this analysis further or find out what skills clusters look like for your area, contact Rob Sentz at rob@economicmodeling.com or fill out the contact form below.