Will data analysis be automated?

In 2011, McKinsey & Company released a report about the shortage of analytical talents, marking the beginning of the era of big data. In the following year, an article in the Harvard Business Review called data scientist “the sexiest job in 21st century.” Businesses and organizations, private and public sectors alike, continue to express their interests in hiring analytical talent, some even express difficulties in hiring one. While the future seems bright and shining for analytical talent, those who have just set their mind on becoming an analyst probably share the same deepest fear that I do—Will my job as a data analyst be automated in the future? This is not just a question for data analysts. The world changes at a faster pace than we can possibly keep pace with or imagine. Everyone in their early 20’s should consider this before deciding which career to pursue and how technology could impact their careers/jobs in the future.

After reflecting on my own experience in the MSA program over the last six months, I would like to share some insights on this question, and hopefully I will alleviate some of the anxiety.

First, let’s consider why people think analytics jobs could potentially be automated by computers. It is no secret that data analysis and modeling highly rely on computers. Nowadays, computers can not only show the modeling results but also help the analysts pick the best model by applying the model to the test set or validation set, or alternatively use techniques such as cross validation. However, to pick the best model, we need to decide on the criteria to pick the model, and this is where human intervention comes in. Often times in analytics, there’s no one universal answer to all situations (this is where the running joke “it depends” at the Institute comes from), and thus the final call still relies on human judgement.

However, in my opinion, the biggest obstacle that prevents analytical jobs from being automated lies in the nature of data itself. To be more specific, let’s talk about two aspects of The Four Vs of Big Data –“Variety and Veracity.”

In most of the real world analytics projects, a large amount of effort goes into preparing the data in the analytics-ready format. Depending on the sources of your data, the resources required to get through this stage of preparing the data varies. Let’s assume you would like to predict power usage over the next week for an energy company. The information you need may be saved in the same database (in this case you’re extremely lucky), or it could be spread out in several database across different departments in the organization, which comes in different formats. Sometimes the information you need doesn’t exist in the company’s database. For example, weather is highly related to power usage, so you would like to include weather data in your analysis, but it is not available in the current data set. You may need to scrape the website that provides such information and convert the information into the same format as your current dataset and integrate the data together. All these are just the tip of the iceberg of the variety of data. As you can see, to pull all the necessary information together, and transform them into an analytics-ready format requires lots of human intervention, not to mention all the data cleaning work (missing values etc.) once the data is put together.

Let’s move on to the veracity of the data – one of data analysts’ biggest nightmares. In terms of the quality and accuracy of the data, this could only be determined by a human. After all, a computer is just a machine; it takes whatever data you feed in, and it does not have the ability to question the quality of data. In many cases, the data analysts are not involved in the data collection process and the data they’re given may not be suitable to answer the questions that are posed. Sometimes it is necessary for the analyst to communicate with those who design the data collection to assess the quality of data. Another factor that complicates the issue is privacy. Here at the Institute, every student is assigned to a practicum project and given the chance to work on a real world problem for their sponsors. For privacy concerns, the data handed to the students must not contain personal information identifiers, which sometimes pose extra challenges in data analysis. For example, if you can’t tell that two purchases are made by the same person, how could you find the purchasing pattern on the individual level? As a result, analyzing the data that are masked to protect personal privacy requires lots of human intervention.

So it seems like data analysis is nowhere near being automated—at least not in the next five years, and the demand for analytical talents might be larger than you think. If you think analytics is the right career for you, I would encourage you to pursue this path*.


*Again, one should never underestimate the disruptive power of technology. All these arguments are made based on the current technologies. If some unexpected technology comes into play, these arguments may no longer be valid.


Columnist: Ellie Lo