To derive insights from data, a person must know where and how to look. This skill could take years of education and experience to hone. And perhaps there is no better example of this than Eddie Kim, chief data scientist at Sysomos.
Kim has a Ph.D. in Time Series Data Mining, a Master’s degree in Electrical Engineering, an M.B.A in Operations Research, Financial Engineering. For the past sixteen years, Kim has worked as a data scientist at Teradata and Sysomos.
DMN spoke with Kim to about the evolution of data science, current trends and insights and future of the industry.
If you can, for those who do not know, please explain exactly what a data scientist does?
Data science is the intersection of three fields: 1. Business (domain) analyst, 2) Machine Learning (ML) expert, and 3) Scalable Technology. Traditionally, the ML Research Analyst role combines the first two. Given a business requirement, one gets sample data from a data base and then applies a ML algorithm; test and deploy the model into an operational system. Rinse and repeat.
This ML analysis is now confusingly also called data science due to the trendiness of the new title. However, there is a difference and it has to do with the third point: “scalable technology.”
With the massive volumes and velocity of data today, there is a need to run these ML algorithms on “big data” to meet the (near) real-time needs of business processes. By using parallel and distributed technologies [essentially, technologies which can run many processes simultaneously] such as Hadoop, Spark and/or deep learning neural nets on GPU, and other massively parallel processing platforms, data scientists are using hardware acceleration to meet these critical time requirements.
This is what is meant by “scalable technology” and what in my mind, separates a true data scientist from traditional machine learning analysts. This seems like a nuanced difference. However, the real distinction between data science and ML analyst is: leveraging scalable technology against “big data” in order to meet time critical business requirements.
How has the position changed from when you began?
As I discussed above, there was only a ML analyst role when I first started. With big data, and the advent of Hadoop/Spark/GPU on commodity servers, there is now a platform to facilitate massive parallel processing to meet business critical needs. This has been the biggest change in the data science world over the past 12-15 years.
What are the main focuses for you right now?
We are focused on deep learning in two specific areas as it relates to social media: Image recognition (computer vision for object recognition) and natural language processing (NLP) tasks such as sentiment and topic recognition.
What trends and insights can you extrapolate from the recent collection of data?
People every day are becoming more and more reliant on near-real-time data science, whether it’s Google traffic maps, Android/Apple cloud photo uploading/sharing, Uber, AirBnB, Netflix recommendations, Twitter/FB/Instagram social advertisements, Cyber Monday shopping, credit card fraud detection, Fitbit/Apple watch health monitoring etc. People are enjoying a better quality life from the insights and actions driven by data science. I believe this will continue to be more relevant and prevalent in our modern world.
What do you think the trends and insights tell us about ourselves?
I think that all of these enabling technologies have turned us into an immediate gratification society. We don’t want to waste our time on the mundane or routine tasks that machines can help with based on the data science of crowdsourcing. We want to be more efficient with our time in order to spend more quality time on things we value, such as family and friends. We get what we want with minimal planning effort.
How do you anticipate the field will change with new technologies?
We are going to see more and more personalized, targeted recommendations. For instance in retail, currently we are targeted during certain shopping seasons like Christmas and Back to School with crowd-sourced recommendations: “like others who bought that, you might also want this”.
But imagine that the retailer has your purchase history and found that four years ago you bought a baby item, then in the same week last year, you bought clothing for a four year old girl. Then this year, they can recommend gifts for a five year old girl. In the extreme, ad targeting becomes a one-to-one conversation between the retailer that meets your needs at the right time.
What do you predict the data scientist of the future will examine?
The Internet of Things (IoT) is going to expand the scope of data science even further. We already have some home and health monitoring devices, but let’s consider the self-driving cars as it incorporates many future aspects of IoT data science, from computer vision to voice and language understanding, to recommending best course of action in real-time environments.
According to all analysis, if we only had self-driving cars in the future, there would be no traffic jams and more importantly, no traffic accidents. Self-driving cars would interact efficiently. Lives would be saved and quality of life would be improved (how many people complain about traffic jams at work?).
The only problem then is that on the journey to a complete self-driving society, the machines would have to interact and account for human drivers. This is in fact a harder problem than interaction with other computer programs. Computer vision and decision making would have to be processed in fractions of seconds to ensure that the brakes are applied when an accident appears imminent.
In the future, I see data scientists becoming even more relevant and important in our everyday lives, improving the quality and efficiency of our lives.