How to science data

James Rubinstein
7 min readApr 14, 2018

--

I recently hired a data scientist for my team at work. This is very exciting. However, it took 6 months to do it. I want to explore why I had such a difficult time finding someone to be a data scientist, through the lens of what it means to be a data scientist.

What is a data scientist? A data scientist is someone who uses data to answer questions using the scientific method. The scientific method, in case you have forgotten 5th grade goes like this:

According to my friend, Dan, a good data scientist needs a good problem https://medium.com/@dfrankow/how-to-become-a-data-scientist-part-1-find-a-good-problem-2971442227cc

That’s step one, step two is formulating a hypothesis. A hypothesis is an informed supposition about what will happen given a particular basis in theory. It’s a guess. But not just a guess, it’s a guess based on observation, theory, and is *measurable*. Saying “blue buttons will be better than green buttons” is not a hypothesis. “Given that our screen background is green and improved contrast will help users see the button, we believe a blue button will increase button clickthrough by 5% over two weeks” is a hypothesis. We have theory, metric, and expected increase in that metric all in one statement. That statement can be falsified, which is to say it can be proven wrong. Us science types are forever trying to prove ourselves wrong.

How do prove ourselves wrong? We have to design an experiment. Designing an experiment is the art of trying to control for every conceivable other source of change that could invalidate our hypothesis. If we think blue>green, have we considered that colorblind users might not see one or the other? If we have all our color blind participants in one group, that could tank the results of that experimental group. As a result, we randomize our participants, we ensure we have a large enough sample to get a representative sample of the population we care about. We also need to ensure that we aren’t biasing ourselves in a million other little ways. We need to run the experiment for long enough to give people time to learn the new state of the system. We have to ensure that we don’t put users in one group in the morning and fill the other group in the afternoon (because morning people and afternoon people might be different). And on and on. A scientist has to understand enough about the system she is working on to avoid these pitfalls when designing her experiment.

Now that we have our experiment running, we are assigning users to blue button and green button conditions at random, we need to ensure we are collecting that sweet, sweet data. Are both of our conditions instrumented the same way? I sure hope we aren’t double-counting blue button clicks, because that would skew the results. Not good. The data has to be accessible to our data scientist, too.

Once the experiment has run, our humble data scientist can access the data, run some smart analyses and she can draw some conclusions. Maybe CTR for blue buttons is 3.14 and green is 3.02… what does that mean? We have to apply some statistics to ensure that what we see in the sample groups will generalize to the entire population of our site. If we get 3.14 button clicks per page view, but the standard deviation is 12, then the difference between 3.14 and 3.02 probably isn’t meaningful. If the standard deviation is 0.002, then it probably is … if we have enough people to make that inference. Good ol’ statistics are what tell us if we can make that inference. Our data scientist will need to know what is the right test that she can apply to make that inference to support or disprove our hypothesis.
Tangential rant: notice I didn’t say prove, you never prove anything in science, only add support. You can disprove things with one observation, but you can never prove anything. It’s maddening, I know. Take this example: the sky is always blue. You have literally thousands of observations that the sky is always blue. But one observation of a green sky invalidates that statement.

GOOD NEWS EVERYONE, we’ve gained evidence that supports our hypothesis! Now what? We need to make some new hypothesis, of course! If blue buttons were better than green, maybe red would be even better! Or orange, or yellow, or puce! Maybe we can change the call-to-action text on the button, or its size, or placement, or font, or …
A scientist will need to be able to generate new hypotheses based on the outcomes of her previous experiment. Even the simplest case should generate new hypotheses.

How is a data scientist different than any other scientist then? That is an interesting question, I’m glad I asked it! The answer is not much. To me the only difference is the tools that a data scientist has at her disposal. As data scientists, we get to track people who use our services in ways that most other scientists can’t track their research subjects. We have big data on our side. With that, we have to use big data tools. But the fundamental thought process around the scientific method does not change. That’s why you see so many physics, social science, or chemistry majors getting pulled into the orbit of tech giants like Google, Faceboook, or Microsoft.

Why was it so hard to hire a good data scientist then? Hiring a good data scientist is hard for the same reason hiring anyone is hard. As hiring manager, I had to find the person with the right mix of skills, knowledge, abilities, and cultural fit to be successful in my organization.

The other thing that makes hiring a data scientist difficult is that no two people seem to have the same definition of a data scientist or what skills are necessary. So, here is my definition:

A data scientist is a person who can use the scientific method to answer questions for the benefit of an organization through the use of large-scale data.

Let’s break that down. The scientific method we’ve looked at. Answering questions for the benefit of the organization means that a data scientist has to understand that organization, or work with people that do. Those are usually UXers, PMs, engineers, managers, etc. Data scientists do not get to sit alone in the dark and churn through spreadsheets. They have to get out and communicate with other members of the organization. They must have the temperament to interact with others, influence, drive decision making, and know when to back off. It’s a tough challenge. I have not mastered it.

The use of large-scale data is more straightforward. Back when I was a grad student, a large study might have 40 or 50 participants. Now, in the tech world, you might have 40 or 50 million. The scale is different, so the tools are different. Data scientists have to be versed in pulling data from different sources with different tools and synthesizing that data. Tools like Spark, Hadoop, Hive, Presto, S3, Elasticsearch, etc. enable us to make use of millions of rows of data. R, SAS, Python, and other analysis tools have to be employed to clean, munge, and analyze those millions of rows. A data scientist has to be able to use these tools when each is most appropriate. It’s a tough challenge. I have not mastered it.

The main problem I saw when interviewing was that many candidates focused on the data tools problems and not enough on the organizational or scientific problems. Many candidates focused exclusively on hot new technologies like machine learning or artificial intelligence. The ability to run a random forest model in Scikit-learn does not make you a data scientist. It makes you a machine learning engineer.

An engineer is different from a scientist. Engineers apply models to solve problems. They rely on scientists to provide evidence for these models. Let’s look away from tech for an example: bridge building. A civil engineer needs to know how strong the rebar for the concrete structure will be. He must rely on the tensile strength measurement performed by some materials scientist somewhere. He will not go out and test the breaking point of every batch of rebar that comes to the job site.

Applying machine learning to a problem at an organization is similar; just applying models developed elsewhere to the situation at hand. In that case we aren’t answering questions, we are solving problems. Which is the fundamental difference between a scientist and an engineer in my book. Where machine-learning engineering transitions to data science is when we apply that scientific method with the machine learning tools to solve our problem. Many machine learning engineers become more data-science-oriented when they have to select, test, and tune their models. In a way, those are like experiments, holding back some data to see if the new model outperforms the old at prediction.

Where the data scientist still differs is the application of statistical testing techniques to ensure that the outcomes are reliable. When our data scientist creates an experiment with live users to test her new algorithm and uses the right statistical testing methods to accept/reject her hypothesis, that’s data science.

Too many candidates I interviewed, calling themselves ‘data scientist’ were adept at applying machine-learning models, but they hadn’t spent the time analyzing the results. They didn’t know why a t-test is inappropriate for categorical data. They didn’t know an ANOVA from a logistic regression. While we have been able to paper over these shortcomings in our methods with more data, I don’t think that will be a viable state of the world for ever. It’s going to be more important than ever to be able to understand the statistical underpinnings of machine learning. It’s already important to be able to answer questions using data. Most importantly, a data scientist has to be able to do science.

--

--

James Rubinstein
James Rubinstein

Written by James Rubinstein

Search nerd, data nerd, and all-around nerd-nerd. He has worked at eBay, Apple, and Pinterest, and currently leads the Product Analytics team at LexisNexis