The Dangers of Data-Driven Science - Dr. Piotr Szczurek, Comp Sci.

Lewis SURE Program
Jul 22, 2019
3 min read

For the past few decades, data science has been revolutionizing the way science is done and has enabled new discoveries that would not be possible otherwise. However, one must be careful in how data science is applied. It’s a great tool, but has to be used in a proper way in order to draw meaningful, sound conclusions. Despite the novelty of data science, it’s still just the old, tried-and-true, scientific method, although with slight twists. The people that forget this and use data science in a trivial manner, will find themselves making fake science news.

The premise of data science is that clever algorithms and large computational resources can be combined to find patterns which would be hard for humans to derive on their own. This approach to doing science has been in the works for many years, but it has become a standard practice since computational power has grown to its current scale. The main motivation that increased record gathering and storage capabilities made it impossible for a single human to analyze all the data. By using data mining and machine learning techniques, computers can automatically generate a possible model that can explain the data and apply it to other instances.

The most famous, now legendary example of how this can be useful is the so-called “beer and diapers” story. According to the legend, a supermarket owner was able to improve sales by performing a data mining technique called association rule mining on the database of sales transactions. The data mining revealed that whenever people buy beer, they also tend to buy diapers. The logic then implies that by putting these two products in close proximity within the store, the sales of both can increase. This is known as cross-selling and is a common tool used in sales and marketing. This story is often quoted as one of the main motivational examples of data mining. However, as with all legends, the story is based on some facts, but is largely false. Nevertheless, most people who teach data science still use it today (including myself) to show what can be possible.

You may think that the danger of using this story is that it’s not based on real events, but that’s not really the main issue. The most important, and often forgotten aspect of this story is that it completely violates one of the main rules of science and statistics – that one cannot conclude a fact based on a single example. Even among professionals in data science, this is often overlooked. Just because the data shows that beer and diapers were often bought together, that doesn’t mean that they will be bought together in the future or in a different store. It could have been just a fluke that will never repeat itself again. Hence making product placement decisions based on this information is problematic.

Yet the data does seem to tell us something. So how do we make use of this? Well, what the data gives us is a possible hypothesis – there’s a chance this rule could be generalized to other settings. To check if it does, we have to gather new data. To be more precise, it doesn’t really have to be new, per se. It just has to be data that wasn’t used to generate the possible hypothesis. So, we can continue gathering sales transactions for the same store or look back at older transactions. Alternatively, we can gather data from other, similar stores. With this new, test data, we can now check if the same pattern emerges. If people still buy diapers with beer, the hypothesis is confirmed and we can conclude a new scientific finding. Otherwise, it was probably just a fluke and we should not pay attention to that pattern. That’s simplifying it a bit, but this is the general procedure. There are still the questions of how much new data we need and how much of a chance we are willing to accept that our conclusion will be false (these are related issues). The good news is that these are known problems and statisticians (and data scientists) know how to solve them. The important thing is to remember that this is a required part of the (data) scientific process.

In essence, what data science provides is an ability to find new possible hypothesis. With petabytes worth of data being collected continuously, using automated methods for finding models is usually the only feasible possibility for generating new discoveries. It is just becoming too much for humans. However, we cannot forget that a hypothesis is not a scientific fact. We have to make sure the finding is repeatable in other experiments and this requires that new data is collected. I hope you won’t forget this when doing data-driven science in the future.

Cheers!

The Dangers of Data-Driven Science - Dr. Piotr Szczurek, Comp Sci.

Recent Posts

Комментарии