Preventing Data Leakage in Machine Learning: Lessons Learned from My Experience

Avoid Data Leakage in Data Preprocessing Steps

5 min readMar 20, 2022

The experimental setting is a critical component of the Machine Learning (ML) ecosystem. This, I believe, is the most crucial (and I cannot emphasize this enough) parameter that supports the outcome of your models. I’m a reviewer for a couple of machine learning-related conference publications. After reviewing a few scientific studies, I discovered that authors tend to make more mistakes in the experimental setting section. Now, anytime I start reviewing an article, I just skip the other parts and go straight to the experimental settings, where I can judge the quality of the results (article). To be honest, when I first started developing ML-based frameworks, I made the same mistake in constructing experiments to yield a reliable result.

Assume you created an ML-based framework/method that gives a rewarding performance (for example, accuracy), but you did not follow the right settings, your work may not be accepted. Any results of an ML-based experiment that is not properly set up cannot be trusted, as it may result in unintended repercussions. The question now is how we can build our experimental setup so that the results are robust, reproducible, and acknowledged by experts. Today, in this post, I’ll go…

Preventing Data Leakage in Machine Learning: Lessons Learned from My Experience

Avoid Data Leakage in Data Preprocessing Steps

Written by M. Masum, PhD