Data Science Mistakes to Avoid: Data Leakage | Leah Simpson and Ray McLendon | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Data Science Mistakes to Avoid: Data Leakage

teacher avatar Leah Simpson and Ray McLendon, Data Scientists

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

14 Lessons (24m)
    • 1. Class Trailer

    • 2. What is Data Leakage?

    • 3. How Does a Machine Learning Model Learn?

    • 4. Don't Randomly Split Time Series

    • 5. Don't Include Data from the Future

    • 6. Don't Randomly Split Groups

    • 7. Don't Forget Your Data is a Snapshot

    • 8. Don't Randomly Split Data When Retraining

    • 9. Do Split Data First

    • 10. Do Use Cross Validation

    • 11. Do Be Skeptical of High Performance

    • 12. Do Use scikit-learn Pipeline

    • 13. Do Check for Features Correlated with Target

    • 14. Recap

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

Have you ever trained a machine learning model with amazing accuracy, only to deploy it into production and have it not perform as well? Data leakage may be the culprit! Data leakage occurs when your machine learning model has accidentally learned information about the test set. It can unfortunately be introduced in many different scenarios and can have a huge impact on your model's performance in production.

Leah is a Data Scientist at a large financial institution and discovered there is a serious gap between the skills and techniques students learn in school versus what they actually need on the job in the real world. Data leakage wasn't stressed at all in Leah's undergraduate program. She'll help you avoid making the same mistakes she made at her first job by teaching you how to identify and prevent data leakage.

This course is intended for aspiring data scientists and programmers looking to expand their knowledge of data leakage.

In this course you’ll learn:

  • What data leakage is
  • How to identify data leakage
  • Best practices for preventing data leakage

Leah will walk through several real world examples of data leakage and how to fix them.

No prior knowledge of data leakage is needed for this course; however, a a basic understanding of machine learning concepts such as model training/retraining will be helpful (but not required).

Music by TimMoor from Pixabay

Images use in this course - leaky faucet, superhero dog, confused dog, telescope

Meet Your Teacher

Teacher Profile Image

Leah Simpson and Ray McLendon

Data Scientists


Leah Simpson and Ray McLendon are Data Scientists at a large financial institution and have over 15 years of combined experience. Over the years, they realized that there is a serious gap in the skills students learn in school versus what they actually need in the real world. Data Science programs often use perfectly cleaned data sets and focus on machine learning algorithms, while real world data scientists spend around 80% of their time cleaning data. Leah and Ray began making courses to reduce this knowledge gap and help prepare new data scientists for the workforce. 

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Class Trailer: Hi everyone, Welcome to today's video which is on data leakage. My name is Leah Simpson and I'm a data scientist at a large financial institution with about four years experience. M coworker Ray and I really wanted to make these videos because we noticed there's a huge gap in the skills that you learn in school versus the ones that you actually need in the real-world. So all of our datasets will be focused on real-world problems and giving you the skills that you need to solve those problems, that they don't necessarily teach in school. We are super excited to have you here today and hope you will stick around. In today's lesson, we'll be covering the basics of how to identify and prevent data leakage in the machine learning world. Surprisingly, data leakage wasn't something that was really taught to me in school. We touched on it a little bit, but kind of focused on these perfect datasets and more learning the machine learning algorithms. So I didn't really have my first experience with data leakage until I got into the real world and started making some mistakes. So hopefully this course will help you avoid making the same mistakes that I made whenever I first started making models at my first job. Data leakage occurs when your machine learning model knows too much information. And I kind of like to think about it in two different flavors. The first one being when your training set includes information about your test set. And the second one being that your model is trained on information that it won't actually have access to when it goes into production. Data leakage can be tricky to spot so we'll walk through a couple of different scenarios where data leakage could have been an issue and then identify how we can fix them. This course is really meant to be beginner friendly. It's okay if you've never made a machine learning model before, we're going to start from the basics. So you should have a good understanding. However, even though it is meant to be beginner friendly, I do believe that more experienced data scientists could also learn and benefit from this video as well. And with that, let's get started. 2. What is Data Leakage?: Alright, today's course is on data leakage when your machine learning model knows too much. Today we'll be covering four sections around data leakage. The first one being all about data leakage. What is it? Why do we care about it? Next, we'll go through a few examples of data leakage and how to prevent them. Then we'll walk through a few best practices around preventing data leakage. And then we'll wrap it up whether recap. So first of course, we have to cover what is data leakage? Data leakage, I like to think about it in two kind of different flavors. The first one being that your training set includes information about your test set. And the second one being that your model was trained on information that's not actually available when it goes into if you've ever heard anything about data leakage or research that you might also hear it called leakage or target leakage. Also, if you have done research on data leakage before, you might see it pop up in the information security space. So today's course is not going to be covering what data leakage means in that context. We'll just be focusing on the Machine Learning context of it. Data leakage can be super difficult to spot. So that's why I wanna go through a few examples where it might pop up as well as some best practices for how to prevent it. So now that we've covered what data leakage is, the next logical question is, okay, why should I care about it? When you're testing your model will actually seem like it's doing really amazing. It's going to have all these crazy hide metrics. It's going to look like the superhero model that's going to solve all your problems. But actually when you get to production, your model will look really confused. And when that happens, it not only looks bad on you as a data scientist for putting something into production, you may have over-promise these results that your model might out, and then you don't meet those certain metrics that you're trying to achieve. And ultimately this is going to come back on you as the data scientists. You've probably been talking with your business customers saying that your model is doing great. It's reaching these amazing metrics that you're getting them really excited to put it into production. And then imagine what it doesn't do as well as you had hoped it would in your testing setting. So we definitely don't want that to happen. So let's talk through a few examples of data leakage and how we can prevent them. 3. How Does a Machine Learning Model Learn?: Let's talk about how a model learns. A machine learning model is doing behind the scenes when you train it, is you're taking all of this data that's available to you. You split that into two sets, your training data and your test data. The model will learn off of the training data. It will find patterns in the training data, pick up on those. And then once you want to test out your model, you will use your testing data as data that the model hasn't been trained on the verify your results. And when we have our training data and our testing data, we are doing usually what's called a random split on that data. And this works for a lot of different data types, but we're actually going to go through a couple of examples where this does not work good. So let's see how a random train test split actually works. Here I've got some example data of movie names along with their genres. If we wanted to do a random split on this data, we would first figure out how much data do we want in our training set versus our test set? And there are several different popular ways you can do this. So for this example, I'm going to do a 60, 40 split, meaning that we have 60 percent of our data in our training set and then 40 percent of our data in our test sets. But you can't do others splits like 80 to 90, 10, 730 really kinda depends on the data that you're using in your situation, why you might choose one split over another. So in this example here we've got this dark green color as our training set and then the purple as our test set that we've taken random six data points and put them in the green section there. And then four data points that are in the purple section. 4. Don't Randomly Split Time Series: Now that we know how a machine learning model works and what a random train test split is. Let's walk through our first example of where data leakage might be introduced. Here we're gonna take an example with some time series data. We've got some months here, January 2020 through October 2020, as well as the precipitation for that month in inches. And let's say we're ready to train a machine learning model to take in a month and then predict how many inches of precipitation will get for that month. And if we do a random train test split here of 60 percent training data, 40 percent testing data. This is what that might look like. The issue here is that random train test splits don't work for time series data. What happens here, Let's say when we tried to go and predict how many inches of precipitation will get in July, we actually know information about future months. You'll see over here that we've already trained our model on August, September, and October. So it already knows what's going to happen in those future months when it really shouldn't. So that's why random train test split does not work for time series data. Instead, what we wanna do is actually use a sliding window. There are several different ways to do sliding windows in time series it, but I'm going to show just one example here. So for this example, what we're doing is starting over on the left-hand side here, we've got our first two months, January and February, that were being trained on, and we're using those to predict March, our next iteration, we've got January, February, and March as our training data, and we use that to predict April. We keep repeating that cycle for how many times we want to. In this case, we've done three, take the errors for each of the months that we're trying to predict, aggregate those and then get our error metric that way. So with time series data, you really always should be using some sort of sliding window approach it, whether it's this one or another method that you buy. 5. Don't Include Data from the Future: Next up, let's take this same example and then extended a little bit further. Let's say we want to add some more information to our model to hopefully make it better. So we look up the high temperature for each month and include that in our model. Now this data would be available online anywhere. Or if you're at some company, you might have this information stored in a database for you. The tricky part here is this is actually data from the future. And if we're not careful, if we include this, our model actually won't have that information whenever it runs in production. So to prevent this, one thing I like to do is think about my data on a timeline. If we put our data on a timeline, this is what it might look like. Starting over on the left-hand side, we have our month beginning and let's say it's January on January 1st or maybe January second, our model predicts how much precipitation is going to occur in that month. But then later on in that month that every day we're logging the daily precipitation and temperature data, then the month ends. And finally after the month ends, that's when we can calculate what the maximum temperature was for that month. So when we put our data on a timeline like this, we want to make a note of where our model is actually predicting. And what we actually want to do is only include information that occurs prior to the model predicting. So we don't include any information from the future accidental. 6. Don't Randomly Split Groups: All right, so we've got a couple of example around time-series data. Let's take a look at a classification problem. For this example, we've got some different students, students one through four, as well as different essays that they've written and then the grades that they've gotten for those essays. Let's say we want to make a model that will take in the text of the essay and then predict what the score or grade that students should receive. As one thing we might start out by doing is creating a random train test splits. In this example, we've done a 60, 40 split where we've taken six random data points there, that dark green color, That's our training set. Then our models getting tested on the four light purple rose there. Probably seems innocent enough. And in fact, in my job, I have done this before, but this can actually be an area where you might introduce him fatally good. So in this example, you'll notice that we have student one in the training and the test set, as well as student two and student three, when the model has information from a student in the training test and the test set that it actually might unintentionally learn some writing behavior specific to that student. So it actually has more information when it goes to test on it. So instead of randomly splitting our dataset, well, we actually want to do is keep track of the student number and put a student only in the test set or only in the training. You can see we've done that here. We have student one is only in the training set. Student 2's only on the training set, and then student 34 are in the test sets. This way the model doesn't get extra information about how the student rights. And this would more mimic a production like setting the Python library that's really popular in the machinery means phase. Scikit-learn actually has a really simple way to do this. It has a function called group shuffle split. And from that function you can just tell it which column you want to use as your indicator for splitting. In this case, it would be our student number and it takes care of the rest. I would highly recommend using this if you have some sort of natural grouping and your data. 7. Don't Forget Your Data is a Snapshot: Next up, let's talk about an example of another classification problem, but some different data. Oftentimes in school you're usually given a dataset that does not change. However, in the real-world, your dataset is changing all the time. In this example, let's say we have a website with a click prediction model. This is going to predict whether or not a user will click on an ad. We could then use this model to predict which ads to show to users. We've got some example data, let's say on May 16th, 2021, we have user one. They haven't clicked on any ads before. And for this specific ad that we showed them, they did click it. And then when we come back a couple of weeks later, inquiry our dataset on May 31st, we have our user one still, but this time they have clicked through to as, and they also click to this ad that we showed that currently, they're really subtle thing here that I want to stress is that your data is really just a snapshot. So in this example, information about future actions of the user have actually been leaked into our training set. So we really know that the user's going to click to ads if we train our model on the data from May 31st. And they're really tricky thing here is that you have to remember in the real-world, your data is a snapshot in time and it's always changing. So you have to be on guard for features that you might use in a model that are actually leaking information into your training set, similar to our original time series example, one thing we can do here to prevent this from happening is to think of our data on a timeline. So if we put our example on a timeline, here's what it might look like. First, all the way to the left-hand side, the user visits the site. Next, the model predicts the ad to show to the user. The user clicks on the ad after that, or maybe they wanted, and then the user leaves the site. And then finally after that, we calculate the number of ad clicks. So we want to figure out similar to before when our models making this prediction and then only include information that happens prior to that. 8. Don't Randomly Split Data When Retraining: So far we've only talked about deal leakage occurring when you are training your model. But what often happens in production is you need to retrain a model because your model isn't performing as well. Now here's an example of when data leakage might occur when you're retraining your model. So for this example, let's say we have trained a model on June 15th, 2021. This model, we did all of our due diligence and it is good to go into production. So we deploy it to, for this example, let's say that we have done a 640 train test split, similar to our other examples. So we have six data points, points, 1245610 that the model was trained on, and then it was tested on points 3, 7, 8. Well, let's say our production data starts to change over time, so we need to retrain our model. We go in on August 30th it and we train up a model that's going to be a challenger for the one in production. If it's good enough, we'll replace the production model with this model. And let's say when we trained our challenger model, we did a random train test split on all the data points that we had. Now this includes the first 10 data points that were used for the production model. And then we had some new data come in between the June 15th and August 30th data points, 11, 12, 13, 14, 15. Now, if we just take a random split of our data here, what we actually might be doing unintentionally as introducing some data. So in order to compare these models, but where I'll actually want to do is take our production model and our challenge your model and make predictions on the Challenger models test set. Now when we do that, The tricky thing that can occur is that the production model might have been trained on data points that are actually in the test set up, the challenger model. In this case, data points 1, 2, and 10 are data points that the production model was trained on, on June 15th. So when we go to compare these models on the test set up, the challenger model, the production model has already seen data points 1, 2, and 10, and can probably know the answer to those. It gives it an unfair advantage against the challenge. So just to illustrate that a little bit further, let's take our test set data points 1, 2, 3, 8, 10, and 12. And let's see what the results would be if the prod model is making its predictions versus the challenger model. Again, data points 1, 2, and 10 are ones that the prod model has already seen. So you can see in this chart it gets them right. But the challenger model hadn't seen them before, so it gets them wrong. For data points 3, 8, and 12, the model gets a couple of those wrong, and then one of those right. And then the challenger model gets a couple of those right, but one wrong. So when we go to calculate our metric that in this case accuracy, the prod model has a 67 percent accuracy while the challenger model has a 33 percent accuracy. If we weren't thinking about data leakage and saw this, we would probably think, okay, the problem model is doing better. So let's leave that in production and let's try a different challenger model. But in this case, we've actually introduced some data leakage for the prod model. It knows more than it should, so it artificially looks better than the challenger model. So to prevent this, what we actually want to do is save off all of our training and test sets. So let's say from the prior example we have the exact same training and test split that we did for the production model. But when we go to our challenge your model, what we actually want to do is load in the training and test set that we use for the production model. So it's exactly the same. And then for those new data points, points 11 through 15, we just want to add those evenly into the training and test set. So this makes it so the production model doesn't know any extra information than it should, and it doesn't make it look any better than the challenger model. We have an equal playing field. Now, into further illustrate that let's take our test set and compare what the production model would predict versus the challenger model. In this case, you can see you There's no data points that the production model has already seen. So it gets a couple of them wrong, but all of them right. Challenger model, on the other hand, only gets one wrong. So in this case, the challenger model gets a better accuracy than the production model. And we were probably deploy that to production to replace the current production model. So now that we haven't equal playing field, the challenger model actually does better. 9. Do Split Data First: Now that we've gone through a few examples of where data leakage might occur. Let's talk through a few best practices that you should include whenever you're doing your modeling. First step, I want to talk about the importance of splitting your data immediately. Sometimes it can be tempting to just start in on dunes and different transformations to your data and then split your data afterwards. But really what we want to do is first let our data, and I'm going to first go through an example here showing why we do not want to make transformations and dense split our data. So for this example, let's say we have some different videos and we have the number of views that they've gotten on YouTube. But normally what we wouldn't want to do for this numeric data is normalize it so that outliers don't affect our machine learning model. In this case, I'm using min-max normalization and you don't have to know how to do that to this, trust me that my normalized views column there is the correct calculation. Basically what min-max normalization does is it takes the minimum of your dataset and the maximum of your dataset and uses that to transform all of your values between 01. Now we've got this new column for normalized data, and then we split our dataset into our training and test set. Again, we're doing a 60, 40 split, similar to what we've done in all of our other examples. We have six data points in our training set in that green color and then four data points in our test set in that light purple color. This probably seems innocent enough, but this is an example of where data leakage can occur. What we actually want is for the training set to have its own minimum and maximum and the testing set to have its own minimum and maximum. Right now we've split the minimum and maximum between both of those datasets are training set knows what the minimum of its dataset is. So it actually thinks the maximum of this dataset is 0.6 or 0.7. If you round up when really we should have maximum of one for its dataset. So in contrast to this, when we actually want to do is split our data first thing. So here we've taken our video and views dataset that we've split it into a 60, 40 train test split, six data points going into the training set for in the test set. And then after that we can do our min-max normalization. And you can see here that now our training set has a minimum and a maximum and the test set has a minimum and a maximum. This would help us prevent data leakage in this scenario. So just to compare these side-by-side, on the left-hand side I have where we rescale first, and on the right-hand side I have where we split first. And you can compare these values against each other. And you might think, okay, these are really close to each other. Why does this really matter? Well, even though the are really close, so the re-scaling actually introduces data leakage. And what we want to do is on the right-hand side. 10. Do Use Cross Validation: Next step are best practice that I want to talk through is cross-validation. Cross-validation can look kind of crazy if you're just seeing for the first time, but we'll walk through it. So hopefully it makes sense. And first we'll start off with taking our dataset that we had in the very, very beginning of this presentation. And we're going to split it into our training set in that green color and then our test set in that light purple color. Somewhat cross-validation does is it actually takes your training dataset and splits it into K-fold. Basically, you can think of a fold as a different part of the training set. So in this case, we're going to use three-fold cross-validation. And what we do is split our training set into three different parts. So now that our training set is split into three different parts, we're going to take one of those parts and call that our test set. However, this is still from our training splits. You might also hear this referenced as your validation set. So in this example, what I have done first is Mark my first fold one as my validation or testing set. And then folds 2 and 3 are used as my training set. Then we rotate where the validation or the test set is it. In the next example it moves to fold 2 and then we have fold 13 training datasets. And then we keep iterating through that process for the number of folds that we have in this example, just one more time for threefold. So now what we're doing through this process is finding which parameters work best for our model. Once we find that optimal parameters that will be actually do is take our final test set and use that to finally evaluate the model. This is data that the model hasn't ever seen before. And ultimately what this does is help prevent data leakage. This might seem kind of complicated, but the library of scikit-learn in Python is really great about implementing this if you can do it in one line, and it does all this fold the shuffling behind the scenes for you. 11. Do Be Skeptical of High Performance: My next step is to be skeptical of high-performance. And I know that kind of sucks to say because if you're like me, you get really excited when you see your model having amazing metrics. But you really should start being skeptical when you see such high-performance. And there's two different scenarios I want to talk about here. The first one being when your train and test set to perform kind of average, but then your validation set actually performs better than your train or your test set. This is kind of a red flag, but you might have some data leakage. So at this point, if you see that happening, I would probably take a step back and think really long and hard about any areas where you might have introduced data leakage. So just to illustrate that in our top row here we've got our train set within F1 score of 80 percent. If you're not familiar with F1 score is just another metric. You can think of it like accuracy. You prove that our test set with an F1 score of 78% and then our validation that actually is 85 percent, which is better than either the train or the test set. Another scenario where you might need to be a little bit skeptical is when all of your datasets perform really well. And to illustrate this here, we've got our training set on the bottom with an F1 score of 99%, tests with an F1 score of 99 percent, and validation also performing with an F1 score of 99 percent. I think younger me would have been really excited about the bottom results there. But now I know that is a dead giveaway that you probably have some data leakage somewhere and your model's not going to perform well when you implement it in production. 12. Do Use scikit-learn Pipeline: Another best practice is to use scikit-learn pipeline functionality. Essentially what a pipeline does is just apply a set of steps in a certain order. So here I've included some code street from scikit-learn documentation. It imports a few different functions from the scikit-learn library. They split their data into a train and test set, and then they set up a pipeline that's applying a scaling mechanism as well as applying their support vector classifier to make predictions with pipelines will allow you to apply a set of steps, alter your data at one time, and it takes away some of the mental burden that you might have from some of the other examples that we've talked about today. 13. Do Check for Features Correlated with Target: My last best practice is to check for any features that you have that are correlated with your target or the variable that you're trying to predict it. So here we've got some example health care data. We've got some age of the people, their height, their weight, their gender. So we also have a piece of information available to us of whether or not the person took antibiotics. And let's say we use that to make a prediction for whether or not someone gets the flu. You can see here that these variables are really highly correlated with each other. So that could be a giveaway that we might be introducing some data leakage if we use this as an input into our model. Another way of thinking about this is back to that timeline approach that we talked about earlier. When you start putting this data on a timeline, you actually probably won't take antibiotics until after you get the flu. By including this as a future into our model, we would be introducing some data leakage because we are including information from the future. 14. Recap: Let's wrap all this up. Fetal leakage occurs when your model knows too much information and it can be really hard to spot. So always try to be on guard for data leakage. And today we've covered a few different methods for how to identify and prevent data leakage as well as some best practices. Thanks so much for joining today and learning more about data leakage.