Fixing data problems using pandas and Python | Paul O'Neill | Skillshare

Fixing data problems using pandas and Python

Paul O'Neill

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
7 Lessons (40m) View My Notes
    • 1. introduction

    • 2. identify nulls

    • 3. dealing with nulls

    • 4. duplicate rows

    • 5. inconsistent data

    • 6. 6 looking at the data

    • 7. sum up and class project


About This Class

In the real world data is messy. To be useful it often need to be cleaned. The ability to clean data will give you an advantage over other data analysts and data scientists who can't. This class uses pandas to identify and fix data problems within a data set. The class assumes you have a pandas environment set up or that you know how to do that. If not I have another class which explains how to set up pandas. 


1. introduction: Hi. Welcome to this class on using Pandas person to clean did a My name's poll. Um, when most people become interested in did a science, that analytics, they tend to focus on things like data visualization, machine learning, neural nets, all of that good stuff. But sometimes what's overlooked is actually cleaning debtor. It's very unlikely in the real world that you'll be given a data set that is in perfect condition, ready to go. It's quite likely there'll be problems with the data that could it might be missing. Some parts there maybe feels that are blank. You may have duplicated Rose. You may have other problems on the ability to deal with this. To take the data and get it into a condition that you can actually work with is a very useful skill to have on people with this sort of skill. On this sort of experience will stand out from other get analysts and data scientists who haven't really worked with that sort of process. The class itself is not really ah, complete beginner class. You need to be able. You need to have some knowledge of pandas. You need to have a working environment set up that you can run pounders. Um, if you don't have that, I I do have another class that explains how to set up a working environment using a Jupiter notebook. Pandas, I'm getting getting started with Data Analytics, so I hope you enjoy the class. Andi, I hope that it is obvious to you. 2. identify nulls: Okay, so this class is going to be looking at Harvard. Can use pandas. Pandas is ah, Did analytics did a science library for Python. Harvey can use pandas to clean, uh, delis. It's so most people who have an interest in data analytics, theater science, they want to get stuck into things like data, visualization, machine learning, that type of thing. But usually before you can even start to think about those sort of things, you have to clean your data. You have to get it into a four month that's actually going to work. And you have to look at issues like the quality of the data. Um there waas I think it was. IBM released a statistic which said that 80% of well did assigned to spend 80% of their time, um, preparing dinner. So that's getting it ready for did a visualization on machine learning of things like that . I'm only 20% of the time actually doing those sorts of things. I'm somebody else said that Well, yeah, we spend 80% of our time cleaning the data, but most data scientists will then spend the other 20% complaining about the quality of their did. Okay, so where I say using pandas, we're using a data set called Artists at CSP. On this has information about artists who have works of art within the Museum of Modern Art in New York. So first thing get started. Obviously, we need to import pandas that we need to import the data. Okay, so now we can look at the structure of our data, the harmony, for example, how many rows and columns doesn't have you see, it has just over 15,000 rows on six columns. And then if you want to see what those columns are, what's artist I d name? National Day. Okay, so one of the first things I want to look at his knows within the dinner knows air important. They can have an effect on different things. Especially, for example, machine learning machine learning does not like no values. So we're going to do is create 1/7 co. Um, a new column for our data said, And it's going to be called Ronel cunt, and it's simply some the number of knows within that rule, we're going to do that by taking our did it for him dot is no. So this tests of the the value the column value for the road is no are not. If it is, it's going to summit, and then it's going to put this value into this into this cold. Then we can take a look at that. You can use either the head of the tail function. This is argue column. Uh, these are just the last five rows in the dinner for him so you can see have a couple of ones on the nose. Kisses. The death year was no whatever. It makes sense. If the artists, for example, was still alive. Obviously there's no death year, so we can assume maybe at the time of compiling this data set, these orders were still alive. Then he looked last three. We have a 4 304 So in this case, no nationality, no gender, no death here, but also no breast year. This one has. It does have a gender, but again, nationality. Berthier on death year, Orel Blank drone knows so we can explain death. You're being No, but I don't see how we can explain the birth year being a no. So this is just bad didn't. Um obviously there should be. If this is a real person, they should have Ah, year of birth. And since it's not, it's missing. Therefore, being would just say it's bad. Did up what? We can get a better picture off Just how bad the did it is, or maybe how good it is by running this line of code. So again, we're looking at the our new Colmar Ronel Kondakova within our artists data from I'm going to use this function Value cuts. If you're not sure what value counts is or what it does works. Look it up on pandas Documentation pandas Documentation gives lots of very useful information about all of the different functions that you have to use to work with. So he run this. Okay, so now you can see there's 6000 just over 6000 rows that have one no value. So possibly like these rows about 4 to 5000 rows with no nos. Well, again, that's fine, because some of the artists we assume will be dead on. Therefore they will have a death here populated soas, faras, birth years and death. Years ago, maybe about 10,000 of our rose might be okay. But then when we get 2 to 3 or four nails Paro, um, we're getting into this sort of, you know, these sorts of rose here where we have missing nationalities. Genders may be other things. Names. There are at least four to high 5000 rows in this data said that are far from perfect. 3. dealing with nulls: Okay, so we know. I know that are did. It is not perfect. It looks like we have, uh, missing visors. Course. There could be other problems with the data, but for now, at least, we know that there are problems with no values. So then, harm, might we deal with it? If you're trying to run machine learning on the data said or did friend that has knows, it will not like that it will crash. It just doesn't like no values. So how can we deal with that? Well, one way to deal with it is, first of all, to filter out the rose that contend nails Andi rather than filtering out every single rule that contains no, we want to filter it. First of all, maybe the rose were the birth year isn't no, because we know that that's just not possible. Death years. No, let's find. But not having a birth year is just not possible. So these ones are definitely problems. So let's filter right. These ones to start with, um, we're gonna create this filter. We're just saying that within our artists did a frame this column the birth year within our artist did it for him. If it's no, then it becomes true in our filter. So now that we have a filter, let's apply it to our deficit. Our data frames, right. Um, so we're going to crit, and you did it for him. These are I wouldn't call it, but if you want, I've called it on. Born artists did a frame, so I can't be Look at this would use the head function this time. Okay, So you can see all of these artists during date have no birth year. In fact, these ones also have no death here, so I guess it is least consistent within that. See that some of these rose also have nationality and gender is knows. Well, so then how many rows are in this new data for you? So if he run the ship function again? Marcus, we have 3854 Rove's and again our seven colds. So this is one way of beginning to deal with the no values. We have no creditor. I did it for him. With what we think are probably problem rules within our get a set. Their problem rose because the birth year is all nobles on. We just know that that's not possible. So we have isolated those those rows within their own data frame. It may be the As you start to work with your dinner for him, you may decide that you're just going to throw away those rose. Or it may be that they're still useful information within those rules and you don't want to throw them away. So you put them in a separate theater for him, and they're there if you need to use them for something else. It is possible in panders to drop Rose with knows. So if we take our original data for him, artists underscore DF When we apply this drub n A. This has dropped the nose, and with this function, you can you can pass a parameter a subset parameter with a list of all the combs that you want to work with. I want to look at so in this case, we're just going to look at birth year. So in other words, we're going to get rid of all the rows that have a no birth year and that will give us a subset of our original data from which we're going to call artists underscored. Born different If you run that, then again, we can look at when we look at the 1st 10 rolls this time. Okay, so you can see that all of these seem tohave a burst here and again. We can run the value cunts on our new did a friend. So you see, we have fewer problems, Not you have 6 10 Just over 10,000 rose. That haven't most just one. No, on that no. Could just be the death year. We're not sure, but it's a possibility that many of these rules will be the missing death. Here, we still have two and three rose with two or three knows. Although it's done to have yet 100. Um, Creed. Another filter. This is gonna call more than one. No. And then apply that filter to our artists born different. So basically, what we're doing is identifying the rose, uh, these rules that have more than one. No taru. We're putting it into a new data from which we call more than one no different. They never take a look at that one. Okay, so you can see a lot of these have to nose, and it looks a lot of them have one of the noses death here, which may be OK on then, the other one. Things be lots of missing gender field. So again, we need to decide, you know, Is this a problem for us? Um, do we still want to use the data throw that did away? At least now we have identified and isolated thes rose. So we're in a better position to decide what we're going to do. So again, I look at the value counts on this new deal. If it I say we have, But he had 100 rose with two or three nos. So these are the We have isolated thes rose. Okay, So this is her one of one? No, Perot did a friend, and it does look like gender. Is the man missing column? Okay, so one way of dealing with Rose that have, um, no values. It's just simply to throw that deter away to drop those Rose thistle's kind of the nuclear option. But it is an option. So I just wanted to show you how you can do that. Okay, so we have our artists underscore DF did a frame. We're going to drop every single row that has a no value. And that includes all of the roads that have a no for death. You. I'm never going to put the did into a new data for him called artist DF underscore. No, nos. So as you say, this is a bit of, ah, a nuclear option. But it's just to demonstrate that you can drop the rules that have nose. And if we check that so obviously we have it drop down to just 4429 Rose. We've lost a lawful lot of Argueta. Um, but all of these rules hadn't No, no, no values. Okay, so we have seen that one way of dealing with nails is just to drop the roads that have the Knowles. We saw a couple of examples in the 1st 1 We use the subset parameter to specify specific combs so we don't need drop the road if value and this column was enough, um, or nuclear option was then just a drop in a with no parameter no parameters going into the function. And this will just drop every single rule that has at least one? No. Um so are there any alternatives to this? Of course. Yes, there are. Um, one interesting and potentially useful function is thief Phil and a so n a nose on this. This function relies to do various different things. I recommend going to the pandas documentation looking up all the different uses of this Phil and a But what we're going to use just simple example where we have no within the gender field, we're going to replace that. No, with a value. I'm just gonna use upper kiss you to represent on the door. So the first thing we do is set up. It's a simple dictionary. Uh, just the one value pair. So specify gender as the the column and the value that we're going to replace all with bells with. So we just need one more line of code Then so are artists. Underscore D f. Did it for him. Our original did different is equal to artists dot DF. But this time we're going to fill the nose with our values. It's when we run that I mean, look at artist DF we can see than the gender field. We're not getting these use coming through. So before these were knows and they've never been replaced by years. So again, this is useful. If you don't want tohave knows. For example, if you are doing machine learning or something like that, you have to get rid of the nose. Instead of just dropping rose, you can replace them with fixed values. 4. duplicate rows: Okay, So another problem that can sometimes occur is duplicated Rose. Um, so again, using the mental health survey descent. What I've done is I manually went into the Sea SV file. Copy the first row on Pissed that it didn't. So now we have a replica room. Okay, so he just load the data again on take a look. 1st 5 rows. See the first road rose zero on role three. Our identity, I guess I know these are identical because I copied this and pace that it didn't. But in the real world, you may have did a set where you have triplicate rose like this. But of course. How do you know that there are duplicate rose? If you have a data set with a 1,000,000 rose or even 1000 rows, you can't go in and manually look at each road to see if there any identical rose. But pandas offers a function that will help. So our new data for MDF too that has the to triplicate Rose. We're going to use this function duplicated. Now we're just going to look again. The 1st 5 rose. This function returns a Siri's in the first kiss the first instance off this road. Is it replicated? No, it doesn't. But then on the fourth row, it's saying, True, because this is a duplicate off this first row. So I'm then, if it anymore Dip lick its of this rule again. Say number seven was a triplicate would have, ah, false, false. True again. So again, you can critter a new column in your get a set at the Siri's as the new column, you can create filters, and you can remove and eat replicated rose. 5. inconsistent data: Okay, So we're going to look now at another type of problem. We're going to run another Tripler notebook on a different data set. So again, start import Banda's and then we need to import are or did it into a data from I can just check. We have this time smaller data set. 1000 2000 1259 rose. 27 columns. So not somebody rose, but lots of columns. Um, we check These columns are we have timestamp age, gender, country. This did a set, um, deals with a survey on mental health in the world of I T. Okay, we just check some of the data 1st 5 rows just looking at these rows. Time stump. A time stamp age seems okay, but when you look at gender, we have female upper kiss em. Let me have a meal. Male male. So there seems to be some variation in hi people are and high people are stating their gender. I'm assuming that uppercase M is the same as meal, but it's hard to be certain. So again we can look at this column gender of what sort of values air in here. So remember, if you want to specify a specific column, you have to put the name in inverted commas inside square brackets. We're using the function value cuts. Okay, so it looks like maybe when they were gathering, this did a well. They were asking people to feel like the survey. It wasn't. People didn't have ah, drop down to choose from a fixed number of options and said, it looks like they've typed in whatever they wanted to put in. So you have meal spelt with an uppercase M meal with a lower kiss. Just another kid. Sam. Same a female. It could be a upper case. F floor kids upper case F by itself. Lord, there's a lot of things nethers Also, I guess spelling mistakes, Mick. I guess that's supposed to be meal, the K and the L or beside each other on the Quartey keyboards. Um, it was just a man. Um, so someone else has identified Is meal as a postage? Uhm, I'm guessing that is a mistake. No one mile. So the meal or Mylar, it's the mail again, the e. And they are beside each other on the keyboard. So maybe a type of the basic. There's lots of variation here. Um, again, as the did analyst of the data scientist, you need to decide, What are you going to do about this? Um, you could just throw away everything that doesn't meet. What do you think it should be? But you're throwing away a lot of data. For example, if you only choose the roads that are uppercase M. Ailey, you're throwing away all of the roads that have the dorky ists em. And there's also still quite a few rows, which just up Ricky's them on a few rows that are just lower case him. So, throwing away the data, you're going to be losing a lot of Rose. So what other alternatives are there? So one alternative that we can use is to re assign values to try and normalize the values. So everything that's uppercase meal, Lorca's male etcetera. We'll put it all just Aziz. One value on this line of code will do that for us all. We're going to say that our data frame is equal to our did a friend, but we're going to replace and the gender field so uppercase M l E will become nor kiss em daily uppercase m will also become Lord ksm Ailey and so on. Um, I've only included three here, but you know, there is many more that you can choose. For example, the m ai l somebody wrote you could say, Well, it should be m a l e. So if we run this and again, we can look at the value cunts. It's nice e instead of all these different meals, we've not consolidated most of them into this m a l e. It's nigh up a 971 before it was just 206. And again, you can choose more. I only chose a few there to demonstrate it's a little bit tedious typing all this site, but you could include, for example, the M a k a. The what I assume are probably typos. May I I l etcetera include all of this into this dictionary. So using this approach what you'll end up with his, um, meal spelled with old Yorkis Maali. When you do the same with a female lord, kiss F E M A. L E and then all of the other gender types, so they said. This is one way of dealing with inconsistencies within the data, with art using the nuclear option of dropping lots and lots of rose. It's also another example of why it's good idea to take a look, actually printed some rose and take a look at the data. This value cans function is very useful for finding out what sort of values you have within a column, And it is, I think, important to not assume that your data is good quality. Um, may be better to be a little bit pessimistic when it comes to did equality and assume that there probably are problems on. Part of your job is to identify those problems and then decide how you're going to fix them . 6. 6 looking at the data: Okay, So one last example, um illustrates why again? It's important to look at the data and get to know something about the data because you never know what you might discover. So before really working on the data said, it's probably worthwhile. Just spending 30 minutes just looking at Rose and did and see what's in there. So as usual, start off with importing pandas on. We have another new data set. This one is mass shootings in the USA. So again, we need to import the data into a data for him. Then we can look at the ship of the data. How many rows and columns says it's quite a small data set. There's only 390. It rose 13 columns. We could look at the 1st 5 rows so straight away we can see there is missing data in here again. Gender latitude, longitude. We look at the location cold. See if there's any locations that have more rose than others. Okay, so it looks like Seattle Washington head the most number of recorded mass shootings, at least in this data set, so we can take a closer look at, uh, Seattle Rose. So in create a smaller did it for him. Subset of our man did it for him. I'm you called us. DF underscore Seattle This line of code will select those rose where the location column in our men DF did. A frame is equal to Seattle Washington When you're using these bullying conditions and it's equal to equal to a string. In this case, remember to use a double equal sign if you use only one equal sign. The Python thinks you're trying to assign a value, and you'll just get a never message. Okay, so again, I looked the ship of are smaller data set Seattle deficit. So, as expected, we have six rows on our 13 columns. So because there's only six rows, we don't need any head or tail function. Reduce. Look at the entire didn't set. Did it for him. Okay, so all locations Seattle. Washington One thing that stands out is the dit, So we have two rows that are on the same day. It summaries air. Different titles are slightly different, but fatalities injured. See him. There's obviously a problem with this total victims because seven plus two does not equal it. Yes, looking at it more. You can see that six plus one does not equal six either. So I'm assuming total victims should be some of fatalities and injured. So whatever method was used to create the values in this column, it doesn't seem to be working. That's something that could be easily fixed. We could right liner to a pandas. That's going to just take this column, added to this column on Put the value into this cold. But I don't think we can trust the data set values. We wondered that how many other total victim, how many other rose and there's Total Victims column are also incorrect. That would be something we did check because obviously, then, if we're doing any did analysis and we're using that did it in this column, we could get inaccurate results. So another thing, then, is these two rows that occur on the same? Did you have to wonder? Was there to mass shootings in Seattle on the same day? It on the same day? When you look at the latitude and longitude, they're almost identical. So in other words, you had, according to this data set, you had to my shootings in Seattle on the same day within a few feet or a few meters of each other. That seems unlikely, difficult to believe. So there's something else going on. One possibility might be that the person who created this data said, was maybe scripting information from news sources on the Web. So New source one gives you this road news source to get you this road. Andi follow the data is very, very similar. It's not exactly the same, and that would explain small discrepancies. But whatever The Raisin Ford were not aware that there are some problems with this data set , we could have values in total victims that are incorrect. We could also have rose that are not exactly gyp lick. It's so we've already looked at how to identify Duplicate Rose. These These would not be picked up by the duplicated function because they're not the same . They're very similar, and actually, this row on this road probably are referring to the Sciam incident. Sometimes problems are not going to appear when you use functions like value, Constand replicated drop in and all of those sort of functions. Sometimes you just have to open up, the data said, and start looking at some of the did and see if you can see any problems 7. sum up and class project: Okay, so just a very quick sum up of what we have done. Um, yes, we have seen that. You can't just assume that data sets will have perfect dinner. Sometimes you just need to sit down and look at the data. In fact, I would recommend if you're working with data, take some time. Just when you have a new data set, just open it up. Look at the data. I. Sometimes it's going to be easier to look at the dinner because, uh, it's not all just numbers, but you could certainly see if there's missing did and things like that use panders to help . So some of the functions, like value counts, replicated they will help you to find problems within your data. So then the class project is simply for you to go and find deal sets on. There are plenty of places on the Internet now You can find it upsets most national governments, city governments, organizations like W. Chou, the European Union. They produce data sets and they have portals where you can download data sets to work with . There are also other sites like Chicago, which have open source data sets that you can work with the data sets from I'm Working with Hero came from cargo. So once you find it, did I said downloaded patisserie ponders. Take a look at it just visually. Take a look at some of the Rosa's well, and see if you can identify those problems. Upload your findings along to the project part of the class.