Beginner data science with python - from zero to confident beginner | John Harper | Skillshare

Playback Speed


  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Beginner data science with python - from zero to confident beginner

teacher avatar John Harper, Cambridge Programmer, AI engineer

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

15 Lessons (1h 27m)
    • 1. What is to come DS ML DL AI

      9:26
    • 2. Introduction to data science

      5:35
    • 3. Getting set up for data science

      3:33
    • 4. Accessing our first dataset

      3:17
    • 5. Loading our data - Pandas

      4:51
    • 6. Basic exploration of dataframes

      6:41
    • 7. Accessing columns

      2:36
    • 8. Basic visualisation - count factor

      8:49
    • 9. The four Cs of data science

      3:03
    • 10. Variable types

      4:38
    • 11. PyTanic Creating

      15:03
    • 12. PyTanic Completing

      6:36
    • 13. Pytanic Converting

      4:42
    • 14. PyTanic Correcting

      7:06
    • 15. Titanic data set data science recap

      1:03
  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

90

Students

--

Projects

About This Class

  1. In this class we will take you through the ins and outs of beginners data science. Through using python you will learn how to access, load, visualise and clean big data sets. 

Meet Your Teacher

Teacher Profile Image

John Harper

Cambridge Programmer, AI engineer

Teacher

Class Ratings

Expectations Met?
  • Exceeded!
    0%
  • Yes
    0%
  • Somewhat
    0%
  • Not really
    0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. What is to come DS ML DL AI: Now that we're coming into the section on data science, we're going to be looking at a range of concepts were going to be fair starting over days to science and movie onto machine learning, Deep learning and AI. There's a lot for us to cover And don't worry, Worry will be going step by step through all of this. But I thought be good to have a standard lecture just to introduce this chapter of the course of fuel. So with data science, strap in your in for a ride because we're going to be going through some really exciting concepts and some really well, any way of saying it is impressive practical project where you're going to be able to integrate data science machine learning deep learning on these air. So these air cutting edge things that we're looking at right now, where really exciting things are being done in in the professional world. So one thing to keep in mind is that all of these concepts are based on maths and statistics. You don't need to get too carried away with thinking that AI is purely philosophical or anything like that. It's all based on maths and statistics on, then we code to implement it. So we're going to be going over the maths and statistics. Don't, Roy. I'll be leaving that you through that and I'll be making sure that it go. We go over the mathematical concepts in a way that is graspable. So don't feel like we're going to be doing some crazy maths you're never gonna be able to understand. So just to talk about what we're going to look in here so it data science essentially, in a nutshell. What date science could be thought of is you get a get a lot of big data, and with date science, you start off by cleaning the data, visualizing it and engineering the data to provide as much actionable insight as possible. So we'll go over what that means later. On a sense, you can think of cleaning data as there may be missing values or misspelled values or things like that. So it's it's very basic things like that and what I say actionable insight. Let's say, for example, we have an insurance company and they have lots and lots of data, and they want to get some insights on whether they should I don't diversify into into home insurance of some my bat. A data science could relate a lot of data on home insurance, clean it up and using visualization show the people were going to be making the decisions, right? This is this is why maybe a big a good idea to use home insurance. This might not and things like that. So machine learning is in a step up from that. When you have the data that's being cleaned, you're then able to train algorithms and what I mean by algorithms. Essentially, these could be used for predictive values. So, for example, on and a machine learning our algorithm could be used to make predictions on things like Fraught. So let's say a bank is looking at the behaviour off one of their their customers, and it looks like it might be fortunate their behavior patterns, machine learning algorithms, could be developed in order to, in order to provide perfect predictions on the likelihood of it being fortunate. So date science in machine learning, they're very much on the cost cusp of each other, because indeed, science, we also use algorithms, but they sometimes they could be interchangeable. dates, science in machine, learning, deep learning. And this is something that's being talked about a lot these days. And that's the training of neuron networks similarly for predictive purposes, but also sometimes for generative purposes. And we'll talk about more that later in the course and then, finally, artificial intelligence. So this is where computer systems integrate machine learning or deep learning concept in order to perform human lights tasks. Let's go a bit more into each one of these. So data science. So here a few of the steps that day scientists might tech might take First of all, the data science is expected to be able to define the scope or project. So, for example, the outcomes they're looking for you know what they're looking to get out of the data. What they looking to show is actionable insights. Then, of course, is there's data acquisition. So how exactly the date science will gain access to their data? Do they need to buy it from 1/3 party? Does their company already already have it? This is something that a date scientists will need to consider, And then one of the big things and some dates scientists describe this is 90% of their work . Is data preparation so green in the data transforming the data? So that's in a format that's actually useful to them? Also, exploratory data analysis. This is kind of the exciting part where the date scientists can look at all of the different variables in their in their data and decide which features are actually going to be useful for insights on which ones they need to engineer in order to in order to provide more insights. Then we have data modelling, which is essentially using algorithms just like machine learning. The reason why I put this in brackets is because it does it in the machine learning domain as well. So then we have visualizations that's creating graphs, charts and things like that in order to better communicate the actionable insights to their stakeholders. So machine learning machine learning is the use of algorithms and statistical models to perform tasks without explicit instructions, instead using patterns and inference. So, for example, let's say we wanted to use a house price predictor eso. We have loads of data on on houses in a certain city, and we have the prices that they've all been sold for. We could then provide that or use that in an algorithm and that we need to perform the task of predicting the price of new houses. So we were provided with too much explicit, explicit instructions. We just need to get the prediction back out. As you can see on the diagram of the right, there are three different types of machine learning unsupervised, supervised and reinforcement. I wouldn't worry about that too much. The moment will be going into much more detail in the machine learning part of this course , but you can see there are a lot of different ways of using machine learning. So we have with machine with supervised learning. That's for example. I just talked about where we're giving data on. We're providing labels on. We're saying that the model needs to be able to provide predictions. Unsupervised learning is being given data and being asked to show patterns within the data . But not to be able to do any prediction treats like that. So, for example, might be given lows of data on the voting patterns of different get Democratics. An election on now rhythm could group together different groups of demographics and and show some sort of insight into how different demographics vote. Finally, reinforcement learning. You can think of this Azaz robot, for example, that's been giving a rule and a task. So, for example, you can move your arm in, up, down, left and right, and you can open a close your hand and you pick up that glass of water over there. The system has to explore all of the attempts of moving the arm opening closing hand until it succeeds at the task of picking up the glass of water. Another good example is Deepmind beating the go. So go is a board game beating the champion, and that's by basically the algorithm was provided with all the rules of the board game, and it just played against itself. Millions and millions of times. So deep learning that's like a family of machine learning that uses neural networks to achieve the outcome on this is being talked about, although time on, people think it's incredibly complex and only the smartest people can understand it, which just, absolutely isn't true. It does require big data on a large amount of mathematical calculate calculations. Fortunately for all of us that the computer does. The calculations are not human on. That's why deep learning is so useful because it can perform these tasks that humans couldn't consciously do with mathematical calculations, at least so it can require a large amount of computation in production. Finally, let's talk about artificial intelligence because this is the big buzzword. Eso, after his intelligence, is simply the application of machine learning and deep learning to perform human might tasks. So that's, for example, in facial recognition. Be able to recognize the face of a specific human or converting speech into subtitles or do salesforce, car forecasting or self driving cars. All these little things are basically implementations, off machine learning and deep learning algorithms. So I don't want you to feel too overwhelmed at this point. I know I've just introduced a lot to you, but really, we're going to take it one step at a time. I can completely assure you I'm not going to jump the gun and re get you start talking about concepts that would just know you won't be able to understand. I'm gonna be using some fancy labels like deep learning like machine learning. But the end of the day. It's all just maths with with a bit of intuition, which I'll be also be nurturing for you as we go along. So what's more important, needing ableto be really good at maths or being really intelligent? What's so important is that you have patients, you have perseverance. And I think a lot of people have have gone through the steps of learning dates, science, machine learning, etcetera. They will really agree with me on this that as long as you're patient and you persevere you you can definitely master this and the rewards could be could be absolutely great. No, only in terms of, you know, on the job demand for people with these skills in the amount the salary she can get, but also because it's a really cool, skilled toe. Have. And it's really, really interesting once you get into it and the final point I want to make is if I can do it, then you definitely can for sure, so up next, let's get started. Let's get stuck in with data science 2. Introduction to data science: Let's take the exciting step now in today's science will talk about the basic work that a date scientist does on different examples of how we use state science in application. So in a previous later, we talked about the basics of what date scientist does. So, for example, defying the scope of a project, the data acquisition, how they get hold of their their data, how they prepare data, have explored the data through exploratory data analysis and finally, visualization and communication for all the stakeholders involved in the project. Let's go through these one by one instead of finding the scope. So the importance of deciding the scope is to be able to succinctly communicate the scope and the outcomes of a project along the stakeholders, and that will be really important. But you way above the rest in terms of people applying for these sort positions. So we were talking about the scope. What I'm talking about is the business objectives where it was not the business for whatever team you're working with. So what are your objectives, then? Also what you're expected outcome is so, for example, let's say you want to do data science on I don't know sales projections for a company. The expected outcome is your prediction on on the protection for sales, but also maybe some insights into the trends and sales in general, and also how you expect to approach the task. So once you have done a few day to science tasks, you will be able to be experienced enough to say This is how I expect to approach the task . I'm going to use these kind of methodologies, for example. So then you to consider what stakeholders air involved. So with some defining the scope, you to talk about what kind reporting there require, we need to provide charts and graphs here. What milestones and deadlines will be required, says the basics of disk. I need to find the scope for date science. But it then, of course, we have data acquisition. So, for example, you can get big data from data brokers. These air specific that third party companies who basically sell data in large amounts you can get data in house. Of course, all of companies do that. Some just create surveys to get their data from from whatever uses they're looking for. You can also get data from, of course, from sensor networks. So let's say, for example, you want to do a big data approach project on that has something to do with forest fires. Then you could set up in an era. So, for example, California, where there's been a lot of forest fires, you could put in sensors for things like gauging the temperature of the slow, the humidity, the water content of soil. And then you can put that against the information on the forest fires on Diego, you have your data. Those are a few ways of accessing your data. So when we talk about data preparation, so what date signed to spend most their type time doing is this sort of thing. So cleaning and organizing their data compared to actually building their training sets or doing the machine learning? This is what's all about. That's why bundling this one data preparation. So although it might seem my data's that date, science is it's always big, exciting fields. There is a lot of preparation that there is involved in this, and you'll get used to this quite quickly. That's come cleaning and transform your data. So, for example, you need to you often need to convert your data into the correct formats were quite feeding into your machine learning model you to fill in your missing values. So you know, let's say, for example, you get your values from from surveys. Some people may not actually put in. And there you all of the fields were quiet. And so also, you need to fix Miss Belt or incorrect inputs over example. If someone writes 14 correctly, or they just put in a completely different one like they put male into the age, age part, things like that and then also handling outliers. So that's data that obviously isn't great on may affect your data in not so useful ways. So exploratory data analysis this is spot the fun stuff now. So for data state science teams, this is all about right. We want to use these features in the data because they seem to important for a goal. So then you can communicate this to State called by saying we have found that feature X and feature Why Kylie correlated with our goals off set. So that could be, for example, if we're doing sales projections off, I don't know for an ice cream company. I mean, a very simple thing would be, we'd say we found that on the outside temperature on the time of year is highly correlated to the goals of increasing sales for ice creams. Something like that. So examples of exploratory data nice. It's so one of speeds feature selection. Let's say, if you wanted to do this sales projections for ice creams on do you had, you know, 100 different variables he had, you know, only, like the time of year on the temperature, but also, I don't know, distance from schools, which would be a useful one. Then they're less useful ones, like number of trees in the area and stuff like that, and you can actually select toe. Remove those features that won't be useful for your machine running model on. Here's another example for house prices. Yeah, number of rooms is useful. Distance from school is useful, but the distance from the country of Ice Iceland, probably Nazis for so with exploratory based on Isis visualizations are often very helpful as well, not only for yourself to take a look at the correlations, because it's really easy, really helpful to have visualizations Teeler correlations, but also be really helpful later on in communicating or find holders findings with the stakeholders. Like I said below, so another going over the basics of state science, Let's get you set up so you can start doing data science yourself. 3. Getting set up for data science: now Because we're going to be primarily using python for data science and associated libraries. We really need to do is to install a few of these libraries ourselves. So we really have duped, too. Notebook set up. If you don't that essentially use one type in pit in stool, then cheaper Tesco the Why notebook and you have that set up. The next thing that we want her in store is psych it land, which is a very, very helpful library for a lot of dates. Science. So you do you pip install SK land now psych it Land tuned to have installed properly and it also installed you might be able to see on this line. Here it's installed job lips, numb pie, sigh pie and all these different ones. So it's already come installed with numb pie, which is helpful. Now if you're sustain installing food, feel free to pause the video here and just wait for yours to fish stalling, but otherwise will press on the next things we want to install is numb pie. In case yours didn't install. When when you insult SK, Len wants to pip in stool dump I as you can see it returns saying requirement already satisfied. So I've already got it. So that's fine. We don't need it. Oops. So they actually want to do Is Pip install Bendis so numb by is re helpful Ivory for the mathematical calculations. Pandas is a great way to do data exploration on for data science. Or when you're cleaning up the data, we're gonna be used using pandas a lot to load in data to view the data also to make changes within the data very efficiently. So pandas is a fantastic library for that. So I'm gonna pause the video here while pandas installs on you can pause a video as well. Makena Pandas has been installed. Lastly, For now, at least, we're putting in to install a lot more as we go along. But for now, which going to install something called Matt plot lib? So Matt M 80 p l o t l i b So given storm at plot live. So we're gonna leave this to instill our imposter video. What's yours? Insulting as well. It's another We've installed map plot lip, which is a very good way for visualizing data and creating graphs and plots. We now have the basics of what we need for our date science module. So if we have a need to install anything else, we always says, come to calm command line and you spit in store and then type it in here. So I'm sure it will be back here at some point. But for now, let's run do tonight. Black, as we normally do so just remind you do to No book is a very useful graphical user interface we can use to run small bits of python code at a time. And it's so useful. And data science, especially for data visualization, for seeing the graphs in a night for a nice format. It's a very helpful a number of different ways. So let's just make sure the Duke to note, because working and that we're able to import all those things that we just installed. So let's just wait for this to load. So I'm going to create a new Python three file. Okay, so now it's try importing some of these someone trying import numb pipe to start off with. Let's see if that works, and if there are any areas that come up, it's happy up. And then let's see if we comport funders SK, learn, import that plot that seems to working just fine. So we now have do tonight, but working on we have the correct libraries set up so we can get started with our Dave Science. So make sure that these were working for you and then we can get started in the next lecture. 4. Accessing our first dataset: another way, set up with the library's modules that we need Friday Science less gain access to our first data set, so please go over toe cargo dot com. This is a fantastic website that has a wide range of data sets. If I click through today set dates, it's here. You'll see how money they have, and they have a wide range of different subjects as well. So you have things from environmental sites to crimes in Boston to gas prices in Brazil, and you can also these are the hottest of the moment. You consort by the one of the most votes, the newest ones, most usable, all these kind of things. So not only can you just gain access to data sets, but if you click through here, what you can often get is you get access to the date sets that gives you a nice that, and they give you a nice description of what it contains, but also what's really helpful If you're going to colonels, which is something you may know head off before, But essentially they were you have here is people describing how they went about solving the problem so you can see that whether they using python or a different language, and you can see how many up votes have gone so something. So this guy, for example, has had 784 people say this is great so you can click through and you can you can even take a look. At what? How his recommendation, When How you go about solving this problem. And they usually if they are voted by 1000 they tend to be pretty thorough and pretty good . So this is a Chicago is a great place to learn about how to date science, and we're going to use one of their most often used date sets, which is the Titanic date set. So if you just go to the search bar, So first of all, you want to sign up. I'm just using one of my old profiles. I'm using a while. Go to such bar you type in Titanic. This is after you've created yourself and accounts a free account I should mention. So you just click on the top one here. Titanic machine learning from disaster. Then, once you've gone through, go to the data. You can read the description as well if you like. We want to download these train sets of the training and test sets. So what we want to do is find where we download it, which should be around here. Hurry up, please. Click on download. Or as you can see, we've got a titanic dot zip. So I'm just going to open that I'm going to move that onto my desktop, okay? And then I'm gonna unzip it. Let's let's take a look of what's inside this tight tight tank. So gender submission have trained and test. So what, we won't have access to a moment. Is the train one? So I'm just gonna act extract it'll. Why? No. Excellent. So now what we have to make sure is that when we when we try to act when we tried to access this data, we should start our Dupin notebooks from that directory on will be going into that in the next sector. But right now, what we're sure of is that we've downloaded the correct data if I open one of these now, So what we care about right now is the training data set. It's just gonna be called Coltrane. But since she is gonna be a large spreadsheet and we'll go over that in the next lecture 5. Loading our data - Pandas: Okay, so we now have our data set. I've got mine in a folder called Titanic, which is on my desktop. And so just to make things really easy for myself, I'm going toe run due to notebooks from this folder. So it's very easy to access this file here. So first of all, what? We're going to see the desktop. We go into my Titanic folder and from there I'm going to run Duped a notebook? Because this way, I won't have to you When I'm specifying what? Why? When I went to open it, I don't have to bother a bother putting in some long directory. I could just put in. I want to open that exact file. Okay, So as you can see, we already have our We can see our train train dot CSB file here. So this is the top file type CSB you might have seen before. XLs, for example. That's quite a common one for spreadsheets, but essentially, it's just containing all of our data in columns and rows, just like it expected normal spreadsheets. So I'm just going to set up a new notebook here. The first thing I want to do is import bendis now, because I'm gonna be using pandas a lot. And the commands I'm just going to short in this some scanty import panders as PD. So you might remember in our in the python module, you can import something, and then if you write as then whatever you want, I can just use this is they're having to write out pandas every time. Just saves me a little bit time. So once that's been imported, like and then I could didn't write in PD, which is now the pendants. And then this has loads of different functions that Aiken due to, for example, open the CSB or look at the data or change the data. Well, I want to do right now is to open my CSP. So I'm going to write in beauty dots. Open What? Sorry. Don't load. Now I'm gonna load up my CSP. Let's go. Be a simple is writing in the load CSP function and then putting in the name of the file. So we now have the data for the Titanic's. So I have the eye of the Titanic folder on my desktop just to make things really easy for now, I'm going to open dupes notebooks from this directory. So there is absolutely easy. Just open up the file rather than having to put in some long directory to access it. So I'm going to CD. So change directing into my desktop and then into Titanic. So from there, I'm going to run due to new book. So one stupid to notebook is open and run. We can essentially then just load up RCs V using pandas, and then we can access it immediately. And this is quite exciting time where you suddenly realize that python you has some cool things where you can suddenly open a large amount of data. So now we're getting into the big date stuff. So first I'm going to do is import Brandis soap. And this is what we use essentially for open data for looking at data for changing data. It's just a really, really helpful tool for date sites. So because I'm going to writing pandas a lot, I'm actually going toe change it to P. D. So it's just a bit shorter would be easier on a lot of people do this as well. So import pandas as PD. So we went over this in some of the python lectures some time ago. Hopefully, you still remember that That's a that's a possibility. So now that that's important from its DPD, then dot And this is where Aiken great access toe lows of different functions that have provided within bandits. So what I'm going to be using is read underscore, see the and then we'll have to do is in the brackets the name off file, which is trained on CSP. And like I said before, I'm opening jobs notebook. I'm running it sorry from this folder. So I see deed into the folder that contains all of the contains these files so that all after is right in the name rather than long directory. So now when I run this, you're going to see that we have that we have the data available to us immediately. Yeah, and just like that, we have a whole day to set. So, as you can see, so you're providing a short amount of it because it's bigger dot, dot, dot There. It doesn't want to show them all off, but we seem to have 100 nights. One rose, 12 columns says basically 891 on data points. That's different people on then. The columns represent different bits of information about the data points. So, for example, here we have the name of sex, the age, the ticket, the fare, etcetera. So we probably I want to say it want to store this information in a variable. So then we can then refer to this whenever we want Teoh change something in one of the columns or view something in the data. So essentially, this is what we need to do to access the data we import. Brandis I decided to import. It is PD just to make it easier to write, then within a variable. I've done pd dot read on score CV, then in brackets trained dot CSP. So have a go yourself. Hopefully you'll be able to load up your data and then when you're ready, I'll see you in the next lecture. 6. Basic exploration of dataframes: So now that we've loaded in the data, let's do it some basic exploration of the data. So when we load up is called a data frame. So I'm just going in a very vocal DF short for data frame, says a panda state room that we're working with. So imported pandas, like we discussed last time, now have opened up the CV on restoring it a date frame. And that's not because I'm putting. This is the F. I can call it wherever I want, of course, but it is a panda state. Frames is just easiest to call it DF. So if we just look at the data frame, we have these columns. So these are essentially the different. It's information we have, and then we have all of these rows on. These are just different data points. So the different people who onboard the Titanic so the whole point of this high tech just in case you don't know the story of it. Essentially, it was a ship that set sail on. It's sunk on. A lot of people died and a few people survived. So the whole point of this data set is to see if we're able to predict who survives and who doesn't based on the information we have. So their sex, their age, the tickets, they're fair. All these things we want to see if we can create some kind of model that's able to predict who would survive and who did survive and who didn't sonar training set. We we have most of most of the data. Then the test set. We have a little of the data to test out. Later, we'll go into that more and more detail. First of all, it's too basic data exploration. So first of all, let's look at the shape of the data. So if I do DF so that's the date frame. And then I call this DOT shape with 809 21 by 12 that saying we have 891 rose. That's how many people we have on the 12. Here is how many bits of information we have, for example, the sex, the name, the ticket fare, that kind of stuff that's very useful. The next thing we can look at is the size, which is NT Just how many bits of data do we have? That's pretty much steer the number of rows multiplied by the number of columns. So after that, let's look at something even more useful so they could describe. So you put in the name wave. We've stored our date for a mass. I've just told his GF don't describe with April brackets. So no, this is starting to help us out of it. You're noticed that we don't have all of the columns that we had at the top. So that's just just print out the data from up here. So you referred to it. So as you can see things like name, sex, tickets, cabin, they're not appearing over here. That's because they're not new Miracle. You can see that they have letters in there. For example, where is the numerical one? Such a age survived peak last. Things like that, that's all in there. So that's interesting tonight, and we will be looking at that later on. So we'll just like the American ones here first, where we can see how many we have of each one. So 891 that seems about right age. We have 714 so it looks like we're missing. So a few people we didn't would have the information by the age. We should also look of the mean so we can see that if survived, you can see zero or one sub sea. One is if they survived. While I'm guessing. But appreciates, right? Survived is one, and it's zero if they didn't survive, so we can actually look at that number 0.38 That kind of means that 38 only 30% of people survived. So so it's quite a low number of people. The mean age was 29. The mean fair was 32 and we could look. A STD stands for the standard deviation. That means basically how far spread is the data. So if it's quite high, it means there's quite a quite a spread of the data. So, for example, there's about there's a 14 standard deviation here, so there's quite a spread of age, a large part of fair, for example, and then we have the minimum and maximum. It's the maximum. The minimum age was 0.42 so that's obviously a baby. Maximum is 80 so someone was relatively old on there, so this is really helpful way of looking at new miracle descriptions. And now let's have a look of the next one dot info. So this is very helpful to what? Help us take a look at what data is incomplete. So with me, when we hear about something being no, that's essentially when we don't have the data there on. This is a riel annoyance for in data, and that's that in all of your data sets, you always get some parts of data where someone just hasn't input value on that's half half of the struggle of being date Scientists is cleaning up your data before you can do any of the cool stuff, like creating the models and doing machine learning. You have to sort out your no values. So essentially we have. We have 200 on four no, no values in the cabin. So we're gonna have to clean up things for the age because missing some there were missing some four. So this is how yes, and then we have turned a four year 889 were missing two for their all the rest under night once they're fine, but this helps us know. Make a note so usually if if I was going through this myself, I'd make a note the age Andi Kevin and embarked there were missing some information. So I need to go through there and clean it up. It will be going over how we can do that in a lecture to come up. So finally, I might just like to take a look or something like a sample. So let's just take a look at one of them. Just see what it looks like. I do. Dear Sample don't include brackets there, so yeah, every time you do, it basically takes just a random part of the data. So here you have Mr Stoy it. Show me on off and you can see the person's age. You can see if they survived. It is just a good way of getting a feel for the data that's always very helpful to have and then find me. Let's have a look of the columns. Oops. So it's just a way Just take a quick look away the columns that we have so we can quickly see we have passenger I d survived peak last name, sex, age, etcetera. So these air very useful methods that we can use to just take a quick look at the data. So we have taken look at the shape and the size just to get a good feel of how much data we have. Then we look to using describe which helped us to get a good feel for the new miracle parts of the data that we used dot info to see right where we missing some data. We used the example. Just have a quick look at some of the data and there was DF columns to see the information that we had for each of the data points. So I definitely recommend you haven't got this yourself. Take a look at take a look at data for yourself and I'll see you in the next lecture. 7. Accessing columns: Okay. Now, before we get stuck into the real nitty gritty of data science, let's take a look at how we can access any one of these columns. So just to remind you this is our date friend of loaded in these air, the rows down here and across the top these are the columns. One column would be the head of this is name this one. Sex This one's age, the way the access. These is quite easy. We're since you have to. Let's just put this is deft head or even better after sample. Oops, so we just have one and we can see the line below. So let's say we want to access like DF sex, for example. We want to get the sex column. We if there's DF than in square brackets, we put the name of the column, so if this will be put sex and this will return the whole column essentially, obviously it doesn't want to show off Away 190 for this one says just but dot, dot, dot, in the middle very considerably weaken. Just do DF not fair. And now we could do if we wanted fruit to example for example, change all the fares for 0 to 0. Obviously, we don't want tea, but I'm just showing you that once we can access a column, we can then start to change things within it so pretty that that we can look at the date frame again, or even the deaf sample. And you can see here that all of the fares are now zero. We just to make sure that's true, we could do the same to cabin, where you can see there's still some values in there. Run. This 90 cabin is also zero. We could delete one quite easily just by putting in Del first. And then let's say we want to delete the name. So we spent del DF square brackets than the name off the column. Now you can see that name has been removed from that. So essentially the take away message from this lecture is revolting. Lecter, how we can load in the data. Now we know how we can view the data with just DF. But now we can actually access the columns by doing the name of the date frame than in square brackets. The name off the column. We could do things like. So if I just put this in below so we can access Just warmed by doing DF square brackets, for example, Kevin, every day we could set all of the values to zero. So DF actually, uh, let's go for embarked equals zero. We just released column like this. So these are some of the things that we can start to do with the columns in the data frame . So have a go yourself accessing the columns in the data frame, and then I'll see in the next lecture. 8. Basic visualisation - count factor: we're now going to take a quick look different ways that we can visualize. Our data using the Titanic date set is our example. So the first thing I'm going to do here is just import to the standard things so important . Vendors SPD and I want to import Seaborne are Seaborn is something we have looked at before . So if you haven't got it installed yet, you just need to do in your command line and still see born. Or if you're not using Pepin using using Anaconda huge just to Kanda in source Eban, we're gonna import that as SNS just because it would be easier to write. And then, of course, matte lip that plot lib Dr Price plot for us bot. And then, of course, we want to create our date frame. So we're gonna DPD dot reid on school C s B train put See me? Okay, so just make sure that works. I don't see s v No CSC. The case. That's how happy the first you want to do eyes show you how cross Tab works. Now, cross tab is something that we can use that's based in within pendants. So isn't it It's just a really easy way to see some of the data. So to compare two different variables. So first thing I want to do is see Okay, that see the differences in survival rates between man and female. So I'm going to create the cross tab. So what need to do is speedy for pandas thin dots crossed up. Then essentially, all I need to do here is impotent. My death, sex. That's the column, that's that, whether it's male, female and then, well, we want to compare that with survived. So as you can see him, we can then compare survived. When it zero, that basically means they didn't survive. One is if it did survive. You can see that 233 females did survive, and only 100 9 males didn't survive. So already, just from this basic cross tab were able to compare these two variables the sex of the person on where they survived. And then we can just take a look at that had, if there's an effect, basically between survival and the sex. So So let's just add another bit of code just underneath, so I'm gonna check out the columns so DF dot columns. So there's compare something else instead of just taking the sex and survival. Let's look at age, although that's does not gonna be categorical, so we're going to see a larger range. So as you can see, it's better to stick to ones that categorical like male female rather than something that continuous, where we have a lot of these different variables. A lot of these different values. Sorry. First there was something different. Let's look at, um, let's go for the P cause. So the passenger class. So we go back up here and change this to B Class. That's a bit more readable now, so you can see those in the upper class. They did less than didn't survive, So the large largest proportion of people survived. If they're in the upper class in the first class, you see if they're in third class, they're much more likely not to survive. So across time is a really simple way of us being able to look at two variables interacting . Now let's say we wanted to visualize this. Then we can use Seaborne, make you something called count plots. So I'm just going to create a variable Quote acts going to SNS dots come plot on within here. We want our ex access to be sex. Then Hugh is equality survived, which is going to explain all these just once have written us out. So we have a palate love set one. That's because just the color that we want to have and then the data we want to be obviously our date for him, which we've called DF. So after you've done that, then we just want toe set the access it set the title the X label, the white label That kind stuff must give you x another of creator variable x dot set and then within here I can just give it a title. Let's call it survivors Survivors according to Sex X label as fix and then I y label state of And then we need to do now his party And that's bringing are not what lips. If neither Then run this. You can see we can now compare it a bit easier so we can actually visualize it. We can see that. So this is these people who didn't survive in red. I know she did survive in blue so it's just a bit easier to them. Visualize. Yeah, based on their sex where they're more or less might be to survive. So using to see one in this sense is very helpful in order to be able Teoh to visualize the things. So next up we want to look at something called factor plot. So factor plot allows us to do a similar visualization. But this way we look look Atmore than just two variables. So we're going to do SNS stopped factor plots. They went in here. Let's say we want to look at the passenger class on. We also want to look at Compare that to how many survived. Andi, let's also add in the sex. So we're gonna put as a huge and then we want to obviously put in what I date is gonna be. That's DF and which wanted to define what the size of our graph wants to be. So I was just gave aspect your 0.9 size 3.5, and it has to be a good size. So I run this. Now you can see that we can visualize the passenger class. We can also look at the survival rate for those passenger classes on. We can also look, it's men and female. So here, weaken visualize really nicely that between the first and 1/3 class. So if you're in first class, whether you're male female, you're more likely to survive than if you're in the third class. But you can see in general, and in this visualization it really stands out. If you're female, you're much more likely. Even if you're in third class, you're more likely to survive the male in first class. It is a very interesting stuff. So sns dot factor plot is really helpful for if you want to look it so three variables rather than within just two as above. So just have fun. Let's look at one more of these. Let's add in embarked instead of the passenger class, and that's right out together. Say, that's next door fact toe a plot and then we need to put in the X. We're gonna be looking at embarked now why is going to be survived, then? The huge is going to be sex. My data is coming as our day trained that we created at the start, and then it just ended aspect off their 0.9 and then the size suite five. As you can see, we can now compare the embarked as well so we can see if people embarked from this place. See, they're actually more like meters. What survived both male and female and you can kind of see it's is relatively the same between male and female, where there is a larger dropped for males on If they were Teoh to embark from cute. So that's it for this picture. Just revisit what we've talked about here. First of all, we looked across stop, which is where we could compare two different variables. So able to compare, for example, of the passenger class on whether whether they survived or not, we're able to very easily look at the data on day and come up to a few, come up with a few basic conclusions. So then looked at how we can do the same similar thing, but with a nice visualization. And we used sns dot com plot for that. And don't worry, of course, I'll be providing you with this with the script for this. You can run this yourself and then we want to look at three factors. Three variables. So he s next stop factor plot. So now I've introduced you to how we can use visualizations knowingly using pandas and map , but lip. But how? We can also use something called Seaborne. So have a great this yourself definitely play around with it. Get used to using it because it's so helpful when you're when you're playing around data to be able to visualize it and especially when you want to be able to be able to communicate these things with other people. So you have a play and I'll see you in the next lecture. 9. The four Cs of data science: We're soon going to be coming up to doing our first practical data science project says Important for us to go over what I like. School foresees of data science. So what are the forces we have? Correcting, completing, creating and converting? Let's go over those one. It's time. So correcting is essentially just correcting any outlying data that seems to be relatively incorrect and could throw off our models. Say, for example, age 808 age 80. Okay, that might be normal. Boy Ph is 800. Then that might be quite suspicious for a human on would want to remove outlined data like that. Completing is simply we have no values like we have here. So, for example, if you have a missing age, what you can do is you can just replace that with a specific age. So what you could do? Some date scientists may look it the profession of the person on that might give them in an indication to, ah, a more likely age. Or they can simply for end by using the median value for the data so you can look at the most commonly occurring age in your date set and just plug that in instead. But it's really important to fill in your no values. So creating this is feature engineer, which were talking this a bit about so a few examples of feature engineering. So this is where you use existing features to create new features that could be more helpful to our machine learning models. So, for example, if we had the title, so let's say, Duke, Dutch chess, Mr Miss Miss on then we could create a new fake feature for sex, which would be Mr which will be male, female, other, you know, whatever it might be. So this no only helps us to create categories that are or features that help with the predictive power of our model. But it also helps to simplify our variables into fewer category, sometimes so converting. So with this one, we actually what's really important. We need to convert all of our information into numbers. So, using the example of men of female, we need to convert those into zero and one on other with age, for example, those already in numbers that's absolutely fine. But what we always need to do is convert any text information into categorical numbers that can then be plugged into our machine learning models, so some things could be very difficult. So, for example, if we had names like John Joe Elizabeth, stuff like that, there are so many different names that could be that is kind of difficult toe convert all of those into different numbers, so you would often just drop the names. Or we would look to see if we could use the titles like Mr Miss Duke, Sir, Whatever it might be, that's one option. Or we could see if there's any correlation between the commonly occurring names on whatever the outcome is, we're looking at. So converting is basic converting any information on making sure there in numbers so up next, let's put these forces into practice. 10. Variable types: Now, before we get into the nitty gritty of getting into our data and doing some real data cleaning and data science, let's take a look of variable types because it's very important for you to be able to understand and appreciate different variable types that exist. So it's gonna run through a few of the basics. So we're gonna be looking at dependent and independent variables and then two different types. Categorical numerical data. These are the very basic data data types that will be looking at, and they will help you better to become better communicating in date science. So dependent versus independent is relatively straight forwards. Essentially, you can think of it is dependent. That's the variable in what we're testing. So, for example, in their Titanic dates that were going to be looking at, well, what we're looking at, there is survival whether someone survives or not so or a house price predictor. We're looking at the house price. It's the dependent. Variable is the one that really matters to us. What, what we want to be there going to be able to predict in the future. So the independent very rules allow the other variables involved the might affect our dependent variable. So, for example, in the Titanic and maybe that the age of the person they're class, whether there they bought first class, second class or third class tickets with their male or female things like that in house prices, a few good examples are the number of rooms off the location of the house. Or I don't know whether there's a pool in the back, things like that or the age of the house. Finally, we could look it sales forecasts. So, for example, would be looking at how many ad buys would we get every month on that might be the things that might affect. The advice might be the time of year, the website traffic, things like that. So just both the video here for a second have a think yourself about a separate examples or something that's not one of the three of put here. And think about if if you create a machine lady model, what might be the dependent variable and what might be the independent variables around it ? So just pause the video. Have a quick game of that now, So now that we've gone over dependent and independent variables. Let's look at two different types. Categorical and new. Miracle. There's first take a look at categorical, so categorical is kind of in the name. The data are put into categories of some type. So for and we have two different types within that we have nominal and order or data that fit into the categorical type. So you know it's anything the finch category. So, for example, with nominal, we have things where the order isn't very important. So male or female, zero or one, or apple or pear or mango, so categories where the order doesn't really matter then orginal is we have categories, but the order is quite important. So, for example, in educational, back background, high school graduate or postgraduate or extra small, small mark, medium, large and extra large with clothes. Or, you know you might have categories for how well some someone's doing in their class or, if we're looking at computer games. People who score between zero and 20 2100 103 100. So those are things with category in categories where the orders actually important. What's really important you take away from this side is that categorical data is basically any data that fit into categories. So pause the video now and have a think of the example examples of categorical data and then try and create an example phenomenal and one for ordinary. So positive. Now have you got that another? We're going over Casco Regulator. That's like a numerical. So we have two different types here we have discrete and continuous. So New Miracle, as it says in the name again, is things that are necessary in categories is just They're just numbers. So, for example, we have discrete numerical values, which might be your account of things. So, for example, the number of people in a room or the result of frying dice, so anything, the whole numbers. Another thing could be the age of someone in years. Those so so things. Continuous numbers are ones that aren't whole numbers. They can be a value within any range. So, for example, it could be the height of the person, which could be 1.65 meters or the length of a leaf or the speed of a car. So something that's continuous so you can think of numerical data is being anything that's pretty much just a number. And it could be whole numbers, which will be discreet or non whole numbers, which will be continuous. So have a boards of the video. Now I have a thing for yourself of one example of numerical data. Then try and think of one that's discreet on one that's continuous. Once you've done that, that is the basics of the data types that we could be looking at. So what I'm for going through this and come out with some of your own examples and I'll see you in the next lecture. 11. PyTanic Creating: we're now going to look into you. What I think is the most exciting part of data science on that's feature engineering. That's where you're taking existing data. That may not be very useful in terms of our exploratory data analysis or in terms of applying it to machine learning, but is taking the information and making it useful in some kind of way. Now, whether that is for exploratory data analysis and you can, you can visualize things better, or whether it's for training a machine learning model feature known Preach Engineering could be incredibly helpful now for machine learning. What's really important is if you have categories like Let's say we have we have text. So, for example, let's a profession. If we had, let's say we had a column called Profession, and under this we had plumber. Ah, that's a mighty yeah, let's say banker, whatever, it doesn't matter. But let's say we had 100 of those now in order to put that through a machine learning algorithm in, these have to be turned into numbers, so this would be 10 for example. That would be one that would be too. I will be going over this and a lot more detail in the next lecture. But in our data set, there are some parts of it which will be relatively relatively difficult for us to turn into numbers and for it to be actually useful. Now the main one I'm talking about is our column name. So actually, the names of the people who are on the Titanic let's just take a quick look of that. So let's just take a look at DF name. That's her sample of 20. So here we have a number of names. Now, as you can imagine, for every single person on the Titanic is gonna be different. So if we were to convert and numbers, it would be 01234 etcetera. Ad for as many as there are. And that's not gonna be helpful, Atal. Now, if we take a look of this, this sample, pause the video here, take a look of these names and think Is there anything in there that could be useful to us in terms of training, machine learning model or exploring the data and more, more, more depth. So I hope you have taken a quick look. What I noticed is that you have all these titles you have. Miss, You have Captain. You have. Mr. You have miss again. You have, doctor. Now let's see any other ones in there. So yeah, we have whether they were the same here. But we have such a large range. If we were to do 200 I'm sure would see even more. So What I want to do in this lecture is to show you how we can feature engineer these names on, turn it into BSP, able to get the titles on, be able to then convert those into numbers or have them ready so they can be attendance numbers in the next lecture. When we're converting, everything s so let's have a go at this. It's the first thing I'm going to do is great. A function where we can get all of their titles. Alvin aims to create function called Get underscore Title Onda Obviously inputs gonna be the name of the person. And so what I've noticed is basically if you think about it, you've got an initial name here. But for every title we have this full stop afterwards. Some guys say 1st 1 was gonna add a condition. I'm just going to say if we have a full stop in the name somewhere because I see in your place where it seems to be occurring So you can say if then dot is in the name just to make sure because otherwise we don't We don't really want to be day. What comes next and next we're going to be to basically splitting up the name and stripping it as well of all the white space. So I'm going to want to return from this function, the name I want to split it. We can take this one step at a time. So first we need to do this here. So what we want to do is split here. So we've got this would be the first part on This would be the second part. So this would be, let's say, index theory and this would be index one. So when you split something, you're basically wherever this is for the first occurrence of this happened, so right here is gonna split it there into two parts. So we want to take the second part of that which is the next one. So for this name here. We're splitting of the Kama. So we'll have everything after the comma to this whole bit here. So then we want to spend it again? Of course, because we just want to take the miss part of this. We don't want anything a thing, and we don't want anything after four stop. We don't want the four stop itself either. So we're going to do another spirit. I really don't splits, and then here, we're going to separate it from by the full stop. We want to say the 1st 1 which is zero. So what we're left with now is essentially this bit of this plank and then the name. And that's what we have for the last need to do now is docked strip and that dot strip removes or the white space. So that's it. And then we can just have it having else saying something like, uh, they return. Let's have some tonight. No title in name, which I'm sure isn't going to cut. So there we go. We have a nice function here, so no weaken, basically weaken. Using the power off functional programming, we can get ourselves a nice list off or of all of the the different titles that we might have now because we we only really want this point. We want to just create a list that contains, you know, just one of each of these, so we don't necessarily want have a list that contains all the mess in order. Mr. At this point, we just want to see a lot the titles that exist. So we're just gonna have a set instead of a list. And we went over sets What they are in the python section. Just remind you upset. Let's say we have a list that has three occurrence of myths. If we cast that into a set, Well, I have one occurrence so and he has one only has unique values within So that store this in a very book titles. Andi will create set. So here we want to use some cultural programming on this May. It may seem a bit long, but essentially what we're doing here if we're creating a nice list, so we want to put in X for this is list comprehension. You might remember if any of these things you're feeling unsure of definitely go back to these lectures. So on sets or this comprehension, this is all in the python section. So if x for X in and we want our the f well, the name Colin so he could just just was for sake of it, we could do it like this. We would be new before that was just as well. But for this one, I'm going to mix it up and just dfd name. Just trade that could be done differently. So you want to map this? So we're gonna use Lambda. Just put those X and then we're going to apply my function of get title there again. This should work nicely. No. So if I run this, then that's that's just Britain without, say, printout titles. You can see these are all the different titles that we have. Mrs Master, the Countess Don, Major Doctor. Mrs Jean here, Colonel and yet many others. So that's what would be able to extract. So that's brilliant. This is kind of first step. So we have been able to do is we created ourselves a function here, were able to extract the title and then using functional programming here, we were able to take Ah, um, our name, which are the names of in the date frame, were to map that into a set where we applied to each of those names. I get tighter function, and that's what we got here in this variable titles, which is very, very helpful. So the next thing we want to do is this is still actually quite a lot of different titles Teoh having to deal with. So I think we should condenses down even further. So I'm going to create function now called Shorter Titles. Titles is in extra staff to write this out every time title, if title. So first of all, I'm just going to take things like, let's see Colonel, Captain and Major because those will seem to be kind of like Officer like title. Some guys say if title is one of these, so I can just say is in this list, for example, So Captain, Colonel or Major, then let's return that as officer on then else if title is in and then there seemed to be quite loved ones for royalty. So, for example, we've got John Kier still also got dawn. We've also got Countess the Count Yes, that was like a roti name. And of course, Donna groups donor, then lady and sir said the easel kind of royalty. One sums Going to return Roy T. And don't mind that I'm Sometimes you can see what awaits and double quotes. That's just me not paying too much attention. It's usually best just to stick to one, of course. So then we've got a few others. So we still got the Countess, the enemy and lady. So put those just as Mrs on. But we still have mlle on Miss, which will put in the mist serious. So let's just go for that So L f title is in, Actually avoid putting the Countess here. So that's fine. We'll leave that as royalty. So let's just put in the MME on the lady Well, they were actually already reusing really lady immorality as well, it's realized. So let's just say if title is equal Teoh for me we want to attend. This is and then finally that's have for that c m l l e and miss LF title is in him i e well, miss turn miss And then if the others were working, we just returned the title as is Okay, Fantastic. That rise I put this is titles or just provides title. So let's just run that. Make sure they're nice in touch. Claris. Okay, Missing. But in a coat on here they get That seems to be working. Fine. Say now what we want to do is you want to actually start applying the source stuff. So that's that's create a new column. We'll call it their title. So let's do that. So we have a date frame. So this is where we want Teoh again. Use a bit of functional programming. Factual programming can be so helpful in date science. So now we're going to be using map. So we're going to be taking names from D F name into map. It's Lambda again. I was going to play that to to get titles. Thanks. So what this is going to do, essentially, is we're going Teoh be taking the actual titles from the names. So we're gonna be creating a new column on that will contain the only the titles from the names that extract just the titles. So that's the first thing we're gonna do here. That's just run that case now what we want to do is we want to. We want to apply. Are shorter titles function onto this new column? DF names. So we can then just dio deaf title because the f dot we use this really helpful then court apply which basis means apply dysfunction toe all of the values in this column. So we're going to use short titles and then access is equal to warming up to specify that. And that basically just means go down the column not to go along a row. Okay, fantastic. So next thing I want to do now, let's just have a look. Say, it's just print out the f dots title dot and then what we use in a previous lecture How'd you counts? So have now been able to condense the number of titles to just these number brother in this long, lame, long line that we had before. So now we have both columns. We have one for name and one for title. Now, now that we've been able to extract that feature, we don't actually need to keep the name column, it'll because we're probably not going to be using that. So I'm just gonna remove it. So I'm gonna Before we used Dell, which you can totally do see new Dell DF name your show you another way of deleting a column, which is a bit longer. But there's always could show different options. So you used the F dot drop and this is used quite often. So just say if you ever see this encode, you're not confused, that's what it does. So you basically say DF dot dropped in the name of the column and then you want to say that you want to be dropping the whole column. So you say access equals one. And then again, just like we did before in place is equal to true, which basically means save this change. They go. So now you can see in our date frame. This is 20. You can now see that we no longer have our name, but we have the title so we can see here. Mr Mr Mr. Master Officer. So I've been able to successfully extract a feature for my data. So hopefully you've been able to do this yourself. If not, have a look at the script. Try and create yourself, have a play with it and then I'll see you in the next lecture 12. PyTanic Completing: Now, this lecture is especially important for cleaning data because you're probably gonna be coming across this quite a lot whenever you're dealing with real world data on that's looking at no values, no values is basically when someone has hasn't input any information for that specific value. So what I mean by that? So let's say, for example, where someone's filling in a form and they put in their first name their last name, but they don't put anything in for their age. For example, then that would end up as a no value because there's just no value in there. You're gonna come across that a lot, So it's important that we're able to not only locate where the no values are, but we're actually able to fill the no values as well with a correct or suitable value. So let's get stuck in. I've just important Penders here at the top, and we're looking at the same date frame we're calling it. The efforts always so first you want to do is I want to see how many no values we have in each column, because if we have loads of no values in one column, we may just want to drop that column completely because it may not have enough information for us to work with. Or there may be no nobody's at all in one, so you have to worry about it either way. Let's take a look. So I'm gonna do I'm gonna use a four loop for this. I'm gonna look through all the columns in the data frame on DSI. Which ones have any? Nobody's, some really four. I'm going to use the word column because comic sense in our data frame that's one to basically print out. So I wonder about a couple things first full that's print out what the column name is. So we just need to do column. And after that, that's at Colon. And then we want to put in the total number of no value. So I want to do for this is state frame and then the name of the column, Whatever it may be, ministry. After that, we've entered put is docked no and then strapped to that. We want to dot some. So it's going with some of all the no values within a column. So if I run this since we're gonna get this nice output. Which shows is how many? No values we have for all of our columns. You can see we have quite a lot for our age, but not too many. That's fine. We have lots for our cabin on board. I mean, if we were really, really wants been loved Tate lof time with this data, then we could look at different ways where we could still extract useful information from the cabin data. But I mean that such a large value, we're going to drop that column completely now for ages gonna be relatively simple, what we're going to do. Well, basically, just gonna look at the median age of the whole group. So median spell it. This m e d I n would come across this before It's the most commonly occurring age in our date set is a nice, easy way. And then for embarked. Now we for embarked. We don't have numbers, we have letters. So all density for that, we're going to just look at the most commonly occurring letter on guns going. Teoh, you replace these two no values with that one. So first of all, let's let's first of all fit in for our name. So just runnings make sure so what I can get. Okay, so we got Morgan. Do is DF. We're going to access the age column, and then all we have to do is put in the lovely function called fill any which basically means fill the no values. So what we so within our Colin for age, which were here, what do you want to feel the no values with what we want to fill in the median. We've done this before, so it's just DF square brackets age then dots, media, princes. And then we need to specify in place is equal to truth. So what that means is that we basically want t want to save the new values You want to overwrite these knows with the media with the media values. So we won't include this in places equals tree. So I do this now and then We run this one again. You see, we've completely removed the problem for the age. Fantastic. So the next thing I want to do is essentially just do the same thing for embarked. But we want to see what's in most commonly occurring One first So for that, what I'm going to do is print out they frame for in parts, begin the column for Embarked On. Then we need to do is do you value on school camps? And this is essentially just going to give us the number of occurrences for each of the different values. So as you can see, the most commonly occurring value is s So we're just going to replace all the nose with us . So we're going to do something very similar to work done above here or institute is deaf. Them barked, start to fill in a replacing with us. And, of course, we want to state in place stick witty. True. So if we run this and then we run the set above again, we should just have this company she left. So for the cabin, we're going to delete a whole column, and that is incredibly simple. We have to do is write Del and then the name of the column, which is coming. So on this one, run this again on agency. Its just been completely removed. So in this lecture, we've gone over how to complete the date set and or complete will are no values. So what we did first is we looked for the columns that have no values, and it's told us how many. No values we had. So if I just run the show in the top again and it told us how many know valleys we had within each column. So for the age of relatively simple, we just used the Phil and a function. We filled it with the median occurring age. And then we had to embrace True to make sure that was saved within the column. We did exactly the same thing for for s except because it was. It wasn't a number where to use dot value counts to see which was the most regularly occurring. And then finally, because cabin had so many no values, we just deleted the whole column using Dell. So I hope that makes sense. Have to go through this yourself. And if you if you like, you could actually take a look into the range of ages, and you could take a look at where, for example, in embarked you could take a look it all right. So these two occurrences can I see the information for those those two data points and see if there's a better way of doing it than just using us. But this is a good standard way to fill in your enemies, so have a go. You have a play yourself and then I'll see you in the next lecture. 13. Pytanic Converting: So now that we've gone through the date set, we've done some initial exploratory analysis. We've been able to fill the null values who have been able to remove the outlying data. We have also been able to do some feature engineering. It's time for us to convert all of our information into numbers. That's essentially so that we can plug this into a machine learning model. Because if we have things, for example, if we just get up a sample here off the data, As you can see here, we do have some no values that simply because I run the script from the previous lecture you can see we have the title in here and we have everything we need in order to do the converting right now. So what we need to do is, for example, convert female male into zero and one because machine learning models need everything to be in the format of numbers in order for it to actually do all of its calculations on the algorithms. So everything needs to be put into into numbers. So what we need to do is wait to convert the sex. So, for example, mail female, we'd convert embarked on the title. Now ticket. We're just going to drop because we're not gonna be able to get much useful information from that. So I'm just going to run this here so we just run this whole thing again to remove the ticket. So now what we need to do we convert sex embarked, entitle into values. So it's very simple. Had to do this. Actually, what I need to do is, for example, access sex column on used don't replace. It's within here. Firstly, I want to put in the existing values that are in there. So we have mail and we have female. And then after that, we just put in what we want them to turn into. So just zero once, if it's male, 0 50 miles one and then after that, we basis once a in place, as we've done many times before, is equal to truth. Just to say we want to save these changes. So then we want to see the exact same thing for Kobe. This So what? Do the same thing again for embarked and what we can do because we don't know what the values are. If we just do DF dot embarked thoughts on Sunday night Value counts. Then we can just see the different body. So we have S C and acute so we can just put in here s see Thank you. And now, because this is all categorical information, it doesn't matter which one goes which. So it doesn't matter if s zero or C a zero. That's no, a huge deal. It just matters was important. The machine learning model can differentiate between the different ones. So it's not a big problem if you if you order it in a different way. So you have 01 and two for this one, and then finally we want to do the titles. So obviously changed that as well as that's embarked. Now we want title here, so it's just do the same thing as we did here, Entitle. So that's plugged with these ones in here. So that's put in Mr Actually just make things a lot more simple. Having to have 34567678 Okay, 1st 3 four groups. This is coping based this a few times. No, make things Give it to me. That's a 34567 So we got mr that. Miss Mrs Master. Dr Rev An officer Rosie. We're if I can spell it, though again. And then we just want to put in the different. So 12345678 numbers there just to check. 12345678 Okay, Perfect. So now we have been able to essentially take this different categories, but it would change them. So now we have are the numbers which makes a machine learning model tasks a lot easier. And it's a simple Is that so? All you need to do is just use their stopped replaced function in order to chat, to change your categorical values into numbers. So have I got that? And I'll see you in the next picture. 14. PyTanic Correcting: Let's get started now with cleaning up a data to begin with, we're going to start by correcting the data. That's like listening for outliers in our data. So if we just take a look at the head by data with print disease, that Prince is that I look nicer. So there's something about where things might be outliers. So most likely where we're gonna find them are gonna be under age. Let's say probably under the fair as well, possibly under siblings and spouses. And this get started initially just by looking at the fair. So what we're going to do here is just create a graph. Which quote, Graf Why? No. And so we're gonna start by using seaboard. So we're gonna use something this time called factor grid. So put that in. So what we need to put in here essentially is the day train. We want to look out, which is just what the efforts were caught up. And then also what column we want to be looking at circle equals. So we're going to be looking at survived, and then we just want to map one to that. So we do with the name of our graph, which this graph dot map and then we want to look at that, hissed a gram. We want to be looking at the fair Colin. And so bins are true that all that is in second. Essentially, it's just how many times you split up the data in tow. Different little boxes. That's a pretty good facet grid. So here you can see essentially, let's just change the been. So you know exactly what that means. If I decrease the bends to like to, let's say you'll see that these are you gonna be split up into two sections in the so it's not very descriptive. Descriptive of the data, 20 seems to work quite well, so let's stick with that, because there we can see that in the ones who survived, there's someone who spent around five hundreds while machines pounds back in those days. That was a lot of money. So it's It's highly unlikely that one person would have been that much. We could do a bit more research into how much someone myth may have spent this, but for the purpose of purposes of us trying to correct something, let's just pretend or consider this immediately as an outline, we can assume that. So essentially, what? We want to do this. We could either just drop that bit of information completely willing to say right. Okay, if there's anything over 400 just delete it from our data. But we kind of want to preserve the data as much as possible. So if it's a few outlines like this, what we can do is we could just reduce the value ever so slightly so we could put it to the maximum one. After that seems to be around 300. What? We could just replace it with the median fair in general. So I'm going to do now is just just change it to the medium fair median Fair. So median is the most commonly occurring there in the data. So what we're going to use for this in order to replace an existing data point, you want you something called lock DF? I thought look and so vicious locating since, and data given a conditional. So what I mean by this Let me show you. So I want to locate within the fair column. Any values that over 400? Let's say that's that's a 400 is. I cut off point for how much someone would likely have spent on the ticket. And so then we just need to put in again the name of the column, which is fair. And then we basically just want toe replace whenever. Wherever we locate where this is true, we want toe change it. So we're gonna put equals. We'll change it to the median value. So it's gonna d d f fair thoughts we did with princes. So now when I run this, so we're changing it to the median on. Let's just just make sure it's very, very illustrative. Let's change this. Instead of it being the median, which is called, we'll just call it out. For now, there's change it to 1000 or something, just that you can see it on the graph appear. Okay, so what? This has done medicated with one instance where it's over 400 that's changing it to 1000 that reply this out again. You can see now there's one lying in 2000 so that's work successfully. Some change that back to the median fair, and there you so that now we've been able to correct that value within the data. So just really make noted this dot lock because it's incredibly helpful when you're trying to find things like this and you're trying to make corrections. So next let's take a look at age. Let's see if there's there's any problems there that way want we want to correct. So what, you gonna do the exact same thing again? Just as good practice, really. Such a school. This graph on school age? No, Misty s nest dot Fast it grid. The word stating the date frame in the column we want have is survived. Call equals, sir, five sets up. See what is going to appear in these two graphs and then gruff one school age. I don't know. No, it's like it that but it's a hist a gram. We're looking at the age. Call them here. Ben's equals to 20. Well, so I do this, you can again see that there is one outlier, which is 80. That isn't actually you all that that Oldham. It's highly likely there was someone of that age. Let's say it was 800. Just just let's just say so we can we can practice this. We want to change this to you. Be 62 or something like that on. Let's say there are a few occurrences of this, so you want to do the same things we did above someone you DF don't lock. So want to locate anything within a day frame for age where it's over to say, Let's say over 70. So once we've located that when we want to locator within the age column stretched out here , we want to change that to let's just say we'll change it. 70. You run this, they're in the back of the instagram. We can see that that's reduced 80. That's the basics of correcting data. So essentially we want to do is you want to look your overall data your date frame and take a look at where you think that might be outliers, and then you can start to look to draw down. By using the Seaborne facet grid, you can quickly visualize where there's outstanding data where it's your big outliers. You can then use a conditional, like DF dot lock. We can say if it's over a certain number, replace it with the median, we could replace it with a large maximum occurring value outside of those, um, so that's what we've done. We do with it that again for age. So why recommend is you go through the stage, cream yourself, having got this, do some corrections yourself, and then I'll see you in the next lecture. 15. Titanic data set data science recap: Now, before we go into the machine learning side and applying that into our Titanic data set, I just wanted Teoh succinct, always done into a single small script that you have access to its attached to you in this lecture on your able to use it as well. So there's no confusion. I've just added in the basic things that we've applied to the Titanic date set. So here in this part, we've created the functions in order to know only get the names. The titles from the names in the name column of the Titanic Dates set were also able to replace them with all of these. This is our feet to engineering. Here and then below were able Teoh basically that we're applying that for the for that part . But then we're also filling in the no values here. We're also able to drop some of the columns that we no longer needed cabin ticket and name , and then we replaced the categorical ones, and so they were into their numerical values. So it's attached. This lecture just wantedto make it clear that you can access that this yourself when we start to apply it in the machine learning models later on in the course