Getting started in machine learning with Python and Scikit-Learn | Paul O'Neill | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Getting started in machine learning with Python and Scikit-Learn

teacher avatar Paul O'Neill

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

6 Lessons (26m)
    • 1. Introduction

    • 2. preparing data

    • 3. useful data

    • 4. categorical data

    • 5. machine learning

    • 6. class project

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

This class focuses on a real life problem - using machine learning to predict if hotel bookings will be cancelled. This is a beginner level class you do not need any previous knowledge of machine learning but you do need an environment where you can write and run Python code.

Meet Your Teacher

Teacher Profile Image

Paul O'Neill


Hello, I'm Paul. I am an artist, cartoonist, teacher and data analyst. I live in Ireland but I've also lived in Japan for a significant portion of my adult life.

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Introduction: Hi. Welcome to this class on getting started with machine learning using python pandas on the trip in her notebook. So my name is Paul. This is a beginner level class, so you don't need any previous knowledge of machine learning. But you do need to have a new environment set up in which you can run pandas on Payson. So as I said, I'm using Jupiter Notebook. If you're not sure how to get all that set up, I do have another class on getting started with pandas Python on Jupiter notebook. So, in this class I'm using, I did a set on hotel bookings. So each row in the data set is a booking at one of two hotels. Both over tells air in Portugal. One of them is in a resort hotel near the beach and the other is a city hotel. So if you're the manager of the hotel, either of those hotels, how could machine learning help you? Well, one way that it might be able to help is to predict if a booking is likely to be cancelled . So this class, we're going to write some machine learning that will help us to make that prediction. Is this booking likely to be canceled? Yes or no? In machine learning terminology, this is a classification problem because we have two classes is cancelled for is not canceled and we have to decide for each booking which of those two classes were going to put the booking into? 2. preparing data: Okay, so the first thing we need to do is to import bandits. So you run this line of code, we're just gonna call pandas P. D. It's just easier to type and typing out pandas each time. And then we need to import our data. Nardella is in a C S V file. Cold hotel booking start CSP. We need to import that into a deter frame. Did A Freeman Pandas is just data structure it's you can think of as a two dimensional Did a structure with rose and combs. I guess when you just look at that ship of the did a number of rows and number of columns we have under 19,000 rose 32 columns. We can check the 1st 5 rows. Okay, so you see, we have first column is the hotel on? There's two values in here Resort hotel on City Hotel on. Ultimately, with our machine learning, we're going to try and predict the value in this column. So where is cancelled? Zero means it was not cancelled. One means it is cancelled. We can look it. All of the columns of 32 of them for these in the names of all the combs. One of the things we have to do a machine learning is decide which of these columns or which of these features were going to use in our predictive model. So, as I said in the first part, the data set is divided into that two representing two hotels the City hotel and resort hotel. So if we run this line of code, it will tell us Harmony Rose. We have the city hotel harmony roles we have from the resort hotel. Okay, so you can see there almost twice as many rose for the city hotel as there are from the resort hotel. One of the things we need to look at as faras data quality goes from machine learning. We have to look for notes. Machine learning doesn't like nose, so we're going to have to remove the nose or place them with something else. So the first thing you need to find out how many knows there are are these knows spread evenly amongst all the different columns or the maybe one or two columns are three or four columns that have more nose than others. This rose just creating a new column in our data for him on this column will have the total number of nose for each row. Never. You can use our value kinds function again. This time we're, uh, normalizing the results. Okay. So you can see 91%. Just over 91% of rose have one. No, that percent of to nose for another way of thinking of it. 99% of the rose have a least one. So if you go back up to will be looked at the 1st 5 rose. If 99% of these have knows, it shouldn't be too hard to find at least one row here that has at least one. No. So the first set of columns look OK, okay. So you can see agent has some nose on the company. Calm. Well, it's all notes, at least for the 1st 5 rows. So just looking at this, it looks like these two columns could account for at least some of the knows that we're seeing so we can focus in on the company column. Creative variable. We're going to put the number of no values in this column into this variable and then just printed out. Okay, so all of our 119,000 and something rose. 112,593 are no for this cold. So we can also say Therefore, the vast majority of rose in this column, um, provide no information? No, Didn't there just no values? So one simple way to deal with that would just be to drop, um that cold. It's not providing us any data that's going to help to make a prediction, Really? So maybe we can just drop it. So how to drop columns with pandas? It's pretty straightforward. Just going to say there are data frame is equal to our data for him, but we're going to drop all the columns in this list if it's on you. One column just put the name that coal. We also need to specify the Axis, and you can say the access equals columns, or you can also say access equals one and the kiss of rose. It would be access equals zero or access equals rose. So if you look at our combs again, are did a friend Well, originally we had this did a type deposit type agent company on days and waiting list. Now we just have deposit type on days and waiting list. So our company on the agent columns have been dropped. So again, if we look at our number of knows Paro, now that we have dropped those two columns, we can see that we have fixed a lot of our problems. We have no 99 point, let's say 99.0.6 percent of our rose Zero knows on only 0.4% have one. No, we can find out which columns have knows using this line of code. No, just put it out is the list market. So the Children column and the Country column seemed to have a few Knowles left say 99.6% of our rose are no no free just by dropping the agent on country columns. Another thing we need to consider is we have two hotels. The Resort Hotel in the City hotel. Are we going to build a single model that will work for both those hotels? Are we going to try instead to have two models, one for the city hotel from one from the resort hotel? If we decided to go with two different models. Then what we can do is to split our data from on the way you can create a kind of sub set off. Your data for him is well, we call it a new data from a call it city underscored DF for our city hotel. I will say that's equal to our did it for him where the room meets this condition. So hotel column in the row is equal to city hotel. Um, we will select that rule and put it into our new different. Remember that when you're using these body and conditions, if you're saying that something must be equal to and this gives this string, you need to equal signs. If there's only one, it will think you're trying to assign a value, and you'll just get an error message coming out that we can do exactly to see him this time from the resort hotel. So now we have two new did it friends. One Far City Hotel data one for a resort hotel. Did it? I mean, just check these so 79,330 rose. So that matches up with this number. So that hosts okay, and then resort data friend should have 40,000 on 60 year olds. Okay, so that number had supposed we have 31 comes because we dropped to Colmes, the company and the agent columns. But we also added one column that had the number of nose in the road. So that's why you and I have 31 combs. Okay, So one last thing to look at before we start moving in towards machine learning. We look at the number of consolations for our resort data on our city. Did it again, using the value cans function on normalizing it so you can see 72% off. The bookings for the resort hotel were not cancelled on about 28% where canceled and then for the city hotel. Okay, so the number of cancellations in the C hotel was much higher. Looks about 42% off bookings were cancelled. So again, this is something to consider. Should we have one model to try and predict consolations for both resort and city hotel, or should be maybe separate the two two cases on how to predictive models Personally, I'm going to go with the go down the route off having to protective models. I think the two situations are very different. There's a much higher rate of consolation for the city hotel than for the resort hotel. So I suggest to me that if we have just one model, it may be too general and it may not be accurate enough. 3. useful data: So when we're looking at which columns maybe useful clues in our predictive models, for example, the country column may be quite useful. If you look at our original did offer him. The country called on getting value kinds just like a 1st 10 You can see the majority of Rose are for Portugal, so Portuguese people booking one of the hotels. The next highest is UK France, spin and sore. But then, if we look at just the canceled bookings again, we see Portugal is high. In fact, there's a very high writ off cancellations, more than more than 50% what the bookings were cancelled for the Portuguese people. For UK people, it's a much lower rate. Canceled cancellations making say, for example, that American guests who made bookings cancelled 501 of theS USA doesn't even appear in the top 10 for a number of bookings. So the right we can say that maybe some nationalities, arm or likely toe cancel a booking than other nationalities. So this country column could be quite useful to us in making predictions about whether or not a booking will be cancelled. 4. categorical data: Okay, so we've already said that no values were a potential problem for us with machine learning machine running just doesn't like no values. Another problem is this categorical dealer. You can only pass numerical data into machine learning. You cannot pass strings. So some of our columns. So if we look at the mill column, for example, using the violent cons function again, you can see that there are four categories in this column on each category is Maar is designated with a string a two character strength so we cannot pass these values into the machine learning. We have to replace them in some way. The easiest way the lost technology way, if you like, is just to create a replacement. Do this one. I manually there's only forced them. We say, for the middle column. If it has a value of BB, we'll replace that with the number one of its HB, replace it with number two and so on. And then we actually do the replacements, and we create a new data set. We're gonna call City underscore d f underscore replace, Then if we look again, undervalue counts. So now you can see that Ben breakfast. Um, so on self catering, half board full board have all being replaced by numbers. 1234 So that's fine when you have a column with just four categories. What happens if you have a column where, say, 100 or 1000 categories to right? The more like manually like this would be incredibly tedious Can do it this way. It's more automated way of doing it. So for the country codes, there are over 100 different country codes. So this method is just not going to work. Instead, we'll do it this way. It's no you'd see it. Generate a day, call it and replace map Underscore company. So each country code beach three digits are three. Character country code has been replaced by a number. Then again, we can run this line of code to replace all of our country codes with whatever number specified in here. And if we look again, value counts just firsthand in See, countries have not been replaced with numbers. There's one more column we need to work with. This is a rival did month, and they give the months not as number 1234 but is January, February and so on. So again, we need to replace those 5. machine learning: Okay. So never going to get onto the actual machine learning. So we're using our city. Underscore DF underscoring place. Did a friend We're going to drop a bunch of combs. So first Fall hotel. We don't need that because we know this is only the city hotel. So it's all going to be cities where you just drop that, These other ones, I'm going to drop for the moment because we still have over 30 columns. You find the machine learning protection is very low value, very low accuracy. We can always add in columns again. But for now, I'm just going to drop with ease. No overtime to 13 columns. We just need to check on our data types. So they're all numerical data types if there was any objects, and here we need to runs through the whole categorical data thing again. So these are the columns and our city underscore D f underscore, replace different. We also want to remove this. Call him because this is the one we're trying to predict. We're going to use all of these columns then these other 12 combs to make that prediction. We need to import, um, some stuff here. We're going to create that inputs. This warning is just to say that this will be deprecate ID man from NY. One should really be using a different or get into the habit of using a different macho. But for now, I will just continue using this. So we have our data inputs unexpected. I put so are expected. Open is the is cancelled coal on the data input since the other 12 comb. So very easy to create this. You just say that the data inputs equals the city underscored the every place here for him . But you drop the is canceled. And then for this one, you just tick. He is cancelled. Sometimes when you're looking at other people's machine, darling, they'll have. This is a upper kiss X on a lower case. Why, that's a more traditional way of representing. But sometimes the names, inputs and output it's more more meaningful, easier to read. And then we need to divide or did it into training data untested. So one of the things that you can one of the problems you can run it through a machine. Learning is bias on one way around. That problem is to divide the data into two parts. So you have one part is your training data say 2/3 year data will be used for training, and then the other third will be used for testing. So you don't need to do that manually. Um, or psychic learn this SK learned will do all of that forest, so they were going to create a random forest classifier. If you want to know about random forests, for example, Wikipedia has an article random forest, complete with all the mathematics involved. You don't really need to know or understand all of that to use machine learning. So we're going to start off. I'm just gonna say the number of estimators is 100. This is one of the things that you can vary to try and improve the accuracy, and then we're going to fit our model using our trend. Did it and this contain a few seconds to run? But then this next few lines of code will tell us high important eats off the features are and making predictions so you can see lead time. Is it high value? And also the country code is a high value, so almost so 51 52% is dependent on just these two columns. So, as I said before, we dropped a lot of combs, we could start adding in taking out some the combs and working on that to see which gives us a better accuracy for model in order to actually measure the accuracy, thes two lines will give us a value. Okay, so we say we just over 82% accuracy, which is okay, It's not fantastic. Really? You want to be aiming for 90% or above these days? I like that. One way of doing this would be to very some of the parameters that we're going into the random forests, for example, how many estimators you want to use and also which columns. So I basically just chose 12 combs kind of a random. I knew I wanted to include the country, cause that seemed to, uh, I knew I wanted to include the country column because that seemed to be important. Other ones I just guessed you could take. Some of these are to replace him with some of the combs that had been dropped. I'm just experiment that way. Alternatively, you could just put in all columns, see what if you could get a better accuracy than this? 83% 6. class project: So for a class project you can find your own deficits on. There are plenty of places now you can find it asserts online. They're all free roll open source. So most governments, national governments, city governments, organizations like the United Nations, W chou, European Union. They all have portals with many data sets You can download this particular did a set that I was working with. I got this from Chicago. Chicago also has free open source data sets that you can work with. You neither work with them on their on their site, or you can download the data sets on work with him on your own machine. It's easy enough to set up an account with cargo. You just need email. And I think the first time you activate your account, you have to give, um so you need to give him a cell phone number that they can send you a pin number by text message and you have to enter that pin number. I think the first time you set up your content