Data Science and Business Analytics with Python | Jesper Dramsch | Skillshare

Data Science and Business Analytics with Python

Jesper Dramsch, Machine Learning Researcher

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
36 Lessons (4h 3m) View My Notes
    • 1. Introduction

      3:22
    • 2. Class Project

      1:35
    • 3. What is Data Science?

      4:44
    • 4. Tool Overview

      4:15
    • 5. How To Find Help

      14:17
    • 6. | Data Loading |

      0:21
    • 7. Loading Excel and CSV files

      6:20
    • 8. Loading Data from SQL

      5:11
    • 9. Loading Any Data File

      5:58
    • 10. Dealing with Huge Data

      10:15
    • 11. Combining Multiple Data Sources

      4:14
    • 12. | Data Cleaning |

      0:54
    • 13. Dealing with Missing Data

      8:05
    • 14. Scaling and Binning Numerical Data

      12:26
    • 15. Validating Data with Schemas

      10:09
    • 16. Encoding Categorical Data

      6:39
    • 17. | Exploratory Data Analysis |

      7:16
    • 18. Visual Data Exploration

      10:19
    • 19. Descriptive Statistics

      5:32
    • 20. Dividing Data into Subsets

      12:30
    • 21. Finding and Understanding Relations in the Data

      5:51
    • 22. | Machine Learning |

      1:08
    • 23. Linear Regression for Price Prediction

      14:30
    • 24. Decision Trees and Random Forests

      6:59
    • 25. Machine Learning Classification

      9:45
    • 26. Data Clustering for Deeper Insights

      8:16
    • 27. Validation of Machine Learning Models

      10:01
    • 28. ML Interpretability

      16:23
    • 29. Intro to Machine Learning Fairness

      7:47
    • 30. | Visuals & Reports |

      0:15
    • 31. Visualization Basics

      7:23
    • 32. Visualizing Geospatial Information

      5:30
    • 33. Exporting Data and Visualizations

      6:42
    • 34. Creating Presentations directly in Jupyter

      2:38
    • 35. Generating PDF Reports from Jupyter

      3:47
    • 36. Conclusion and Congratulations!

      2:04
27 students are watching this class

About This Class

Business analytics and data science have become important skills across all industries. Knowing both how to perform analytics, as well as, sense checking analyses and understanding concepts is key in making decisions today.

Python has become the lingua franca of data science and is, therefore, the topic of this class. This class assumes Python knowledge if you'd prefer a high-level introduction without programming application to data science I recommend this class on Skillshare.

Programming can be intimidating, however, Python excels due to its readability and being freely available for all platforms including Linux, Mac and Windows. This class will assume some prior knowledge of Python syntax, but to establish a common learning environment some of the basics will be covered. We will cover the full data science workflow including:

  • Loading data from files (e.g. Excel tables) and databases (e.g. SQL servers)
  • Data cleaning
  • Exploratory data analysis
  • Machine learning
  • Model validation and churn analysis
  • Data visualization and report generation

In this class, we will use freely and openly available Python libraries including: Jupyter, NumPy, SciPy, Pandas, MatPlotLib, Seaborn, and Scikit-Learn and you will also learn how to quickly learn new libraries.

Transcripts

1. Introduction: Data science in a sense, is like a detective story to me. You unravel hidden relationships in the data, and you build a narrative around those relationships. My name is Oscar Thomas and I'm on machine learning researcher and data scientists. I spent the last three years working towards my PhD in machine learning and geophysics. I have experience working as a consultant, teaching Python and machine learning in places like Shell and the UK government, but also midsize businesses and universities. All this experience has given me the ability to finish my IBM Data Science Professional Certificate in 48 hours for a course that's supposed to take about a year. I also create exactly these notebooks that you learn to create in this course for data science and machine learning competition, one type called Kaggle, which is owned by Google. There I gained the rank 81 worldwide out of more than a 100,000 participants. After this course, you will have come through every step of the data science workflow. That means you'll be able to recreate all the visualization, and have all the code snippets available for later for use with your own data in your own reporting. We'll do a very applied step-by-step. We'll start at the very beginning, starting with getting your data into Python. That means looking at Excel files and looking at SQL tables, but also looking at those weird little data formats that sometimes can be a bit tricky to work with. Then we'll preprocess our data, clean our data, and do exploratory data analysis or short EDA. EDA is that part where you really refine your question, and where we have a look at the relations in our data and answer those questions. Afterwards for fun, we'll have a bit of a look at machine learning modeling and how to validate those machine learning models, because in this modern time, this is more important than ever. We'll have a look at different data visualizations, how to best tell your story, how to best generate presentations and reports to really convince, to really punctuate your story that you can tell that data. Very finally, we'll have a look at automatically generating presentations and PDF reports directly from Python. I have the unfortunate lack of graduating into recession twice now. But Python has given me the ability to finish a PhD while working as a consultant and making these amazing world-class data science portfolios for myself that have now generated so much attention. It's amazing. One of my notebooks has been viewed over 50,000 times. I hope to share this with you. Data signs for me is the super exciting new field and Python is very accessible. So I hope to see you in class. 2. Class Project: Welcome to class and thanks for checking it out, and I'm really happy to have you. This class will be in bite-sized videos, that are part of larger chapters. Because then you can come back and have a look at the small details, and not have to search in the larger videos. Each chapter will be one of the steps in the data science workflow. In the end, because data science is an applied science, we'll have a project, and in this project, you'll recreate what we do in these video lectures and go through the entire data science workflow, and in the end, generate a pdf or a presentation with your findings either on your own data, or on a dataset that I provide. On top of that, I make all of these notebooks available to you,, so you can code along during the videos, because it's best to experiment. Sometimes, you see something, you want to create something different, you want to understand it better, and then experimenting with the code that I have on screen is really the best way to do it. For the first couple of lectures, I want to make sure that everyone has an equal starting around, so we'll have a look at the tools and we'll have some introductory lectures, where we really get everyone out, and then we'll start with the entire dataset workflow, where we're loading, cleaning, exploratory data analysis and all the way to machine learning, and recall generation. 3. What is Data Science?: In this class we'll look at data science from two different perspectives. There's one. Well, we'll have a look at what actually constitutes data science and what are the important fundamentals and the other one, the process approach. How do you actually do data science? Defining data sciences as a bit of peace because it's such a new discipline that everyone has paid off in different opinion borrowed. I like the way that Jim Gray, a Turing Award winner, basically defines it as a four pillar sign and that data science or information technology really changes everything about science. I think the impact of data-driven decisions on science business is really shallow. One of my favorite ways to look at data science is the data science hierarchy of needs by Monica Rogati. She really defines it as this pyramid of base level needs and then, well, more niche needs as you go higher. At the very base of that hierarchy of needs is collecting data. We have to be aware that already on the collection process, we bias our data. A lot of people like to think that data is unbiased, true. But that's really not the case. A lot of times, even in physical systems buys our data already by collecting the data. Then go on to level 2, moving and storing data. Making sure that we have reliable storage, reliable, slower data. Having a ETL extract transform load process in place to really help the infrastructure of either since. The next level, level three is exploring and transforming data. Doing anomaly detection like cleaning, preparing our data for the actual in the list. Step 4 is aggregating and labeling the data, deciding for metrics that we'll use and looking at the features and the training data, the panel ultimate step is doing the actual modeling. Doing AB testing, testing between one version of the website and another and experimentation simple machine learning algorithms to gain insights to model the data and to make predictions based on data. The tip of the permit is AI and peak learning. The really juicy stuff, but also the stuff that most companies actually don't need. This roughly summarizes how much time you should spend on each step in the perimeter as well. If you don't spend anytime acquiring data or thinking about data then you probably have a problem down the line. Another way to look at data science is asking questions. The data science process is fundamentally about asking questions about your data. It's a very iterative approach so in the beginning, you pose a question, you acquire your data but how is the data actually sampled? This goes into the bias of your data, which data is relevant and then you go on to explore the data. To the exploratory data analysis we inspect, but sometimes you have to go back. It's an iterative process. During exploration, you see that some data source would really help that information you have in your data. You go back and forth between steps as well. Then you model the data. Those are simple machine learning model, just like a hierarchy of needs you really gain insights by modeling your data with simple a logarithms. Finally, and this is not part of the hierarchy of needs, but it is definitely part of the data set process is communicating your results. What did we learn? How can we make sense of the data? What are our insights? How can we convince people of the insights that we have? We all know that sometimes knowing the truth isn't enough. You have to tell a compelling story to convince people of your insides and to really make an impact with your data sites. This class we'll show you the entire process and also how to generate those stories out of data. 4. Tool Overview: Let's have an overview over the tools that we're using in this class. Obviously, everything Data science-related will be universal, but also learning Python is extremely valuable to your skill set. Python has gained a lot of popularity because it's free, and it's open source, it's very easy to use and it can be installed on pretty much any device. So Mac, Windows, Unix, your phone even are not a problem and partners code for humans. So a lot of places, Google, YouTube, Instagram, Spotify, they all use at least in Python, because It's so easy to get new people on board with Python, because if you write good Python code, it can almost be read like text. We'll install Python 3.8 using the Anaconda installation. Anaconda is nice because it distributes a lot of the Data science packages that we already need and it's free. If you own a later version of Python, that should be completely fine as long as it's Python 3. You may be wondering if you need to install some kind of IDE or some kind of compiler for Python and that's not the case. We'll be using Jupyter, which is a web-based interface to Python and makes teaching Python and learning Python extremely easy, and going from that, you can always go to another editor. One of my favorites is VS Code. It's gotten really good for Python development and these code actually comes with an interpreter and VS code actually comes with an extension for Jupyter as well, but that's for another day. At the base of anything we do, there's NumPy. It is the scientific computing library in Python, and we won't be interfacing with it directly, but I want you to know it's there. So always when you need to make some calculation, you can do it in Python. It has been used to find black holes. It is used for sports analytics, and for finance calculations, and it is used by every package that we will be using in this course. You'll quickly notice in this course that everything we do is depending on Pandas. Pandas is this powerful tool that is a mix between Excel and SQL for me and it's really a data analysis and manipulation tool. So we store our information with mixed columns in a DataFrame, and this DataFrame then can be manipulated, changed, added onto just within this tool. For the Machine learning portion and the model validation when using scikit-learn and libraries built upon scikit-learn, scikit-learn has changed a lot how we do Machine Learning and has enabled part of the big boom that we see in Machine learning interests in the world right now. Matplotlib is a data visualization tool, and we'll mostly be using libraries that build upon matplotlib, but it's very important to know it's there and it has an extensive library with examples where you can have a look what you want to build. Seaborn is one of these libraries that build upon matplotlib, and it's extremely powerful in that it often takes a single line or a couple of lines to make very beautiful data visualizations of your statistical data. These are the cornerstone tools that we'll be using in Data science. They're open source, they're free, and they're the big ones, but we'll be using a couple of other smaller tools that I've grown to really like as well, but I'll introduce them along the course. The documentation of these open source tools is amazing because it's also built by volunteers like me. I've read part of the Pandas and the scikit-learn documentation and you'll find that it's really helpful whether small, nifty examples, that'll really make you understand the code better. If you're using these in a corporate setting, they're still free, but consider becoming an advocate for sponsorship because these packages really rely on having paid developers and core maintainers. 5. How To Find Help: It can feel really daunting doing this course, I totally understand. I'm constantly learning. I'm doing this courses and being alone in this courses is terrible. But Skillshare, has the project page where you can ask for help. In this class, we'll also have a look at all the different ways, how you can find help and how you can learn help yourself. Because every programmer will tell you that they got increasingly better at programming when they learned how to Google for the right things. To start out, we'll have a look at the Jupyter notebook because the Jupyter notebook directly wants to help us. If we have any kind of function, even the print function, we can use Shift Tab and when we hit it once, it opens up basically the basic description, so we get the signature of our function. That means print is the name. This is the first argument, and then the dot dot dot is just small and these are the keyword arguments. It gives back the first, sentence out of the documentation in the docstring. We can hit Shift Tab several times, two times, just opens up the entire docstring. Three times, makes that's the docstring is open longer and you can click here as well and all that. Four times will cast it to the bottom here. You have it available while you're working and you can just pop it out here into its own side, but also just close it down. In addition. We'll be working with pandas. When we start typing, we can often hit Tab for auto-completion. This is really my personal way of being a bit lazy when typing, so when I want to import pandas, I can hit Tab and see which kind of things are installed. Pandas as pd executing right here. I'll do with Control Enter to stay in the same place and Shift Enter it's going to execute it and get me to the next cell. So pd is now our pandas. When I hit period and Tab, it'll open up all the available methods on pd. Here I can really check out anything like if I want to merge something, I can then hit the parentheses, Shift Tab into it and read how to merge it. Now, this can be a bit rough to read, even though we can put it all the way on the bottom here. That is why there is the pandas documentation, which is essentially built from the docstring with a little bit of formatting tricks. You can see right here that you see what this is, the signature of it, and you can read into it and even have a look at the examples and copy over the examples. One thing to know in software is that these kind of codes that are here, they're nothing great. You don't really have to type them out. You can just copy them over and say, all right, I needed this. Now, I have a nice DataFrame with age, etc. Copying something like this is super common. It is just what we do in software. The next way to get help is Google. I sometimes make the joke that in interviews you should just have people Google Python and see if it shows snakes or if it shows the Python logo. Because at some point Google starts to get to know you and shows you more and more Python. It's a good way to see that you have a lot of experience in Python. When you want to ask any kind of question, when you're stuck with anything, you have a very obscure data format that you want to load, or you just have an error that you don't really know what to do with, you copy it over and let's say you have a type error, for example. You just have a look here and then there is usually highlighted one. But of course, Google always changes and you are often lead to the docs. In this case it's the Python docs, and then one of the links is going to be Stack Overflow as well. Stack Overflow is this website that well, it's extremely helpful, but it's also not the best place for newbies because there are some of the best experts in the world on this website answering your question. But if your question is not well formulated, some of the people on this website can sometimes be a bit unfriendly about it. However, for browsing and for finding solutions, like your question has probably been asked before. If you can't find it on Stack Overflow, try changing your Google query a little bit so you find different results. Like what kind of type error did you have? Copy over the entire name of the type error and all that. Really then you want to scroll down to the answers, and this one isn't really up voted that much. But yeah, oftentimes you have an up voted answer that is very popular, and sometimes you can even get accepted answers like have a look at this one. Here you have a green check mark on it, which means that the question asker has marked this as the excepted answer. You can see right here that people put a lot of work into answering these questions. You have different ways to see this with code examples, and you can really check out what to do next with your error. Let's go back to Jupyter, close this one out because this is also something that I do want to show you in Python Arrow RG, because we can just readily do them if we have something like this, it'll tell us right away what's going on. There's something weird in the beginning, but what I really first do on any arrow, however long this is a very short. Scroll to the very last line and have a look. This is a syntax error and it says unexpected EOF word policy. EOF means end of file. If you don't really know what this is, copy this over, checkout Google and have a look if Google tells you what this is. Oftentimes the Google search is better than the search on the websites itself. Here, it means that the end of your source code was reached before all codes were completed. let's have a look at our code again. Here, if we close the parenthesis, our code is now completed and it works quite well. Let's generate another arrow. Something that we can definitely not do is have this string divided by some number. If we execute this, this gives us a type error. We'll scroll all the way to the bottom and say, well see what's happening right here and it tells you that the division is not possible for strings and for integers and really going through arrows is your way to be able to discern why Python does not like what you've written right here. Since we're on the topic of help and I won't be able to look over the shoulder in the classes then I gave a very common error that you can catch yourself is that these Python notebooks do not have to be executed in order. You see the little number right here next to what has been executed and what hasn't? Lets make a small example, add some new things here. Let's say right here, I define a and here I want to define b and b is going to be a times 5. I go through here, I experiment with this. I have a look at PD merge. Have an error here, which is fine. We can leave that for now. Run this code. Maybe print something. You can see these numbers are out of order. This is important later. Then I execute this cell and it gives me an arrow. Name arrow, name a is not defined and that's because this cell does not have a number. It has never been executed. Just something to notice that you have to execute all the cells that you do that you want. Because when we run this one and then run this one, this is completely fine, so really have a look at the numbers and the next arrow that is very related to this, is that sometimes we change something somewhere like here and we change a to be six. Then we run this code again, and suddenly we have a look and b is 30, although a is 5 here. This is one of the big problems that people have with out of order execution and you have to be careful about this. Either you just have to track which cells you did. Especially with this, like there's like 10, 7, 8, 49, this gets really hard to keep in mind. Especially, you can delete these cells and a is still going to be in memory. We can still execute this despite the cells not existing anymore. Sometimes you just have to go to the Kernel and say Restart and clear output, which clears all of the variables and clears all of the outputs that we have right here. We can go here, hit this big red button, and now we have a fresh notebook with all the code in here, now we can execute this in order get all our errors that we have and see right here that a is in fact not defined. We have to basically add a new line here and define a again. That way you can catch a lot of errors in Jupyter by having a look at the numbers right here. Did you forget to execute something or did you do it out of order? Yeah, about in total. What you want to do to find help in Python is, remember shift + tab. Remember that tab, autocomplete your queries and can give you information about what methods are available on basically anything. Then you want to get really good at Googling things. In some of my classes, some of the people that I became friends with, they laughed at me at some point and said, 'Your class could have essentially been just Google this,' because at some point everyone has to Google stuff and there are some funny posts on Twitter as well of maintainers of libraries having to Google very basic things about their own libraries because our brains are only so reliable and things change. If you want to have the newest information, there's no shame in looking at up. When you're done with googling, with looking at up on copying over stack overflow copying over some code. You'll be better off for it. There are all these tools to find help and Python and help yourself and this gives you the necessary tools to dive in to Data Science with Python. 6. | Data Loading |: In the first couple of classes, we'll be getting data into Python. Whether you have data in Exelon tables or in your SQL database, it doesn't matter. We'll put it into Python in a tool called Pandas, which is essentially Exelon steroids in Python. Yeah. Let's free your data. 7. Loading Excel and CSV files: This is the first class where we touch code. Open up your Jupyter notebook if you want to code along. We'll start with loading data. I have provided some Excel files and CSVs, so comma separated value files and we'll get into loading them. Now we could write this by hand and I'll show you on a much simpler example as well how to write something like this by hand. But luckily, with Python being now over 20 years old, a lot of people have already put a lot of thought into extending Python functionality. We can use this pandas here and extend Python to load data into Python. What we do here is we just say import pandas and this would be enough but because we'll be using pandas a lot usually we give it a shorthand to some alias. pd is a very common one that a lot of people use and then we execute the cell and we now have pandas in Python. To import or read data, we can do the pd.read, hit Tab and see all the different ways you can load data into pandas. In this course, we'll have a look at the most common ones that I found in my work as a Data Scientist but I'll show you how to find the others as well. Because if we don't really know what we're doing, we can always have a look at pandas documentation, where we can have a look at everything that we can do with pandas. Since we have read Excel here already, we can also hit Shift Tab and have a look at this signature. You'll see that this looks eerily similar to the documentation, because pandas and all of Python actually comes with its documentation built in so it's very stand-alone and very user-friendly. In the beginning we just need to give the filename where we actually have the file and this is going to be data/housing.xlxs, the new Excel file and loading this will execute and we see we have all these data now in pandas. We didn't save it into a variable right now but what we usually do if we just have a temporary dataset, we call it df because in Python, this is called a DataFrame. It is basically an Excel representation of a single sheet in your Python. Because we want to have a look at it afterwards, we'll just call head on our DataFrame and have a look at the first five rows. We can see here, the headers and our data. CSV files are a little bit different because CSV files are raw data. Let's have a look here. I have the data. We can actually have a look at CSV so comma separated values in Notepad because it's just text and it's fantastic for sharing data between systems, especially if you have programmers that might not have Microsoft Office available. This is the best way to share data. We pd read CSV and we can just give it the file name again so housing.csv and this should, let's call head right on this one. It should give us the same data and we can see they are the same. I want to show you a really cool trick though. If you know some data is online like this dataset of Medium articles on free code camp, you can actually call it pd read CSV and just give it the URL. But this is going to fail, I'll show you. We have to learn that errors in Python are fine, it's totally okay to make errors. Read the last line, pass error tokenizing data so expecting something different. You may already see here that this is not a CSV, this is a TSV file. Someone was actually separating this with tabs and to put tabs, make this backslash T character as the separator and we can import this data right from the Internet by just giving the correct keyword. This is something really important to see, really important to learn. If we have a look at this, there's a lot of keywords that we can use. These keywords are extremely useful in cleaning up your data already. You can see right here that there's something called NaN, this is not a number that we have to clean later. Curing the loading of this, we can already have a look at things like, do we want to skip blank lines? It's really, pandas is very user-friendly. If you want to experiment with this one, I'll leave this in the exercise section and you can check out if you can already clean it off, some man's will have a dedicated section of cleaning data later as well. The loading data into Python with pandas is so extremely easy. Try it out with your own data. If you have an Excel file lying around on your computer, remember all of this is on your computer, nothing gets out so you can just pd.read and get in your data and play around with it. This class we worked through loading Excel tables and comma separated value files and even had a look how to load data from the internet. Next class, we'll have a look at SQL tables. If you've never worked with them feel free to skip it. The next class will be there, right for you. 8. Loading Data from SQL: SQL databases are a fantastic way to store data and make it available to data scientists. Working with SQL would be too much. There's entire courses here on Skillshare that I'll link to. You can find them right here in the notebook as well. However, it's good to have an overview because it's really easy to load the data once you know how to do it. If you work with SQL, this will be really valuable to you. Most companies do not store their data in Excel files because an Excel file gets copied, gets copied and suddenly you end up with final, final, final version. It's probably on someone's PC somewhere, maybe on a laptop. So instead, a lot of places have databases on a server. This database that contains all this information that you need. Usually this way of accessing information is called SQL, which is short for Structured Query Language. This isn't language in itself. It would be too much to explain this in this course. If you want to learn more, there's courses on Skillshare. There's also resources like this which I linked where you can try it out, do the exercises step-by-step, learn how to write-up up a query, get data into Python in a advanced way. It is absolutely enough to once again import pandas. Then we can have a look if there is SQL down here. What you can do here is actually three different ones. That's a general one, SQL, there's a SQL query and there's a table read SQL in the documentation. That's usually a very good place to start. We see that there's two kind of ways. If we scroll down, we can see that there's the different to SQL table and SQL query. Let's have a look at the query, and this needs you to write a SQL query. Some of them can be very simple and can save you a lot of space. So if you have a big database SQL table just load the entire table from your server. In addition to Pandas, we actually want to import SQLAlchemy. Then below this we'll create the connection. So it's called an engine. Let's have a look what we need in here. So if you have a PostgreSQL database, we can just copy this. This should be the location of your database. Here we go read SQL table just to make it easy. Now, if you had your SQL database, you can put your name here, like, for example, sales. Add the connection here. If we wanted to actually use the SQL language, we would have to use read SQL query. That means in this case that we need to define a query that goes into our connection. So this query can be very, very simple. Of course, this query can be as complicated as you want. So we actually take the multiline string here from Python. So we can say select customers and total spend from sales. because it's such a big table, we want to limit it to 1000 entries because we just want to have an initial look and we don't want to overload our computer. Addition to that, we want to say that the year is 2019. Now we can copy over this entire thing over here and select our data from our imaginary database right here using SQL query. Hopefully in this class you saw that SQL can be quite easy. You can just get the table from the database and work with it in Pandas. Now, the next class is going to be how to load any kind of data. we'll show that Pandas makes everything a little bit easier. 9. Loading Any Data File: Sometimes you have weighted data. I'm a geophysicist, I worked with seismic data, and there are packages that can load seismic data into Python, just like our CSV files. In this class, will have a look how to load any data and how to make it available in Python. Pandas is grateful tables and structured data like that. But sometimes we have different data formats, like just a text file or images or proprietary formats. When I was mentoring class at the US Python conference, someone asked me about the super specific format that they work with. The first thing I did is I Googled it. There was a Python library for it and I'll show you how to use most common Python libraries. We'll use the text file, like the text file we have here. It's a CSV, but it's still a text file. As you can see, what we say is with open and then we have to give it the place where it is and the name, and now let's shift tab into this. There are different modes to standard mode is R. Let's have a look what these modes actually mean, because you can open files on any computer, just most programs do it for you in read mode, write mode, and in append mode. You want to make sure that if you're reading data that you don't want to change, that this is set to R, let's make this explicit, then we give this file that we open a name. We can just call this housing and Python, wide space is very important. Now we have a block in which our file is opened. Within this block, for example, we can say data equals housing dot read, and this reads our data. Now if we go out of this block there, we can actually work with our variable without having the file open. This is incredibly important. A lot of people that are on YouTube programming don't know this, but most files can only be opened by one person and one program at one time. If two try to access the same data, it will break the data. It's really important that you open your data, save into variables, load it into Python and then close everything. If we have it here in this data variable, and go out of this block, or just execute this, and go to the next cell. We can do stuff with data, like have a look at what is in data. We can see it right here that this is a text file without having the file open, which is just a very easy and accessible way to do it. We can also have a look, housing as our file handle, right here, and we can see that this tells us if housing is closed or not. Right here, we can see that after this block is executed, it will be closed. Let's have a look at how this looks inside here. Inside here, it is not closed. That means we can read different lines and all that. However, instead of just using the standard Python open, we can use a lot of different libraries that also give us file handler. I can use something like segyio , which you have probably never heard of before. That's why I want to show it to you real quick. Which is just a way to import this. After importing this, we can say with segyio. Open, give it the file, name it F, and then load all the physical data into Python. After this is done, the file once again is closed and we're safe. This is a very general way to go about loading your data into Python. As you can see here, our CSV doesn't look as nice as it does in Pandas. But with a bit of processing, we can actually make look as nice as Panda. We can split it, for example, on these new line characters, which is \n, and we can see that this already gives us all these lines in here. We can go on and split each of these line on the comma because of this comma separated and so on and so on. But, that's why I showed you Pandas first, because it's so much easier. I think it's very nice to go to these high-level abstractions first, but also see how to do the work in the back. In this class we had an overview of what the width open statement does, and how to load any data, search for data loaders for the weird formats that we sometimes have. I think we definitely saw how easy Pandas makes it for us, because splitting a CSV file like that is really cumbersome. Then cleaning the data like missing values is even worse. In the next class, we'll have a look at huge datasets. What happens when our files becomes so large that they don't fit into memory anymore? How can we load this data and how can we deal with it? 10. Dealing with Huge Data: It is quite common that especially in larger companies, you have datasets that do not fit into your computer's memory anymore. Or that if you do calculations with them, the calculation what takes so long that essentially you are bored. In some cases you would take longer than the Universe already exists. That means we have to find ways to work with the data to make it either small in memory. We'll talk about that. But also how to sample the data, so you have a subset because often times it is valid to just take a sample, a representative sample of the big data, and then make calculations, do the data science on that. This is what we are getting into. We'll import pandas as pd and then we'll load our data into the df DataFrame with read_csv. We will do this explicitly now because we can change it later to see the differences between different loading procedures and how we can optimize our loading. This, gives us the following memory footprint of our loaded DataFrame. We have to say deep equals true because we have some objects in there that have to be measured. You see right here, that ocean proximity is quite a lot larger than everything else and that's because ocean proximity contains StringData. We know it is categorical. We'll have a look at the head view quick. Right here. It is categorical and everything else is numbers. The numbers are very efficient but having strings in there can be very memory intensive. If we have look at the d types so the data types, we see that right now ocean proximity is just an object. Everything else is float, so a number, but the object right here is what makes it so large in memory, because we can change the datatypes of our DataFrame. When we loaded it will do this by saying df of ocean proximity because we want to change ocean proximity. Copy all of that and we'll override our ocean proximity with this dot as type. We can use a special datatype that pandas has available, which is called categorical or category. This improves our memory usage. Code deep equals true, so we see only the memory footprint of the columns. We can see that this improves our memory usage of ocean proximity significantly even below the usage of the numbers. This is how you can make your DataFrame more optimal in a simple way. An obvious problem with this is that we already have this data in memory and then we're changing it. The memory footprint of this is still large, we're just reducing it afterwards. What we can do is change the datatypes during low time. Let's have a quick look in the docstring and there we go. It's d type and we'll assign a dictionary where the key is our column. We'll use ocean proximity again. The value is going to be the datatype. I made a type there and a type in housing. There we go. Using this, you can also assign the integer type to numbers and really change your loading at loading time. So d of small, let's have a look at the memory footprint of this. Use d of small memory usage, deep equals true. We can see right here that this automatically at loading time changed our memory footprint of the DataFrame. What if instead of loading the entire DataFrame with all columns, all features available, we choose to only take a subset of the columns. Maybe we don't need everything. Maybe we don't need the median house price in this one. We'll define a new data frame and we'll load the data as always. But in this case we'll define the columns. That's columns and in this case we'll need a list. Let's have look, use longitude and latitude and we could also use total bedrooms or something like that, but we'll just use the ocean proximity as before. Just paste this in, edited. It's actually the column names per list entry and add ocean proximity. This is going to go wrong and I want you to learn that it's absolutely okay to make mistakes here because in Python, mistakes are cheap. We can see that type error, it says that it doesn't recognize one of the keywords. That's because I used columns instead of usecols. I honestly can't remember all the keywords because there are so many, but that's why we have the docstring and corrected. Looking at the DataFrame, we only loaded longitude, latitude, and ocean proximity, another very nice way to save some space while loading. This way we can load a lot of rows with only a few columns. Sometimes the problem isn't really loading the data though. All the data fits into our DataFrame. But the problem is doing the calculation. May be we have a very expense function, very expensive plot that we want to do, so we'll have to sample our data. Pandas makes this extremely easy for us. Each DataFrame has the method sample available. You just provided a number and it gives you as many rows from your DataFrame as you say in that. Let's have a quick look at the docstring. We can define a number or a fraction of the DataFrame. Since it's a stochastic sampling process, you can provide that random state, which is really important if you want to recreate your analysis and provide it to another colleague or another data scientist. Then you'll have to input the random state right there. We can see right here that it changes every time I execute the cell. But if we set the random state to a specified number, it can be any integer that you want. I like 42. You see right here that this number is 2048 and if I execute this again, this number does not change. This is a really good thing to get used to. If you have any random process, that random process is great when you use it in production. But if you want to recreate something, you want to fix that random process, it's reusable. What I often do is I go in the very first cell where I import all my libraries and I fix a random state there as a variable and I just provide that variable in stochastic processes. That makes it a little bit easier and very easy to read for the next data scientist who gets this. Sometimes you have to get out the big tools though. We'll use DASK of x and we won't use it right here, but you can try it on the website if you go to try now. Dask basically is lazy DataFrames. It doesn't load the entire DataFrame into memory when you point to the DataFrame or to the data. But it knows where it is and when you want to do the execution, it'll do the execution and in a very smart way, distribute it over even big clusters. In this class, we had a look at how to minimize the memory footprint off data. How we can load less data or how we can load data more efficient. I also showed you a quick look at some tools you can use if you want to use lazy DataFrames for example. DataFrames that are in rest when you load them, and then when you do the computation it does that chunk wise. It's a smart way to deal with large data at rest. In the next part, we'll have a look at how to combine different data sources. How we can really flourish and get different information sources to really do data science. 11. Combining Multiple Data Sources: Often the biggest impact really comes from combining data sources. Maybe you have data from sales and advertisement and you combine these data to generate new insights and in this class we'll have a look how we can merge data, join data together, and append new data to our DataFrame. As always, we'll import pandas as pd and save our DataFrame in df. Now we'll split out the geo data with latitude, longitude, and the ocean proximity into the df_geo. Let's have a look at the head and we can see that's three columns, exactly like we defined and now we can join it. Joining data sources means that we want to add a column to our DataFrame. So we'll take our df_geo and join a column from the original dataset into this. Now this is technically cheating a little bit, but it's just making it easier to show how we do it. We'll choose the median house price for this one, let's have a look at the whole DataFrame and we can put that into our geo. We can see how this now contains the original geo DataFrame joined with the column median house value. This is a little bit easier than normal, normally you don't have all the columns available, but we'll have a look at how to merge DataFrames now where you can be a little bit more specific. Let's create a price DataFrame first with longitude, latitude, and the median house price, and what we'll do now will merge both of these into one DataFrame. We take the geo DataFrame and call geo.merge, let's have a quick look at the docstring how to actually do this, so we want a left DataFrame and a right DataFrame and we define a method, how to join these. The inner method means that we only keep the data that is available in left and right. Let's have a quick look at the left and the right DataFrame. The natural join is the inner join. Only the rows and the columns from both DataFrames are there that are there. The left one is everything from left and only the right matching ones and the right join is everything from the right and the left matching ones. The outer join is everything so we fill it up with a lot of NaNs and we have to define the column that the left and the right DataFrame are merged on. We'll take latitude in this case, so we have something that we can actually combine our datasets on. If you have your data sources, left and right should be the same data, but they can have completely different names. Now that work quite well, you can see that everything is now merged. We can also concatenate our data. That means we'll use pd.concat for concatenate and provide the DataFrames that we want to combine into a larger DataFrame. Now in this case we have two, we can combine as many as we want, and right now, you see a good way to add new data or new data points to the rows of the DataFrame. Wherever you don't have data, NaNs are provided. However, since we want to join the data, we provide a join and the axis. You can see everything is now joined into one large DataFrame. In this class, we had an overview of how to combine different data sources and generate one big DataFrame so we can do an analysis combined. That concludes our data loading tutorial. The next chapter, we'll have a look at data cleaning. Probably the most important part of data science. 12. | Data Cleaning |: After loading the data, we have to deal with the data itself. Any data scientists will tell you that 90 percent of their work is done in the cleaning step. If you do not clean your data thoroughly, you will get bad results. That's why we spend a lot of time having a look at different missing values, outliers, and how to get rid of them and how to really improve our data-set after we load it. Because sometimes the measurements are faulty, sometimes data goes missing or gets corrupted, and sometimes we just have someone in data entry that isn't really paying attention. It doesn't matter. We have the data that we have and we have to improve the data to a point where we can make good decisions based on data. 13. Dealing with Missing Data: The first step in the data cleaning process for me usually is looking at missing data. Missing data can have different sources. Maybe that data isn't available, maybe it got lost, maybe it got corrupted and usually it's not a problem. We can fill in that data, but hear me out. I think oftentimes missing data is very informative in itself. While we can fill in data with the average or something like that and I'll show you how to do that. Oftentimes, preserving that information that there is missing data there is much more informative than filling in that data. Like if you have an online shop for clothes, if someone never clicked on the baby category, they probably don't have a kid and that is a lot of information you can just take from that information not being there. As usual, we'll import pandas as pd. This time we will import missing number, the library as msno. We'll read the data and our TF DataFrame. Missing number is this fantastic library that helps visualize missing values in a very nice way. When we have a look at DF, we can see that total bedrooms has some missing values in there. Everything else seems to be quite fine. When we have a look at the bar chart, we can see that to really have a look at how well this library works, we have to look at another dataset and there is an example data set in missing numbers that will now load to see. We'll load this data from quilt. You have this installed as well, but down in the exercise you can see how to get this data. We will note this New York City collision data. It is vehicle collisions that we'll get into our variable. This data has significantly more missing values. We will have a quick look. There are a lot of very different columns and we can already see that there's a lot of nans for us to explore with missing numbers. We'll replace all the nan strings with the numpy value np.nan. Numpy is this numeric Python library that provides a lot of utility and np.nan is just a native data type where we can have a not a number represented in our data. This is the same thing that pandas does, when you tell it to give nan values. In my data, oftentimes this can be a minus 999.25, but it can be anything really and you can specify it to anything you want, which is then replaced as nan. So you know it is a missing value. I'll leave that for later. Let's have a look at the matrix. We see there is more columns in here and the columns are much more heterogeneous. We have some columns with almost all values missing and on this side we can also see which row has the most values filled out and which row has the least value is filled out. Sorry, about that being so low, let's have a look at the bar chart. We can see which columns have the most data filled out and which have the most missing data. Now the dendrogram is a fantastic tool to see relationships in missing data. The closer the branching is to zero, the higher the correlation is of missing numbers. That means on the top right you can see a lot of values that are missing together. This is an easy way to count all the values that are missing in this DataFrame. Let's switch back to our original DataFrame, the house prices, where we can also just count the no numbers and we can see that total bedrooms is the only one that has missing values with 207. In addition to looking at missing no, we can get numeric values out of this. Let's have a look at the total bedrooms right here and add a new column to our DataFrame, which is total bedrooms corrected, because I don't like overwriting the original data I would rather add a new column to my dataset. Here we say, fill our missing values with the median value of our total bedroom, because total bedroom is account so the mean value, the average value, it doesn't make sense, will rather fill it with the most common value in bedrooms. There we go, this would be the mean and this is the median. Luckily, pandas makes all those available as a method, so it's very easy to replace them. We will replace it in place this time, but you have to be careful with that. It's sometimes not the best practice to do this. Now we can see total bedrooms corrected does not have any missing values. When we have a look at total bedrooms and total bedrooms corrected right here, we can see that these are the same values. The values that did not have any zeros, did not have any nans, did not get changed. Only the values with nan were replaced. In this class we had a look at missing numbers. What happens when we have missing data? Can we find relationships between missing values? Just some data go missing when other data's also going missing is the relationship in missing numbers itself. In the next class we'll have a look at formatting our data and also removing duplicates because sometimes it's very important to not have duplicate entries in our data. We can actually see each data point for itself. 14. Scaling and Binning Numerical Data: In this class first, we'll have a look at scaling the data. That is really important because sometimes some of our features are in the hundreds and other features are like in the tens or even at decimal points and comparing those features can be really hard, especially when we build machine learning models. Certain machine learning models are very susceptible to the scaling factors, some bringing them on the same numeric scale can be beneficial to making a better machine learning model. I'll introduce each scaling factor or each scaling method in the method itself, so we can learn it in an applied way. The second part of this class is going to be binning data, so that means assigning classes to data based on numeric value. In this example, we'll use the house value and assign of medium, high and low, and luxury, just to be able to make an example how we can assign classes based on numbers. You'll see this can be done with different methods that give different results. As per usual, we're importing pandas as pd and get our housing data into the df data frame. Make little bit of space so we can actually scale our data, and have a look. We'll start with a very simple method where we scale our data between the minimum and the maximum of the entire data range. I'll modify, the x is going to be x minus the minimum of x divided by the range, so maximum of x minus the minimum of x and that'll give us a value between zero and one for the entire column. We'll choose the medium house value for this one, so df.median_house_value is our x, and we'll have to copy this a few times, so I'm just going to be lazy about this, x minus the minimum of x divided by the maximum of x minus the minimum of x. We have to use parentheses here to make this work because otherwise, we would just divide the middle part, you can see it right here. Our scaled version in the new column that we'll name median_house_value minmax. Right here, we can clearly spot that I made a mistake, not adding parenthesis to the top part. When I add parentheses here, we can see that the data actually scales between zero and one. Now, we can do some actual binning on the data. There are several options available to do binning as well, we'll use the first one, which is the pd.cut method where you provide the bin values yourself, so those are discrete intervals where we bin our data based on thresholds that we put. We're using the minmax that we just created because that makes our life a little bit easier because then we can just define the bins between zero and one, we'll have three quarters, so quartiles and that means we have to provide five values from zero to 1.25 increments. When we execute this, we can see that the intervals are provided. If we don't necessarily want those intervals to be provided, but provide names for them. In the case of these values, we can say that the first one is quite cheap, then, we have a medium value for the houses, a high value for the houses, and then we are in the luxury segment. Of course, you can define these classes however you want, this is just an example for you to take. Make this a little bit more readable at the common data otherwise you'll get an error. Now, with the labels, we can see that each data point now is assigned to a category. Let's actually assign those to price or price range in this case, and indent that correctly. We can see that we now have a new column with new classes that we would be able to predict with a machine learning model later. The second method we'll look at is the qcut method. This is a quantile cut, so we can define how many bins we want and the data will be assigned in equal measures to each bin. We'll use the data from before, so the house value is minmax. Now, in the case of qcut, it doesn't matter which one we take because the scaling is linear in this case, so that's fine. But to compare, we can see that the top bin is now between 1.5, and one instead of 0.75 and one. We can assign the labels to make it absolutely comparable and we can see right here that this is now a lot more luxury and 0, 1, 2, 3, 4 instead of high as before, so this makes a big difference and you have to be aware how the quantiles work. They are really useful, but it's something to be aware of. Let's assign that to a price range quantile and indent it properly. We have a new column that we can work with. Instead of doing this by hand, we can use a machine learning libraries scikit-learn to use the preprocessing because as you saw, sometimes you make mistakes, just forget parentheses and if it's already in a library using it will avoid these silly mistakes that have very severe consequences if you don't catch them. From sklearn, which is short for scikit-learn, we'll import preprocessing and we'll use the minmax scalars so we can compare it to our minmax scaling that we did by hand. We use the fit transform on our data. The fit transform first estimates the values and then transforms the values that it has to the minmax scalar. Now, right here I'm used to reading these mistakes, but mistakes aren't bad, you quickly find out what happened, you can Google for the mistakes, in this case I provided a series and scikit-learn was expecting a data frame instead. Let's have a look, compare our data and some values are equal, some are not, and his seems to be a floating point error. Let's have an actual look at it. The first value is false, so we can just slice into our array and have a look what the first values are. Right here we can see that the scikit-learn method provides less digits after the comma. Now, this isn't bad because our numerical precision isn't that precise to be honest, so we can use the Numpy method, np.all_close to compare our data to the other data. That means our errors will be evaluated within numerical precision whether they match or not. Just copy that over and we can see, yes, in fact they match. Within numerical precision, they are in fact equal. Instead of the minimax scalar, we can have a look and there are a ton of preprocessing methods available, like MaxAbsScalar, Normalizer QuantileTransformers, but one that is very good and I use quite often is the StandardScaler. Using that will show you that it is in fact used the exact same just fit_transform and you get your data out. Instead of the StandardScalar, if you have a lot of outliers in your data, you can use the RobustScalar. In this class, we're going to look at different ways to scale our data and how to assign classes to our data based on the data. We really did a deep dive on how to prepare data for machine learning in the end, and you'll see how we do that in a later class. In the next class, we'll dive into some advanced topics. We'll have a look at how to build schemas for our data so we can actually check if our data is within certain ranges or adheres to certain criteria that we said that the data has to have. If we automate our data science workflow in the end, this is really important because right at the beginning we can say that our data is okay or that our data has changed to what it is before, and that there is data control, quality control issue. 15. Validating Data with Schemas: In this class we'll be looking at schemas. That means when we load our data, we can see if each column that we define fits a certain predefined class or some predefined criteria that we think this feature has to have. We'll be exploring different ways to do this and what to think about when doing this, so we can automate our data science workflow from the beginning to the end. In addition to the usual import of pandas we'll import pandera. This is obviously a play on pandas, and it is the library that we'll use in this example to create schemas and validate our DataFrame. There are other libraries like rate expectations that you can check out, but in this case, pandera will do. First we need to create the schema. The schema is basically our rule set, how our DataFrame is supposed to look like. In this case, we'll use an easy example with "ocean_proximity" and we'll make it fail first. We say that the column is supposed to be integers. So we get a schema error and we can see right here that it tells us all the way in the end that it was expecting an int64, not what it got. If we replace this by string, we can see that now it validates and everything is fine. In addition to the type, we can also provide criteria that we want to check. So we type in "pa.Check", and since we want to check that "ocean_proximity" only has a couple of values, we copy these values over and say it's supposed to be within this list. If we validate the schema, we see everything is fine. Let's make it fail. Delete the "NEAR BAY", and we see that there's a schema error because this could not be validated. Let's run that back, make them work again. Text isn't the only thing that needs to be validated. We can also have a look at other numeric values. If we wanted to check for the latitude to be in a certain area or the longitude to be in a certain area, that totally makes sense. You can check if it's within certain boundaries. Let's have a look at total rooms and check that it is an integer. Right now it is not but we can of course, make the data load as integer and then validate the data, so our loading is always as an integer. What we'll do is we'll define the column and say it has to be an integer. Now, in this case, obviously we get a schema error because it's a float right now, so we have to do a type conversion or we have to reload the data with an integer. We'll get the "housing.csv", and we'll define the datatype for total rooms to be int. The problem here is that there are int32s and int64s. How many bits are in an integer? And these have to be the same. When we look at the error of our schema, we can see that it is expecting an int64. We'll import NumPy and define our loading as int64 right here. Our schema once again validates because we have now matched the type. If we do int64 loading in the beginning, we can match this up with a int64 that we expect in our schema. It's just things to be aware of when you are loading it. Another way to validate our data is using a lambda function, an anonymous function that can do arbitrary checks and return true or false values. In this case, we'll start out with housing median age, do our column and add the check. Now I'm making a mistake here, unfortunately, but you'll see in a second. So "pa.Check", we'll add lambda n is our variable, and we check if n is none, or is not none. We get a type error right here. This is important to note. It is not as schema error and that's because I forgot to add a type check right here. We'll check for "Float" and now everything validates again because none of the values in housing median age are none. We can make it fail by removing the none, and that will break our schema. We can do a lot of other tests, arbitrary function tests in here, like if our squared n is over zero, which it should if math is still working. There are several reasons why you want to do schema validation on DataFrames or on tables. And it is quite common to do those already in databases and it's a good practice to do this in DataFrames. It can be that you just get faulty data or that the data changes in some way. A very simple example right here is percentages. In geophysics, sometimes you have to calculate porosity, for example, of rocks, which can be given as a percentage between zero and one as a decimal, or it can be given as a percentage between zero and 100. Both are completely fine, but you have to take one to have your correct calculations afterwards. Let's create a DataFrame here with mixed percentages where you see that it will throw an error if you validate this data. Save this DataFrame in "df_simple" and we'll create a schema for this making all the data floats between zero and one. So create a DataFrame schema and add percentages for the column. Really, why we're doing this example is for you to see other data than just the housing data that we can do this on physical data as well, and to make you think about your data, how you can validate that your data is in fact correct. We'll have a check right here and we can check that this is less than or equal to one. Once again, we have to validate our DataFrame on the schema and see that it will fail. The nice thing is that our failure cases are clearly outlined right here. We could go in manually and correct the data, or we can correct all the data that we know is wrong in our percentages or drop and get our schema validated with the correct input data. We'll get all the data that is over one and just divide everything by a 100. We have only decimal percentages, and now everything validates easily. In this class, we had a look at different schemas and how we can validate our data already from the beginning. We had a look with a simple example of percentages, why this is so important to do. In the next class we'll have another advanced strategy, which is encoding. A topic that is quite important for machine learning, but also can be applied in a few different ways. 16. Encoding Categorical Data : In this class we'll have a look at encoding our data. If we have a categorical variable like our ocean proximity, our machine learning process often can't really deal with that because it needs numbers, and we'll have a look at how we can supply these numbers in different kinds of ways. In addition to that, once we've done that, we can also use these numbers in different ways to segment our data. We'll start with the usual pandas, and then we'll have a look at the ocean proximity because these are strings and our strings are categorical data. Machine learning systems sometimes have problems with parsing strings, so you want to convert them to some kind of number representation. Pandas itself has something called one-hot encoding, and this is a dummy encoding. Essentially each value in the categories gets its own column with true or false. Each value that was near bay now has one in the near bay column and zero in everything else. Let's merge this data to the original DataFrame so we can compare this to other types of encodings and see how we can play around with it. We'll join this into the DataFrame, and we can see right here, near bay is one for near bay, inland is one for inland, and zero everywhere else. Alternatively, we can use the pre-processing package from Scikit-learn. Scikit-learn gives us encoder objects that we can use. We'll assign this one-hot encoder object to enc, and we'll fit this to our data. The nice part about these objects is that they have a couple of methods that are really useful that we'll now be able to explore. Let's fit this to the unique data that we have under our ocean proximity, and then see how this encoder actually deals with our data. After fitting our encoder to our unique values, we can transform our data. If we spell it right. Converting this to an array gives us the one-hot encoding for our unique values, only a one in each column and each row. Now transforming actual data, so not just the unique values, should give us something very similar to what we saved in the DataFrame further up. Convert this to an array. We have values in the fourth column, and right here you can see near bay, same. Now, you may wonder why we're doing this redundant work. But with this encoder object, like I mentioned, we have some really nice things that we can do. Add a couple of lines and we can use the array that we have from before. I'm going to use NumPy because I'm just more used to dealing with NumPy objects. We can convert this array back now, which is not as easy with other methods, but because we have this nice object that has all these methods available, we can use the inverse transform, provide the array to this inverse transform and get back the actual classes because the object remembers the classes that it was fit on. We can also get all the data that is stored within the object without actually providing values to it. Really just a neat way to deal with preprocessing. Obviously, sometimes we want something different than one-hot encoding. One-hot encoding can be a bit cumbersome to work with. We'll have a look at the preprocessing package, and we can see that there is label binarizes, label encoders. But right now we'll just have a look at the ordinal encoder. The ordinal encoder will assign a number instead of the category, and that basically just means that it's -from 0-1, 2, 3, 4 depending on the number of classes. You have to be careful with this. Like in a linear model for example, the numbers matter. So four would be higher than zero, four would be higher than three, so encoding it as an ordinal would be a bad idea in a linear model. But right now, for this, it's good enough. If we use a different kind of model later, then we are completely justified in using an ordinal encoder. This marked our last class in the data cleaning section. We had a look at how we can encode information in different ways so we can use it in machine learning models, but also save it in our DataFrame as additional information. In the next class, we'll have a look at exploratory data analysis, doing that deep dive into our data. 17. | Exploratory Data Analysis |: In this class, we'll have a look at automatically generated reports. Oftentimes that can be enough. You want an overview over your data and the most common insights into your data. We'll generate this report and it'll be reproducible for you on any kind of dataset that you have. This tool is very powerful. Afterwards, we'll have a look how to generate these insights ourselves as well. Because sometimes you want to know more than this report just gives you, and also, if it was only about running this utility, data science wouldn't be paid that well, to be honest. This is a good first step. Getting this overview over your data is really important. But then we need to dive deeper into our data and really dig out the small features that we have to find. We'll import pandas and then get our data frame and the DF variable as we always do. Then we'll import profile report from the pandas profiling Library. I'm pretty sure you will be stunned how hands-off this process actually is of generating this report. If you take anything away from this, I think this is it. This utility really takes away from lot of things that we usually did manually in pandas. I'll show you how to do those anyways, because it's really good to understand what you're actually doing in the background. But this tool is amazing. You automatically generate all the statistics on your data. You see that it counts your variables and gives you the overview, how many are numeric and how many are categorical. Notice that we did not supply any category features or datatype changes. We can get information how our data is distributed. However, it's a bit hard to see in our notebook. That is why we are going to use a notebook specific version, which is profile dot two widgets. Here we have a very nice overview widget with the same information as the profile report from before. We can see right here that it even tells us the size and memory and tells us when the analysis was started and finished. How you can recreate this analysis. It tells you all the warnings like high correlations. Now between latitude and longitude, that's fine. Missing values. Then on variables, you can have a look at the distributions of your data. You can toggle the results and have a look at the histogram. The histogram is also small up there, but it's really nice to have a large look at it as well. You can flip through all your variables, see that it has missing values. On the left, you have warnings about it, and yeah, really get all the information that you need to get an insight into your data. See if there are any common values that show up all the time. Now, this with 55 values really isn't that common? See the minimum and maximum values that you have. Get a feel for the range. When we have a look at our income, which is more of a distribution, we can see the distribution there as well, and on our categorical feature, the ocean proximity, we can see something very important. Island only has five entries. We have an imbalanced dataset here that there are not many homes on the island. Then we'll click over and have a look at the interactions, so see how one variable changes for the other. If we have a look at longitude against latitude, that's negatively correlated, longitude-longitude, the same value is always positively correlated. Now if we have a look at housing median value against everything else, we can really see how these interact, how these changed against each other. Total bedrooms against households, for example, is positively correlated, something good to know. This is just a powerful tool to really see each variable against another. Then we'll click over to the correlations. The standard linear correlation measure between one and minus one is the Pearson Correlation. Here we can see what we saw before that variable with itself. Longitude against longitude will always be one and all the other values should be somewhere between one and minus one. That way you can really see the relationships between data. Spearman is a bit more non-linear, but usually people prefer candles to Spearman's. Then there's Phi k. Phi is a measure between two binary variables, usually toggled on the top right to read more about these. Have a look at missing values, and this may remind you of something that we did earlier. I'm not the only one that thinks the missing numbers library is awesome, obviously, because this gives very similar insights on this tab. Then we can also have a look at a sample of our data finally, lead to this. We can take our profile report and we can generate an explorative profile report. This one is more interesting when you have different data types. If you also have text, or files, or images in your data frame, in your data analysis. Really not that applicable right here. In general however, you can see that this report already goes over a lot of the things that you want to know in your exploratory data analysis. Generally, you want to know the statistics of your data, the correlations of your data, missing values in your data, and really see how the data interacts with each other and what data can predict each other. It's fine if this is the only thing that you take away from this course, but really let's dive into how we can generate these insights ourselves in the next classes. I'll quickly show you how to get this into a file. You have profile dot two file and then give it a name, and then you get this beautiful website where you can click around and you can share it with colleagues where they can have a look at your analysis. It will say that it is apprentice profile in your report and that's good. Don't just use this. Use this as a starting point to make a deeper analysis and to really inspect your data. But this takes a lot of work away from our everyday data science work. 18. Visual Data Exploration: For EDA, I like to first look at plots. We'll have a look at visualizations that give us an intuitive understanding of relationships in the data, relationships between features, correlations and also the distributions of each feature. We'll be using which makes all of this extremely easy usually with one or two lines of code. First, we're importing pandas as usual and load our data. In addition, we'll load seaborn the plotting library. Seaborn is commonly abbreviated as SNS and the first plot for our data visualization is going to be a pair plot. Now, a pair plot will plot every column against every column, even against itself. When you plot the total rooms against itself you will get the distribution of the total rooms and if you plot it against any other column, you will get a scatter plot. This scatter plot as well as the distribution can be very informative. One of my favorite plots to do for visualization. Right here we can see that for example, our latitude and longitude data apparently has two spikes, so it seems like our geolocation data is focused around two spots. We can see that there are some very strong correlations in the middle of our plot, that is because we have some linear scattering right here. Every other feature that we see right here is distributed in certain ways like this one is scattered all over the place and we can see some clipping at the edges. Probably someone took like a maximum of some data. In addition to the pair plot, we can create a pair plot that is colored by a class. Right now the only class we have available is the ocean proximity and your exploration for the project, it would be really great if you experiment with this may be combined this with the binning exercise that we did. It takes a bit for this to load, that's why I only sampled 1,000 samples right now because we want to get the plot relatively quick. However, this gives a really good overview how different classes are distributed against each other. The legend on the right gives us which color is which and I want to drop the latitude and the longitude right now because those features are strongly correlated with each other and right now they only take up space in our plots so we can really make more use of our plot by getting rid of these. Now, in the drop, I have to add the axis because we want to drop this from the column and then our plot should be able to plot with a few less plots on the grid so each plot is a little bit larger. That gives us a lot of information already. We can see that our data is relatively equally scattered except for the island data. The island data seems to have a very sharp peak. However, remember that our island data has very few samples so that really skews the results a lot. However, maybe we want to just plot the distribution of our data. For this we can use the KDE plot which is short for the kernel density estimate. We'll have a look at how our median house values are distributed. In addition to this plot, we can also once again split this up by hue. Unfortunately there's no nice in-built way to do this like for the pair plots so we'll iterate over the unique values in our ocean proximity. This is a bit of a work around but I really like this plot so I'll show you how to do this anyways. In my teaching usually this question comes up anyways. I hope this plot works out for you as well. We'll subset our data, use the ocean proximity that is equal to the class which is our iterator over the unique values. That means we get our plot split up by our class. However, right now the legend doesn't look particularly nice. Each legend just says median house value and ideally we want the legend of course to say the class, so we can provide a label right here that contains our class name. That way we have a nice little plot that has all our distributions where we can see that inland has a very different distribution than most of the others and of course, the island is skewed to the right which indicates a higher price but once again, not a lot of data there so it's a bit of a skewed result. Now, maybe we want to have a look at more of the scatter plots. Making a scatter plot is very easy but we can even go a step further. There's something called a joint plot where we have the scatter plots and the undersides, we can plot the distributions of the data. Usually a histogram, you can do different ones as well. These are extremely nice to point out how data co-varies. In the case of for example, total bedrooms and population, we see a very clear distribution that indicates basically a linear trend, so some kind of linear correlation between the two. This plot is very easy, you just give the feature, the column name and the data frame and seaborn plays very well with pandas. Right here you can also see the distributions and the labels are automatically applied. This plot has a couple of different options. You already saw that there's a hex option but we can also do a linear regression. So fit a trend line with uncertainty to our data so we can really see if a linear model really fits our data or something else should be. Now, here we can see that outliers skew the results a little bit at least and in addition, we can have a look at a different feature just to see how our linear regression for example changes. This feature seems to be very strongly correlated to total bedrooms so replace population with households and we can see that this is as linear as true data actually gets. I think if we now copy this over, replace population with households then fit the line, we can see that the shade behind the line is basically not visible, so the uncertainty is basically not there on this data. A really nice way to see how our linear regression fits to the data. Instead of the pair plot we can also do a heat map of the correlation. That just gives us the number representation of our Pearson correlation coefficient and we can see that the diagonal is one as it's supposed to be, our latitude and longitude are negatively correlated because the longitude is negative and in the middle we have a square of strong correlation that we should definitely investigate that is very interesting and generally, just a good way to inspect your data as well. We can copy this over and just play around a little bit with it just to show you that nothing is baked in here, you can really play around with it. It's an open playing field to really explore your visualizations. This magnitude from zero to one now shows us that median income is correlated quite highly compared to the median house value. I didn't really see that before so just checking this out, switching it around a little bit can give you more insights. Trying this out from the standard visualizations can be extremely valuable. We can add annotations to this. Now, this is a bit of a mess so we'll round our numbers to the first decimal and see that this is looking much nicer. You can do this with the original data as well. This class gave an overview of different plots that you can use to understand your data better. In the next class, we'll actually look at the numbers underlying under these plots and how to extract specific numbers that will tell you more about your data. 19. Descriptive Statistics: In this class, we'll follow up on the visualization that we just did. We'll have a look at the numbers behind the graphs. Statistics can be a bit of a scary word, but really it's just significant numbers or key performance indicators of your data that tell you about the data. The mean for example, is just the average of all your data, whereas the median, for example, is the most common value. This standard deviations, or STD just describes how much your data varies. How likely is it that you find data away from the mean? We'll explore all of this in this class, and really do a deep dive into descriptive statistics, and how to get them from your data. In the beginning, we'll import our data, and then we can actually just calculate statistics on rows by providing the row. So df.house median age. Then we have the mean, and median, and standard deviation available as methods to calculate directly on the data. The mean is the average in this case, and median is the most common value basically. If we want to get aggregate statistics on all of the data frame, we just call.describe on the data frame or a subset of the data frame. This gives us the count, the mean, the standard, and the quartiles of our data. When you play around with this, make sure to check out the.screen for describe. You can do a lot more with it. Then we can group by our data. A group by action has to be done by something that can be grouped by. So we'll use ocean_proximity in this case, and we can calculate the mean for these groups over each column. This doesn't really make sense in longitude too much, but for all the other values, we can therefore get groups to statistics. In addition, we can use.agg method for aggregate. In there, we can basically define a dictionary with all the statistics that we want to calculate on a given column. For longitude for example, we'll have to look at min, max, mean, and we can copy this over to use it for other features as well. You're not limited to these, and you can even supply functions to this aggregator. They don't have to overlap as well. So for total rooms, you can change this to be the median value instead of the mean, because well, that makes a bit more sense to get the median. For our median income, we'll just try, and get the skew of our distribution. Here, we can see that our new data frame that comes out of this is filled with NaN, where no values are available, where they don't really make sense, but we can really dive into stats here. Another neat little tool just to give an overview of columns is the value counts method. So in ocean_proximity, for example, we can then call the value_counts method to get an overview of how many samples on each of these. Very good to get a feel for how our data is distributed among classes. For the heat maps that were generated before, we needed to calculate the correlation between each column against each other column. We can see right here that we have this data available readily. The core method also gives us the opportunity to change the correlation that we use. You can change it to spearman, for example, really very similar to what we had in the automatically generated report. Here you can dive into the data, and really see how our data correlates by number. In this class, we had to look at descriptive statistics, at the actual numbers, average values, and how we can extract these specific numbers, and make decisions based on them. In the next class, we'll have a look at subsets of that data. How do we select parts of the data, and how can we calculate these numbers on those parts because sometimes, as we saw here, island only has five samples in our entire dataset. So how can we make sure that we extract that data out of our data frame, and explore those further? 20. Dividing Data into Subsets: In this class, we will be learning how to extract subsets from our dataset because sometimes, for example, we only want to focus on one certain location or we want to focus on one subset of customers. Those segments are really easy to extract using pandas and I will show you how to do this. First we load our data, and then we'll take our df DataFrame and have a look at the longitude. Because we can take our df DataFrame and just perform normal logic on it. In this case we want it to be lower than minus 1-2 and we get a series out of it with true and false values so at boolean series, we can use this to choose rows in our original DataFrame and we can see right here that this is only a view so we have to assign it to a new variable. Let's have another look and another way to select subsets. In this case we want to have a look at the ocean proximity because selecting subsets of our categories, is really important for something we'll do later which pertains to the AI fairness and ethical AI. We can choose here that only "near bay" and "inland" should be in there. We get, once again a Boolean series out of this that we can use to slice into our DataFrame or well, get a subset of our DataFrame. You can see this right here, and we can see that it has less rows than before. We can also combine different logics to be arbitrarily complex and what we have to do right here is use the AND operator, ut in this case, it has to be the ampersand. The ampersand is a special operator in Python to do bitwise comparisons. You can see right here that AND will fail because the bitwise operator is just to really short-hand to compare the booleans. You have to be careful that you use parentheses in conjunction with a bitwise operator. Here, we'll just play around a little bit with true and false so you can see how these are combined when we use AND. We can use the same with an OR operator, but of course we have to take the bitwise operator, which is this the pipe symbol. I don't know if you have a different name for it may be, but it is onscreen, you have it in your notebook. Here we get the choice of things that are the choice of ocean proximity that is "near bay" "inland" or df longitude is under minus 122. When we have look at the unique values in our subset, off ocean proximity, we can see that there are values that are not in "near bay" and "inland" because they were in the longitude under minus 122. We can also use the dot loc method. This is a way to select subsets of our DataFrame by using the names of the indices so the index on the columns and the index on the rows. We can copy those right over and I'll show you right here where the difference is to the method before, because this will fail, because it expects us to give slices for all indexes. A DataFrame has two dimensions, the columns and the rows and right now we only gave it the columns. The column right here it is used to just select everything on the row sector and we can of course use this to slice into the rows as well by using the numbers of the index. Right here, we can see that this selected from the index name 5-500 and keep in mind that our index can be anything. We'll have a look at that in a second. Here we can see that this did not change our DataFrame at all. This is just a method to return a view. Of course we can also save this in a variable as always so the dot loc method just works in a different way than our way before. Now let's have a look at indexing, because up there we can see that our index is just a running integer from 0 to whatever the maximum number is, 20,640 in this case. However, we can use the dot set index method to change our index to any other row. This is really powerful in that we can assign any indexing, even text and select on that text, or in this case, the latitude instead of just a number. You can still use numbers and I'll show you afterwards how to do that but this is a way to change thinking about your DataFrame because right now our rows are indexed by the latitude. We can't do what we did before with the number because our index right now is not the integer anymore. Our index now is the latitude so if we choose the num, well, any number from our latitude, this will work again. Here have a look at the index, just copy a number out of here, like here 37.85 and we can then use this to select a subset using dot loc. Just use all the columns and we can see right here that this just shows everything from our index right here. You can see that indexes in pandas do not have to be unique as well, something really important to think about when you work with them. Now, slicing into our DataFrame like that is extremely powerful because our index will just return the data at that index and whatever sorting we have. We don't really have to be aware how our data is structured. Nevertheless, we can use the iloc method, which is basically index location, where we can still go into our DataFrame and select row 5-500, well, 499 because it's exclusive. We can also use this on the columns. If we think we know exactly where everything is, we can use this kind of slicing as well and to just use the number slicing to get our subsets. I usually recommend using dot loc because with dot loc, you can always be sure regardless of sorting, that you will get the things that you want back with the exact index that it is, and you don't have to make sure that you're sorting of your DataFrame is the correct way. Right here we can see that latitude is now not part of our columns anymore because we have assigned it to be our index. Now if we want to get latitude back into our columns, we can do that as well by resetting the index and then our index will be back to just running integers. This also works when you re-sorted it so you can reset the index back to going from 0-500, or your maximum number when you changed around your column ordering. This is really important to think about when you're doing index slicing that you can always change, the sorting of your data. But when you do dot loc, you will be able to retrieve exactly what's on the index. On the topic of selecting columns of course we can do the standard way of just providing the columns we want but sometimes your DataFrame gets very long like if you would think back to the missing numbers. Example we had, I think over 20 columns so selecting all that you want can be really cumbersome to be honest. We can also go the other way and select which columns we do not want and that is with the drop method. We provide the names of the columns that should be dropped from the DataFrame. Right here we'll just take the inverse of longitude and population, provide the axis that we want to drop it from because we can also drop columns. Yeah, right here you can see how you can change around a lot of the things as well, you can do in place dropping as well if you want to change the DataFrame directly in memory. Right here, you can see that we can drop it. Well, do the exact opposite from what we did before by dropping rows. Overall we're doing this because if you select subsets of your data, you can do analysis on the subsets so if we just use the describe method from our descriptive statistics, we can see right here, for example, the standard deviation and the mean of all the columns. We can of course also call the described method on a subset of our data and see how our descriptive statistics change. You can then start plotting on these subsets and do your entire data an analysis on the subsets. This class really deep into how we can select subsets off our data and really decide what to take based on features but also on indexes. We had a look how to switch indices and how to reset it again, because that is really important when you want to do your exploratory data analysis and have a closer look at some subsets of your data. In the next class, we will be looking at how we can generate those relationships in our data and really focus in on what to extract. 21. Finding and Understanding Relations in the Data: In this class, we'll have a look at the relationships within our data. We'll really check out how correlation works within our data, but go beyond this as well. Go beyond linear correlations and really dive deep into dissecting our data. We'll start out again by importing pandas and loading the data frame. We can see that.corr is really central to doing correlation analysis in pandas. We can use corr and change around the correlation coefficient that we actually want to use. Now, the standard Pearson's correlation is a linear correlation. Spearman and Kendall use a rank correlation which can be non linear. In addition to calculating these aggregate correlations, maybe sometimes you just want to find out how one column is correlated with another. Here we can simply provide the column and calculate the correlation on another column, right here. We can even take this one further. Machine learning tools have been really easy to use in the past 10 years, and we can use this machine learning tool to basically predict one feature based on the other features and if we do that with every feature, we can actually see how informative one feature is based on the other. This has been built into a neat little tool that we can use here called discover feature relationships or beyond correlations. It has recently changed name. You'll be able to find it on GitHub as well. This means we can use the discover method of this library to really dive into the relationships in our data. We use the discover method on our data frame and we can supply a method or a classifier but in this case we'll just leave it on standard, you can play around with this later if you're interested in it. It takes a few seconds to execute this, but then, well, I just use the sample from our data frame to make it a little bit faster. You can let it run on larger samples and we get how one feature predicts another feature right here, and we get that for every feature around. We can use the pivot tables that you may know from XO to get out a full grown table that will give you all the information you need right here, very similar to the correlation. However, the central one is not filled out. We'll just fill that with ones because you can predict a feature easily on itself, of course. Then we'll go on to plot this because looking at this as a plot is always quite nice, just like we can look at the heat map from the correlations. This is very similar to the correlations except that we use machine learning to cross predict this time. We'll save this into the variable and then make a nice plot out of this. We can see that as opposed to the correlation plot this is not fixed between minus one and one, so we'll fix that real quick and then you can really see how each feature can be extracted from the other feature. We do this fixing from minus one to one by using the V min and V max and there we see it. For example, analyzing how our population can be predicted by anything else is really a good way to see relationships within the data where you can dig in further why something is predictive or not, and really a nice tool for data science. This was the last class and now a chapter on exploratory data analysis. When I look at how we can extract information about correlations and relationships in our data, and the next class, we'll actually look at how we build machine learning models. Something that we already used implicitly here, we'll now learn how to actually apply. 22. | Machine Learning |: In this chapter of the data science process, we'll have a look at machine learning. Specifically, we want to model our data and find relationships in the data automatically. Machine learning models are so-called black-box models. That means they don't have any knowledge of your data. But when you show them the data and what you want to get out of the data, they will learn a relationship and how to categorize or how to find the right numbers. Do a regression with your data and machine learning is really powerful and super easy to apply these days. Which is why we'll spend lot of time on validating our models as well. Because these models tend to learn exactly what you tell them to learn, which might not be what you want them to learn and validation is the due diligence for you to do to make sure that they actually learned what you want them to learn. Let's fire up our notebooks and have a look at machine learning. 23. Linear Regression for Price Prediction: Welcome to the first class on the chapter on machine learning. We'll have a look at how to build simple models because in machine learning, often the ruler is, the simpler the model, the better. Because simple models are easier to interpret read and are often very robust to noise. Let's dive into it. After loading our data, we can import the linear regression model because we want to predict house values and this exercise. However, before we have to prepare our data in a certain way, we need to split our data in two parts. We want one training part and we one set of data that the model has never seen during training time, so we can validate that our model run something meaningful. This is to avoid an effect that is called overfitting. When our model basically remembers the training data and does not run meaningful relationships between the data that it can then apply to new data it has never seen. That way, we take our data-frame and we split it into two parts randomly. We could of course do this with a sub-setting that we did in the previous section, however, taking a random sample, that is, absolutely sure to not overlap in any way, is a much better way. The train test split function that Scikit-learn supplies is really good for this and it has some really need other functions that we can use. This is also a really nice way to select our features. For the simple model, we'll just use the features of housing median age, and then the total rooms as our training features. The house value is going to be our target. Those are usually saved in x and y. So we know we have x train and x test, and then we have y train and y test. This is quite common. We'll have a look at the shapes. We have a bit over 20,000 rows here. Our train data is going to be about 75 percent of that with 15,000 values and our y train should have the same amount of targets because those are sampled randomly but in the same row so the data obviously matches. Our x tests should now have the remaining rows that are not in the train set. Doing this is extremely important and there's no way around splitting your data for validation. Now it's time to build our model. Our model is going to be the linear regression model that we imported before and Scikit-learn makes it extremely easy for us to build models and assign models. We just have to assign the object to some kind of variable, in this case, we'll just call it model, and you can see that you can change some of the hyper-parameters in the model, but we'll keep it standard right now. Now we fit our model to the data. This is the training step where automatically our model is adjusted and the parameters in our model are changed so that our model can predict y train from x train. To score our model, to test how well it's doing, we can use the score method on our fitted model and provided the data where we know the answers as well. We can use x test and y test to see how well our model is doing on unseen data. In this case, the regression is going to be the r-square. R-square in statistic it's basically a measure of determinism. How well does it really predict our data? The best value there is one, and then it goes down and can even be negative. So 0.03 it's not impressive. When we change our training data to include the median income, we increase the score significantly. Obviously, this is the most important part. We have to find data that is able to give us information on other data that we want. However, once we find that, we can further improve our model by doing pre-processing on our data. We have to be careful here though, because we now do preprocessing and we'll test out different things, if they work or if they don't. What can happen is that we manually overfit our model. That means to do proper data science right here. We want to split our test data into two parts. One validation holdout set and one test set. The test set will not be touched in the entire training process and not in our experimentation, but only in the very very last part of our machine learning journey. Here we define x-val and y-val and I made a little mistake here, leaving that to y train that should of course be x test in the train test split. Changing this means that this works, and this is also a nice part about the train test split function. It really makes sure that everything is consistent, that all our variables match, and we can see right here that our test data set is now quite small with a thousand values. So we can go back to the test split up here, and actually provide a ratio of data that we can use. In your data science and machine learning efforts, you should always see that you can use the biggest test size you can afford really, because that means you'll be able to have more certainty in your results. Here can see that it's now split 50-50 and splitting our test set now further down into the validation set and the test set, shows that our final test set has about 2,500 samples in there. Which it's good enough for this case. We'll define our standard scaler here and our model as the linear regression. We fit our scalar onto the training data. That means we can now re-scale our entire data so that none of the columns are significantly larger than the others. In a linear model, we fit the slope and the intercept. When we scale our data, that means that our linear model now can work within the same ranges of data and not be biased because one feature is significantly larger than the others. We will create x scaled from our ex train data, so we don't have to call the scalar transform in the training loop. We can compare them, here we can see that our scale data is now within, well, centered around zero and all at the same scale. Whereas before it was all over the place. We can now fit our data on our model, on the scale data. Normal labels that we have obviously the label has to be y train in this case. Then we can do the usual validation on our holdout data. This case it's going to be x val and y val. So we don't touch the test date or while we see what kind of scaling and what kind of pre-processing works. We have to transform our data because now our model expects scale data. When we forget that, we get terrible results, and we can see that we improved our model by a small margin, but it is still improved by just applying the scaling to the data. If we try to use the robust scalar instead, we can do this by just, experimenting and using a different scalar. This is the part which I mean where we need an extra holdout set because just trying different things is a really good way to see what works and is how you do data science. Just seeing what sticks is really tantamount to building a good machine learning model. Because sometimes you might not expect that you do have outliers in your data and you try the robust scale and you see that it actually performs better, or you realize that it performs worse. Here we can train on our transformed data with a white terrain again and score our results to check whether this works. Try the min max scale that were used in previous class as well. After we've done the experimentation and trying to find our model, we can use this model to predict on any kind of data that has the same columns and hopefully the same distribution as our training data and the validation set. To do this, we'll use model.predict and provide it with some data. In this case, we'll use the training data just to have a look at how the model stacks up against the actual ground truth data, the labeled data. But of course, doing it on the train data isn't the most interesting because the model has seen this data. Eventually, we will do this on the test set. But finally, you want to do this on completely unseen data to actually get predictions from your machine learning model. Another very nice way and why I really like the train test split utility is that you can provide it with a keyword stratify. Stratification is a means to make sure that some feature is equally represented in each part of the train and test split. If we want to make sure that our ocean proximity, all of the island is in part in train and in part in tests, we can do this by supplying this feature. A reason why people like linear models so much is because linear models essentially fit a line to your data. If you think back to like fifth grade, you may remember that a line is basically just the intercept on the y and a coefficient for the slope. What we can do is interpret our linear model and have a look at these coefficients. Each column has a slope parameter right here. Well basically, this parameter tells you how much the slope of this data influences the prediction result. Of course, we can have a look at the intercept with the y, which gives us full overview of our model. Essentially you could write it out on paper now. In this class we learn how to use scikit-learn on a simple machine learning model, a linear regression. Basically fitting a line to our data. We had a look at how scaling can improve our model and even predicted on some data that the model has never seen. So it's validating if we're actually learning something meaningful, or if it just remembers the data. In the next class, we'll have a look at some more sophisticated models, namely decision trees and random forests. 24. Decision Trees and Random Forests: In this class, will have a look at decision trees and random forests, which are just a bunch of decision trees that are trained in a specific way to be even more powerful. Decision trees are very good learners because you usually don't have to change the basic parameters too much. In this class, you'll see how easy it really is to use scikit learn to build all different models and to utilize that in your exploration of the data. For this video, I already prepared all the inputs and the data loading and I split the data into the train set which is 50 percent, and then a validation and a test set, which are each 25 percent of the total data. Now, we'll go on to build a decision tree to start out. We'll import the trees from scikit learn from the Tree Library. As always, we'll define our model, in this case it's going to be a decision tree regressor. To make it comparable, will again do a regression on the house value model train, the training is going to be the same as always model.fit, x train and y train. I think at this point you really see why scikit learn is so popular. It has standardized the interface for all machine learning models. Scoring, fitting, predicting your decision tree is just as easy as a linear model. Decision trees are relatively mediocre learners and we really only look at them, so we can later look at the random forest that build several decision trees and combine them into a ensemble learning. The nice thing about decision trees is that they're usually quite scale independent and they work with categorical features. We could actually feed ocean proximity to our training here, but then we couldn't compare the method to the linear model as well. Well, not to this right now, but this is definitely something you can try out later. Scaling this data doesn't cost us anything, so we might as well try. Here you can actually see what happens when you don't transform your validation data. Basically it now expects, even though decision tree I expect scale data so it performs really poorly. When we transform our train data and we transform our validation data, our score is slightly worse than before. Next, we can build a random forest. A random forest is an ensemble of trees where we use a statistical method called bagging that basically tries to build uncorrelated decision trees that in ensemble, stronger learners than each tree individually. We'll import the random forests regressor from the sub library from scikit learn, and just like before, we'll assign the model object to a variable and then we can fit our model to the data. As you can see, the fitting of this is quite fast and scoring of this should give us a really good result. We can see here that this is slightly better even then the score we got on our linear model after scaling, and if we now look at the score of the training data, you can see why we need validation data. This random forests tree is extremely strong on the training data itself, but okay, on validation data. Instead, we can also have a look at the scaling just to see how it works, it doesn't cost us anything, it's really to do so you might as well. If this improves your machine learning model or it reduces over fit, it's always worth to do because it's cheap. We'll scale our training data and fit our model to it, we can use the same scalar from before because the scalars is an independent of the machine learning model is just the scalar. We see right here that our training score basically didn't change like it's in the fourth comma, so it's basically random noise at that point. On the validation set, we shouldn't expect too much either so it's slightly deteriorated the result so it is worth preserving the original data in this case. A fantastic thing about random forests is that random forests have something called introspection. You can actually have a look how important a random forests think each feature is. These are relative numbers, they might fluctuate a bit, but we can see that these features are differently weighted within the random forests to predict a correct price. This was a really quick one, I think scikit learn is amazing because it makes everything so easy. You just don't fit, don't predict, and don't score and those are super useful for all of our machine learning needs. In the next class, will have a look how we not only predict price but how we can predict categories. In more business sense, that maybe predicting if someone as credit worthy or not. 25. Machine Learning Classification: In this class we'll have a look at classification. That means assigning our data to different bins depending on what's contained within the data. In our example we'll have a look at ocean proximity. We'll try to predict if one of our houses is closer or further from the ocean. That basically means that we'll have the chance to test different algorithms and how they are affected by preprocessing of our data as well. We'll import everything and load our data. Now in the split, we want to replace the house value with ocean proximity because we want to do classification. So we need to predict classes. In this case, we'll predict how near a house is to the ocean. But generally you can predict almost any class. We'll turn it around this time and use all of the training features. But of course, we need to drop ocean proximity from our data frame. If we left that in, it would be a very easy classification task, I'd say. The easiest model or one of the simplest models is the nearest neighbor model or k nearest neighbor model. The nearest neighbors are essentially just taking the closest data points to the point that you want to classify and usually you just take a majority vote. That means the class that is most prominent around your point is probably the class of your point. For classifications I could learn is no different than the regression. We'll assign the object to a variable, and then we'll try to fit our data. But something went wrong in sans or infinity or anything, and k-nearest neighbor does not deal well with this. Like I said, I leave all the preprocessing steps in the preprocessing chapter to keep these chapters short and concise. But in this case, we'll drop the man's without any different preprocessing just so those rows get deleted. Might not be a good idea in most cases, but in this case it's just to get our data out of the door. Here we can fit our model with the usual training data. It just works this time. Then we can score our model. Now, scoring and classification is a little bit different than in regression. We do not have the R-square because the R-square is a measure of determinism in regression. In this case, we have the accuracy and the accuracy is at 59 percent, which is all right. So 60 percent of the time, this nearest neighbor model is correcting the correct class. We can probably do better, but that's a start. One thing you can try in your exercise is change the nearest neighbor number and have a look how many nearest neighbors to the point give the best value. We can have a look at many different classification algorithms. On the left you see the input data, which is three different forms of inputs. Then you see the decision surfaces of a binary classification on the right. You can see that the random forest is very jagged, for example, and a Gaussian process is very smooth. Just for you to understand how these understand data. We'll try out the random forest because it looks very different in the decision surface, and random forests are once again very powerful models. This is going to be the same schema as the nearest neighbors, so we'll have a quick chat about scoring functions. Because the accuracy score is all right, it's a good default, but it essentially just counts how many you get right. Let's say you work in an environment where errors are especially bad, especially expensive, you'd want to have a look if another scoring function would be more appropriate. You can have a look on the scikit-learn documentation. There are different scoring functions that you can check. Here we have a look and the random forest just outperformed with default values, anything the nearest neighbors gets close to with 96 percent. That is on unseen data, so it is a very good score. We can once again have a look at the feature importances to see what our model thinks is the most important indicator that something is close to the shore, and obviously part of it is going to be the longitude and the latitude. Let's just drop those as well from our data frame, from our training data because we want to make it a little bit more interesting. Maybe something else is a better indicator. If you come to your boss and say, "Hey, I figured out that location tells really well that my house is close to the ocean." They'll probably look at you a little bit more pitiful. Have a look, and obviously our random forest score is a little bit worse, but pretty all right. Let's have a look at another linear model. The logistic regression model is a binary model. You can use it for multi-class as well with a couple of tricks that basically goes between 0 and 1 and finds the transition. You can see it right here in the red. Logistic regression models are really interesting because they once again give a good baseline model because they are linear classifiers. But more interestingly, you saw that there's this transition line between 0 and 1 in the image. You can define a threshold in there. Per standard, it is at 0.5, but you can do tests how you want to set the threshold for your logistic regression. This is a really good thing to think about in your machine learning model, and we'll have a look how to determine this threshold after this segment of programming the logistic regression. We'll add this and have a quick look because we have a multi-class problem right here, and we want this multi-class problem to be solved obviously. Luckily, multi-class is automatically set to auto because most people don't deal with binary problems in real life. Scikit-learn really tries to set good default values. We'll fit our model with x-train and y-train data. Unfortunately, it did not converge, so it did not work in this instance. I'll go into the doc-string and have a look. There we go. Max_iter is going to be the keyword that we have to increase so it gets more iterations to find where the logistic regression is supposed to be. 1,000 wasn't enough either, just add a 0. This is going to take awhile. We'll think about our optimum threshold. Because in a sense, when you have machine learning, you want all your positives to be positively classified and all your negatives to be negatively classified. Then you have to think about which one is worse, getting something right or getting something wrong. In this case, we can use the ROC curve for logistic regression where we can plot the true positive rates. So the positives that are positive against the false positive rate. So everything that was classified positive falsely and then choose our optimum. In this class, we had a look at different classification algorithms. There are many more as I showed you on that slide. You can really dive into the different kinds of classification really easily. As you see, it's always.fit.train and then your score and you predict on unseen data. In the end, it's always the same. Then it comes to how you scale your data, which is part of the exercise, and also how you choose hyper-parameters like k for the k-nearest neighbors algorithm. In the next class we'll have a look at clustering our data, so really seeing the internal structure of our data and how each data point belongs to the others. 26. Data Clustering for Deeper Insights: In this class we'll have a look at how we can cluster each data. Sometimes data points really cluster well, sometimes there are been harder to discern. We'll have a look how different algorithms treat the data differently and assign it to different bins. After importing our data, this time we'll skip the part where we split the data because we'll rather look at the clustering algorithm as a data discovery tool. If you want to build clustering algorithms for actual prediction or for assigning new new classes, you have to do the splitting as well. Just so you know that it actually does what it's supposed to do. In our case, we'll just have a look at k-means. K-means is the unsupervised little brother of k-nearest neighbor, where it measures the closeness to other points and just assigns them to a cluster if they're close enough. We'll fit our data on the data frame and we'll use fit predict because we want to do everything in one step. Now the problem right here is that we have ocean proximity in there with strings in there. We'll actually just have a look at the spatial data, longitude and latitude, because those are very easy to visualize in 2D. That just makes our life a little bit easier. We'll get out some labels for these. What we can do then is we can plot these using matplotlib. You'll get to know matplotlib in a later class as well. But just for an easy plot, it does have the plt.scatter, which takes an x and y coordinate. Then you can also assign a color, which is labeled in our case. K-means, you can define how many clusters you want to get out. Now the default is eight. We'll play around a little bit with it and you can see how it changes, how the clusters change. The higher you go, the more fragmented it gets. You can argue how much it really makes sense at some point to still cluster data with hundreds of clusters. It's easy enough just to show what happens when we actually have proper clusters. We'll split our data a little bit. Essentially use the subsetting that we discussed before to delete some of the middle part in the longitude. For that, we can use the between method that basically defines a start point and an end point. When we negate this between, we are left with a cluster on the left of our geographic scatter plot and on the right of our geographic scatter plot. For that, we'll just choose minus a 121 and minus 118 as the left and the right borders. We can see right here that this gives us a split dataset. Assign that to a variable so we can use it. Let's have a look and plot this so we see what's happening with our data. Just delete for now that we have colors are labeled because those don't apply here, and we can see the clear split between two clusters. Then we can use our k-means to classify these two or to match these two. I'll just copy this over and use the fit predict to get our data on the split data. Also copy over our scatterplot and add back in the labels. We can see with two clusters, it's quite easy for k-means to get one cluster on the left and one cluster on the right. If we play around with the numbers, we can really test the behavior of how I would find sub-clusters in this and how it interprets the data. But because it's so easy I could learn. Let's have a look at other methods. This is a graphic from the scikit-learn website where you have different clustering algorithms and how they work on different data. The spectral clustering comes to mind. But I personally also really like DBSCAN and Gaussian mixture models. They work quite well on real data and especially the further development of DB scan called HDB scan is a very powerful method. HDB scan is a separate library that you have to have a look at and install yourself, but definitely worth a look. We can do the same as before. We'll import DB scan from our cluster library and scikit-learn, assign it to the DB variable. It doesn't have a lot of different hyperparameters that we can set. Maybe change the metric in there that you saw in the docstring but for now Euclidian is totally fine. We can see right here without setting any clusters, there are the outliers in the right. Basically it finds three clusters without us telling it much about our data. Let's also have a look at the spectral clustering right here. It looks just the same. We'll assign it to a object and instantiate it. We have to supply clusters for this one, so we want two and just copy over all of this to the prediction on our SP and execute the entire thing. This takes a little bit longer. Spectral clustering can be a bit slower on large datasets. Check out the documentation. They have a really good overview which clustering method is best for the size of data. Also, basically what you have to think about when applying different clustering methods. Since the methods are always evolving and always growing, it's a really good idea to just check the documentation because that is always up-to-date. Here we can see that the clustering is quite good. Clustering data can be really hard, and as you saw, it can lead to very different outcomes depending on what algorithm you use, which underlying assumptions are in that algorithm, but also how your data is made up. Is it easy to separate the data or is it really hard to separate the data? In the end, I think it's a tool that can generate new insights into your data that you didn't have before based on the data you feed to it. In the next class, we'll have a look at how we validate machine-learning models. Because just building the model isn't enough. We have to know if it's actually learning something meaningful. 27. Validation of Machine Learning Models: In this class we'll have a look at validating your machine learning models, and you have to do it every time because building machine learning models is so easy. The hard part is now validating that your machine learning model actually learned something meaningful, and one of the further classes we will also see if our machine learning models are fair. In this class, we'll have a look at cross-validation. So seeing what happens if we shift our data around, can we actually predict meaningful outcomes? Then we'll have a look at baseline dummy models that are basically a coin flip. So does our model perform better than random chance? After importing everything and loading the data, we'll drop the nouns and split our data. Right now will do the regression. So we'll build a random forest regressor, and this is just to have a model that we can compare to the dummy model to do the validation. So we'll fit it right away to our train data, and add the labels here. Having a fitted model, we can now score this and then go on to do cross validation. Cross-validation is a very interesting way if you just learned about test train splits, this is going to take test train splits to the next level. So cross-validation is the idea that you have your training data and your test data will keep the test data and as you know, we split our train data in a training set and a validation set. In cross-validation, we're now splitting our training data into folds. So basically just in sub-parts, and each part is once used as test set against everything else as a train set. So we're building five models for we have five-folds. You can also do this in a stratified way, like the test train split that we used before. Now, it is quite easy to do this. Once again, the API, so the interface that you work with is held very simple. So we'll import the cross-validation score and right here, the cross-fold score takes your model, the validation data, and that is your x, so the features and of course, the targets, because we have to validate it on some number. This takes five times as long because we evaluate five models and we get out five scores for each model. You may notice that all of these scores are slightly lower than our average score on the entire data, and this is usually the case and closer to reality. We can also use cross-fold predict to get predictions on these five models. So this is quite nice to do model blending, for example. So if you have five trained models, you can get predictions out as well. It is not a good measure for your generalization errors. So you shouldn't take cross-fold predict as a way to see how well your model is doing. Rather to visualize how these five models on the k-fold, on the cross-validation predict. Another validation strategy is building that dummy models. Whether you do this before doing cross-validation or after, that is up to you. It is one way to validate our model. So a dummy model is essentially the idea that we want our machine learning model to be better than chance. However, sometimes knowing how chance looks like as a bit rough. So we'll have a look here, you can do different strategies to use any of your classifier. You can just do it brute force and try them all. But a good one is usually using prior. I think this will become the default in future methods for the dummy classifier. But since we are doing regression first, let's have a look at the regressor. So right here you can also define a strategy. Default strategy is just to have the mean. The worst model returns just the mean. So we'll use this as our dummy model, just like any machine learning model, you fit this to the data or the x train and y train. Then we can score this function, we can even do cross-validation on this and see how well this chance model does. Based on these scores, it's a good gauge how well your actual model is doing. If the chance of model is performing better or equal than your model, you're probably build a bad model and you have to go back and rethink what you're doing. We can do cross-validation , but obviously scoring would be more appropriate here, which is something you can try out in the notebooks. We'll do this again and we'll create a really quick classification model using the ocean proximity data again. Here we'll build the classifier just with a normal strategy. I personally think the dummy models are really useful because sometimes while chance isn't just 50-50, if you have class imbalances, like we do with the island data, for example, your coin flip is essentially skewed. It is biased and dummy classifier is just a very easy way to validate that even with class imbalances, you did not build a useless model. So right here we can score this and we get a well pretty low accuracy right here. We can check out how the different strategies affect our result. So 32 percent is pretty bad already, but even like you should probably take the dummy classifier with a best model because that is still a chance result. So these 40 percent on the chance prediction on not a good sign. Let's say that. So right here we'll build a new model using the Random Forests again. There we go with a classifier and we'll directly fit the data to it so we don't have to execute more cells. Now scoring on the data will show that a random forest is at least a little bit better than the dummy. So 20 percent better accuracy, I'd say we're actually learning something significant here to predict if we're closer or further away from ocean proximity. Now, as I said, the scoring is more appropriate, so we'll use the scoring right here with our dummy model on the test data. The warning right here is interesting because our cross-validation score tells us that ocean proximity, the Island class does not have enough data to actually do a proper split. So this is really important to notice. That is something to take into account. But apart from that, we see that even on cross-validation, our model is outperforming the dummy model. Validating machine learning models is very close to my heart. It's so important since it's become so easy to build machine learning models that you do the work and see that those models have actually learned something meaningful, and I'm just reading something into noise. These strategies are really the base level that you have to do with every machine learning model, and in the next class will have a look at how to actually build fair models and how to make sure that our model doesn't disadvantage anyone because of some protected class, for example and that will be extremely important if you touched humans with your machine learning model. 28. ML Interpretability: In this class, we'll have a look at machine learning interpretability. We're going to look at this black box model that we built, and we'll inspect what it actually learned, and if you're like me and you've ever built a model and showed it to your boss and said, yeah, it learned, and it performed such and such well on the score it had 60 percent accuracy. But he won't be impressed, they want to know what this machine learning model actually thinks. In this class, we'll have a look how each and every different feature in our data influences the decision in our machine learning model, and we'll actually dive deep into the really cool plots that you can make that influence the decision in a machine learning model. Here we'll pretend we already did the entire model validation and the model building and the data science before, so we can check if our model is fair. The notion of fairness really is the idea that, even though our model has not seen the ocean proximity class, it may implicitly disadvantage this class. You can check right here in our split, we directly drop ocean proximity, we do not train on it, but maybe our data is somehow implicitly disadvantages some class that is within ocean proximity, so check for this. Well, a couple of pandas tricks have been using pandas for a bit, so here you see the stuff you can do with pandas. Because right here you have the validation data and that is only a part of our data frame. I want to find all indices from our Data Frame that are in this class, and find the intersection of this index with the validation data. That way I can choose the subset of the class in our validation set without actually having the data present in the Data Frame from our test try and split. Doing this, I'm playing around a little bit with it and trying to make it work, just printing over and over so I see what's happening, and so I can actually validate with the data that, this is the data that I want. Right here you see that now I'm taking the subset of it, taking the index of it, and then I'm printing the intersection of x val and class-based index that I created before that. So save this in IDX. I can just go into a Data Frame and subset the Data Frame with the index. We'll use the model scoring function in this case, just to get an initial idea how well our model is really performing on the class subset of our validation data. Right here, print this because we're now in the loop. Yeah, I have to use dot lock here. I mean, it's really important for you also to see I still may make mistakes very, very commonly and you cannot be scared of mistakes in Python because it doesn't hurt, mistakes are cheap, so just try it around. In this case I messed up, I sometimes have problems keeping the columns and the rows apart, and right here, interestingly, we see wildly varying values. We have three that are around 60 percent, which is nice, but the third value is around 23 percent and the last one is zero. I have a look at the indices and we can see right here that must be Island, because Island only has five values in total, so we definitely have a prediction problem here, despite our model doing well overall. It does terrible on the island data because it doesn't have enough data right there. Here you can see that I'll try to improve the model by stratifying it on the ocean proximity just to try to see if this changes everything. It does not, because now I made sure that they are equally distributed across classes and we have even less data, so before I got lucky that I had two of the samples in the validation set and now I have less. Now come to this because this is already a subset of the data, so we'll just skip this because with five samples trying to spread them out over all the data-sets, it's kind of loot, and really in this case, you have to think about if you can either get more samples, somehow create more samples, collect more samples, or if there is any way to get the samples out of the system. But since they are data, they should be represented in your model usually. Really in this case, get more data in that case. We can see right here that the stratification has improved the model overall, which is nice to see and what this way, so the backslash n is a new space just to make it look a little bit nicer. Here we can see that this is giving us good predictions for everything that is near to the ocean. So near bay, near ocean and under one hour of the ocean, but the inland is performing significantly worse than the other data. Let's ignore the island for now because we discussed the problems with the island. Let's have a look at the test data, because right now we're not doing model tuning anymore, so this is really the end validation and we can see that on the test data, inland really has a problem. So our model is performing well overall. But some thing is going on here with our inland data. It would also be good here to do cross-validation so we can actually get uncertainty on our score, and see if there are any fluctuations in there. But let's move on to eli5, eli5 or eli5 is a machine-learning explanation package, this is the documentation, and for decision trees, it is just right here we can use the explain weights. This is what we're doing right here. I'm calling this on our model right here. We have to supply the estimator, so what we trained as our model. We can see the weights for each feature and how these features chord. This is an extremely good way to look into your model to be able to explain what influences our model the most. We can also use eli5 to explain individual predictions, so right here, we can obviously supplier our estimator object, but then also supply individuals samples and get an explanation how the different features influenced this prediction. Here right now we'll just use a single sub sample from our data. We can see right here how each individual feature contributes to the outcome of 89,000. Yeah. Of course you can do this for classifiers as well, or we can iterate over several different numbers and see how these are explained by Eli5. I'll just use the display here. Like I said, the format is really nice as well, but I don't really want to get into it in this class. Here you can interpret how each of these is influenced by the different model parameters. After having a look at these, we can have a look at feature importance. You may remember from before, from their random forests that you can do introspection on their feature importance, and scikit-learn also provides a permutation importance for everything. For all, you can apply this to every single machine learning model available and scikit-learn. The way this works is that essentially, the permutation importance looks at your model and takes each feature in the data and one-by-one scrambled those features. First it goes to the households and scrambles those. They are essentially noise. Then sees how much this influences your prediction. You can see here that it gives you the mean importance, the standard deviation, and also the overall importance. You can really dive deep into how your model is affected by each feature. You can also decide to drop out certain features here. Next, we'll have a look at partial dependence plots. These plots are really nice because they give you a one way look into how a feature affect your model. Introspection is relatively new in scikit-learn. That's why there's scikit yellow brick, scikit minus yb, which makes these fantastic plots. Top middle you see the precision recall curve, for example. Generally, a really good way to visualize different things that explain your machine learning. Here we see all the different features in our training, and how they influence the prediction result. Bottom right, the median income you can see has a strong positive influence. Right here you can interpret how changing one feature would influence the price outcomes. Households, for example, has a slight increase when there's more households, etc. It's a really neat little plot. But the final library and my favorite library for machine learning explanation, they did get a nature paper out of this, even is shap. Shap has different explainer modules for different models, they're basically very fine tuned to each. You can even do deep learning explanation with shap. We'll use the tree explainer for our random forest model. We have a warning that we have a lot of background samples. We might actually be able to sub-sample this to speed it up. But right now, we'll pass our validation data to this, explain our objects that we created, and calculate this. This takes a second. I'll actually consulate right here because I want to save those in a variable so I can reuse them later and don't have to recalculate them. But generally the plot that I want to show you is a force plot. This plot can be used to basically explain the magnitude and the direction each feature of your machine learning model of your data, how it shifts the prediction. I really love to use these plots and reports because they are very intuitive. You'll see that in a second. Here we have to pass the expected value in our explainer object, and the shap values that we calculated before, and one of our data points in the validation data. You can once again do this for several of your validation data points. I made a mistake here and not have any underscore, and also, I should have activated JavaScript for shapely because they make so nice plots. They're falling back on JavaScript to do this. It has to be initialized right here. But afterwards, we have this plot and I hope you can try it yourself, because here you can see that this particular prediction was most influenced by the median income negatively, and then population a little bit less positively, number households negatively, and just overall really nice package. We've had a look at different ways to visualize and include data in your reports, and how you can generate them and definitely check out the documentation. They have so much more for you. In this class, we inspected the machine learning model. We had a look at how different features influence our machine learning decision, but also how strong is this influence on the decision and how two different features influence other features. Maybe we can even drop some from our original data acquisition. In the next class will have a look at fairness. There's important part where machine learning models might actually disadvantaged some protected classes because they learn something that they shouldn't learn. We'll have a look at how to detect this and some strategies, how to mitigate this. 29. Intro to Machine Learning Fairness: In this class will have an introductory look at machine learning fairness. Machine learning has gotten bit of a bad rap lately because it has disadvantaged some protected classes that shouldn't have been disadvantaged by the model. This has only come out by the people noticing, not by data science, doing the work beforehand. This class will have a look, how you can do the work, how you can see if your model is performing worse on certain protected characteristics, and also, we'll have a look if your machine learning model is less certain in certain areas. Sometimes you get a model that predicts that someone is worse off because they're in a predicted class, and that is a big no-go. If that ever happens, your machine learning model may never reach production. But sometimes your machine learning model is just less certain for some people over a certain class,and then you should try to increase the certainty of the model by doing the machine learning and data science work beforehand. This will be building on what we did in the interpret ability part, where we already did part of the fairness evaluation. We start with a random forest and we do the scoring, so, we have a baseline on knowing how well our overall model is doing. We'll start to dissect it by class. We already have stratification on the class because we'll keep that from before, because it improved the model significantly, but then we'll iterate over the classes and actually have a look at how well they're doing. In the beginning we want to know the score, and basically do the same work that we did in the interpret ability part. Our classes right here, then we can actually have a look and interpret our data. Right here, we'll do the work of getting our indices, so, we do the whole thing with a intersection and getting our class indices in the beginning, so, this is going to be saved in our idx for index, and then we're taking the intersection of the union of these values for validation and for our test, because right now we just want to really test our algorithm, so, I think both hold our data-set, it's actually fine for this part. Usually you do it in the end after fixing your model, and after hyper parameter tuning to see if your model is disadvantaging anyone. We take the intersection with a class index right here, copy this over, make sure the dot isn't there, then we can score or model on the validation data and on the test data. Ideally all these scores should perform equally well. Absolutely ideally, all of them perform as well as the overall model score, but we remember that inland was significantly worse than everything else, and of course we have the problems with Island not having enough samples to actually do the validation, that's why I included just skip for Island for now. Later, I'll also show you how to catch errors in your processing so we can do this. Then we'll expand this through include cross validation, because with cross validation, we can really make sure that there aren't weird fluctuations within our data. Maybe inland just has some funny data in there that really makes it very wildly in it's prediction. Getting that out is really important right here. This is only the beginning, so, for you, if you want to have a play around with it, you can build dummy models as well and just really dig into why inland is doing so much worse using the interpret ability to really investigate why something is happening in this model right here. There we have the scores, and while looking at these scores, it's nice, it's getting a bit much with the numbers, so, what we can do, first of all is at the try-except that Python has, so, if there's an error, because Island doesn't have enough data, we can catch that error, except we'll just put a pass, so, everything else still runs after we processed Islands as well. There we go. Now we'll save these as variables, as val and test, because then we can actually just calculate statistics on this circuit, the mean and get the standard deviation. An indicator for uncertainty here would be, or something funny happening here would be, if the standard deviation of our cross validation would be very high. Which interestingly, it isn't. In this class we had a look at how to dissect our machine learning model and evaluated on different protected classes without training on them. We saw that a model that overall does quite well can perform really poorly on some classes, and that sometimes we even don't have enough data to really evaluate our models, so, we have to go back all the way to the data acquisition and get more data to really be able to build a good model and do good data science in regard to that class. The business case here is really clear, you never want to disadvantage someone that would be a good customer because you lose a good customer. This concludes the chapter on machine learning and machine learning validation. The next class we'll have a look at visualization and how to build beautiful plots that you can use in your data science report and presentations. 30. | Visuals & Reports |: In this final chapter, we'll have a look at data visualization, and also how to generate presentations and PDF reports directly from Jupyter. That concludes this course. 31. Visualization Basics: In this class, you'll learn the underlying mechanisms for data visualization in Python. We'll import our data as always, and then we'll use the standard plotting library in Python, matplotlib. It underlies most of the other plotting libraries that are more high-level like seaborn as well. It's really good to know it because you can use it to interface with seaborn as well. We'll make an easy line plot with median house value here. Now usually you want to line plot for data that is actually related to each other. We'll start modifying this one. First we will open up a figure and we'll also call the plot show because Jupyter notebook, is making it a bit easy for us, showing us the plot object right after the seller's executed. But this is really the way to do it. We can change the figure size by initiating our figure object with a different figure size. Those are usually measured in inches. Just an aside and then we can change this from a line plot. We'll modify this further. Since line plots aren't really appropriate here, because if we plot this against each other, it looks a little bit funky. We can change the marker to be an x and we get a nice scatter plot right here. You can see that seaborn, for example, makes it much easier for us to get a nice-looking scatter plot. This plot still needs a lot of work, but for you to know this is how we change the colors and change different markers. You can look up the markers on the matplotlib website there is a myriad of different markers. But then we can also start adding labels to this. The plot object that we have is a simple plot object. We can just add on a label, so x label is our population and y label is going to be our house value right here. We can add a title because,one, we want our plots to look nice and be informative. Additionally, we can add additional layers of plotting on top of this, so instead of population against median house value, we can plot population against our total bedrooms and change the marker size, marker color, and the market style. But obviously our total bedrooms is scaled very differently than our median house value. We can make a hot fix right now, of course you never do that in an actual plot. But just to show how to overlay different types of data in the same plot, you can modify this in this way and you can save your data and have it available as a normal file. Changing new DPI means that we have the Dots Per Inch. That is the resolution of our plot that is saved. Then we can also plot image data. We don't have an image right now for this, but it is with plt.imshow and essentially you just give it the image data and it'll plot the 2D image for you. Let's have a look at how to change this plot even more like overlaying different data on this as well. If we only have one scatter, this is completely fine and we can add a legend. But it gives us a nice warning that there are no labels or in this plot object. That means we can add a label to our data plot right here and we'll just call it house. Now we have a legend here. It makes more sense if we overlay more data. If we want to plot some data on top, we can just call another plot function right here. Change the marker, so it looks a little bit different and you can see that our legend is updated as well. This is how you can make singular plots. As I mentioned, seaborn is a high level abstraction for matplotlib. That means we can actually use matplotlib to work a little bit with seaborn. This will only give you one example how to save a seaborn plot. But you can easily look up other ways to add information to your seaborn plots or modify your seaborn plots through the matplotlib interface. Right here, we'll do the pairplot with only 100 samples because we wanted to be quick and we can once again open up a matplotlib figure and change the figure size as we wanted, and then save the figure. Here we can see that this is now available as a PNG. Open up the PNG and use it wherever we need it. Well, opening this, just looks like the plot but as an image. If you want to make quick plots directly from DataFrames without calling seaborn or matplotlib, you can call the pandas plot function actually, which interfaces with seaborn. So df plot gives you the ability to make different plots. Really you can make bar charts and histograms and also scatter plots, which we'll be doing this time. This means we'll just provide the label of our x and our y data and tell it to make a scatter plot. We'll use the population against our total rooms again, and just provide the word scatter, and it plots a scatter plot for us. In this class, we learned the different ways to plot data in Python. We use matplotlib, we use pandas directly, and we even saw that these interact with each other because seaborn and pandas both depend on matplotlib. You can take the objects returned by those plots and save them and even manipulate them further. In then next class, we'll have a look at plotting spatial information. 32. Visualizing Geospatial Information: In this class, we'll have a look at mapping geospatial data. Data where you have geo-locations available. You can make nice maps and really show where your data is out, which adds another dimension of understanding to your data. We'll start up by loading our data. Our data already contains longitude and latitude. However, they're in the wrong order, so we'll have to keep that in mind. Well, import folium. Folium is a way to plot our data on maps interactively actually. We'll start out with a folium base-map, and give the location of our first data. I would like to scroll. An easy way for me to just make the base-map for my data is providing the mean of the latitude and longitude as the center point of our map. Then we can have a look at the intact of display. You can see that this has OpenStreetMap as the background, and you can zoom around in it. Then we can start adding things to it. One example is adding markers. Markers are a good way to just add locations from your data points and just give some sample locations at tooltips to it. This is what we'll do right here. Volume has the macro class and have a look at all the different classes that you can use. The library is still growing, but it has some really neat functionality in it already to build some really cool maps. We'll add the first data points from all market to the map. This is why it has the add to method. We'll copy over the base map into this cell because everything has to be contained into one cell to be able to change it. We can see it right here. We'll change that around, add latitude and longitude, and just use iloc to get the very first row from our DataFrame. There we go. Add it to our base-map m. When we move our zoom out, we can see our first marker on the map. This is quite nice. We can change around the map, we can change around the markers as well. There are different kinds of markers, so you can add circles for marking to your map as well. Definitely experiment with it. It's quite fun, I think, and quite a neat way to visualize data. Our map was zoomed down way too much at the standard value. At 12th was somewhere nowhere to be found. Zoom out a little bit in the beginning so we can actually see the markers. We can also add multiple markers by iterating over our DataFrame. For that, we'll just use the iterrows method we have in the DataFrame, and this will return the number of our row and also the row and the row content itself. I'll probably add a sample to the Df because if we added 20,000 markers to our map, that map would be quite unreadable. There we go. Maybe do five for the beginning. Right here, I'll add the ISO that is unpacked in the loop itself, and I can remove iloc right here because we don't have to access any location of our thing, the iteration already does that for us. We have a nice cluster of a few markers right here. Then you can also go and change these markers, add a tooltip when you hover over it. This tooltip can contain any information that we have available. For example, we can use the ocean proximity right here. Now, when you hover over it, you can see what class this marker has according to our data. In this class, we've a look at how to create maps and add markers to these maps and make them really nice and interactive. In the next class, we will have a look at more interactive plots, like map plots and all that, to be able to interact with the data directly and make these nice interactive graphs. 33. Exporting Data and Visualizations: Oftentimes, we need to save this data because we don't want to re-run the analysis the entire time, or we want to share the data with a colleague. That means we need to export the data, all the visualizations that we do. This is what we'll do in this class. Having our data slightly modified, we can, for example, use the one-hot encoding that we used before, just so we know that we have some different data in there. We can use the to_csv or one of these methods. You can write out Excel as well. Write this to a file, so that way we have the data after processing it available, and don't have to re-run the computation every time. To_csv takes a lot of different arguments, just like read_csv. It's very convenient in that way. You can really replace the nans in here with, for example, just the word nan, so people know that this is not a number and not just a missing value where you forgot to add something. Then we can have a look at this file and the two [inaudible] as well, search for nan, and we could see right here that it added nan into the file directly. Really a convenient wrap up to get out our data frame into a sharable format again. Instead of this, we can also use the right-hand functionality. This is basically how you can write out any kind of file that you want. We'll use the out.txt and this, so.txt, and switch out the mode to write mode, w. We'll just use f as a file handle right now. F.write should give us the opportunity or the possibility to write out a string into this file. We can convert our data frame to a string with values. I think there should be a two string method to really give. So use the two string method right here, which is, in a sense another exponent method. But really, this is just to show that you can write out any kind of thing that you can format as a string into a file. We'll see right here that this is not as nicely formatted as before. We have the tabs in between instead of the commenter. Really it needs a bit of string magic to make this work as nicely as the Pandas to CSV. But yeah, this is how you export files in Python. Now something to notice here, is that we'll always overwrite your final. If you change it to anything else, and we'll have another look at the file, we can see that refreshing this, gives us only this. The file is replaced by the writing operation. There's another mode that we should have a look at, which is append mode. Append mode just has the signifier A, where we can add onto a file. This is quite nice if you want to preserve the original data or have some kind of process that is ongoing to write out data and add it to your file, to an existing file without deleting that file essentially. We can see right here that we now wrote out our data frame. Then we can copy this over and change this to append, executed, and go over, refresh this. Have a look at the very end, it should say anything really, and it does. Yeah, those are files. We already did this in the basics of visualization, but in case you skipped that when you have figures, you can export these figures usually with safety command. This one takes a filename, file handler, some kind of signifier, of course, you need some kind of plot. I really want to point you to the tight layout method right here, because that one is really good to tighten up the layout of your safe plots. If you save your figure and it looks a bit wonky, plt.tight_layout will really clean up the borders of your figure and usually makes them more presentable. I basically run them on almost any exported figure. Here you can see that our figure was exported just fine. You can change around all these parameters, of course, to save the figure in exactly the way you need. Maybe you have a corporate color that you want your figure to be. In this case, I chose black, which of course, is a poor choice if your [inaudible] aren't black. But yeah, just to show you how to do it, how to play around with it. We had a look at how easy it is to save data in different formats from Python. Then the next class, we'll actually have a look how we can generate presentations from Jupyter Notebooks directly. 34. Creating Presentations directly in Jupyter: It can be complicated to generate whole presentations, but it is possible to get presentations right out of Jupyter. In this class, I'll show you how. You can use any notebook, we will use the one created in the visual exploration one. You want to go to "View", "Cell Toolbar", and then "Slideshow". You can change the slide type for each cell, if you want to have it displayed or skipped, and slide is going to be one of the main slides. Everything you want on its own slide, you can put as slide or sub-slide. Sub-slide is going to be a different navigation. Notice these plots, we'll have a look at the presentation in a second. Fragment is going to add another element to the parent slide essentially. We'll check that out as well. After signifying these, we can go to "File", "Download as", and call the "Reveal.js slides"'. When we open this up, we get a presentation right in the browser. Let's get rid of that. This is a main slide, scrolling to the right essentially. We can still have a look at the data and it shows us the code and everything. Sometimes, you have to play around a little bit with the plots that they work. These are the sub-slides that we talked about before. Now, the fragment is going to add another element to your slide essentially. This is also a fragment, and another fragment. In this class, we had an overview of how to generate presentations in JavaScript and HTML from Jupyter. We saw that we can really preserve the data and the code in our presentations, and even have these plots automatically included. We saw that we can do sub-slides and fragments, and really make this super interesting presentations that are different from what you usually see. In the next class, you will see how to get PDF reports out of Jupyter. 35. Generating PDF Reports from Jupyter: In that's final class, you'll learn how to generate PDFs directly from Jupyter Notebook, and how you can get these beautiful visualizations ride into your PDFs without any intermediate steps. We'll start out with a Jupyter Notebook and go to print preview. Here we can already save it as a PDF, if we print this. Alternatively, we can download as a PDF, but here you have to make sure that you have later installed, and I know a lot of people don't, I don't on this computer, for example, so you get a server error. You can go the extra step of going download as HTML. Open the HTML and this is equivalent to the print preview and save it as a PDF. In the PDF you can see that this now contains your code and all the information that you had previously available as well. Additionally, we do have Nbconvert, so the notebook convert functionality that comes with Jupyter Notebook, and I think that's a really nice way to work with it. It has a read me when you just call Jupyter space Nbconvert and it'll tell you how to use it essentially. What you'll want to do is go to your data, right here where in my code repository for their Skillshare course, and there you can just Jupyter Nbconvert and then choose one of the well where you want to generate the report from. That HTML, and it's usually the default if you just call Jupyter Nbconvert on your notebook, it'll convert it to HTML. You can also supply the minus minus 2, but if you say PDF, it'll run into the same error as before that you don't have LaTeX installed, if you install that, you can easily get these PDF reports directly from Jupyter. Another very nice way is that, well, if your interpreter notebook, he often play around a little bit and your well, the the cells can run quite high numbers, be like 60, 70 and that basically shows how much experimentation you did,. If you want a clean notebook that gets run top to bottom, you can provide them minus-minus execute option, which executes your notebook cell by cell before exporting. This is how you generate PDFs in Jupyter. Maybe you have to install LaTeX to be able to do it or you use the print functionality from the HTML report, but this concludes the class on data science in Python here on Skillshare. Thank you for making it this far, I hope you enjoyed it and I hope this brings it forward in your career. 36. Conclusion and Congratulations!: Congratulations, you made it through the entire course on Data Science with Python here on Skillshare, and I understand that this is a lot. We went through the entire data science workflow, including; loading data, cleaning data, then doing exploratory data analysis and building machine learning models. Afterwards, validating them, looking at interpretability, and also generating reports and presentations from our analysis. This is huge, and I understand that this can be overwhelming. But you can always go back to the byte sized videos and have another look at those, to understand and depthen your knowledge. Right now, in my opinion the best data scientists just build projects. You will learn more about data science by actually applying your knowledge right now, and that is why we have a project at the end of this course. You will build a nice data science project and analysis of your own data or the data that I provide here and build a PDF with at least one visualization that you like. But honestly, the more you do, the better. Deep dive into the data, find interesting relationships in your data set, and really work out how to visualize those the best. This is how you will become a better data scientist by really applying your knowledge. Thank you again for taking this course. I hope you enjoyed it. Check out my other courses here on Skillshare if you have time. Now, make sure to go out and build something interesting, something that you really like. Thanks again for taking this course. I've put a lot of work in this and I am glad that you made it through the end. I hope to see you again and build something awesome.