Data Science and Machine Learning with Python – Intermediate Hands-On Coding | Jesper Dramsch, PhD | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Data Science and Machine Learning with Python – Intermediate Hands-On Coding

teacher avatar Jesper Dramsch, PhD, Scientist for Machine Learning

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

    • 1.

      Introduction to Data Science with Python


    • 2.

      Class Project


    • 3.

      What is Data Science?


    • 4.

      Tool Overview


    • 5.

      How To Find Help


    • 6.

      | Data Loading |


    • 7.

      Loading Excel and CSV files


    • 8.

      Loading Data from SQL


    • 9.

      Loading Any Data File


    • 10.

      Dealing with Huge Data


    • 11.

      Combining Multiple Data Sources


    • 12.

      | Data Cleaning |


    • 13.

      Dealing with Missing Data


    • 14.

      Scaling and Binning Numerical Data


    • 15.

      Validating Data with Schemas


    • 16.

      Encoding Categorical Data


    • 17.

      | Exploratory Data Analysis |


    • 18.

      Visual Data Exploration


    • 19.

      Descriptive Statistics


    • 20.

      Dividing Data into Subsets


    • 21.

      Finding and Understanding Relations in the Data


    • 22.

      | Machine Learning |


    • 23.

      Linear Regression for Price Prediction


    • 24.

      Decision Trees and Random Forests


    • 25.

      Machine Learning Classification


    • 26.

      Data Clustering for Deeper Insights


    • 27.

      Validation of Machine Learning Models


    • 28.

      Machine Learning Interpretability


    • 29.

      Intro to Machine Learning Fairness


    • 30.

      | Visuals & Reports |


    • 31.

      Visualization Basics


    • 32.

      52 Geospatial new


    • 33.

      Exporting Data and Visualizations


    • 34.

      Creating Presentations directly in Jupyter


    • 35.

      Generating PDF Reports from Jupyter


    • 36.

      Conclusion and Congratulations!


  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

Are you ready to learn the skills that are in high demand across all industries?

Machine Learning and data science are essential for making informed decisions in today's business world. And the best way to learn these skills is through the power of Python.

Follow the free code resource book on: (no e-mail)

Python is the go-to language for data science, and in this course, we will dive deep into its capabilities. This class is designed for both beginners and intermediate students, so don't worry if you're new to programming. This class assumes some Python knowledge, but if you'd prefer a high-level introduction without programming application to data science, I have another class: The No-Code Data Science Master Class.

We will start by covering the basics of Python syntax and then move on to the full data science workflow, including:

  • Loading data from files and databases
  • Cleaning and preparing data for analysis
  • Exploring and understanding the data
  • Building and evaluating machine learning models
  • Analyzing customer churn and validating models
  • Visualizing data and creating reports

We will be using popular and freely available Python libraries such as Jupyter, NumPy, SciPy, Pandas, MatPlotLib, Seaborn, and Scikit-Learn.

By the end of this class, you will not only have a solid understanding of data science and analytics but also be able to quickly learn new libraries and tools. So, don't wait any longer and enroll in this Python Data Science Master Class today!


Who am I?

Jesper Dramsch is a machine learning researcher working between physical data and deep learning.

I am trained as a geophysicist and shifted into Python programming, data science and machine learning research during work towards a PhD. During that time I created educational notebooks on the machine learning contest website Kaggle (part of Alphabet/Google) and reached rank 81 worldwide. My top notebook has been viewed over 64,000 times at this point. Additionally, I have taught Python, machine learning and data science across the world in companies including Shell, the UK government, universities and several mid-sized companies. As a little pick-me-up in 2020, I have finished the IBM Data Science certification in under 48h.


Other Useful Links:

My website & blog -
The weekly newsletter -

Twitter -
Linkedin -

Youtube -
Camera gear -

Meet Your Teacher

Teacher Profile Image

Jesper Dramsch, PhD

Scientist for Machine Learning



a top scientist in machine learning, educator, and content creator.

In my classes, you'll learn state-of-the-art methods to work with AI and gain insights from data, along with over 7,000 other students. This takes the form of exploring data and gaining insights with modelling and visualizations. Whether you're a beginner, intermediate, or expert, these classes will deepen your understanding of data science and AI.

I am trained as a geophysicist and shifted into data science and machine learning research, and Python programming during work towards a PhD. During that time, I created educational notebooks on the machine learning contest website Kaggle (part of Alphabet/Google) and reached rank 81 worldwide. My top notebook has been viewed... See full profile

Level: All Levels

Class Ratings

Expectations Met?
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Introduction to Data Science with Python: Data science in a sense, is like a detective story to me. You unravel hidden relationships in the data, and you build a narrative around those relationships. My name is Jesper Dramsch and I'm on machine learning researcher and data scientists. I spent the last three years working towards my PhD in machine learning and geophysics. I have experience working as a consultant, teaching Python and machine learning in places like Shell and the UK government, but also midsize businesses and universities. All this experience has given me the ability to finish my IBM Data Science Professional Certificate in 48 hours for a course that's supposed to take about a year. I also create exactly these notebooks that you learn to create in this course for data science and machine learning competition, one type called Kaggle, which is owned by Google. There I gained the rank 81 worldwide out of more than a 100,000 participants. After this course, you will have come through every step of the data science workflow. That means you'll be able to recreate all the visualization, and have all the code snippets available for later for use with your own data in your own reporting. We'll do a very applied step-by-step. We'll start at the very beginning, starting with getting your data into Python. That means looking at Excel files and looking at SQL tables, but also looking at those weird little data formats that sometimes can be a bit tricky to work with. Then we'll preprocess our data, clean our data, and do exploratory data analysis or short EDA. EDA is that part where you really refine your question, and where we have a look at the relations in our data and answer those questions. Afterwards for fun, we'll have a bit of a look at machine learning modeling and how to validate those machine learning models, because in this modern time, this is more important than ever. We'll have a look at different data visualizations, how to best tell your story, how to best generate presentations and reports to really convince, to really punctuate your story that you can tell that data. Very finally, we'll have a look at automatically generating presentations and PDF reports directly from Python. I have the unfortunate lack of graduating into recession twice now. But Python has given me the ability to finish a PhD while working as a consultant and making these amazing world-class data science portfolios for myself that have now generated so much attention. It's amazing. One of my notebooks has been viewed over 50,000 times. I hope to share this with you. Data signs for me is the super exciting new field and Python is very accessible. So I hope to see you in class. 2. Class Project: Welcome to class and thanks for checking it out. I'm really happy to have you. This class will be an bite-sized videos that are part of larger chapters because then you can come back and have a look at the small details and not have to search in the larger the news. And each chapter will be one of the steps in the data science workflow. In the end. Because they just sines isn't applied sine one over protein. And then this project, you'll recreate what we do in these video lectures. Go through the entire data science workflow, and in the end, generate PDF for a presentation with your fine, It's good on your own data, on a dataset that I provide on top of that, and make all of these notebooks available to use so you can code along during the videos because it's best to experiment. Sometimes you see something, you want to create, something different, you want to understand it better. And then experimenting with the code that I have on screen is really the best way to do it. For the first couple of lectures, I want to make sure that everyone has an equal starting drought. Work a look at the tools. We'll have some introductory lectures where we really get African objects. And then we'll start with the entire dataset. What though very brutal loading, cleaning, exploratory data analysis and all the way to machine learning. And we call generation. 3. What is Data Science?: In this class we'll look at data science from two different perspectives. So there's one where we'll have a look at what actually constitutes they designed to act. What are the important fundamentals? And there's the other one, the process approach. How do you actually do data scientists? Defining data science as a bit of a beast because it's such a new discipline that everyone has bid of different leukemia. And I liked the way that Jim Gray, the Turing Award went out, basically defines it as a Ford Pinto signs. And that data science or information technology. And three, Need Changes Everything about science. And I think the impact of data-driven decisions on signs, business, it has shown up my favorite ways to look at data science, data science hierarchy of needs by unwanted care rogue our teeth. And she really defines it as this pyramid. All base level needs and Ben Wolf, more niche needs as you go higher. And at the very base of that hierarchy of needs is collecting data. And we have to be aware that already under collective process by this out-of-date, lot of people like to think that data is unbiased, that true. But that's really not the girls. A lot of times even then physical systems bias our data, are read by collecting and then go on to level tooth, moving and store it big. So making sure that we have reliable storage, reliable slow of data, having a ETL extract, transform and load process in place to really help the infrastructure of data science. Next level level of breed is exploring and transforming data. So doing anomaly detection at cleaning, preparing our data for the actual MLS. Step four is aggregating and labeling the data, deciding for metrics that we'll use and looking at the features and the training data, the panel ultimate step is doing the actual modelling. So doing AB testing, testing between one version of the website and another, and experimentation simple machine learning algorithms to gain insides to model the data and to make predictions based on the tip of the pyramid heads, AI and, and people nodding. So the really juicy stuff, but also that stuff that most companies actually don't think. This roughly summarizes how much time you should spend on each step within the perimeter as well. So if you don't spend anytime acquiring data of thinking about data, then you'll probably have a problem down the line. Another way to look at data sciences asking questions. The data science process is fundamentally about asking questions about your data. And it's a very iterative approach. So in the beginning, you pose the question, acquiring these data, but how is the data actually sampled? This goes into the buyers data. Which data is relevant, and then you go on to explore the data to the exploratory data analysis. Andrey inspect. Sometimes you have to go back. It's an iterative process. During exploration, you'll see that some data source would really help that information you have in your data. So you go back and forth between steps. Then you model the data, build a simple machine learning model or just like the hierarchy of needs and really gained insights by modeling your data with simple algorithms. Finally, and this is not part of the hierarchy of needs, but it is definitely part of the data science process is communicating your results. What did we learn? How can we make sense of the data? What are our insights? And how can we convince people of the insights about how we all know that sometimes knowing the truth isn't enough. You have to tell a compelling story to convince people of Jane science and to really make an impact with your day two sides. So this class will show you the entire process and also how to generate those stories out-of-date. 4. Tool Overview: Let's have an overview over the tools that we're using in this class. Obviously, everything Data science-related will be universal. But also learning Python is extremely valuable to your skill set. Python has gained a lot of popularity because it's free, it's open source, it's very easy to use and it can be installed on pretty much any device. So Mac, Windows, unix, your phone even on not a problem. And hi thinness code for humans. So a lot of places, Google Ads, YouTube, instagram, Spotify, they all use at least and Pub Python because it's so easy to get new people on board with Python. Because if you write good Python code, it can almost be read like text. Will install Python 3.8 using the Anaconda installation. Anaconda is nice because it distributes a lot of data science packages that we already need and it's free. If you're on a later version of Python, that should be completely fine as long as it's Python, pray, you may be wondering if you need to install some kind of IDE or some kind of compiler for Python. And that's not the case. We'll be using Jupiter, which is a web-based interface to Python and makes teaching Python and learning path and extremely easy. And going from that, you can always go to another editor. One of my favorites is VS code. It's gotten really good for Python development. And VS code actually counts with an interpreter. And views code actually comes with an extension for Jupiter as well. But that's for another day at the base of anything we do this NumPy, it is scientific computing library and Python. And we won't be interfacing with that directly, but I want you to know it's there. So always when you need to make some kind of calculation, you could do it in Python. It has been used to find black holes. It is used for sports analytics and for finance calculations. And it is used by every package that we will be using in this course. You will quickly notice on this course that everything we do is depending on pandas. Pandas is this powerful tool that is kind of a mix between Excel and SQL for me. And it's really a data analysis and manipulation tool. So we store our information with mixed columns in a DataFrame. And this DataFrame then can be manipulated, change, added onto just within this tool for the machine-learning portion and the model validation when using scikit-learn and libraries built upon scikit-learn, Scikit-learn has changed a lot how we do machine learning and has enabled part of the big boom that we see in machine-learning interests in the world right now. Matplotlib is a data visualization tool and we'll mostly be using libraries that build upon matplotlib. But it's very important to know it's there and it has an extensive library with examples where you can have a look what you'd want to build. Seaborn as a one-off these libraries that build upon matplotlib. And it's extremely powerful in that it often takes a single line or a couple of lines to make very beautiful data visualizations of your statistical data. These are the cornerstone tools that we'll be using data science. There are open source, they are free and they're the big ones. But we'll be using a couple of other smaller tools that I've grown to really like as well, but I'll introduce them along the course. The documentation of these open source tools is amazing because it's also built by volunteers like me. I've written part of the pandas and the scikit-learn documentation, and you'll find that it's really helpful with a small nifty examples that'll really make you understand the code better. If you're using these in a corporate setting, they're still free. But consider becoming an advocate for sponsorship because these packages really rely on having paid developers and core maintainers. 5. How To Find Help: It can feel really daunting doing this course. I totally understand. I'm constantly learning. I'm doing these courses and being alone in these courses is terrible. But Skillshare has the project page where you can ask for help. And in this class will also have a look at all the different ways, how you can find help and how you can learn help yourself. Because every programmer will tell you that they got increasingly better at programming. One, they learned how to Google for the RIF payments. To start out, we'll have a look at the Jupiter notebook because the Jupiter notebook directly wants to help us. So if we have any kind of function, even the print function, we can use Shift Tab. And when we hit it, once, it opens up, basically the basic description, so we get the signature of our function. That means print this the name. This is the first argument, and then the dot-dot-dot is just small. And these are the keyword arguments. And it gives back the first, I'm the first sentence out of the documentation in the docstring. So while we can hit Shift Tab several times, two times, just opens up the entire docstring. Three times makes That's the docstring is open longer and you can click here as well. And all that. And four times will cast it to the bottom here. So you have it available while you're working and you can just pop it out here into its own side, but also just close it down. And an addition. Well, so we'll be working with Pandas. So when we start typing, we can often hit Tab for autocompletion. And this is really my personal way of being a bit lazy when typing. So when I want to import pandas, I can hit tab and see which kind of things are installed. Pandas as pd executing right here, I'll deal with Control Enter to stay in the same place. And Shift Enter. It's going to executed and get me to the next cell. And here I can also, so P D is now our pandas. When I hit period and Tab, it'll open up all the available methods on PD. So here I can really check out anything, like if I want to merge something, I can then hit the parenthesis, shift tab into it and read how to merge it. Now, this can be a bit rough to read, even though we can put it all the way on the bottom here. And that is why there is the pandas documentation, which is essentially built from the docstring with a little bit of formatting tricks. So you can see right here that you see what this is, the signature of it. And you can read into it and even have a look at the examples and copy over the examples. One thing to know in software is that these kind of codes that are here, I mean, then nothing great. You don't have to well, you don't really have to type them out. You can just copy them over and say, alright, I needed this. Now, I have a nice DataFrame with age is cetera. So copying something like this is super common. It is just what we do in software. The next way to get help is goodwill. And I sometimes make the joke that in interviews you should just have people at Google Python and see if it shows snakes or if it shows the Python logo. Because at some point google starts to get to know you and shows you more and more Python. And it's a good way to see that you have a lot of experience in five. So when you want to ask any kind of question, when you're stuck with anything. Like you have a very obscure data format that you want to load. Or you just have an error that you don't really know what to do with your copy it over. And let's say you have a type error, e.g. just have a look here and then there is usually a highlighted one. But of course, Google always changes and you are often lead to the docs. So in this case it's the Python docs. And then one of the links is going to be StackOverflow as well. And StackOverflow is this website that well, it's extremely helpful, but it's also not the best place for newbies because some of the best experts in the world on this website answering your question. But if your question is not well formulated, some of the people on this website can sometimes be a bit unfriendly about it. However, for browsing and for finding solutions, like your question has probably been asked before. If you can't find it on StackOverflow, try changing your Google query a little bit. So you find different kinds of results like what kind of type error or did you have a copy over the entire name of the type error and all that. So really then you want to scroll down to the answers. And this one isn't really upvoted that much. But oftentimes you have an upvoted on. So that is very, very popular. And sometimes you can even get accepted answers. Like have a look at this one. Here you have a green check mark, which means that the question asker has marked this as the excepted answer. And you can see right here that people put a lot of work into answering these questions. You have different ways to see this with code examples and you can really check out what to do next with your kind of error. Let's go back to Jupiter and close this one out. Because this is also something that I do want to show you. In Python. Arrows are cheap because we can just readily do them. If we have something like this, it'll tell us right away what's going on. So there's something weird in the beginning. But what I really first do on any error, however long this is, this is a very short arrow. Scroll to the very last line and have a look. Oh, okay, this is a syntax error and it says unexpected EOF while policy EOF, EOF means and a file. So if you don't really know what this is, copy this over, checkout Google and have a look. If Google tells you what this is. Oftentimes the Google search is better than the search on their websites itself. And here, it means that the end of your source code was reached before all codes were completed. So let's have a look at our code again. Here. If we close the parentheses, our code is now completed and it works quite well. Let's, let's generate a nother error. Yeah, something that we can definitely not do is have this string divided by some number. So if we execute this, this gives us a type error. So we'll scroll all the way to the bottom and say, Well, see what's happening right here. And it tells you that the division is not possible for strings and for integers. And really going through arrows is your way to be able to discern why Python does not like what you've written right here. Since we're on the topic of help and I won't be able to look over his shoulder. And the classes that I gave a very common error that you can catch yourself is that these Python notebooks do not have to be executed in order. So you see the little number right here next to what it has been executed and what hasn't. Lets make a small example, add some new, new things here. Let's say right here I define a, N, here. I want to define b. And b is going to be a times five. And I go through here, I experiment with this. I have a look at PD merge, have an error here, which is fine. We can leave that for now. Run this code, maybe print something. And you can see these numbers are out of order. This is important later. Then I execute this cell and it gives me an error name, error name a is not defined. And that's because this cell does not have a number. It has never been executed. So just something to notice that you have to execute all the cells that you do that you want. Because. When we run this one and then run this one, this is completely fine. So really have a look at the numbers and the next arrow. And that is very related to this, is that sometimes we change something somewhere like here. And we change a to B to six. And then we run this code again. And suddenly we have a look and b is 30, although a is five here. And this is one of the big problems that people have with out of order execution. And you have to be careful about this. So either you just have to track which cells you did. And especially with this, like there's like 107849, this gets really hard to keep in mind. Especially you can delete these cells. And a is still going to be in memory. So we can still execute this despite the cells not existing anymore. So sometimes you just have to go to the caramel and say restart and clear output, which clears all of the variables and clears all of the outputs that we have right here. So we can go here, hit this big red button, and now we have a fresh notebook with all the code in here. Now we can execute this in order get all our errors that we have, and see right here that a is not defined. So we have to basically add a new line here and define a again. And that way you can catch a lot of errors in Jupiter by having a look at the numbers right here, did you forget to execute something or did you do it out-of-order? Yeah. In total. What you want to do to find help in Python is, remember shift tap. Remember that tab, autocomplete your queries and can give you information about what's, what methods are available on basically anything. Then you want to get really good at Googling things. In some of my classes, some of the people that I got a bit well that I became friends with, they laughed at me at some point and said, your class could have essentially been just Google this because at some point everyone has to Google stuff and there are some funny posts on Twitter as well of maintainers of libraries having to Google very basic things about their own libraries because our brains are only so reliable and things change. And if you want to have the newest information, there's no shame in looking at up when you're done with googling, with looking at up on StackOverflow, copying over some kind of code. You'll be better off for it. Now all these tools to find help and Python and help yourself. And this gives you the necessary tools. Dive into data science with biking. 6. | Data Loading |: The first couple of classes will be getting data into Python. So whether you have data in the tables are in your SQL database, it doesn't match up. We'll put it into Python in a tool called pandas, which is essentially excellent steroids in Python. And let's for your data. 7. Loading Excel and CSV files: This is the first class where we touched code. So open up your Jupyter notebook if you want to code along. We'll start with loading data. So I have provided some extra files and CSV comma separated value fonts and we'll get into loading them. We could write this by hand and I'll show you on a much simpler example as well how to write something like this by hand. But luckily, with Python being now over 20 years old, a lot of people have already put a lot of thought into extending Python functionality. So we can use this pandas here and extend Python to load data into Python. So what we do here is we just say import pandas. And this would be enough for, because we'll be using Pandas a lot. Usually we give it a shorthand up to some kind of alias. Pd is a very common one that a lot of people use. And then we execute the cell and we now have pandas and Python. And to import or read data, we can do the PD, don't read, hit tab and see all the different ways you can load data into Pandas. In this course, we'll have a look at the most common ones that I found in my work as a data scientist. But I'll show you how to find the others as well. Because if we don't really know what we're doing, we can always have a look at pandas documentation. While we can have a look at everything that we can do with pandas, since we have read X0 here already, we can also hit Shift Tab and have a look at this signature. And you'll see that this looks eerily similar to the documentation because pandas and all of Python actually comes with its documentation built. And so it's very stand-alone and very user-friendly. So in the beginning we just need to give the filename where we actually have the file. And this is gonna be data slash housing dot XLSX, the new kind of extra file. And loading this will execute. And we see we have all these data now in Pandas. We didn't save it into a variable right now. But what we usually do if we just have a temporary dataset, we call it df. Because in Python, this is called a DataFrame. So it is basically an XO representation of a single sheet in your Python. Because we want to have a look at it. Afterwards. We'll just call head on our DataFrame and have a look at the first five rows. We can see here ride the headers and our data. Csv files are a little bit different because CSV files are raw data. Let's have a look here. I have the data. We can actually have a look at CSV comma separated values in notepad because it's just text and it's fantastic for sharing data between systems, especially if you have programmers that might not have Microsoft Office available. This is the best way to share data. We pd read CSV and we can just give it the file name again. So housing dot CSV. And this should, let's call it head right on this one. This should give us the same data and we can see they are the same. I want to show you a really cool trick though. If, you know some data is online like this dataset of medium articles on free code camp. He can actually colored pd, read CSV, and just give it the URL. But this is going to fail. I'll show you, we have to learn that arrows and Python, that's fine. It's totally okay to make errors. Read the last line, pass error tokenizing data. So something like expecting something different. And you may already see here that this is not a CSV, this is a TSV file. So someone was actually separating this with tabs. And to put tabs, make this backslash t character as the separator. And we can import this data right from the Internet by just giving the correct keyword. And this is something really important to see, really important to learn. If we have a look at this, there's a lot of keywords that we can use. These keywords are extremely useful and cleaning up your data already. You can see right here that there is something called NaN. This is not a number that we have to clean later curing the loading of this, we can already have a look at things like, do we want to skip blank lines? So it's really, pandas has a very user-friendly if you want to experiment with this one. I'll leave this in the exercise section. And you can check out if you can already clean it off. Some nans will have a dedicated section of cleaning data later as well. The loading data into Python with pandas is so extremely easy. Try it out with your own data. If you have an XL file lying around on your computer, remember all of this is on your computer. Nothing gets out. So you can just pd dot print and getting your data and play around with it. This class we worked through loading Excel tables and comma separated value of files and even had a look how to load data from the Internet. Next class, we'll have a look at SQL tables. A few nano work with them. Feel free to skip it. The next class will be that ride for you. 8. Loading Data from SQL: Sql databases are a fantastic way to store data and make it available to data scientists working with SQL to what be too much. There's entire courses here on Skillshare that I'll link to. You can find them right here in the notebook as well. However, it's good to have an overview because it's really easy to load the data once you know how to do it. And if you work with SQL, this will be really valuable to you. Most companies do not store that data in Excel files because Mexico gets copied, it gets copied. And suddenly you end up with final, final, final version. And it's probably on someone's PC somewhere, maybe on a laptop. Instead. A lot of places have databases. On a server, this database that contains all this information that you need. Usually this way of accessing information is called SQL, which is short for Structured Query Language. This is some language in itself. It would be too much to explain this in this course. If you want to learn more, there's courses on Skillshare and there's also resources like this, which are linked where you can try it out, do the exercises step-by-step, learn how to ride up a query, get data into Python in advanced way. It is absolutely enough to once again import Pandas. Then we can have a look and there is SQL down here. What you can do here is actually three different ones. There's a general one, SQL, there's a SQL query. There's a table read SQL in the documentation. That's usually a very good place to start with. See that there's two kinds of waste. If we scroll down, we can see that there's different to SQL table and SQL query. Let's have a look at the query and this needs you to write a SQL query. Some of them can be very simple and can save you a lot of space. So if you have a big database SQL table just loads the entire table from your server. In addition to Pandas, we actually want to import SQL alchemy. And then below this will create the connection. So it's called it an engine. And let's have a look what we need in here. So if you have a postgres SQL database, we can just copy this. This should be the location of your database. Here. We go read sequel table just to make it easy. And now, if you had your SQL database, you can put your name here, like e.g. sales as the connection here. If we wanted to actually use the SQL language, we would have to use read SQL query. And that means in this case that we need to define a query that goes into our connection. So this query can be very, very simple. Of course, this query can be as complicated as you want. So we actually take the multiline string here from Python. So we can say Select customers and total spend from sales. And because it's such a big table, we want to limit it to 1,000 entries because we just want to have an initial look and we don't want to overload our computer. Addition to that. We want to say that the year is 2019. Now we can copy over this entire thing over here and select our data from our imaginary database right here. Using SQL query is, hopefully in this class is all about. Sql can be quite easy. You can just get the table from the database and work with it in Pandas. Now, the next class is going to be how to load any kind of data. And we'll show that pandas makes everything a little bit easier. 9. Loading Any Data File: Sometimes you have weird data and I'm a geophysicist, I work with seismic data. And there are packages that can load seismic data into Python just like our CSV files. And in this class, we'll have a look how to load any data and how to make it available. In Python, pandas is great for tables and structured data like that. But sometimes we have different data formats, like just a text file or images or proprietary formats. So when I was mentoring class at the US Python conference, someone asked me about this super specific format that they work with. The first thing I did is I googled it. There was a Python library for it, and I'll show you how to use. Most common Python libraries will use the text file. Unlike the text file we have here, it's a CSV, but it's still a text file. As you can see, what we say is open and then we have to give it the place where it is and the name. Now let's shift tab into this. There are different modes to stand up. Mode is R. Let's have a look what these modes actually mean because you can open files on any computer, just most programs do it for you. And read mode, right mode, and in append mode. So you want to make sure that if you're reading data that you don't want to change, this is set to r. Let's make this explicit. Then we give this file that we opened a name so we can just call this housing. And Python, whitespace is very important. So now we have a block which I'll file is opened. And within this block, e.g. we can say data equals housing dot read, and this reads our data. Now if we go out of this block there, we can actually work with our variable without having the file open. And this is incredibly important. A lot of people that are new to programming don't know this, but most files can only be opened by one person and one program at one time. If to try to access the same data, it will break the data. So it's really important that you open your data, save it into variables loaded into Python, and then close everything. So if we have adhere in the state of variable and go out of this block, Paul just execute this and go to the next cell. We can do stuff with data bike, have a look at what is in data. And we can see it right here that this is a text file without having the file open, which is just a very easy and accessible way to do it. We can also have a look housing as our file handler right here. And we can see that this tell us if housing is closed or not. So right here, we can see that after this block is executed, it will be closed. Let's have a look at how this looks inside here. So inside here, it is not closed. That means we can read different lines and all that. However, instead of just using the standard Python open, we can use a lot of different libraries that also give us finally handlers. So I can use something like, I'm sick. Why IO, which you have probably never heard of before. And that's why I want it I want to show it to you real quick, which is just a way to import this. After importing this, we can say with segue I 0 dot open, give it the file, name it S, and then load all the physical data into Python. And after their system, the file, once again, this closed and was safe. So this is really, this is a very general way to go about loading your data into Python. And as you can see here, our CSV doesn't look as nice as, as it does in Pandas. But with a bit of processing, we can actually make look as nice as pandas so we can split it e.g. on these new line characters, which is backslash n. And we can see that this already gives us all these lines in here. And we can go on and split up each of these line on the comma because of this comma separated and so on and so on. But that's why I showed you Pandas first. Because it's so much easier. And I think it's very nice to go to these high-level abstractions first, but also see how to do the work and the back. And this class we're had an overview of what the width L Can statement does and how to load any kind of data search for data loaders for the weird formats that we sometimes have. And I think we definitely saw how easy Pandas makes it for us because splitting a CSV file like Vout is really cumbersome. And then cleaning the data like missing values is even worse. And the next class on have a look at huge datasets. So what happens when our files becomes so large that they don't fit into memory anymore, how can we load this data and how can we deal with it? 10. Dealing with Huge Data: It is quite common that especially in larger companies, you have datasets that do not fit into your computer's memory anymore. Or that if you do calculations with them and the calculation will take so long that essentially you borrow and in some cases, you would take longer have then the Universe already exists. So that means we have to find ways to work with the data to make it either small and memory. We'll talk about that. But also how to sample the data. So you have a subset because oftentimes it is a valid to just take a sample, a representative sample of big data, and then make calculations, do the data science on that. And this is one we're getting into. We'll import pandas as pd, and then we'll load our data into the df DataFrame with read CSV. We will do this explicitly now because we can change it later to see the differences between different loading procedures and how we can optimize our loading. This gives us the following memory footprint of our loaded DataFrame will have to say deep equals true because we have some objects in there that have to be measured. You see right here that ocean proximity is quite a lot larger than everything else. And that's because ocean proximity contains string data. So we know it is categorical. We'll have a look at the head real quick. Right here. It is categorical and everything else is numbers. The numbers are very efficient, but having strings and there can be very memory intensive. If we have a look at the deep types. So the datatypes, we see that right now ocean proximity is just an object. Everything else is float, so a number. But the object right here is what makes it so large in memory, because an object, we can change the datatypes of our DataFrame when we loaded it will do this by saying df of ocean proximity because we want to change ocean proximity. Copy all of that and we'll override our ocean proximity with this dot as type. And we can use a special datatype that pandas has available, which is called categorical or category. What? This improves our memory usage. Deep equals true. So we see only the memory footprint of the columns. And we can see that this improves our memory usage of ocean proximity significantly even below the usage of the numbers. And this is how you can make your dataframe more optimal in a simple way. Now an obvious problem with this is that we already have this data in memory and then we're changing it. So the memory footprint of this is still large. We're just reducing it afterwards. What we can do is change the datatypes during low time. So let's have a quick look in the docstring. And there we go. It's D type. And we'll assign a dictionary where the key is our column. We'll use ocean proximity again. And the value is going to be the datatype. That means you can use as many as you will. I made a typo there and a typo and housing that will go. And using this, you can also assign the integer type two numbers and really change your loading at loading time. So d of small, Let's have a look at the memory footprint of this. So USD of small memory usage, deep equals true. And we can see right here that this automatically at loading time changed our memory footprint of the DataFrame. So what if instead of loading the entire DataFrame with all columns, all features available, we choose to only take a subset of the columns. Maybe we don't need everything. Maybe we don't need the median house price in this one. So we'll define a new DataFrame and we'll load the data as always. But in this case, we'll define the columns. So that's columns. And in this case we'll need a list. Let's have a look, use longitude and latitude. And we'll, we could also use total bedrooms or something like that, but we'll just use the ocean proximity as before. Just paste this in edited. So it's actually the column names per list entry and add ocean proximity. Now, this is going to go wrong and I want you to learn that it's absolutely okay to make mistakes here. Because in Python, mistakes are cheap. We can see that type error. It says that it doesn't recognize one of the keywords. And that's because I use columns instead of use Coles. I, I honestly can't remember all the keywords because there are so many, but that's why we have the docstring and corrected. Looking at the DataFrame, we only loaded longitude, latitude, and osha proximity. Another very nice way to save some space while loading. And this way we can load a lot of rows with only a few columns. Sometimes the problem isn't really loading the data though. All the data fits into our DataFrame. But the problem is doing the calculation. Maybe we have a very expense function, very expensive plot that we want to do. So we'll have to sample our data. And Pandas makes this extremely easy for us. Each DataFrame has the method sample available. You just provided a number and it gives you as many rows from your DataFrame as you say. In that, let's have a quick look at the docstring. We can define a number or a fraction of the DataFrame. And since it's a stochastic sampling process, you can provide that random state, which is really important if you want to recreate your analysis and provide it to another colleague or another data scientist. And then you will have to input the random state right there. So we can see right here that it changes every time I execute the cell. But if we set the random state to a specified number, it can be any integer that you want. I like 42. And just see right here that this number is 2048. And if I execute this again, this number does not change. So this is a really good thing to get used to. If you have any random process. That random process is great when you use it in production. But if you want to recreate something, you want to fix that random process, so it's reusable. What I often do is I go in the very first cell where I import all my libraries and I fixed the random state and there as a variable. And I just provide that variable in stochastic processes. That makes it a little bit easier and very easy to read for the next data scientists who gets this. Sometimes you have to get out the big tools though. So we'll use task of x and we won't use it right here, but you can try it on the website if you go to try now. And dusk basically as lazy DataFrames, so it doesn't load the entire DataFrame into memory when you pointed to the dataframe or to the data. But it knows where it is and when you want to do the execution, it'll do the execution and a very smart way, distributed over even big clusters. In this class, we had a look at how to minimize the memory footprint of data. So how we can load less data or how we can load data more efficient. I also showed you a quick look at some tools you can use if you want to use lazy DataFrames, e.g. so DataFrames that are in rest when you load them and then when you do the computation and does that chunk wise. So it's a smart way to deal with large data at rest. The next part, we'll have a look at how to combine different data sources. So how we can really flourish and get different information sources to really do data science. 11. Combining Multiple Data Sources: The biggest impact really comes from combining data sources. So maybe you have data from sales and advertisement and you combine this data to generate new insights. And in this class we'll have a look how we can merge data, join data together, and appending new data to our DataFrame. As always, we'll import pandas as pd and save our DataFrame in df. Now we'll split out the geo data with latitude, longitude, and the ocean proximity into the df underscore. Go, let's have a look at the head. And we can see that's three columns, exactly like we defined. And now we can join it. Joining data sources means that we want to add a column to our DataFrame. So we'll take our df underscore GO and join a column from the original dataset into this. Now this is technically cheating a little bit, but it's just making it easier to show how we do it. Well, choose the median house price for this one. Let's have a look at the whole dataframe. And we can put that into our G. We can see how this now contains the original geo DataFrame joined with the column median house value. This is a little bit easier than normal. Normally you don't have all the columns available, but it will have a look at how to merge DataFrames. Now, while you can be a little bit more specific, Let's create a price DataFrame first with longitude, latitude, and the median house price. And what we'll do now, one, merge both of these into one dataframe. So we take the geo DataFrame called geo dot merge. Let's have a quick look at the docstring, how to actually do this. So we want a left DataFrame and the right DataFrame. And we create all we define a method. How to join these? The inner method means that we only keep the data that is available in left and right. Let's have a quick look at the left and the right DataFrame. The natural join is the inner join. So only the rows and the columns from both DataFrames are there, that are there. The left one is everything from left and only the right matching ones. And the right join is everything from the right and the left matching ones. The outer join is everything. So we fill it up with a lot of nouns. And we have to define the column that the left and the right DataFrame are merged on. So we'll take latitude in this case. So we have something that we can actually combine our datasets on. If you have your data sources, left and right should be the same data, but they can have completely different names or that work quite well. You can see that everything is now merged. We can also concatenate our data. So that means we'll use pd dot concat for concatenate and provide the DataFrames that we want to combine into a larger DataFrame. Now, in this case we have two. We can combine as many as we want. And right now, you see a good way to add new data or new data points to the rows of the DataFrame. Wherever you don't have data, NaNs are provided. However, since we want to join the data, we provide a join and the axis. And you can see everything is now joined into one large dataframe. This class, we had an overview of how to combine different data sources and generate one big dataframe so we can do an analysis combined. And that concludes our data loading tutorial. And the next chapter, we'll have a look at data cleaning. Probably the most important part of data science. 12. | Data Cleaning |: After loading the data, we have to deal with the data itself. And any date and data scientists will tell you that 90% of their work is done in the cleaning step. If you do not clean your data thoroughly, you will get bad results. And that's kinda why we spend a lot of time having a look at different missing values, outliers, and how to get rid of them. And how to really improve our dataset after we loaded. Because sometimes the measurements are faulty, sometimes data goes missing or gets corrupted, and sometimes we just have someone in data entry that isn't really paying attention. It doesn't matter. We have the data that we have and we have to improve the data to a point where we can make good decisions based on data. 13. Dealing with Missing Data: The first step in the data cleaning process for me usually is looking at missing data. Missing data can have different sources. Maybe that data is available, maybe it got lost, maybe it got corrupted. And usually it's not a problem. We can fill in that data. But hear me out. I think oftentimes missing data is very informative in itself. So while we can fill in data with average or something like that, and I'll show you how to do that. Oftentimes, preserving that information that there is missing data, there is much more informative than filling in that data. Like if you have an online shop for clothes, if someone never clicked on the baby category, they probably don't have a kid. And that is a lot of information you can just take from that information not being there. As usual, we'll import pandas as pd. And this time we will import missing number, the library as MS, NO. And we'll read the data in our TF DataFrame. Missing number is this fantastic library that helps visualize missing values in a very nice way. So when we have a look at the F, we can see that total bedrooms has some missing values in there. Everything else seems to be quite fine. And when we have a look at the bar chart, we can see that to really have a look at how well this library works, we have to look at another dataset and there is an example dataset in missing numbers that will now load. To see. We'll load this data from quilt. You have this installed as well. But down in the exercise you can see how to get this data. We will load this New York City collision data. It is vehicle collisions that we'll get into our variable. And this data has significantly more missing values. We'll have a quick look. There are a lot of very different columns and we can already see that there's a lot of nouns for us to explore with missing numbers. We'll replace all the nan strings with the NumPy value np dot nan. Numpy is this numeric Python library that provides a lot of utility. And np dot nan is just a native data type where we can have not a number represented in our data. This is the same thing that NumPy does when you, this is the same thing that pandas does when you tell it to, um, give nan values. In my data. Oftentimes this can be a -9.9 to five. But it can be anything really. And you can specify it to anything you want, which is then replaced as NAM. So you know it is a missing value. So let's have a look at yeah, I'll leave that for later. Let's have a look at the matrix. We see there's more columns in here and the columns are much more heterogeneous. So we have some columns with almost all values missing. And on the side we can also see which row has the most values filled out and which row has the least value is filled out. Sorry about that being so low. Let's have a look at the bar chart. And we can see which columns have the most data filled out and which have the most missing data. Now the dendrogram is a fantastic tool to see relationships in missing data. The closer that the branching is to zero, the higher the correlation is of missing numbers. So that means on the top right, you can see a lot of values that are missing together. This is an easy way to count all the values that are missing in this DataFrame. Let's switch back to our original DataFrame, the house prices, where we can also just count the null numbers. And we can see that total bedrooms is the only one that has missing values with 207. So in addition to looking at missing know, we can get numeric values out of this. Let's have a look at the total bedrooms right here and add a new column to our DataFrame, which is total bedrooms corrected. Because I don't like overwriting the original data. I'd rather add a new column to my dataset. And here we say, fill our missing values with the median value of our total bedroom. Because total bedroom is account so the mean value, the average value, doesn't make sense, will rather filled with the most common value in bedrooms. There we go. This would be the mean and this is the median. Luckily, pandas makes all those available as a method, so it's very easy to replace them, will replace it in place this time, but you have to be careful with that. It's sometimes not the best practice to do this. And now we can see total bedrooms corrected does not have any missing values. When we have a look at total bedrooms and total bedrooms corrected right here. We can see that these are the same values. The values that were did not have any zeros, did not have any nans, did not get changed. Only the values with nan were replaced. In this class, we had a look at missing numbers. So what happens when we have missing data? Can we find relationships between missing values? So just some data and go missing when other datas also going missing, is there a relationship in missing numbers itself? In the next class, we'll have a look at formatting our data. Also removing duplicates because sometimes it's very important to not have duplicate entries in our data. So we can actually see each data point for itself. 14. Scaling and Binning Numerical Data: In this class first, we'll have a look at scaling the data. That is really important because sometimes some of our features are in the hundreds and other features are like in the tens or you can add decimal points. And comparing those features can be really hard, especially when we're building machine learning models. Certain machine-learning models are very susceptible to the scaling factors. So bringing them on the same kind of numeric scale can be beneficial to making a better machine-learning model. I'll introduce each scaling factor or each scaling method in the method itself so we can learn it in an applied way. The second part and this class is going to be binning data. So that means assigning classes to data based on numeric value. In this example, we'll use the house value and assign it medium, high, and low end luxury. Just to be able to make an example how we can assign classes based on numbers. And you'll see this can be done with different methods that give different results. As per usual, we're importing pandas as pd and get our housing data into the df DataFrame. Make a little bit of space so we can actually scale our data. Have a look. We'll start with a very simple method. Well, we scale our data between the minimum and the maximum of the entire data range. So I'll modify the x is going to be x minus the minimum of x divided by the range. So maximum of x minus a minimum of x. And that'll give us a value 0-1. For the entire column. We'll choose the median house value for this one. So df dot median house value is our x. And we'll have to copy this a few times. So I'm just going to be lazy about this. X minus the minimum of x divided by the maximum of x minus the minimum of x. And we have to use parentheses here to make this work. Because otherwise it would just divide the middle part. You can see it right here. Our scaled version in the new column that will name median house value minmax. Right here. We can clearly spot that I made a mistake, not add in parentheses to the top part. So when I add parentheses here, we can see that the data actually scales 0-1. Now we can do some actual binning on the data. There are several options available to do binning as well. We'll use the first one, which is the pd dot cut method, where you provide the bin values yourself. So those are discrete intervals where we been our data based on thresholds that we put we using the minmax that we just created because that makes our life a little bit easier. Because then we can just define the bins. 0-1 will have three-quarters, so quartiles. And that means we have to provide five values, 0-1 and 0.25 increments. When we execute this, we can see that the intervals are provided. If we don't necessarily want those intervals to be provided, but provide names for them. So in the case of these values, we can say that the first one is quite cheap. Then we have a medium value for the houses, a high value for the houses, and then we are in the luxury segment. Of course, you can define these classes however you want. This is just an example for you to take. Make this a little bit more readable at the common data. Otherwise we'll get an error. And now with the labels, we can see that each data point now is assigned to a category that's actually assign those to price or price range in this case, and indented correctly. And we can see that we now have a new column with new classes that we would be able to predict with a machine-learning model later. The second method we'll look at is the q cap method. This is a quanta are cut. So we can define how many bins we want. And the data will be assigned in equal measures to each bin, will use the data from before. So the house values minmax. Now, in the case of cue card, it doesn't matter which one we take because the scaling is linear in this case. So that's fine. But to compare, we can see that the top bin is now between 0.5, 15.1 instead of 0.7, 5.1, we can assign the labels to make it absolutely comparable. And we can see right here that this is now a lot more luxury and 01234 instead of high as before. So this makes a big difference and you have to be aware how the child's work. They are really, really useful. But yeah, it's something to be aware of. Let's assign that to price range quantile and indented properly. And we have a new column that we can work with. Instead of doing this by hand, we can use a machine learning libraries, scikit-learn to use the pre-processing. Because as you saw, sometimes you make mistakes, just forget parentheses. And if it's already in a library using it will avoid these kind of silly mistakes that have very severe consequences if you don't catch them. From SKLearn, which is short for scikit-learn. We will import preprocessing and we'll use the minmax scalars so we can compare it to our min-max scaling that we did by hand. We use the fit transform on our data. And the fit transform first estimates the values and then transforms the values that it has to the minmax scalar. Now are right here. We can see that, I mean, I'm used to reading these mistakes, but like mistakes, bad, you quickly find out what happened. You can Google for the mistakes. And this case, I provided a serious and scikit-learn was expecting a DataFrame instead. Let's have a look, compare our data. And some values are equal, some are not. And this seems to be a floating point error. Let's have an actual look at it. The first value is false. So we can just slice into our array and have a look. The first values are. And right here we can see that the scikit-learn method provides less, less digits after the comma. Now, this isn't bad because our numerical precision isn't that precise to be honest. So we can use the NumPy method, NumPy dot all close to compare our data to the other data. So that means our errors will be evaluated within numerical precision. Whether they match or not. Just copy that over. And we can see, yes, in fact they match. So within numerical precision, they are in fact equal. Instead of the minmax scalar, we can have a look and there are a ton of pre-processing methods available, like Max app scalar, normalizing quantile transformers. But one that is very good and I use quite often is the standard scaler. And choosing that will show you that it is. In fact, use the exact same just fit transform ends. You get your data out instead of the standard scaler. If you have a lot of outliers in your data, you can use the robust scalar in this class well and look at different ways to scale our data and how to assign classes to our data based on the data. So we really did a deep dive and how to prepare data for machine learning and the end. And you'll see how we do that in a later class. In the next class, we'll dive into some advanced topics. We'll have a look at how to build schemas for our data. So we can actually check if our data is within certain ranges or adheres to certain criteria that we said that the data has to have if we automate our data science workflow in the end, this is really important because right at the beginning, we can say that our data is okay or that our data has changed to what it is before and that there is data control, quality control issue. 15. Validating Data with Schemas: In this class won't be looking at schemas. So that means when we load our data, we can see if each column that we define fits a certain predefined class or some predefined criteria that we think this kind of feature has to have. And we'll be exploring different ways to do this. And what think about when doing this. So we can automate our data science workflow from the beginning to the end. In addition to the usual import of panels have, we'll import pandas era. This is obviously a play on pandas, and it is the library that we'll use in this example to create schemas and validate our DataFrame. There are other libraries like rate expectations that you can check out, but in this case, pandorable two. First, we need to create the schema. The schema is basically our rule set, how our DataFrame is supposed to look like. So in this case, we'll use an easy example with ocean proximity and we'll make it fail first, we say that the column is supposed to be integers. So we get a schema error. And we can see right here that it tells us all the way in the end that it was expecting an int 64. Not bothered, God. If we replace this by a string, we can see that now it validates and everything is fine. Now, in addition to the type, we can also provide criteria that we want to check. So we type in PA dot check. And since we want to check that ocean proximity only has a couple of values, we copy these values over and say it's supposed to be within this list. If we validate this schema, we see everything is fine. Let's make it fail. Delete the near bay, and we see that there's a schema error because this could not be validated. Let's run that back, make that work again. Text isn't the only thing that needs to be validated. We can also have a look at other numeric values. So if we wanted to check for the latitude to be in a certain area, or the longitude to be in a certain area. That totally makes sense in like, you can check if it's within certain boundaries. Let's have a look at total rooms and check that it is an integer. Now, right now it is not. But we can of course, make the data load as integer and then validate the data. So our loading as always as an integer. So what we'll do is we'll define the column and say it has to be an integer. Now in this case, obviously we get a schema error because it's a float right now. So we have to do a type conversion or we have to reload the data with an integer. We'll get the housing dot CSV. And we'll define the datatype for total rooms to be int. The problem here is that there are in 32s and in 64. So how many bits are in an integer? And these have to be the same. So when we look at the error of our schema, we can see that it is expecting an insecurity for. So we'll import numpy and define our loading as in 64 right here. And our schema once again validates because we have now matched the type. So if we do in 64 loading and the beginning, we can match this up with it in 64 that we expect and our schema. It's just things to be aware of when you are loading. Another way to validate our data at this using a lambda function. So an anonymous function that can do arbitrary checks and return true or false values. In this case, we'll start out with housing median age. Do at how a column and add the check. Now I'm making a mistake here unfortunately, but you'll see in a second. So P dot check will add lambda n is our variable. And we check if n is none, All is not none. And we get a type error right here. This is important to note. It is not a schema error. And that's because I forgot to add a type check right here. So we'll check for float. And now everything validates again because none of the values in the housing median age are numb can make it fail by removing the none. And that will break our schema. We can do a lot of other tests, arbitrary function tests in here, like if our squared n is over zero, which it should if math is still working. There are several reasons why you wanna do schema validation on DataFrames are on tables. And it is quite common to do those already in databases. And it's a good practice to do this in DataFrames. It can be that you just get faulty data or that the data changes in some way. And a very simple example right here is percentages. In geophysics. Sometimes you have to calculate porosity, e.g. of rocks, which can be given as a percentage 0-1, so as a decimal, or it can be given as a percentage, 0-100. Both are completely fine, but you have to take one to have your correct calculations afterwards. So let's create a DataFrame here with mixed percentages, where you see that it'll throw an error. If you validate this data. Save this DataFrame and D of simple. And we'll create a schema for this. Making all the data floats 0-1. So create the DataFrame schema and add percentages for the column. And really why we're doing this example is for you to see other data than just the housing data that we can do this on physical data as well. And to make you think about your data, how you can validate that your data is in fact correct. So we'll have a check right here. And we can check that this is less than or equal to one. Once again, we have to validate our DataFrame on the schema and see that it will fail. And the nice thing is that our failure cases are clearly outlined right here. So we could go in manually and correct that data. All we can correct all the data that we know is wrong in our percentages or drop and get our schema validated with the correct input data. So we'll get all the data that is over one and just divide everything by 100. So we have only decimal percentages. And now everything validates easily. In this class, we had a look at different schemas and how we can validate our data already from the beginning. And we had a look with a simple example of percentages, why this is so important to do. In the next class, we'll have another advanced strategy, which is encoding the topic that is quite important for machine learning, but also can be applied in a few different ways. 16. Encoding Categorical Data : In this class, we'll have a look at encoding our data. So if we have a categorical variable like our ocean proximity are machine learning process often can't really deal with that because it needs the numbers. And we'll have a look at how we can supply these numbers in different ways. And in addition to that, once we've done that, we can also use these numbers in different ways to segment our data. We'll start with the usual pandas. And then we'll have a look at the ocean proximity because these are strings and our strings are categorical data. And machine learning systems sometimes have problems with parsing strings, so you want to convert them to some kind of number representation. Pandas itself has something called one-hot encoding. And this is a dummy encoding. So essentially each value in the categories gets its own column where it's true or false. So each value that was near bay now has a one in the near bay column and zero and everything else. Let's merge this data to the original DataFrame. So we can compare this to other types of encodings and see how we can play around with it. We'll join this and to their DataFrame. And we can see right here near bay. One for near bay, inland is one for inland and zero everywhere else. Alternatively, we can use the pre-processing package from scikit-learn. Scikit-learn gives us encoder objects that we can use. So we'll assign this one-hot encoder object to ink, and we'll fit this to our data. The nice part about these objects is that they have a couple of methods that are really useful that will now be able to explore. Let's fit this to the unique data that we have in our ocean proximity. And then see how this encoder actually deals with our data. After fitting our encoder to our unique values, we can transform our data. If we spell it right. Yeah, converting this to an array gives us the one-hot encoding for our unique values. So only a one in each column and each row. Now transforming actual data. So not just the unique values should give us something very similar to what we saved in the DataFrame. Further up. Convert this to an array. So we have values and the fourth column. Right here you can see near bay. Same. Now, you may wonder why we're doing this redundant work. But with this encoder object, like I mentioned, we have some really nice things that we can do at a couple of lines and we can use the array that we have from before. I'm going to use NumPy because I'm just more used to dealing with NumPy objects. And we can convert this array back now, which is not as easy with other methods, but because we have this nice object that has all these methods available, we can use the inverse transform, provide that array to this inverse transform, and get back the actual classes because the object remembers the class instead of it was fit on. And we can also get all the data that is stored within the object without actually providing values to it. So really just a neat way to deal with preprocessing. Obviously, sometimes we want something different than one-hot encoding. One-hot encoding can be a bit cumbersome to work with. So we'll have a look at the preprocessing package and we can see that there is labeled by an a risers label encoders. But right now we'll just have a look at the ordinal encoder. The ordinal encoder will assign a number instead of the, instead of the category. And that basically just means that it's 0-1234 depending on the number of classes. And you have to be careful with this, like in a linear model, e.g. the numbers matter. So four would be higher than 04 would be higher than three. So encoding it as an ordinal would be a bad idea and a linear model. But right now, for this, it's good enough, like if we use a different kind of model later than we are completely justified in using an ordinal encoder. This marked our last class and the data cleaning section. So we had a look at how we can encode information in different ways. So we can use it in machine-learning models, but also save it in our DataFrame as additional information. In the next class, we'll have a look at exploratory data analysis. So doing the deep dive into our data. 17. | Exploratory Data Analysis |: In this class, we'll have a look at automatically generated reports. And oftentimes that can be enough. You want an overview over your data and the most common insights into your data and will generate these reports and it'll be reproducible for you on any kind of dataset that you have. This tool is very powerful. Afterwards, we'll have a look how to generate these insights ourselves as well. Because sometimes you want to know more than this report just gives you. And also, if it was only about running this utility, data science, wouldn't be paid that. Well, to be honest, this is a good first step. Getting this overview over your data is really important. But then we need to dive deeper into our data and really dig out the small features that we have to find. We'll import pandas and then get our DataFrame and the DF variable S we always do. Then we'll import profile report from the pandas profiling Library. And I'm pretty sure you will be stunned how hands-off this process actually is of generating this report. And if you take anything away from this, I think this is it. This utility really takes away from lots of things that we usually did manually in Pandas. And I'll show you how to do those anyways, because it's really good to understand what you're actually doing in the background. But this tool is amazing. So you automatically generate all the statistics on your data. You see that it counts your variables and gives you the overview of how many are numeric and how many are categorical. Notice that we did not supply any category features or datatype changes. And we even get inflammation. How our data is distributed. However, it's a bit hard to see in our notebook. So that is why we are going to use a notebook specific version, which is profile da2 widgets. And here we have a very nice overview widget with the same information as the profile report from before. We can see right here that it even tells us the size and memory and tells us when the analysis was started and finished. How you can recreate this analysis. It tells you all the warnings like high, high correlations. Now between latitude and longitude, that's fine. Missing values. And then on variables, you can have a look at the distribution of your data. So you can talk with the results and have a look at the histogram. The histogram is also small up there, but it's really nice to have a large look at it as well. And you can flip through all your variables, see that it has missing values on the left, you have warnings about it. And really get all the information that you need to get an insight into your data. See if there are any common values that show up all the time. Now, this was 55 values really isn't that Coleman? See the minimum and maximum values that you have. So kinda get a feel for the range. And when we have a look at our income, which is more of a distribution, we can see the distribution there as well. And on our categorical feature, the ocean proximity, we can see you something very important. Island only has five entries. So we have kind of an imbalanced dataset here that there are not many homes on the island. Then we'll click over and have a look at the interactions. So see how one variable changes with the other. If we have a look at longitude against latitude, that's negatively correlated, longitude, longitude, the same value is always positively correlated. Now if we have a look at housing median value against everything else, we can really see how these interact, how these changed against each other. Total bedrooms against households, e.g. is positively correlated. Something good to know. And this is just a powerful tool to really see each variable against another. Then we'll click over to the correlations. And the standard linear correlation measure between one and minus one is the Pearson Correlation. And here we can see what we saw before that a variable with itself, so longitude against longitude will always be one and all the other values should be somewhere between one and minus one. And that way you can really see the relationships between data. Spearman is a bit more non-linear, but usually people prefer candles two specimens, and then there's pi k. So phi is a measure between two binary variables, usually toggled on the top right to read more about these. Have a look at missing values. And this may remind you of something that we did earlier. And I'm not the only one that thinks the missing numbers library is awesome, obviously, because this gives very similar insights on this tab. And then we can also have a look at a sample of our data. Finally, lead to this. We can take our profile report and we can generate an explorative profile report. This one is more interesting when you have different data types. So if you also have text or files or images in your dataframe, in your data analysis. So really not that applicable right here. In general, however, you can see that this report already goes over a lot of the things that you want to know in your exploratory data analysis. Generally, you want to know the statistics of your data, the correlations of your data, missing values in your data, and really see how the data impacts with each other and what data can predict each other. It's fine if this is the only thing that you take away from this course. But really, let's dive into how we can generate these kinds of insights ourselves. In the next classes. I quickly show you how to get this into a file. So you have profile dot to file and then give it a name. And then you get this beautiful website where you can click around and you can share it with colleagues where they can have a look at your analysis. It will say that it is apprentice profile or in the report, and that's good. Don't just use this, use this as a starting point to make a deeper analysis and to really inspect your data. But this takes a lot of work away from our everyday data science work. 18. Visual Data Exploration: For EDA, I'd like to first look at plots. So we'll have a look at visualizations that give us an intuitive understanding of relationships in the data. Relationships between features, correlations, and also the distributions of each feature. And we'll be using Seaborn, which makes all of this extremely easy to just usually with one or two lines of code. First, we're importing pandas as usual and load our data. In addition, we'll load Seaborn plotting library. Seaborn is commonly abbreviated as SNS. And the first plot for our data visualization is going to be a pair plot. Now, a pair plot will plot every column against every column, even against itself. So when you plot the total rooms against itself, you will get the distribution of the total rooms. And if you plotted against any other column, you will get a scatter plot. This scatter plot, as well as the distribution can be very informative. One of my favorite plots to do for a visualization, right here we can see that e.g. our latitude and longitude data apparently has two spikes. So it seems like our geolocation data is focused around two spots. We can see that there are some very strong correlations. And the middle of our plot, that is because we have some linear scattering right here. And every other feature that we see right here is distributed in certain ways like this one is scattered all over the place and we can see some clipping at the edges. So probably someone took like a maximum of some data. In addition to the pair plot, we can create a pair plot that is colored by a class. Right now, the only class we have available as the ocean proximity in your exploration for the project, it would be really great if you experiment with this, maybe combine this with the binning exercise that we did. It takes a bit for this to load. That's why I only sampled 1,000 samples right now, because we want to get the plot relatively quick. However, this gives a really good overview how different classes are distributed against each other. The legend on the right gives us which colors which. And I want to drop their latitude and longitude right now because those features are strongly correlated with each other and right now they only take up space in our plots. So we can really make more use of our plot by getting rid of these. Now, in the drop, I have to add the x's because we want to drop this from the column. And then our plot should be able to plot with a few less plots on the, on the grid. So each plot is a little bit larger. And that gives us lots of information already. So we can see that our data is relatively equally scattered, except for the island data. That island data seems to have a very sharp peak. However, remember that our island data has very few samples. It really skews the results a lot. However, maybe we want to just plot the distribution of our data. For this, we can use the KDE plot, which is short for the kernel density estimate. So we'll have a look at how our median house values are distributed. In addition to this plot, we can also once again split this up by hue. Unfortunately, there's no nice in-built way to do this, like for the pair plot. So we'll iterate over the unique values in our ocean proximity. And this is a bit of a workaround, but I really liked this plot, so I'll show you how to do this anyways. And in my teaching usually this question comes up anyways. So I hope this plot will, I hope this plot works out for you as well. So we'll subset our data. Use the ocean proximity that is equal to the class, which is our iterator over the unique values. That means we get our plot split up by our class. However, right now the legend doesn't look particularly nice. Each legend just says median house value. And ideally we'd want the legend, of course to say the class. So we can provide a label right here that contains our class name. And that way we have a nice little plot, has all our distributions. Well, we can see that inland has a very different distribution than most of the others. And of course, the island is skewed to the right, which indicates a higher price. But once again, not a lot of data there, so it's a bit of a skewed result. Now, maybe we want to have a look at more of the scatterplots. Making a scatter plot is very easy. Well, we can even go a step further. There's something called a joint plot, where we have the scatter plots and the undersides. We can plot the distribution of the data. So usually a histogram, you can do a different ones as well. These are extremely nice to point out how data co-varies. In the case of e.g. total bedrooms and population, we see a very clear distribution that indicates a basically a linear trend. So some kind of linear correlation between the two. And this plot is very easy. You just give the feature, the column name and the DataFrame and seaborne place in very well with pandas. Right here, you can also see the distributions and the labels are automatically applied. This plot has a couple of different options. You already saw that there's a hex option. We can also do a linear regression, so fit a trend line with uncertainty to our data. So we can really see if a linear model really fits our data or is something else should be. Now here we can see that outliers skew the results a little bit at least. And in addition, we can have a look at a different feature just to see how our linear regression, e.g. changes. This feature seems to be very strongly correlated to total bedrooms. So replace population with households. And we can see that this is, this is as linear as true data actually gets. I think if we now copy this over replaced population with households, then fit a line. We can see that the shade behind the line is basically not visible, so the uncertainty is basically not there on this data. A really nice way to see how our linear regression fits to the data. Instead of the pair plot, we can also do a heat map of the correlation. So that just gives us the number representation of our Pearson correlation coefficient. And we can see that the diagonal is one as it's supposed to be. Our latitude and longitude are negatively correlated because the longitude is negative. And in the middle we have a square of strong correlation that we should definitely investigate. That is very interesting. And generally, just a good way to inspect your data as well. We can copy this over and just play around a little bit with it just to show you that nothing is baked in here, you can really play around with, with it. It's an open playing field to really explore your visualizations. This magnitude 0-1, now shows us that median income is correlated quite highly compared to the median house value. And I didn't really see that before. So just checking this out, switching it around a little bit can give you more insights. So trying this out from the standard, from the standard visualizations can be extremely valuable. We can add annotations to this. Now this is a bit of a mess. So we'll round our numbers to the first decimal and see you that this is looking much nicer. You can do this with the original data as well. This class gave an overview of different plots that you can use to understand your data better. In the next class, we'll actually look at the numbers underlying under these plots and how to extract specific numbers that will tell you more about your data. 19. Descriptive Statistics: In this class we'll follow up on the visualization that we just stayed. So we'll have a look at the numbers behind the graphs. Statistics can be a bit of a scary word, but really it's just significant numbers or Key Performance Indicators of your data that tell you about the data. So the mean, e.g. is just the average of all your data, whereas the median e.g. is the most common value. And this standard deviation, so STD just describes how much your data varies. So how likely is it that you find data away from the mean? And we'll explore all of this in this class and really do a deep dive into descriptive statistics and how to get them from your data. The beginning, we'll import our data and then we can actually just calculate statistics on rows by providing the row. So df dot house, median age. And then we have the mean and median and standard deviation available as methods to calculate directly on the data. The mean is the average in this case, and median is the most common value. Basically, if we want to get aggregate statistics on all of the DataFrame, we just call it dot describe on the DataFrame or a subset of the dataframe. This gives us the count, the mean, the standard, and the quartiles of our data. When you play around with this, make sure to check out the docstring for described. You can do a lot more with it. Then we can group by our data. And group by action has to be done by something that can be grouped by us. So we'll use ocean proximity in this case. And we can calculate the mean for these groups over each column. So this doesn't really make sense a longitude too much, but for all the other values, we can therefore get groups to statistics. And additional, we can use the AC method for aggregate. And in there we can basically define a dictionary with all the statistics that we want to calculate on a given column. Longitude, e.g. we'll have a look at min-max mean. And we can copy this over on to use it for other features as well. And really, so you're not limited to these and you can even supply functions to this aggregator. And they don't have to overlap as well. So for total rooms, you can change this to be the median value instead of the mean. Because, well, that makes a bit more sense to get the median. And for our median income. Well, just try and get the skew of our distribution. And here we can see that our new DataFrame that comes out of this is filled with nan were no values are available where they don't really make sense. But we can really dive into stats here. Another neat little tool just to give an overview of columns is the value counts methods. So in ocean proximity, e.g. we can then call the value counts method to get an overview of how many samples are in each of these. So very good to get a feel for how our data is distributed among classes. For the heatmaps that we generated before we needed to calculate the correlation between each, each column against each other column. And we can see right here that we have this data available readily. And the call method also gives us the opportunity to change the correlation that we use. So you can change it to spam and e.g. really very similar to what we had in the automatically generated report. Here you can dive into the data and really see how our data correlates by number. In this class, we had a look at descriptive statistics, so at actual numbers, average values, and how we can extract these specific numbers and make decisions based on them. In the next class, we'll have a look at subsets of that data. So how do we select parts of the data and how can we calculate these numbers on those parts? Because sometimes, as we saw here, Island only has five samples in our entire dataset. So how can we make sure that we extract that data out of our DataFrame and explore those further. 20. Dividing Data into Subsets: In this class, we will be learning how to extract subsets from our dataset because sometimes e.g. we only want to focus on one certain location or we want to focus on one subset of customers. And those segments are really easy to extract using Pandas. And I will show you how to do this. So first we'll load our data, and then we'll take our df dataframe and have a look at the longitude. Because we can take our df dataframe and just perform normal logic on it. So in this case, we want it to be lower than minus one to two and we get a serious out of it with true and false values. So a Boolean series. We can use this to choose rows in our original DataFrame. And we can see right here that this is only a view. So we have to assign it to a new variable. Let's have another look at another way to select subsets. In this case, we want to have a look at the ocean proximity because selecting subsets of our categories is really important for something we'll do later which pertains to the AI fairness and ethical AI. So we can choose here that only near bay and inland should be in there. We get once again a Boolean series out of this that we can use to slice into our DataFrame or get a subset of our DataFrame. Can see this right here, and we can see that it has less robust than before. We can also combine different kinds of lottery x2. Well, to be arbitrarily complex. And why we have to do right here is use the AND operator. But in this case, it has to be the ampersand. The ampersand is a special operator in Python to do bitwise comparisons. And you can see right here that and will fail because the bitwise operators just a really short hand to compare the Booleans. And you have to be careful that you use parentheses in conjunction with a bitwise operator. Here, we'll just play around a little bit with true and false. So you can see how these are combined when we use and which will use the same with an OR operator. But of course we have to take the bitwise operator, which is this well, pipe symbol. I don't know if you have a different name for it maybe, but it is onscreen, you have it in your notebook. And here we get the choice of things that are in. The choice of ocean proximity that is near bay inland, or D of longitude is -120 to one, we have a look at the unique values in our subset of ocean proximity. We can see that there are values that are not in near bay and inland because they were in the longitude under minus hundred and 22. We can also use the dot loc method. This is a way to select subsets of our data frame by using the names of the indices. Index on the columns and the index on the rows. We can copy this right over. And I'll show you right here where the difference is to the method before, because this will fail, because it expects us to give slices for all indexes. So a DataFrame has two dimensions, the columns and the rows. Right now we only gave it the columns, the colon right here. It is used to just select everything and the row section. And we can of course use this to slice into the rows as well by using the numbers of the index. Right here, we can see that this selected from the index name five to 500. And keep in mind that our index can be anything. We'll have a look at that in a second. Here we can see that this did not change our DataFrame at all. This is just a method to return a view. And of course we can also save this in a variable as always. So the dot loc method just works in a different way than our way before. Now let's have a look at indexing, because up there we can see that our index is just a running integer from zero to whatever the maximum number is, 20,640 in this case. However, we can use the dot set index method to change our index to any other row. And this is really powerful and that we can assign any kind of indexing, even text and select on that text, or in this case, the latitude. Instead of just a number. You can still use numbers and I'll show you afterwards how to do that. But this is kind of a way to change thinking about your DataFrame because right now our rows are indexed by the latitude. So we can, we can't do what we did before with the number because our index right now is not the integer anymore. Our index now is the latitude. So if we choose the number at war, any kind of number from our latitude, this will work again. Here I have a look at the index, just copy a number out of here, like EF 37.85. And we can then use this to select a subset using dot loc. Just use all the columns. And we can see right here that this just shows everything from our index right here. You can see that indexes in Pandas do not have to be unique as well. Something really important to think about when you work with them. Slicing into our DataFrame like that, it is extremely powerful because our index, we'll just return the data at that index and whatever sorting we have. So we don't really have to be aware of how our data is structured. Nevertheless, we can use the iloc method, which is basically index location, where we can still go into our DataFrame and select row five to 500, 499 because it's exclusive. We can also use this on the column. So if we think we know exactly where everything is, we can use this kind of slicing as well and to just use the number slicing to get our subsets, I usually recommend using dot loc because what dot loc, you can always be sure regardless of sorting, that you'll get the things that you want back with the exact index that it is. And you don't have to make sure that you're sorting of your DataFrame is the correct way. Right here we can see that latitude is now not part of our columns anymore because we have assigned it to be our index. Now, if we want to get latitude back into our columns, we can do that as well by resetting the index and then our index will be back to just running integers. This also works when you re-sorted it. So you can reset the index back to going 0-500 or BO maximum number when you changed around your column ordering. And this is really important to think about when you're doing index slicing that you can always change the sorting of your, your data. But when you do dot loc, you will be able to retrieve exactly what's on the index. On the topic of selecting columns. Of course, we can do the standard way of just providing the columns we want. But sometimes your DataFrame gets very long leg your view of back to the missing numbers. Example we had, I think over 20 columns. So selecting all that one can be really cumbersome to be honest. So we can also go the other way and select which columns we do not want. And that is with a drop method. So we provide the names of the columns that should be dropped from the DataFrame. Right here. We'll just take the inverse of longitude and population, provide the axis that we want to drop it from because we can also drop columns. Right here. You can see how you can change around lot of the things as well you can do in place dropping as well if you want to change the DataFrame directly in memory. Right here you can see that we can drop, well, do the exact opposite from what we did before by dropping rows. Overall, we're doing this because if you select subsets of your data, you can do analysis on the subsets. So if we just use the describe method from our descriptive statistics, we can see right here, e.g. the standard deviation and the mean of all the columns. And we can of course also called the describe method on a subset of our data. And see how our, well, how our descriptive statistics change. You can then start plotting on these subsets and do your entire dataset analysis on these subsets. This class really went deep into how we can select subsets of our data and really decide what to take based on features, but also on indexes. And we had a look how to switch in, they say, and how to reset it again, because that is really important when you want to do your exploratory data analysis and have a closer look at some subsets of your data. In the next class, we will be looking at how we can generate those relationships in our data. And really focus in on what to extract. 21. Finding and Understanding Relations in the Data: In this class, we'll have a look at the relationships within our data. So we'll really check out how correlation works within our data. But go beyond this as well. So go beyond linear correlations and really dive deep into dissecting our data. We'll start out again by importing pandas and loading the DataFrame. We can see that dot core is really central to doing correlation analysis. In Pandas. We can use Corr and change around the correlation coefficient that we actually want to use. Now, the standard Pearson's correlation is a linear correlation. Spearman and Kendall use a rank correlation which can be non-linear. In addition to calculating these, these aggregate correlations, maybe sometimes you just want to find out how one cell, one column is correlated with another. And here we can simply provide the column and calculate the correlation on another column. Right here. We can even take this one further. So machine-learning tools have been really easy to use in the past ten years. And we can use this machine learning tool to basically predict one feature based on the other features. And if we do that with every feature, we can actually see how informative one feature is based on the other. And this has been built into a neat little tool that we can use here called Discover future relationships or beyond correlations. It has recently changed name, so you'll be able to find it on GitHub as well. And this means we can use the discover method of this library to really dive into their relationships in our data. So we use the discover method on, on our DataFrame. And we can supply a method or a classifier, but in this case we'll just leave it on standard. You can play around with this later if you're interested in, it, takes a few seconds to execute this, but then we will just use the sample from our DataFrame to make it a little bit faster. You can let it run on larger samples. And we get how one feature predicts another feature right here. And we get that for every feature around, we can use the pivot tables that you may know from X0 to get out a full grown library, full grown table that will give you all the information you need. Right here. Very similar to the correlation. However, the central one is not not filled out, so we'll just fill that with ones because you can predict feature easily on itself. Of course. Then we'll go on to plot this because looking at this as a plot is always quite nice, just like we can look at the heat map from the correlations. This is very similar to the correlations except that we use machine learning to cross predict this time. So we'll save this into the variable and then make a nice plot. All of those. We can see that as opposed to the, the correlation plot, this is not fixed between -1.1. So we'll fix that real quick. And then you can really see how each feature can be extracted from the other feature. We do this, this fixing from minus one to one by using the V min and V max. And there we see it. So e.g. analyzing how our population can be predicted by anything else is really a good way to see relationships within the data where you can dig in further why something is predictive or not. Really a nice tool for data science. This was the last class and now a chapter on exploratory data analysis. When I look at how we can extract information about correlations and relationships in our data. And the next class, we'll actually look at how we build machine learning models. So something that we already used implicitly here will now learn how to actually apply. 22. | Machine Learning |: This chapter of the data science process or we'll have a look at machine-learning. Specifically, we want to model our data and find relationships and the data automatically. Machine-learning models are so-called black box models. That means they don't have any knowledge of your data. But when you show them the data and what you want to get out of the data, they will learn relationship and how to categorize or how to find the right kind of numbers. So do a regression with your data. And machine learning is really powerful and super easy to apply these days. Which is why we'll spend a lot of time on validating our models as well. Because these models tend to learn exactly what you tell them to learn, which might not be what you want them to learn. And validation as the due diligence for you to do to make sure that they actually learned what you want them to them. So let's fire up our notebooks and have a look at machine learn. 23. Linear Regression for Price Prediction: Welcome to the first-class and the chapter on machine learning. We'll have a look at how to build simple models. Because in machine learning, often the rulers, the simpler the model, the better. Because simple models are easier to interpret and are often very robust to noise. So let's dive into it. After loading our data, we can import the linear regression model because we want to predict house values in this exercise. However, before we have to prepare our data in a certain way, we need to split our data in two parts. We want one training part and we want one. Well, one set of data that the model has never seen during training time. So we can validate that our model learns something meaningful. This is to avoid an effect that is called overfitting. So when our model basically remembers the training data and does not learn meaningful relationships between the data that it can then apply to new data it has never seen. So that way, we take our DataFrame and we split it into two parts randomly. We could of course do this with subsetting that we did in the previous section. However, taking a random sample that is absolutely sure to not overlap in any way is a much better way. And the train test split function that Scikit-learn supplies is really good for this and it has some really need other functions that we can use. This is also a really nice way to select our features. For the simple model, we'll just use the features of housing, median age, and then the total rooms as our training features. And the house value is going to be our target. Those are usually saved an x and then y. So we know we have x train and x test, and then we have y train and y test. This is quite common. And we'll have a look at the shapes. So we have a bit over 20,000 rows here. Our train data is going to be about 75% of that with 15,000 values. And our y train should have the same amount of targets because those are sampled randomly but in the same row. So the data obviously mattress. Our x tests should now have the remaining rows that are not in the train set. Now, doing this is extremely important and there's no way around splitting your data for validation. Now it's time to build our model. Our model is going to be the linear regression model that we imported before. And Scikit-learn makes it extremely easy for us to build models and assign models. We just have to assign the object to some variable. In this case, we'll just call it model. And you can see that you can change some of the hyperparameters in the model, but we'll keep it standard right now. Now we fit our model to the data. This is the training step where automatically our model is adjusted and the parameters in our model are changed so that our model can predict y train from x train. And to score our model. So to test how well it's doing, we can use the score method on our fitted model and provide it well data where we know the answers as well. So we can use x test and y test to see how well our model is doing on unseen data. This case, the regression is going to be the r-square. R-square is a, well, in statistics, it's basically a measure of determinism. So how well does it really predict our data? And the best value there is one. And then it goes down and can even be negative. 0.03 isn't really, well. It's not impressive. When we change our training data to include the median income, we increase the score significantly. Obviously, this is the most important part. We have to find data that is able to give us information on other data that we want. However, once we find that, we can further improve our, our model by doing pre-processing on our data. We have to be careful here though, because one, I'll do preprocessing and we'll test out different things if they work or if they don't. What can happen is that we manually overfit. Model. That means to do proper data science right here. We want to split our test data into two parts. One, validation holdout set and test set. The test set will not be touched and the entire training process and not in our experimentation, but only in the very, very last part of our machine learning journey. Here we define x Val and y Val. And I made a little mistake here, leaving that to y train that should of course be x test in the train test split. Changing this means that this works. And this is also a nice part about the train test split function. It really makes sure that everything is consistent or the well that all our variables match. And we can see right here that our test dataset is now quite small with 1,000 values. So we can go back to the train test split up here, and actually provide a ratio of data that we can use in your data science and machine learning efforts. You should always see that you can use biggest test size you can afford really because that means you'll be able to have more certainty and your results. Here we can see that it's now split 5050 and splitting our test set now further down into the validation set. And the test set shows that our final test set has about 2,500 samples in there, which is, it. It's good enough for this case. We'll define our standard scaler here and our model as the linear regression. And we fit our scalar onto the training data. That means we can now rescale our entire data too, so that none of the columns are significantly larger than the others. This is just a well, in a linear model, we sit the slope and the intercept. And when we, when we scale our data, that means that our linear model now can work within the same ranges of data and not be biased because one feature is significantly larger than the others, will create x scaled from our X train data. So we don't have to call the scalar transform in the train training loop. We can compare them here we can see that our scale data is now within, well, centered around zero and all at the same scale. Whereas before it was all over the place. We can now fit our data on our model, on the scale data. And well, the normal labeled style we have, obviously the label has to be y train in this case. And then we can do the usual validation on our holdout data. And this case it's going to be x vowel, and y val. So we don't touch the test data while we see what kind of scaling and what kind of pre-processing works, we have to transform our data because now our model expects scale data. So when we forget that, we get terrible results, and we can see that we improved our model by, by a small margin, but it is still improve by just applying this scaling to the data. If we tried to use the robust scalar instead, we can do this by just, well, just experimenting and using a different scalar. And this is the part which I mean where we need an extra holdout set because just trying different things, it's a really good way to see what works. And is how you do data science. Just seeing what sticks is really tantamount to building a good machine learning model. Because sometimes you might not expect that you do have outliers in your data. And you try the robust scalar and you'll see that it actually performs better. Or you realize that it works that are performed worse. Here we can train on our transformed data with our Y train again and score our results. To check whether this works. Try the minmax scalar that we used in our previous class as well. After we've done the experimentation and train for final model, we can use this model to predict it on any kind of data that has the same columns, hopefully the same distribution as our training data and the validation set. So to do this, we'll use model.predict and provided with some kind of data. In this case, we'll use the training data. Just have a look at how the model stacks up against the actual ground truth data, the labeled data. But of course, doing it on the train data isn't the most interesting because the model has seen this kind of beta. Eventually, we will do this on the test set. But finally, he wanted to do this on completely unseen data to actually get predictions from your machine learning model. Another very nice way and why I really like the train test split utility is that you can provide it with a keyword stratify. Stratification is a means to make sure that some kind of feature is equally represented in each part of the train and test split. So if we want to make sure that our ocean proximity on the island is in part in train and in part in tests. We can do this by supplying this kinda this kind of feature. And a reason why, and people like linear models so much is because linear models essentially fit a line to your data. So if you think back to like fifth grade, you may remember that a line is basically just the intercept on the y and a coefficient for the slope. So what we can do is interpret our linear model and have looked at these coefficients. So each column has a slope parameter right here that we can, well basically, this parameter tells you how much the slope of this data influences the prediction result. And of course, we can have a look at the intercept with a y, which gives us full overview of our model. So this, essentially you could write it out on paper. Now, in this class we learn how to use scikit-learn on a simple machine learning model, a linear regression. So basically fitting a line to our data, we had a look at how scaling can improve our model and even predicted on some data that the model has never seen the. So it's validating if we're actually learning something meaningful or if it just remembers the data. In the next class, we'll have a look at some more sophisticated models, namely decision trees and random forests. 24. Decision Trees and Random Forests: In this class we'll have a look at decision trees and random forests, which are just a bunch of decision trees that are trained in a specific way to be even more powerful. And decision trees are very good learners because you usually don't have to change the basic parameters too much. In this class. You'll see how easy it really is to use scikit-learn to build all kinds of different models and to utilize that in your exploration of the data. For this video, I already prepared all the inputs and the data loading. And I split the data into the train set, which is 50 per cent, and then a validation and a test set, which are each 25 per cent of the total data. And now we'll go on to build a decision tree to start out. So we'll import the trees from scikit-learn from the tree library. As always, we'll define our model. In this case, it's going to be a decision tree regressor because on this, to make it comparable, will again, do a regression on the house value. Model, train. What the training is going to be the same as always,, x train and y train. And I think at this point you really see why Scikit-learn is so popular. It has standardized the interface for all machine learning models. So scoring, fitting, predicting your decision tree is just as easier as a linear model. Decision trees are relatively mediocre learners and we really only look at them. So we can later look at the random forest that build several decision trees and combine them into a ensemble learning. And the nice thing about decision trees is that they're usually quiet scale independent, and they work with categorical features. So we could actually feed ocean proximity to our training here. But then of course we couldn't compare the method to the linear model as well. So we'll not do this right now, but this is definitely something you can try out later. So scaling this data doesn't cost us anything. So we might as well try. Here. You can actually see what happens when you don't transform your validation data. So basically a now expects even the decision tree I expect scale data. So it performs really poorly. When we transform our train data and we transform our validation data. Our score is slightly worse than before. Next, we can build a random forest. A random forest is an ensemble of trees where we use a statistical method called bagging that basically tries to build uncorrelated decision trees that in ensemble are stronger learners then each tree individually. So we'll import the random forest regressor from the ensemble sub library from scikit-learn. And just like before, we'll assign our model to a, what their model object to a variable. And then we can fit our model to the data. As you can see, the fitting of this is quite fast and scoring of this should give us a really good result. Here we can see here that this is slightly better even than the score we got on our, on our linear model after scaling. And if we now look at the score of the training data, you can see why we need validation data. So this random forests tree is extremely strong on the training data itself, but okay, on VAP validation data. Instead, we can also have a look at the scaling just to see how it works. It doesn't cost us anything, it's really cheap to do. So you might as well. If this improves your machine learning model or it reduces overfit, it's always worth to do because it's, yeah, it's cheap. So we'll scale our training data and fit our data, our model to it. We can use the same scale up from before because the scalars and independent of the machine learning model is just the scalar. And we see right here that our training score basically didn't change like it's in the fourth comma. So it's basically random noise at that point. On the validation set. We shouldn't expect too much either. So it's slightly deteriorated the result. So it is worth preserving the original data. In this case. A fantastic thing about random forests is that random forests have something called introspection. So you can actually have a look how important a random forest, I think each feature is. This, these are relative numbers. They might fluctuate a bit, but we can see that these features are differently weighted within the random forest to predict a correct price. This was a really quick one. I think Scikit-learn is amazing because it makes everything so easy. You just thought fit, dot predict, and don't score. And those are super useful for all of our machine learning needs. In the next class, we'll have a look how we not only predict price, but how we can predict categories. So in a more business sends that may be predicting as someone who is credit worthy or not. 25. Machine Learning Classification: In this class we'll have a look at classification. So that means assigning our data to different bins depending on what's contained within the data. In our example, we'll have a look at ocean proximity. So we'll try to predict if one of our houses is closer or further from the ocean. And that basically means that we'll have the chance to test different algorithms and how they are affected by preprocessing of our data as well. We'll import everything and load our data. Now in this split, we want to replace the house value with ocean proximity because we want to do I'm classification, so we need to predict classes. In this case, we'll predict the near a house is to the ocean. But generally you can predict almost any class. We'll turn it around this time and use all of the training features. But of course, we need to drop ocean proximity from our DataFrame. If we left that in, there wouldn't be a very easy classification task, I'd say. So the easiest model, or one of the simplest model, that is the nearest neighbor model, our K nearest neighbor model. The nearest neighbors are essentially just taking the closest data points to the point that you want to classify and take the, well, usually you just take a majority vote. So that means the class that is most prominent around your point is probably the class of your point. And for classification, Scikit-learn is no different than the regression. We'll assign the model to what the object to a variable. And then we'll try to fit our data. But something went from In finance or infinity or anything. And k-nearest neighbor does not deal well with this. Like I said, I leave all the preprocessing steps in the preprocessing chapter to keep these chapters short and concise. But in this case, we'll drop the nans without any different preprocessing just so those rows get deleted. Might not be a good idea in most cases, but in this case it's just to get our data out of the door. Here we can fit our model with the usual training data. It just works this time. And then we can score our model. Now, scoring n classification is a little bit different than in regression. We do not have the R-squared because the r-squared is a measure of determinism in regression. In this case, we have the accuracy. And the accuracy is at 59 per cent, which is alright, so 60% of the time this nearest neighbor model is correcting the correct class. We can probably do better, but that's a start. One thing you can try in your exercise is change the nearest neighbor number and have a look what kinda nearest of well, how many nearest neighbors to the point give the best value. We can have a look at many different classification algorithms. On the left you see the input data, which is three different forms of inputs. And then you see the decision surfaces of a binary classification on the right. So you can see that the random forest is very jacket e.g. and a Gaussian process is very smooth. So just for you to understand how these understand data, we'll try out the random forest because it looks very different than in the decision surface. And random forests are once again very powerful models. This is going to be the same schema as the nearest neighbors. So we'll have a quick chat about scoring functions. Because the accuracy score is, alright, it's a good default, but essentially just counts how many you get, right? And let's say you work in an environment where errors are especially bad, especially expensive. You want to have a look if another scoring function would be more appropriate. And you can have a look on the scikit-learn document, mentation. There are different scoring functions that you can check. Here we have a look and the random forest just outperformed with default values. Anything the nearest neighbors gets close to with 96 per cent. That is on unseen data. So it is a very good score. We can once again have a look at the feature importances to see what our model thinks is the most important indicator that something is close to the, close to the shore. And obviously part of it is going to be the longitude and the latitude. So let's just drop those as well from our DataFrame, from our training data. Because we want to make it a little bit more interesting, maybe something else. Is a better indicator if you come to your boss and say, Hey, I figured out that location tells us really well that my house is close to the ocean. They'll probably look at you a little bit more pitiful. So have a look. And obviously our random forest score is a little bit worse, but pretty alright. So let's have a look at another linear model. The logistic regression model is a binary model. You can use it for multi-class as well with a couple of tricks. That basically goes 0-1 and finds that transition. You can see it right here in the red. Logistic regression models are really interesting because they once again give a good baseline model because they are linear classifiers. But more interestingly, you saw that there's this transition line 0-1 in this image. And you can define a threshold in their standard. It is at 0.5, but you can do a tests how you want to set the threshold for your logistic regression. And this is a really good thing to think about in your machine-learning model. And we'll have a look how to determine this threshold after this segment of programming, the logistic regression. So we'll add this and have a quick look because we have a multiclass problem right here. And we want this multi-class problem to be solved. Obviously. Luckily, multi-class is automatically set to auto because most people don't deal with binary problems and real life. So yeah, scikit-learn really tries to set good default values, will fit our model with x train and y train data. And unfortunately, it did not converge. So it did not work in this in this instance. So I'll go into the docstring and have a look. There we go. Max iter is going to be the keyword that we have to increase. So it gets more iterations to find the optimum. The, to find the weather logistic regression is supposed to be. Thousand wasn't enough either. Just add a zero. This is going to take awhile. So we'll, we'll think about our, our optimum threshold. Because in a sense, when you have machine learning, you want all your positives to be positively classified and all your negatives to be negatively classified. And then you have to think about which one is worse, getting something right, on getting something wrong. And in this case, we can use the ROC curve for logistic regression where we can plot the true positive rates. So the positives that are positive against the false positive rate. So everything that was classified positive falsely, and then choose our optimum. In this class, we're having a look at different classification algorithms. There are many more as I showed you on that slide. And you can really dive into the different kinds of classification really easily as you see, it's always thought fit doctrine. And then you score and you predict on unseen data. And really in the end, it's always the same. And then it comes to how you scale your data, which is part of the exercise. And also how you choose hyperparameters like k for the k-nearest neighbors algorithm. In the next class, we'll have a look at clustering our data. So really seeing the internal structure of our data and how each data point belongs to the others. 26. Data Clustering for Deeper Insights: In this class, we'll have a look at how we can cluster each data. Sometimes data points cluster. Well, sometimes there have been harder to discern. And we'll have a look how different algorithms treat the data differently and assign it to different bins. After importing our data. This time we'll skip the part where we split the data because it will rather look at the clustering algorithm as a data discovery tool. If you want to build clustering algorithms for our actual prediction or for assigning new new classes. You have to do the splitting as well. You know that it actually does what it's supposed to do. In our case, we'll just have a look at k-means. K-means was kinda the unsupervised little brother of k nearest neighbor, where it essentially gives what it measures the closeness to other points and just assigns them to a cluster if they're close enough, will fit our data on the DataFrame. And we'll use fit predict because we want to do everything in one step. Now the problem right here is that we have ocean proximity in there with strings in there. So we'll be dropping know. We'll actually just have a look at the spatial data. So longitude and latitude, because those are very easy to visualize in 2D. So that just makes our life a little bit easier. We'll get out some labels for these. And what we can do then is we can plot these using matplotlib. They'll get to know matplotlib in a later class as well. But just for an easy plot, it does have the PLT scatter, which takes an x and a y coordinate. And then you can also apply assign a color, which is labeled, in our case. K-means. You can define how many clusters you want to get out. Now the default is eight. We'll play around a little bit with it and you can see how it changes, how the clusters change. The higher you go, the more fragmented it gets. And you can argue how much it really make sense at some point to still cluster data with light. Hundreds of clusters. I'm going three. It's easy enough just to show what happens when we actually have like proper clusters. We'll split our data a little bit. So essentially use the subsetting that we discussed before to delete some of the middle part in the longitude. For that, we can use the between method that basically defines a start point and an end point. When we negate this between, we are left with at cluster on the left of our geographic scatter plot and on the right of our geographic scatter plot. For that, we'll just choose -121 and -100 18th as the left and the right borders. We can see right here that this gives us a split dataset. Assign that to a variable so we can use it. Let's have a look and not this actually. So we see what's, what's happening with our data. Just delete for now that we have have colors are labeled because those don't apply here. And we can see the clear split between two clusters. Then we can use our k-means to classify these two or to match these two. And I'll just copy this over and use the fit predict to get our data on the split data and also copy over our scatter plot and add back in the labels. We can see with two clusters, it's quite easy for k-means to get one cluster on the left and one cluster on the right. If we play around with the numbers, we can really test the behavior of Howard, find sub-clusters in this and how it interprets the data. But because it's so easy with a scikit-learn, Let's have a look at other methods. This is, this is a graphic from the scikit-learn website where you have different kinds of clustering algorithms and how they work on different kinds of data. The spectral clustering comes to, comes to mind. But I personally also really like DB scan and gotten Gaussian mixture models. They work quite well on real data and especially the further development of TB scam called HDB scan is a very powerful method. Hdb scan is a separate library that you have to have a look at and then install yourself. But yeah, definitely worth a look. So we can do the same as before. We'll import DB scan from our cluster library in scikit-learn, assign it to the dB value variable. And it doesn't have a lot of different hyperparameters that we can set. Maybe change the metric in there that you saw in the docstring. But for now Euclidian is totally fine. We can see right here without setting any clusters, there are the outliers and the right. And basically it finds three clusters without us telling it much about our data. Let's also have a look at the spectral clustering right here. It works just the same. We'll assign it to a object and instantiate it. We have to supply clusters for this one. We want to just copy over all of this to the prediction on our SP and execute the entire thing. This takes a little bit longer. Spectral clustering, can be, um, yeah, can be, can be a bit slower on large datasets. Check out the documentation. They have a really good overview which clustering method is best for the size of data. And also, yeah, basically what you have to think about when applying different clustering methods. Since the methods are always evolving and always growing, It's a really good idea to just check the documentation because that is always up-to-date. Here we can see that the clustering is quite well, quite good. Clustering data can be really hard. As you saw, it can lead to very different outcomes depending on what kind of autorhythmic views. So which kind of underlying assumptions are in that algorithm, but also how your data is made up. Is it easy to separate the data or is it really hard to separate the data? In the end, I think it's a tool that can generate new insights into your data that you didn't have before based on the data you feed to it. The next class, we'll have a look at how we validate machine-learning models. Because just building the model isn't enough. We have to know if it's actually learning something meaningful. 27. Validation of Machine Learning Models: In this class we'll have a look at validating your machine-learning models. And you have to do it every time. Because building machine learning models is so easy, the hard part is now validating that your machine learning model actually learned something meaningful. And then one of the further classes we'll also see if our machine learning models are fair. And in this class we'll have a look at cross-validation. So seeing what happens if we shift our data around, can we actually predict meaningful outcomes? And then we'll have a look at baseline dummy models that are basically a coin flip. So does our model performed better than random chance? After importing everything and loading the data, we'll drop the nouns and split our data. Right now, we'll do the the regression. So we'll build a random forest regressor. And this is just to have a model that we can compare to the dummy model to do the validation. So we'll fit it right away to our train data and add the labels here. Having a fitted model, we can now score this and then go on to do cross-validation. Cross-validation is a very interesting way. If you just learned about test train splits, this, this is going to take test train splits to the next level. So cross-validation is the idea that you have your training data and your test data will keep the test data, the test data. And as you know, we split our train data in a training set and a validation set. In cross-validation, we're now splitting our training data into folds. So basically it just in sub parts. And each part is once used as test set against everything else as a train set. So we're building five models if we have five-folds. You can also do this in a stratified way, like the test train split that we used before. Now, it is quite easy to do this. Once again, the API, so the interface that you work with is held very simple. So we'll import the cross-validation score. And right here, the cross Val score takes your model, the validation data, and that is your x. So the features and of course, the targets, because we have to validate it on some kind of number. This takes five times as long because we build five models or yeah, we evaluate five models and we get out five scores for each model. You may notice that all of these scores are slightly lower than our average score on the entire data, and this is usually the case and closer to reality. We can also use cross-fold predict to get predictions on these five models. So this is quite nice too to do model blending, e.g. so if you have five trained models, you can get predictions out as well. It is not a good measure for your generalization error. So you shouldn't take cross-fold predict as a way to see how well your model is doing. Rather to visualize how these five models on the k folds on the cross-validation predict. Another validation strategy is building dummy models. Whether you do this before doing cross-validation or, or after, that is up to you. It is one way to validate our model. So a dummy model is essentially the idea that will that weight that we want our machine learning model to be better than chance? However, sometimes knowing how chance looks like, it's a bit rough. So we'll have a look here. You can do different strategies to use in your classifier. You can just do it brute force and try them all. But a good one is usually using prior. I think this will become the default and future methods for the dummy classifier. But since we're doing regression first, let's have a look at the regressor. So right here you can also define a strategy. The default strategy is just to have the mean. So a while, the worst kind of model returns just the mean. So we'll use this as our dummy model. Just like any machine learning model, you fit this to the data with x train and y train. Then we can score this function. We can even do a cross-validation on this. And see how well this chance model does. And based on these scores, it's a good gauge how well, while how well your actual model is doing. If the chance of model is performing better than your, better or equal than your model, you'll probably build a bad model and you have to go back and rethink what you're doing. We can do cross-validation, but obviously scoring would be more appropriate here, which is something you can try out in the notebooks. We'll do this again and we'll create a really quick classification model using the ocean proximity data again. Here we'll build the classifier, just whether we're the normal strategy. I personally think the dummy models are really useful because sometimes while Chance isn't just 5050, if you have class imbalances like we do with the island data, e.g. your, your coin flip is essentially skewed. It is biased. And the dummy classifier is just a very easy way to validate that. Even with class imbalances, you did not build a useless model. So right here we can score this and we get a accuracy of, well, pretty low accuracy right here. And we can check out how the different strategies affect our result. So 32% is pretty bad already. But even like you should probably take the dummy classifier with a best model because that is still a chance, chance result. So these 40% on the chance prediction are not a good, not a good sign. Let's say that. So right here we'll build a new model using the Random Forests again. There we go with a classifier and we'll directly fit the data to it so we don't have to execute more cells. Now scoring on the data will show that our random forest is at least a little bit better than the dummy. So 20 per cent better accuracy. I'd say we're actually learning something significant here to predict if we're closer or further away from ocean proximity. Now, as I said, the scoring is more appropriate, so we'll use the scoring right here with our dummy model on the test data. And the warning right here is interesting because our cross-validation score tells us that ocean proximity, the island class does not have enough data to actually do a proper split. So this is really important to notice. Apart from that, we do see, so that is something to take into account. But apart from that, we see that even on cross-validation, our model is outperforming the dummy model. Validating machine learning models is very close to my heart. It's so important since it's become so easy to build machine learning models that you do the work and see that those models, I've actually learned something meaningful and I'm just reading something into noise. And these strategies are really the base level that you have to do with every machine learning model. And in the next class, we'll have a look at how to actually build fair models and how to make sure that our model doesn't disadvantage anyone because of some protected class e.g. and that will be extremely important if you touched humans with your machine learning model. 28. Machine Learning Interpretability: In this class we'll have a look at Machine Learning interpretability. So we're going to look at this black box model that we built and we'll inspect what it actually learned. And if you're like me and you've ever built a model and show that to your boss and said, Yeah, it learned and it performed such and such well on this call, like it had 60% accuracy. They won't be impressed. They want to know what this machine learning model actually things. And in this class, we'll have a look at each and every different feature in our data influences the decision in our machine learning model. And we'll actually dive deep into the really cool plots that you can make that influence the decision in a machine-learning model. So here we'll pretend we already did the entire model validation and model building and the data science before. So we can check if our model is fair. So the notion of fairness really is the idea that even though our model has not seen the ocean proximity class, it may implicitly disadvantage this class. So you can check right here. Our split, we directly drop ocean proximity. We do not train on it. But maybe our data somehow implicitly disadvantages some class that is within ocean proximity. So check for this. I'm doing some, some, well, a couple of pandas tricks have been using panelists for a bit. So here you see the stuff you can do with partners. Because right here you have the validation data and that is only a part of our DataFrame. I want to find all indices from our DataFrame that are in this class and find the intersection of this index with the validation data. That way, I can choose the subset of the class in our validation set without actually having the data in, well present in the DataFrame from our test train split. And doing this, I'm playing around a little bit with it and trying to make it work, just printing over an oversize, see what's happening. And so I can actually validate with the data that this is the data that I want. Right here. You see that now I'm taking the subset of it, taking the index of it. And then I'm printing the intersection of x val and class-based index that I create before that. So save this in idx. So I can just go into the DataFrame and subset the data frame with the index. We'll use the model scoring function in this case to get an initial idea how well our hour, well, how well our model is really performing on the class subset of our, our validation data. Right here. Print this because we're now in a loop. So I have to use dot loc here. And it's really important for you also to see I still may make mistakes very commonly and you cannot be scared of mistakes in Python because it doesn't hurt. Mistakes are cheap. So just try it around. In this case, I messed up by, I sometimes have problems keeping the columns and the rows apart. And right here, interestingly, we see wildly varying values. So we have three that are around 60 per cent, which is nice. But the third value is around 23% at the last one is zero. So I have a look at the indices. And we can see right here that must be Island because Ireland only has five values in total. So we definitely have a prediction problem here, despite our model doing well overall. It does terrible on the island data because it doesn't have enough data right there. Here you can see that I'll try to improve the model by stratifying it on the ocean proximity just to try to see if this changes everything. And it does not. Because now I made sure that the, they're equally distributed across classes and we have even less data. So before I got lucky that I had two of the samples and in the validation set. And now I have less. Now can't do this because the Well, this is already a subset of the data, so we'll just skip this because with five samples trying to spread them out over all the datasets, it's kind of moot. And really in this case, you have to think about if you can either get more samples, somehow create more samples, collect more samples, or if there's any way to get the samples out of the system. But the sense they are data, they should be represented in your model usually. So really in this case, get more data. In that case. What can see right here that the stratification has improved the model overall, which is nice to see and with this way. So the backslash n is a new space just to make it look a little bit nicer. Here we can see that this is giving us good predictions for the, for everything that is near to the, to the ocean, near bay and near ocean, and under 1 h of the ocean. But the inland is performing significantly worse than our than the other data. Let's ignore the island for now because we discussed the problems with the island. Let's have a look at the test data. Because right now we're not doing model tuning anymore. So this is really the n validation and we can see that on the test data, inland really has a problem. So our model is performing well over all. But some things going on here with our inland data. And it would also be good here to do cross-validation. So we can actually get uncertainty on our score and see if there are any fluctuations in there. But let's move on to L5. L5 or IL-5 is a machine-learning explanation package. This is the documentation. And for decision trees, it says right here we can use the explain weights. So this is what we're doing right here. I'm calling this on our model right here. We have to supply their estimators. So what we trained our model and we can see the weights for each feature and how this feature is called. And this is an extremely good way to look into your model to be able to explain what influences our model the most. We can also use Eli five to explain individual predictions. So right here, we can obviously is to apply our estimator object, but then also supply individuals samples and get an explanation how the different features influenced this prediction. Here right now we'll just use a single sample from our data. We can see right here how each individual feature contributes to the outcome of 89,000. And of course you can do this for classifiers as well. Or we can iterate over several different numbers and see how these are explained by Eli five, I'll just use the display here. Like I said, the format is really nice as well, but I don't really want to get into it in this class. And here you can interpret how each of these is influenced by other different model parameters. After having a look at these, we can have a look at feature importance. You may remember from the, from before, from the random forest that you can do introspection on the feature importance. And Scikit-learn also provides a permutation importance for everything. So for all, you can apply this to every single machine learning model available in scikit-learn. And the way this works is that essentially the permutation importance looks at your model and takes each feature in the data and one-by-one scrambled those features. So first it goes to the households and scrambled stores, so they are essentially noise. And then sees how much this influences your prediction. And you can see here that it gives you the mean importance, the standard deviation, and also the overall importance. So you can really dive deep into how your model is effected by these, by each feature. So you can also decide to drop out certain features here. Next, we'll have a look at partial dependence plots. These plots are really nice because they give you a one-way look into how a feature effects your, your model. And introspection is relatively new in, in scikit-learn. That's why there's scikit, yellow brick, scikit minus yb, which makes these fantastic plots. Top middle you see the precision recall curve e.g. and generally, a really good way to visualize different things that explain your machine-learning. Here we see all the different features in our training and how they influence the prediction result. So bottom-right, the median income you can see, has a strong positive influence. And right here you can interpret how changing one feature would influence the price outcomes. So households, e.g. it has a slight increase when there's more households, is cetera. It's a really neat little plot. But the final library and my favorite library for machine learning explanation, they did get a nature paper out of this even is shap. Shap has different explainer modules for different models, so they're basically very fine tuned to each. You can even do deep learning explanation with shap. We'll use the tree explainer for our random forest model. And we have a wanting that we have a lot of background samples. So we might actually be able to sub-sample this to speed it up. But right now, we'll pass our validation data to this, explain our object that we created, and calculate this. So this takes a second and I'll actually cancel it right here because I want to save those in a variable so I can reuse them later and that's don't have to recalculate them. But generally the plot that I want to show you is a force plot. This plot can be used to basically explain the magnitude and the direction each feature of your machine learning model of your data, how it shifts the prediction. And I really love to use these plots and reports because they are very intuitive. You'll see that in a second. So here we have to pass the expected value in our explainer object and the shap values that we calculated before. And one of our data points in the validation data. So you can once again do this for several of your of your validation data points. I made a mistake here and not having a underscore. And also, I should have activated JavaScript for Shapley because they make some nice plots. They are, they're falling back on JavaScript to do this. And it has to be initialized right here. But afterwards, we have this plot and I hope you can try it yourself. Because here you can see that this particular prediction was most influenced by the median income negatively. And then pull population a little bit less positively in number of households negatively. And just overall, a really nice package. So we've had a look at different ways to visualize and include data in your reports and how you can generate them and definitely check out the documentation so they have so much more for you. In this class, we inspected the machine-learning model, so we had a look at how different features influence our machine learning decision. But also how strong is this influence on the decision and how do different features influence other features? So maybe we can even drop some from our original data acquisition. And the next class, we'll have a look at fairness. So this important part of where machine-learning models might actually disadvantaged some protected classes because they learn something that they shouldn't learn. And we'll have a look at how to detect this and some strategies, how to mitigate this. 29. Intro to Machine Learning Fairness: In this class will have an introductory look at machine learning, fairness. Machine learning has gotten a bit of a bad rap lately because it has disadvantaged some protected classes that shouldn't have been disadvantaged by the model. And this has only come out by the people noticing, not by data science doing the work beforehand. In this class, we'll have a look how you can do the work. How you can see if your model is performing worse on certain protected characteristics. And also have a look if your machine-learning model is less certain in certain areas. So sometimes you get a model that predicts that someone is worse off because they're in a predicted class. And that is a big no-go. If that ever happens, you have machine-learning model may never reach production. But sometimes your machine learning model is just less certain for some people if offered certain class. And then you should try to increase the certainty of the model by doing the machine learning and data science work beforehand. This will be building on what we did and the interpretability part. Well, we already did part of the fairness evaluation. We start with a random forest and we do the scoring. So we have a baseline on knowing how well our overall model is doing. Then we'll start to dissect it by class. So we already have this stratification on the class because we'll keep that from before, because it improved the model significantly. But then we'll iterate over the classes and actually have a look at how well they're doing. In the beginning. We want to know the score and basically do the same work that we did in the interpretability part. So we'll get our classes right here. And then we can actually have a look and interpreted. Our data. Right here will do the work of getting our indices. So we do the whole thing with the intersection and getting our, our class indices and the beginning. So this is going to be saved in idx for index. And then we're taking the union and the intersection, not the union of these values for validation and for our test. Because right now we just one to really test our algorithms. So having both holdout datasets, that's actually fine for this part. You usually, you do it in the end after fixing your model. And after hyperparameter tuning to see if your model is disadvantaged anyone. So we take the intersection with a class index right here, copy this over, and make sure that data is in there. And then we can score model on the validation data and on the test data. These scores should, ideally all these scores should perform equally well. And absolutely ideally all of them perform as well as the overall model score. But we remember that inland was significantly worse than everything else. And of course, we have the problems with Ireland not having enough samples to actually do the validation. That's why I included just a script file. And for now, later, I'll also show you how to do the how to catch errors in your processing so we can do with this. And then we'll expand this to include cross-validation. Because with cross-validation, we can really make sure that there aren't weird fluctuations within our data. So maybe inland just has some funny data in there that really makes it very wildly and its prediction. So getting that out is really important right here. And this is only the beginning for you. If you want to have a play around with it, you can I'm Bill dummy models as well and just really dig into why inland is doing so much worse. Using the interpretability to really investigate why something is happening in this model right here. So that we have the scores. And while looking at these scores, it's nice. It's getting a bit much with the numbers. So what do we can do? First of all, is at the try-except that Python has. So if there's an error, because Ireland doesn't have enough data, we can catch that error. And with the except, we'll just put a path so everything else still runs. After we processed Islands as well. There we go. So now we'll save these as variables, as Val and test. Because then we can actually just calculate statistics on this. So get the mean and the standard deviation. So an indicator for uncertainty here would be, or something funny happening here, would be if the standard deviation of our cross-validation would be very high. Which interestingly, it isn't. In this class we had a look at how to dissect our machine learning model and evaluate it on different protected classes without training on them. And we saw that a model that overall does quite well can perform really poorly on some classes. And that sometimes we even don't have enough data to really evaluate our model. So we have to go back all the way to the data acquisition and get more data to really be able to build a good model and do good data science in regard to that class. And the business case here is really clear. You never want a disadvantage, someone that would be a good customer, because you lose a good customer. This concludes the chapter on machine learning and machine learning validation. The next class we'll have a look at visualisation and how to build beautiful plots that you can use in your data science report and presentations. 30. | Visuals & Reports |: In this final chapter and we'll have a look at data visualization and also how to generate presentations and PDF reports directly from Jupiter. And that concludes this course. 31. Visualization Basics: In this class, you'll learn the underlying mechanisms for data visualization in Python. We'll import our data as always, and then we'll use the standard plotting library in Python. Matplotlib at underlies most of the other plotting libraries that are more high-level like seaborne as well. So it's really good to know it because you can use it to interface with Seaborn as well. So we'll make an easy line plot with median house value here. Now, usually you want a line plot for data that is actually related to each other. So we'll start modifying this one. First. We'll open up a figure. And we'll also call the plot show because Jupyter Notebook, it's making it a bit easy for us, showing us the plot object right after the seller's executed. But this is really the way to do it. So we can change the figure size by by initiating our figure object with a different figure size. Those are usually measured in inches. Just yeah. Just an aside. And then we can change this from a line plot. We'll modify this further. So since, well, since line plots are unrelated, procreate here, because if we plot this against each other, it looks a little bit funky. We can change the marker to be an x. And we get a nice scatter plot right here. And you can see that seaborne e.g. makes it much easier for us to get a nice-looking scatter plot. This plot still needs a lot of work. But you to know this is how we change the colors and change different markers. You can look up the markers on the matplotlib website. There is a myriad of different markets. But then we can also start adding labels to this. So the plot object that we have is a simple plot objects. So we can just add on a label. So x label is our population, and y label is going to be our house value right here. And we can add a title because we want our plots to look nice and be informative. And additionally, we can add additional layers of plotting on top of this. So instead of population against median house value, we can plot population against our total bedrooms and change the marker size, marker color, and the marker style. But obviously, our total bedrooms is scaled very differently than our median house value. So we can just make a hot fix right now. Of course you never do that in an actual plot. But just to show how to overlay different types of data in the same plot, you can modify this. In this way. You can save your data and have it available as a normal file. Changing the DPI means that we have the dots per inch. That is kinda the resolution of our of our plot that is saved. Then we can also plot imageData. We don't have an image right now for this, but it is worth And essentially you just give it the image data and it'll plot the 2D image for you. Let's have a look at how to change this plot if more like overlaying different data on this as well. Like if we only have one scatter, this is completely fine and we can add a legend. But it gives us a nice warning that there are no labels or in this plot object. So that means we can add a label to our, to our data plot right here. And we'll just call it house. So now we have a legend here. It makes more sense if we overlay more data. So if we want to plot some data on top, we can just call another plot function right here. Change the marker so it looks a little bit different and you can see that our legend is updated as well. So this is how you can make singular plots. And as I mentioned, C1 is a high level abstraction for matplotlib. That means we can actually use matplotlib to work a little bit with a C1. And this will only only give you one example of how to save a Seaborn plot. But you can easily look up other ways too. Add information to your seaborne plots or modify your Seaborn plots through the matplotlib interface. So right here we'll do the pair plot with only 100 samples because we want it to be quick. And we can once again open up a matplotlib figure and change the figure size as we wanted. And then save the figure. Here we can see that this is now available as a PNG. Open up the PNG and you use it wherever we needed. Right here. And this, while opening, this just looks like the plot, but as an image. If you want to make quick plots directly from DataFrames without calling Seaborn or matplotlib. You can call the pandas plot function actually, which interfaces with seaborne. So d of plot gives you the ability to make different kinds of plots. Really you can make bar charts and histograms and also scatter plots, which we'll be doing this time. This means we'll just provide the label of our x and our y data and tell it to make a scatter plot. We'll use the population against our house value again. Sorry, total rooms again. And just provide the word scatter plots, a scatter, scatter plot for us. In this class, we learned the different ways to plot data. In Python. We use matplotlib, we use pandas directly, and we even saw that these interact with each other because seaborne and pandas both depend on matplotlib. So you can take the objects returned by those plots and save them and even manipulate them further. In the next class, we'll have a look at plotting spatial information. 32. 52 Geospatial new: In this class we'll have a look at mapping geospacial data. So data when you have geo locations available. So you can make nice maps and really show where your data's out, which adds another dimension of understanding through your data. We'll start out by loading our data. Our data already contains longitude and latitude. However, they're in the wrong order, so we'll have to keep that in mind. We'll import folium. Folium is a way to plot our data on maps interactively actually. So we'll start out with a folium base map and give the location of our first data. And I like to walk and easy way for me to just make the base map for my data is providing the mean of the latitude and longitude as the center point of our map. And then we can have a look at yeah, at the impact of display. You can see that this has OpenStreetMap as the background. And you can walk around in it. And then we can start adding things to it. One example is adding markers. Markers are a good way to just add locations from your data points and just give some sample locations at tooltips to it. And this is what we'll do right here. So volume has the marker class and have a look at all the different classes that you can use. The library is still growing, but it has some really neat functionality in it already to build some really cool maps. And we'll use the market to walk. We'll add the first data points from our market to them up. So this is why it has to at two method. And we'll copy over the base map into the cell because everything has to be contained into one cell to be able to change it. And we can see it right here. So we'll change that around, add latitude and longitude and just use I locked to get the very first row from our DataFrame that we go, add it to our base map. And when we move our zoom out, we can see our first marker on the map. So this is quite nice. We can change around the map. We can change around for markers as well. Now, different kinds of markers. So you can add circles for marketing to your map as well. Definitely experiment with it. It's quite fun. I think another quite a neat way to visualize data. So while our map zoomed in way too much at the standard value, so at 12 was somewhere nowhere to be found. So zoom out a little bit in the beginning so we can actually see in the Marcos, we can also add multiple markers by iterating over our DataFrame. So for that we're just use it arose method we have in the dataframe and this will return the number of our row and also the row and the row content itself. So for that to work, and I'll probably add a sample to the DF because if we, if we added 20,000 markers to our map, that map would be quite unreadable. There we go. So maybe do five for the beginning. And fried. Here, I'll add the ISO that is unpacked in them in the loop itself. And I can remove, I lock right here because we don't have to access any location of our thing. The iteration already does that for us. And we have a nice cluster of a few, of a few Mockus right here. Then you can also go and change these markers, like add a tool tip when you hover over it. And this tooltip can contain any information that we have available. So e.g. we can use the ocean proximity right here. Now when you hover over it, you can see what kind of flowers this marker has, according to our date in this class, we had a look at how to create maps and add markers to these maps and make them really nice and interactive. In the next class, we'll have a look at more interactive plots, bar plots and all that to be able to interact with the data directly and make these nice interactive graphs. 33. Exporting Data and Visualizations: Oftentimes we need to save this data because we don't want to rerun the analysis the entire time. All we want to share the data with a colleague. That means we need to export the data or the visualizations that we do. And this is what we'll do in this class. Having our data slightly modified, we can e.g. use the one-hot encoding that we used before. Just so we know that we have some different data in there. And we can use to, to CSV or what one of these methods you can ride out, exhale as well, and write this to a file. So that way we have the data after processing it available and don't have to rerun the computation every time. And to CSV takes a lot of different arguments just like read CSV, it's very convenient and that way you can really replace the nans in here with e.g. just the word nan. So people know that this is a, not a number and not just a missing value where you forgot to add something. And then we can have a look at this file. And the Jupiter browser as well, search for nana. And we can see right here that it added nan into the file directly. So really a convenient wrapper to get out our DataFrame into a sharable format. Again, instead of this, we can also use the writing functionality. So this is basically how you can write out any kind of file that you want. We'll use the outdoor text and this dot TXT and switch out the mode to write mode. So w, and we'll just use F S, a file handle right now. And F dot, right? It should give us the opportunity or the possibility to write out a string into this file. And we can convert our DataFrame to a string with a wall values. I think there should be a two string method to really give, yeah, a toString method right here, which is innocence and other export method. But really this is just to show that you can write out any kind of thing that you can form it as a string into a file. So we'll see right here that this wall, this is not as nicely formatted as before. We have the tabs in-between instead of the commenter. So really it needs a bit of string magic to make this work as nicely as the pandas to CSV. But yeah, this is how you export files in Python. Now something to notice here is that right will always override your file. So if you change it to anything else, and we'll have another look at the file. We can see that refreshing this gives us only this. So the file is replaced by the writing operation. There's another, another mode that we should have a look at, which is append mode. And append mode just has the signifier a, where we can add onto a file. So this is quite nice if you want to preserve the original data or some kind of some kind of process that is ongoing to write out data and added to your file, to an existing file without deleting that file essentially. So we can see right here that we wrote out our DataFrame. And then we can copy this over and change this to append executed, go over, refresh this, and have a look at the very end. It should say anything or be. And it does. So yeah, those are files. We already did this in the basics of visualization, but in case you skip that, when you have figures, you can export these figures usually with the safety command. So this one takes a filename, file handlers, some kind of signifier, and of course you need some kind of plot. I really want to point you to the tight layout method right here, because that one is really good, too young to tighten up the layout of your safe plots. So if you save your data and it'll figure and it looks a bit wonky. tight layout will really clean up the borders of your figure and usually makes them more presentable. I basically run them on almost any exported figure. And here you can see that our figure was exported just fine. You can change around all these, all these parameters. Of course, to save the figure in exactly the way you need. Maybe you have a corporate color that you want your figure to be. In. This case, I chose black, which of course is a poor choice that if you're legends or unblock. But yeah, just to show you how to do it, how to play around with it. We have a look at how easy it is to save data in different formats from Python. And then the next class we'll actually have a look how we can generate presentations from Jupiter Notebooks directly. 34. Creating Presentations directly in Jupyter: It can be complicated to generate whole presentations, but it is possible to get presentations of ride out of Jupiter. And in this class, I'll show you how you can use any kind of notebook. We'll use the one created and the visual exploration one. So you want to go to View Cell Toolbar and then slideshow. And you can change the slide type for each cell. So if you want to have a displayed or skipped slide is going to be the one of the main slides. Then. So everything you want on its own slide, you can put as slot, slide or sub slide. And sub slide is going to be a different navigation. So notice these plots while I will look at the presentation and in a second. And fragment, is going to add another element to the, to the parents slide essentially. So we'll check that out as well. So after signifying these, we can go to File Download S and called the Reveal JS slides. When we open this up, we get a presentation right in the browser. Let's get rid of that. So this is a main slide scrolling to the right essentially, and we can still have a look at the data and it shows us the code and everything. Sometimes you have to play around a little bit with, uh, with, uh, plots that they work. These are the slides that we talked about before. And now the fragment is going to add another element to your slide essentially. So this is also the fragment and another fragment. In this class we had an overview of how to generate presentations in JavaScript and HTML from Jupiter. We saw that we can really preserve the data and the code in our presentations, even have these plots automatically included. We saw that we can do a sub slides and fragments and really make this super interesting presentations that are different from what you usually see. In the next class. We'll see how to get PDF reports out of Jupiter. 35. Generating PDF Reports from Jupyter: In this final class, you'll learn how to generate PDFs directly from Jupiter notebook. And how you can get these beautiful visualizations ride into your PDFs without any intermediate steps. We'll start out with a Jupiter notebook and go to Print Preview. Here we can already save it as a PDF. If we print this. Alternatively, we can download as PDF, but here you have to make sure that you have installed. And I know a lot of people don't I don't on this computer, e.g. so you get a server error. You can go the extra step of going download as HTML. Open the HTML and this is equivalent to your to the print preview and save it as a PDF. And in the PDF you can see that this now contains your code and all the information that you had previously available as well. Additionally, we do have NB convert, so the notebook convert functionality that comes with Jupiter notebook, and I think that's a really nice way to work with it. It has a read me when you are just called Jupiter space and be converted. And it'll tell you how to use it essentially. So what you'll want to do is go to your data right here in my code repository for this Skillshare course. And there you can just Jupiter and be converted and then choose one of the way you want to generate the report from that HTML. Html is usually the default. So if you'd just called Jupiter and be converted on your notebook, it'll converted to HTML. You can also supply the minus minus two. But if you say PDF, it'll run into the same error as before that you don't have latex installed. So if you install that, you can easily get these PDF reports directly from to put up. Another very nice way is that, well, if you're in Jupiter notebook, he often play around a little bit and your or the, the cells can be, can run quite high numbers. So be like 60, 70. And that basically shows how much experimentation you did. If you want a clean notebook that gets run top to bottom, you can provide them minus-minus execute option, which executes your notebook cell by cell before exporting. And this is how you generate PDFs in Jupiter. So maybe you have to install latest to be able to do it. All. You use the print functionality from the HTML reports. But this concludes the class on data science and Python here on Skillshare. Thank you for making it this far. I hope you enjoyed it and I hope this brings it forward in your career. 36. Conclusion and Congratulations! : Congratulations, You made a throw the entire course on Data Science with Python here on Skillshare. And I understand that this is a lot. We went through the entire data science workflow, including loading data, cleaning data, than doing exploratory data analysis and building machine learning models. Afterwards, validating them, looking at interpretability, and also generating reports and presentations from our analysis. This is huge and I understand that this can be overwhelming. But you can always go back to the bite-sized videos and have another look at those to understand and depth and your knowledge. Right now. In my opinion, the best data scientists just build projects. You will learn more about data science by actually applying your knowledge right now, and that is why we have a project. At the end of this course, you will build a nice data science project, analysis of your own data or the data that I provide here and build a PDF with at least one visualization that you like. Honestly, the more you do, the better. Deep dive into the data, find interesting relationships in your data set, and really work out how to visualize those the best. And this is how you will become a better data scientist by really applying your knowledge. Thank you again for taking this course. I hope you enjoyed it. And check out my other courses here on Skillshare, if you have time, now, makes sure to go out and build something interesting, something that you really like. Thanks again for taking this course. I've put a lot of work in this and I am glad that you made it through the end. I hope to see you again and build something else.