Data Science with Pandas in Python | Lazy Programmer Inc | Skillshare

Playback Speed


1.0x


  • 0.5x
  • 0.75x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 1.75x
  • 2x

Data Science with Pandas in Python

teacher avatar Lazy Programmer Inc

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

    • 1.

      Introduction Video

      3:12

    • 2.

      Pandas Outline

      1:08

    • 3.

      Loading In Data

      3:43

    • 4.

      Selecting Rows And Columns

      9:39

    • 5.

      The apply() Function

      2:23

    • 6.

      Plotting With Pandas

      2:36

    • 7.

      Pandas Exercise

      2:01

    • 8.

      Where to get discount coupons and FREE machine learning material

      5:31

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

42

Students

--

Project

About This Class

In this self-paced course, you will learn how to use Pandas to perform critical tasks related to data science and machine learning. This involves loading in, selecting, transforming, and manipulating data using dataframes.

The course includes video presentations, coding lessons, hands-on exercises, and links to further resources.

This course is intended for:

  • Anyone interested in data science and machine learning
  • Anyone who knows Python and wants to take the next step into Python libraries for data science
  • Anyone interested acquiring tools to implement machine learning algorithms

Suggested prerequisites:

  • Decent Python programming skill
  • Experience with Numpy and Matplotlib

Meet Your Teacher

Level: Beginner

Class Ratings

Expectations Met?
    Exceeded!
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Introduction Video: Hey everyone and welcome to my latest course, Data Science with pandas and Python. So who am I and why should you listen to me? Well, my name is the lazy programmer and I'm the author of over 30 online courses in data science, machine learning, and financial analysis. I have two master's degrees in engineering and statistics. My career in this field spans over 15 years. I've worked at multiple companies that we now call Big Tech and multiple startups. Using data science, I've increased revenues by millions of dollars with the teams I've led. But most importantly, I am very passionate about bringing this pivotal technology to you. So what is this course about? This course is all about teaching you foundational skills using the pandas library, which has become standard in the past decade for doing Data Science with Python, you'll learn about how to load in a dataset as a dataframe and how to manipulate DataFrames in ways that are commonly needed in data science. E.g. selecting different rows and columns, applying functions to entire columns, and even how to make plots. These skills are critical if you want to do Data Science with Python in the real-world. So who should take this course and how should you prepare? This course is designed for those students who are interested in data science and machine learning and already have some experience with numerical computing libraries, such as NumPy and Matplotlib. The second skill you need is some basic programming. Any language is fine, but since this course uses Python, that would be ideal. Luckily, Python is a very easy language to learn. You already know another language. You should have no problem catching up. For both of these topics. A high-school level understanding should be sufficient. In an undergraduate understanding would be even better. So in terms of resources, what will you need in order to take this course? Luckily, not much. You'll need a computer, a web browser, and a connection to the Internet. And if you're watching this video, then you already meet these conditions. Now, let's talk about why you should take this course and what you should expect to get out of it. Well, what I've noticed after teaching machine learning for many years is that there is a huge gap in knowledge. Students come to a machine-learning course wanting to learn machine learning. They'll understand the concepts, but then have no idea how to put those concepts into code because they don't really know how to code. This course is intended to fill that gap by creating a bridge between a regular coating and the type of coding you'll need for data science and machine learning. Specifically loading in and manipulating your datasets. By the end of this course, you will have learned enough to go out and use what you've learned on a real dataset. In fact, this is what we'll do as our final project. So I hope you're just as excited as I am to learn about this amazing library. Thanks for listening, and I'll see you in the next lecture. 2. Pandas Outline: In this lecture, we're going to introduce the next section of this course, which is on pandas. Pandas is a library that makes it very easy to read, write, and manipulate data. Now, although there is a lot of functionality in Pandas, we will not be able to get to all of it in this short section. This section is focused purely on the fundamentals. We want to answer questions like, how do you load in a CSV and how do you write a CSV? What does a DataFrame and how is that different from a NumPy array? And by the way, if you have an art background and you are from statistics, you should feel right at home with DataFrames. We're going to look at basic operations on DataFrames like selecting specific rows and columns. This will be very strange if you come from a pure programming background. Because the way it looks at first glance, it seems to be the opposite of what you would see a numpy. We'll look at a special function called the apply function, which lets you perform the same operation on each row of your data efficiently without having to use a for-loop. Finally, we'll look at how Pandas makes plotting your data very convenient. 3. Loading In Data: In this lecture, we're going to look at how to load in data using pandas. Pandas is particularly useful for data which is structured as a table. So it won't deal with image data or audio data or unstructured text data, if that's what you were thinking. Tabular data when it's stored in a file is usually in the form of a CSV or a TSV. That stands for comma separated values and tab-separated values. You can also use pandas for reading from an Excel spreadsheet, since that has a similar structure, although that would be more unusual. So to start, we're going to import pandas. Import pandas as pd. Next we're going to download a CSV from my GitHub repository. Both you and I are going to copy this URL from my pre-written notebook. So don't try to type this out manually as some of you sometimes do. We're gonna do W get. And then this URL. As you can see, the file we just downloaded is called an S-box, that CSV. Next we're going to read in the CSV, so that's df equals pd, read CSV as box dot CSV. Note that this command works directly with URLs as well. So if we copy the URL directly, we're going to df equals pd, read CSV, and then paste in that URL. Alright, so that works as well. Next, we can check the type of object that df is by doing type df. As you can see, it's a dataframe object and not a NumPy array. Now, just for comparison sake, we should look at what's in the actual file we downloaded. So let's use the Linux command. So that's bang head as books dot csv. So as you can see, there's a header column with the headers date, open, high, low, close, volume and name. It should be clear that these are stock prices for Starbucks starting in February 2013. X we're going to go back to pandas. Pandas has an analogous command and df.head. Let's try that. And as you can see, if you're in a notebook that shows you a nicely formatted preview of the top of your DataFrame. You can also set the number of rows you want to see as an argument. So we can do df.head ten. And that shows us the first ten rows instead of the first five. Just like in Linux, there's a tail command. So we can do df, dy tail. And this shows us the end of the DataFrame. Finally, there's an info function, Df dot info. And this tells us some important information about the dataframe. So as you can see, it tells you things like what kind of index that DataFrame uses, how many columns it has, the datatypes for those columns, and how much memory it takes up. 4. Selecting Rows And Columns: In this lecture, we're going to discuss how to select rows and columns from our DataFrame. This is analogous to indexing an array. So e.g. with a NumPy array, I can ask, give me the element at row zero, column zero. And in that case, I would use the square bracket notation and pass in a zero comma zero. So let's see if that works with a DataFrame. Df zero comma zero. As you can see, this does not work. So before we do anything else, let's check the columns of the DataFrame by using the attribute called columns. So that's df, dy. Column returns an index object with the column names. Note that you can also do assignment on this attribute with a list of column names. So let's say I don't like the fact that the name column is the only one that is capitalized, since it offends my sense of uniformity. So let's change that to lowercase. We can do df.columns and then just send it a list. Change that to lowercase. And there we go. And we can also check that it worked. All right, So it works. Alright, so here's an idea. What if I pass in one of these column names into the square brackets? So let's try df. Open. As you can see, this returns the open column of the DataFrame. We can also select multiple columns by using a list of column names. So let's try the f open bracket, open bracket, open, close. And that returns both columns. Now, just out of curiosity, let's check the datatype for the open column. So that's type open. Interesting, so it's a series. Now let's check the type of the open and closed columns. So this is a DataFrame. The lesson here is that when you only have one dimension and Pandas is typically stored as a series. If it's two-dimensional, it's a DataFrame. At this point, you might be thinking pandas is very weird because square brackets are used to select columns. Whereas in NumPy and every other kind of array, the square brackets would usually select the rows. The obvious question now is, how do we select a row in a DataFrame? The answer is that we can accomplish this using the I and the local attributes. So we can do df dot of zero. And this returns the zeroth column of the DataFrame. You might want to double-check that. We can also do the F dot log of zero. And this also returns that same row. So you might be wondering what's the difference? The difference is that I look is used for integer indices and no matter what, whereas low selects the row by the index label. And it just so happens that in our DataFrame they are one and the same. To demonstrate this difference, Let's load in our DataFrame again, but this time we'll specify that the date column should be the index. So we'll do df2 equals pd, read CSV as box dot CSV. And then we'll say index col equals date. By the way, you are strongly encouraged to read the documentation for pandas. There are many arguments for the many functions that Pandas has and you'll basically never be able to remember them all. So get used to using the documentation. Now let's do the F2 dot head. So as you can see, the date column appears to now have some kind of special status. In fact, it's the index for this DataFrame. So now we can do d of too low. And then we can pass in one of these indices. And this returns the first row of the DataFrame. By the way, if we check the type of this row, we can see that it's also a series. So both individual rows and individual columns are series objects. Now, let's talk about how we can select multiple rows of the DataFrame. Suppose I want all the rows where the open price was greater than 64. So I can do df open bracket, d f open greater than 64. Alright, so these are all the rows where the open price is greater than 64. Now suppose I want all the rows where the name is not Starbucks. So I can do df, df name not equal S-box. Okay, so we have no rows where the name is not Starbucks. So it seems that using the square bracket notation, I can pass in something like a Boolean code like this works from the inside out. So let's check what this Boolean thing actually is. Let's check the type. So perhaps not unsurprisingly, it's a series containing boolean values. So the square brackets on a DataFrame except a Boolean series as input. Now, oddly, this behavior does match how Numpy arrays work. In my opinion, numpy is more consistent here because this involves row selection and not call them selection. So let's do this. Let's do import numpy as np equals np.array range ten. Let's just see what a is. So this is an array of integers 0-10. Now let's say I just want to keep the even numbers. Then I can do a open bracket, a mod two equals equals zero. This gives me all the even numbers in that array. Now as homework, you can check the data type of the thing we just passed into the square brackets. So a mod two equals equals zero. Now in building machine learning algorithms, you usually wants to work with arrays of numbers and not DataFrames which can contain all kinds of objects. So how can we convert a DataFrame into a NumPy array? We can use the values attribute. So that's just df dot values. Now unfortunately, this gives us the dtype object, which is not what we want if we're doing machine learning, because now there are strings inside of this array. So let's see what happens if we only select the numerical columns. So let's do a equals df, open, close values, and we'll check what it is. Okay, so now we have a proper array of numbers. Let's check the type of a. Alright, so it's a NumPy array as expected. Alright, so suppose now that we've done what we needed to do with our DataFrame, we would like to save it to a file. This is accomplished with the two CSV function. Let's say I want to keep only the open and closed columns. Then I can do small df equals df open, close. And then I can do small df to CSV, output dot CSV. Okay, and that just saved my DataFrame to a file called output dot CSV. Now we can use the Linux command to see what's in our file so we can do head output dot CSV. Now unfortunately, there seems to be a pretty useless index column in our file. Luckily, we can get rid of this. So we'll do this same line. And we'll add a new argument. Index equals false. Now we can try the head command again. And the index column is gone. 5. The apply() Function: In this lecture, we're going to discuss the apply function. The typical use case for the apply function would be similar to the scenario where we want to do some operation on each element of a list, e.g. if we wanted to square each item, of course in Python, we know that for loops are slow, so we would like to avoid them if possible. The apply function can be used if you want to do the same operation on each row of a DataFrame or each column of a DataFrame. In other words, it does what you want to do with a for loop without having to actually write a for loop. So let's do an example. Suppose I want to have a column called a year, where I take the existing date column, parse out the year and convert it to an integer. So let's start by writing a function that accepts as input a single row of a DataFrame. So that would be deaf the two year and it takes in a row. So we're going to return int of rho, square bracket date. Then we'll split that string with a dash, and then we'll grab the zeroth element. Now if you can't see how this works right away, I would suggest trying this on a dummy date string. Recall that the format is a year dash, month dash day. Next, we're going to apply this function to every row of our DataFrame. So we do df dot apply date to year. So the first argument is a function axis equals one. The axis equals one is necessary. Otherwise, this will operate column-wise instead of robots. So let's run this. And as you can see, we get out a series containing only the year of D type in 64. Note that we can also assign that this series to a new column. So we can do df year equals what we have above. Alright, now let's check what this did to our DataFrame. As you can see, there's a new column called year. 6. Plotting With Pandas: In this lecture, we're going to look at how to plot with pandas. Pandas makes this very easy since it provides instance methods on both series and DataFrames that automatically generate plots. So let's try a few. So we're gonna do df open dot hist. As you can see, this creates a histogram. How about the F open plot? As you can see, this creates a line chart. By the way, these method names correspond to their map plot live versions, which makes them easy to remember. We can also do more interesting plots like the boxplot. Now of course, the boxplot is useful for numerical columns. So let's select open high, low and close. So that would be df, open high, low, close. And we'll do plot dot box. And so this is a boxplot. Another plot that is very useful for getting a quick summary of your data is the scatter matrix. So let's plot this first and then we'll discuss what we're seeing. So we're going to import scatter matrix. So from Pandas dot plotting import scatter matrix. And then we're going to call this function. So scatter matrix, same DataFrame as above. And then we'll say alpha equals 0.2 and figsize equals 66. Okay? This is a scatter matrix. So as you know that alpha equals 0.2 makes the dots have transparency and fixed size makes the plot a little bigger so we can see it better. So what is this plot? Basically, this plot shows the linear correlation between each of the data columns. On the diagonal, we get a histogram of each individual column. So it lets us see the distribution of our data. In other words, this is a statistical summary of the data. We see what kind of distribution each column has and how they are related to one another. 7. Pandas Exercise: In this lecture, we're going to go over the next exercise for the pen dissection. In this exercise, you will combine what you've learned in the previous sections and take it a step further. You will also have to make use of the pandas documentation in order to complete this exercise. By the way, using documentation is very important because these libraries are constantly being updated and the APIs are always changing. You could take an entire weeks long boot camp about pandas and still not know half of what pandas has to offer. And even if you did study the entire pan is API, which by the way is very unlikely, you still wouldn't be able to memorize the entire thing anyway. Even if you could memorize all that information, why would you want to when what you've memorized my change. So don't try to memorize syntax or get too attached to any particular way of doing things, just learn to use the documentation. And another note, you should do this without using blogs or tutorials. Use only the official Pandas and NumPy documentation. Alright, so what's the exercise? In this exercise, you will generate the doughnut dataset are the concentric circles dataset. Once you've generated the dataset, which of course will be stored in an array, you're going to create a DataFrame from that array. You want to call the column names X1 and X2. Then you'll want to derive new columns based on X1 and X2. We call these are quadratic feature expansion. So you want to generate three new columns, x1 squared, x2 squared, and X1 times X2. You may find that the apply function is useful here. Also, you want to name these columns appropriately. Finally, after you've completed your DataFrame, save it to a CSV without any headers and without any index column. Thus, your CSV should contain only the numbers that would be stored if it were a NumPy array. Good luck and I'll see you in the next lecture. 8. Where to get discount coupons and FREE machine learning material: Hey everyone and welcome back to this class. In this lecture, I'm going to answer one of the most common questions I get. Where can I get discount coupons and free deep learning material? Let's start with coupons. I have several ways for you to keep up to date with me. That absolute number one, best way for you to keep up-to-date with newly released discount coupons is to subscribe to my newsletter. There are several ways you can do this. First, you can visit my website, lazy programmer dot. At the top of the page, there is a box where you can enter your email and sign up for the newsletter. Another website I own and operate is deep learning courses.com. This website largely contains the same courses as you see on this platform, but it also contains extra VIP material. More on that later. So if you scroll to the bottom of this website, you'll find a box to enter your e-mail, which will sign you up for the newsletter as you would on lazy program at DOT ME. So you only have to do one of these. Now let's do a small digression because this is another common question I get. What's this VIP material all about, and how can I get it? So here's how the VIP thing works. Usually when I release a course, I'll release it with temporary VIP material, which is exclusive for those early birds who signed up for the course during my announcement. This is a nice little reward for those of you who stay up-to-date with my announcements and of course, actually read them. It's important to note that VIP material can come out at anytime. E.g. I. Couldn't make major updates to a course three years after starting it and do another VIP release. The purpose of deep learning courses.com is to have a permanent home for these VIP materials. So even though it could be temporary on the platform you signed up on. If you sign up for the VIP version of the course, then you'll get access to the VIP materials on deep learning courses.com permanently upon request. Here's some examples of materials you might find in the VIP sections in my TensorFlow to course, there are three extra hours of material on Deep Dream and objects localization. Usually I don't release the VIP content in video format, but this was an exception. Another example in my cutting-edge AI course was an extra written section on the T3 algorithm. This course covered three algorithms in total. So the extras section that gives you one more, or in other words, 33% more material. Another example in my advanced NLP and RNNs chorus is a section on speech recognition it using deep learning. In addition, there is an all new section of the course on a stock predictions or memory networks, depending on which version of the course you are taking. The reason for this is I might release slightly different versions of each course on different platforms. Because of how the rules on all these platforms work, I must differentiate the courses. However, since I own a deep-learning courses.com, this is the only platform that contains the most complete version of the course, which includes all the sections. Please note that this is rare, so depending on what course you are taking, it might not affect you. Alright, so let's get back to you. Discount coupons and free material. Other places where I announced discount coupons are Facebook, Twitter, and YouTube. You might want to pause this video so you can go to these URLs and follow me or subscribe to me on these sites if they are websites that you use regularly. So for Facebook, that facebook.com slash lazy programmer dot Emmy for Twitter, that's twitter.com slash lazy underscore scientists for YouTube, youtube.com slash C slash lazy programmer x. Occasionally, I still released completely free material. This is nice if I want to just talk about a singular topic without having to make an entire course for it. E.g. I. Just released a video on stock market predictions and why most other blogs in courses approach this problem completely wrong. That's another benefit of signing up for these things. I get to expose fake data scientists who are really marketers. Whereas I wouldn't ever make an entire course about that. Sometimes this can be in written form and sometimes it can be in video form. If it's in written form, it will either be on lazy program and taught ME or deep learning courses.com. If it's a video, it will be on YouTube. So make sure you subscribe to me on YouTube. If I release a video, I may also make a post about it on lazy programmer dot ME. And I may also announce it using other methods I previously discussed. So that's the newsletter, Facebook, twitter, and obviously YouTube itself. Now I realize that's a lot of stuff and you probably don't use all these platforms. I certainly don't, at least not regularly. So if you want to do the bare minimum, Here's what you should do. First, sign up for my newsletter. Remember you can do that either on lazy program at DOT ME or deep learning courses.com. Second, subscribe to my YouTube channel at youtube.com. Slash C slash lazy programmer x. Thanks for listening and I'll see you in the next lecture.