Transcripts
1. Introduction Video: Hey everyone and welcome
to my latest course, Data Science with
pandas and Python. So who am I and why
should you listen to me? Well, my name is the lazy
programmer and I'm the author of over 30 online
courses in data science, machine learning, and
financial analysis. I have two master's degrees in engineering and statistics. My career in this field
spans over 15 years. I've worked at multiple
companies that we now call Big Tech and multiple startups. Using data science,
I've increased revenues by millions of dollars
with the teams I've led. But most importantly,
I am very passionate about bringing this
pivotal technology to you. So what is this course about? This course is all
about teaching you foundational skills
using the pandas library, which has become standard in the past decade for doing
Data Science with Python, you'll learn about
how to load in a dataset as a
dataframe and how to manipulate DataFrames
in ways that are commonly needed in data science. E.g. selecting different
rows and columns, applying functions
to entire columns, and even how to make plots. These skills are critical
if you want to do Data Science with Python
in the real-world. So who should take this course and how should you prepare? This course is designed
for those students who are interested in data
science and machine learning and already have some experience with numerical
computing libraries, such as NumPy and Matplotlib. The second skill you need
is some basic programming. Any language is fine, but since this
course uses Python, that would be ideal. Luckily, Python is a very
easy language to learn. You already know
another language. You should have no
problem catching up. For both of these topics. A high-school level understanding
should be sufficient. In an undergraduate understanding
would be even better. So in terms of resources, what will you need in
order to take this course? Luckily, not much. You'll need a computer, a web browser, and a
connection to the Internet. And if you're
watching this video, then you already meet
these conditions. Now, let's talk about
why you should take this course and what you should
expect to get out of it. Well, what I've noticed after teaching machine learning for many years is that there is
a huge gap in knowledge. Students come to a
machine-learning course wanting to learn
machine learning. They'll understand the concepts, but then have no idea how
to put those concepts into code because they don't
really know how to code. This course is intended to fill that gap by creating
a bridge between a regular coating
and the type of coding you'll need for data
science and machine learning. Specifically loading in and
manipulating your datasets. By the end of this course, you will have learned enough
to go out and use what you've learned on
a real dataset. In fact, this is what we'll
do as our final project. So I hope you're
just as excited as I am to learn about
this amazing library. Thanks for listening,
and I'll see you in the next lecture.
2. Pandas Outline: In this lecture, we're going to introduce the next
section of this course, which is on pandas. Pandas is a library that
makes it very easy to read, write, and manipulate data. Now, although there is a lot
of functionality in Pandas, we will not be able
to get to all of it in this short section. This section is focused
purely on the fundamentals. We want to answer
questions like, how do you load in a CSV
and how do you write a CSV? What does a DataFrame and how is that different from
a NumPy array? And by the way, if you have an art background and
you are from statistics, you should feel right at
home with DataFrames. We're going to look at
basic operations on DataFrames like selecting
specific rows and columns. This will be very
strange if you come from a pure programming
background. Because the way it
looks at first glance, it seems to be the opposite of what you would see a numpy. We'll look at a special function called the apply function, which lets you perform the
same operation on each row of your data efficiently without
having to use a for-loop. Finally, we'll look
at how Pandas makes plotting your data
very convenient.
3. Loading In Data: In this lecture, we're
going to look at how to load in data using pandas. Pandas is particularly useful for data which is
structured as a table. So it won't deal
with image data or audio data or
unstructured text data, if that's what you
were thinking. Tabular data when
it's stored in a file is usually in the form
of a CSV or a TSV. That stands for comma separated values and
tab-separated values. You can also use pandas for reading from an
Excel spreadsheet, since that has a
similar structure, although that would
be more unusual. So to start, we're
going to import pandas. Import pandas as pd. Next we're going to download a CSV from my GitHub repository. Both you and I are going to copy this URL from my
pre-written notebook. So don't try to type this out manually as some of
you sometimes do. We're gonna do W get. And then this URL. As you can see, the file we just downloaded is called
an S-box, that CSV. Next we're going to
read in the CSV, so that's df equals pd, read CSV as box dot CSV. Note that this command works
directly with URLs as well. So if we copy the URL directly, we're going to df equals pd, read CSV, and then
paste in that URL. Alright, so that works as well. Next, we can check the
type of object that df is by doing type df. As you can see, it's a dataframe object and
not a NumPy array. Now, just for comparison sake, we should look at what's in the actual file we downloaded. So let's use the Linux command. So that's bang head
as books dot csv. So as you can see, there's a header column
with the headers date, open, high, low, close,
volume and name. It should be clear that
these are stock prices for Starbucks starting
in February 2013. X we're going to
go back to pandas. Pandas has an analogous
command and df.head. Let's try that. And
as you can see, if you're in a notebook
that shows you a nicely formatted preview of
the top of your DataFrame. You can also set the number of rows you want to
see as an argument. So we can do df.head ten. And that shows us
the first ten rows instead of the first five. Just like in Linux,
there's a tail command. So we can do df, dy tail. And this shows us the
end of the DataFrame. Finally, there's an info
function, Df dot info. And this tells us some
important information about the dataframe. So as you can see, it tells you things
like what kind of index that DataFrame uses, how many columns it has, the datatypes for those columns, and how much memory it takes up.
4. Selecting Rows And Columns: In this lecture, we're
going to discuss how to select rows and columns
from our DataFrame. This is analogous to
indexing an array. So e.g. with a NumPy array, I can ask, give me the element at
row zero, column zero. And in that case, I would use the square bracket notation and pass in a zero comma zero. So let's see if that
works with a DataFrame. Df zero comma zero. As you can see,
this does not work. So before we do anything else, let's check the columns
of the DataFrame by using the attribute
called columns. So that's df, dy. Column returns an index
object with the column names. Note that you can
also do assignment on this attribute with a
list of column names. So let's say I don't
like the fact that the name column is the only
one that is capitalized, since it offends my
sense of uniformity. So let's change
that to lowercase. We can do df.columns and
then just send it a list. Change that to lowercase. And there we go. And we can also check
that it worked. All right, So it works. Alright, so here's an idea. What if I pass in one of these column names into
the square brackets? So let's try df. Open. As you can see, this returns the open column
of the DataFrame. We can also select
multiple columns by using a list of column names. So let's try the f open bracket, open bracket, open, close. And that returns both columns. Now, just out of curiosity, let's check the datatype
for the open column. So that's type open. Interesting, so it's a series. Now let's check the type of
the open and closed columns. So this is a DataFrame. The lesson here is that
when you only have one dimension and Pandas is
typically stored as a series. If it's two-dimensional,
it's a DataFrame. At this point, you might be
thinking pandas is very weird because square brackets are
used to select columns. Whereas in NumPy and every
other kind of array, the square brackets would
usually select the rows. The obvious question now is, how do we select a
row in a DataFrame? The answer is that we can
accomplish this using the I and the local attributes. So we can do df dot of zero. And this returns the zeroth
column of the DataFrame. You might want to
double-check that. We can also do the
F dot log of zero. And this also returns
that same row. So you might be wondering
what's the difference? The difference is
that I look is used for integer indices
and no matter what, whereas low selects the
row by the index label. And it just so happens that in our DataFrame they
are one and the same. To demonstrate this difference, Let's load in our
DataFrame again, but this time we'll specify that the date column
should be the index. So we'll do df2 equals pd, read CSV as box dot CSV. And then we'll say
index col equals date. By the way, you are
strongly encouraged to read the documentation
for pandas. There are many arguments for the many functions that Pandas has and you'll basically never be able to
remember them all. So get used to using
the documentation. Now let's do the F2 dot head. So as you can see, the date column appears to now have some kind of
special status. In fact, it's the index
for this DataFrame. So now we can do d of too low. And then we can pass in
one of these indices. And this returns the first
row of the DataFrame. By the way, if we check
the type of this row, we can see that
it's also a series. So both individual rows and individual columns
are series objects. Now, let's talk about how we can select multiple rows
of the DataFrame. Suppose I want all
the rows where the open price was
greater than 64. So I can do df open bracket, d f open greater than 64. Alright, so these are
all the rows where the open price is
greater than 64. Now suppose I want
all the rows where the name is not Starbucks. So I can do df, df name not equal S-box. Okay, so we have no rows where
the name is not Starbucks. So it seems that using the
square bracket notation, I can pass in something like a Boolean code like this
works from the inside out. So let's check what this
Boolean thing actually is. Let's check the type. So perhaps not unsurprisingly, it's a series containing
boolean values. So the square brackets
on a DataFrame except a Boolean
series as input. Now, oddly, this behavior does match how
Numpy arrays work. In my opinion, numpy is more
consistent here because this involves row selection and
not call them selection. So let's do this. Let's do import numpy as np
equals np.array range ten. Let's just see what a is. So this is an array
of integers 0-10. Now let's say I just want
to keep the even numbers. Then I can do a open bracket, a mod two equals equals zero. This gives me all the even
numbers in that array. Now as homework, you can
check the data type of the thing we just passed
into the square brackets. So a mod two equals equals zero. Now in building machine
learning algorithms, you usually wants to work
with arrays of numbers and not DataFrames which can
contain all kinds of objects. So how can we convert a
DataFrame into a NumPy array? We can use the values attribute. So that's just df dot values. Now unfortunately, this
gives us the dtype object, which is not what we want if we're doing machine learning, because now there are strings
inside of this array. So let's see what
happens if we only select the numerical columns. So let's do a equals df, open, close values, and
we'll check what it is. Okay, so now we have a
proper array of numbers. Let's check the type of a. Alright, so it's a NumPy
array as expected. Alright, so suppose
now that we've done what we needed to do
with our DataFrame, we would like to
save it to a file. This is accomplished with
the two CSV function. Let's say I want to keep only the open and
closed columns. Then I can do small df
equals df open, close. And then I can do small df
to CSV, output dot CSV. Okay, and that just
saved my DataFrame to a file called output dot CSV. Now we can use the Linux command to see what's in our file so we can do head output dot CSV. Now unfortunately,
there seems to be a pretty useless index
column in our file. Luckily, we can get rid of this. So we'll do this same line. And we'll add a new argument. Index equals false. Now we can try the
head command again. And the index column is gone.
5. The apply() Function: In this lecture, we're going to discuss the apply function. The typical use case for the apply function
would be similar to the scenario where we want to do some operation on each
element of a list, e.g. if we wanted to
square each item, of course in Python, we know that for loops are slow, so we would like to
avoid them if possible. The apply function can be
used if you want to do the same operation
on each row of a DataFrame or each
column of a DataFrame. In other words, it does what
you want to do with a for loop without having to
actually write a for loop. So let's do an example. Suppose I want to have
a column called a year, where I take the
existing date column, parse out the year and
convert it to an integer. So let's start by writing
a function that accepts as input a single
row of a DataFrame. So that would be deaf the two
year and it takes in a row. So we're going to return int
of rho, square bracket date. Then we'll split that
string with a dash, and then we'll grab
the zeroth element. Now if you can't see how
this works right away, I would suggest trying this
on a dummy date string. Recall that the format is a
year dash, month dash day. Next, we're going to apply this function to
every row of our DataFrame. So we do df dot
apply date to year. So the first argument is a
function axis equals one. The axis equals
one is necessary. Otherwise, this will operate column-wise instead of robots. So let's run this. And as you can see, we get out a series containing only the year of D type in 64. Note that we can also assign that this series
to a new column. So we can do df year
equals what we have above. Alright, now let's check what
this did to our DataFrame. As you can see, there's a
new column called year.
6. Plotting With Pandas: In this lecture,
we're going to look at how to plot with pandas. Pandas makes this
very easy since it provides instance methods on both series and DataFrames that automatically
generate plots. So let's try a few. So we're gonna do
df open dot hist. As you can see, this
creates a histogram. How about the F open plot? As you can see, this
creates a line chart. By the way, these method names correspond to their map
plot live versions, which makes them
easy to remember. We can also do more interesting
plots like the boxplot. Now of course, the boxplot is useful for numerical columns. So let's select open
high, low and close. So that would be df, open high, low, close. And we'll do plot dot box. And so this is a boxplot. Another plot that is
very useful for getting a quick summary of your
data is the scatter matrix. So let's plot this first and then we'll discuss
what we're seeing. So we're going to
import scatter matrix. So from Pandas dot plotting
import scatter matrix. And then we're going
to call this function. So scatter matrix, same
DataFrame as above. And then we'll say alpha equals 0.2 and figsize equals 66. Okay? This is a scatter matrix. So as you know that alpha
equals 0.2 makes the dots have transparency and
fixed size makes the plot a little bigger
so we can see it better. So what is this plot? Basically, this plot shows the linear correlation between
each of the data columns. On the diagonal, we get a histogram of each
individual column. So it lets us see the
distribution of our data. In other words, this
is a statistical summary of the data. We see what kind of
distribution each column has and how they are
related to one another.
7. Pandas Exercise: In this lecture, we're
going to go over the next exercise for
the pen dissection. In this exercise, you will
combine what you've learned in the previous sections and
take it a step further. You will also have to
make use of the pandas documentation in order to
complete this exercise. By the way, using documentation
is very important because these libraries are
constantly being updated and the APIs
are always changing. You could take an entire
weeks long boot camp about pandas and still not know half of what pandas
has to offer. And even if you did study
the entire pan is API, which by the way
is very unlikely, you still wouldn't be able to memorize the entire
thing anyway. Even if you could memorize
all that information, why would you want to when what you've memorized my change. So don't try to memorize
syntax or get too attached to any particular
way of doing things, just learn to use
the documentation. And another note,
you should do this without using blogs
or tutorials. Use only the official Pandas
and NumPy documentation. Alright, so what's the exercise? In this exercise,
you will generate the doughnut dataset are the
concentric circles dataset. Once you've generated
the dataset, which of course will
be stored in an array, you're going to create a
DataFrame from that array. You want to call the
column names X1 and X2. Then you'll want to derive new columns based on X1 and X2. We call these are quadratic
feature expansion. So you want to generate
three new columns, x1 squared, x2 squared, and X1 times X2. You may find that the apply
function is useful here. Also, you want to name these
columns appropriately. Finally, after you've
completed your DataFrame, save it to a CSV without any headers and without
any index column. Thus, your CSV should contain only the numbers that would be stored if it were a NumPy array. Good luck and I'll see
you in the next lecture.
8. Where to get discount coupons and FREE machine learning material: Hey everyone and welcome
back to this class. In this lecture,
I'm going to answer one of the most common
questions I get. Where can I get discount coupons and free deep learning material? Let's start with coupons. I have several ways for you
to keep up to date with me. That absolute number one, best way for you to
keep up-to-date with newly released discount coupons is to subscribe
to my newsletter. There are several
ways you can do this. First, you can visit my
website, lazy programmer dot. At the top of the page, there is a box
where you can enter your email and sign up
for the newsletter. Another website I
own and operate is deep learning courses.com. This website largely contains the same courses as you
see on this platform, but it also contains extra VIP material.
More on that later. So if you scroll to the
bottom of this website, you'll find a box to
enter your e-mail, which will sign you up for the newsletter as you would
on lazy program at DOT ME. So you only have to
do one of these. Now let's do a small
digression because this is another common
question I get. What's this VIP
material all about, and how can I get it? So here's how the
VIP thing works. Usually when I release a course, I'll release it with
temporary VIP material, which is exclusive for those early birds
who signed up for the course during
my announcement. This is a nice little reward for those of you
who stay up-to-date with my announcements and of
course, actually read them. It's important to note that VIP material can
come out at anytime. E.g. I. Couldn't make
major updates to a course three years after starting it
and do another VIP release. The purpose of deep
learning courses.com is to have a permanent home
for these VIP materials. So even though it could be temporary on the platform
you signed up on. If you sign up for the VIP
version of the course, then you'll get access
to the VIP materials on deep learning courses.com
permanently upon request. Here's some examples of
materials you might find in the VIP sections in my
TensorFlow to course, there are three extra
hours of material on Deep Dream and
objects localization. Usually I don't release the
VIP content in video format, but this was an exception. Another example in my
cutting-edge AI course was an extra written section
on the T3 algorithm. This course covered three
algorithms in total. So the extras section
that gives you one more, or in other words,
33% more material. Another example in my
advanced NLP and RNNs chorus is a section on speech recognition it
using deep learning. In addition, there is
an all new section of the course on a stock
predictions or memory networks, depending on which version of
the course you are taking. The reason for this
is I might release slightly different versions of each course on
different platforms. Because of how the rules on
all these platforms work, I must differentiate
the courses. However, since I own a
deep-learning courses.com, this is the only
platform that contains the most complete
version of the course, which includes all the sections. Please note that this is rare, so depending on what
course you are taking, it might not affect you. Alright, so let's
get back to you. Discount coupons
and free material. Other places where I announced discount coupons are Facebook,
Twitter, and YouTube. You might want to pause
this video so you can go to these URLs and follow me or subscribe to me on these sites if they are websites
that you use regularly. So for Facebook,
that facebook.com slash lazy programmer
dot Emmy for Twitter, that's twitter.com slash lazy underscore
scientists for YouTube, youtube.com slash C
slash lazy programmer x. Occasionally, I still released
completely free material. This is nice if I want
to just talk about a singular topic without having to make an
entire course for it. E.g. I. Just released a video on stock
market predictions and why most other blogs in courses approach this problem
completely wrong. That's another benefit of
signing up for these things. I get to expose fake data scientists who are
really marketers. Whereas I wouldn't ever make
an entire course about that. Sometimes this can be in written form and sometimes
it can be in video form. If it's in written form, it will either be on
lazy program and taught ME or deep learning courses.com. If it's a video, it
will be on YouTube. So make sure you subscribe
to me on YouTube. If I release a video, I may also make a post about it on lazy programmer dot ME. And I may also announce it using other methods I
previously discussed. So that's the
newsletter, Facebook, twitter, and obviously
YouTube itself. Now I realize that's a
lot of stuff and you probably don't use
all these platforms. I certainly don't, at
least not regularly. So if you want to do
the bare minimum, Here's what you should do. First, sign up for
my newsletter. Remember you can do that
either on lazy program at DOT ME or deep
learning courses.com. Second, subscribe to my YouTube
channel at youtube.com. Slash C slash lazy programmer x. Thanks for listening and I'll see you in the next lecture.