Transcripts
1. Introduction: Good day and welcome to this introduction or Excel to Python using
Jupiter Part two. This is a skill shared course. In this introduction,
this course will continue with introducing you to the Python program language. We will assume you
are new to Python, but familiar with Excel
and spreadsheets. We'll be using Jupiter Lab as a Python coding environment. Just note that this is
part two of the course and focus on more Excel
to Python examples. We will show you Excel
operations like sorting, filtering and adding
calculated columns. We use pandas and show
you how to group data, find top end values and
visualize data with plots. We will also cover
advanced topics like how to do V lookup, index matching, and Python. Lastly, we will cover
cleaning data over tables, and performing
statistical operations. We will use the
same Excel data to set the demonstrated
examples we used before. This an example of the data, four columns, ID,
carme, model and price. This is a section of the
examples I'll cover. For example, we commonly
work with average, Max and Min in Excel, and we will show you how to do the same examples
in Python as well. V Lou and index matching is quite popular and
useful in Python, as well, just like you
can use it in Excel. So we will show you
how to do that quite easily in Python. Great. Let's get started.
2. Project: Now, a first hand approach to understand what we'll
cover is actually a project. So here you'll see
what you'll be covering in this part
two of a course. So again, download the
Dam data dot CSV file from the resources
section of this course, and you need to process the data for this project
and plot the data as well. So this is data you'll
be working with city buses, population
in millions. So you're going to
have to extract cities with population above 21.5 million and save
it into a CSV file. Calculate the average price of all the cities and the
histogram of the cities. Again, these are the things
you'll learn in the course. So further things you have
to do cred a bar plot of the city with cities with
population above 21.5 million, add labels for the X and
Y axis for your plot, add a title, and
finally save your plot and upload it as
well as a CSV file. Great. Now you know what you work with and what you'll
be learning in this course. Great. Let's get started.
3. Lesson 01 Getting Started: Great, let's get started
with Lesson one. Let's get familiar with the Python coding
environment you will use, namely Jupiter Lab. You go to this link,
Jupiter lab dot Github IO slash Demo
slashlablash index dot MO, and you'll see a
browser popping up that gives you the option to
click on my Notebook option. So here we're going to be using Python Podide Notebook option. Again, this is what
you'll see when you go to the link and you see the Notebook option and
we'll select Python Piodide. So this is what you done before. But now in this Part two, we will be using card data two dot CSV five found in the resources
section of the course. You just download that
and upload that to your Duple lab environment
like we've shown you in Part one of the course. So click on upload button on left side of a
toolbar to do that. So again, what you need to do when you want to
read in any data, especially CC file, first, I would say import Pandas as PD. And then we're going to store
the data in DF variable. This actually stands
for data frame. This is how Panda stores data. So we can store it in DF and you equate that to how
we're going to do that. So how we're going
to read in data, we're going to use
PD dot read CSV. This will us read in the CSV
and we'll give it a name, card data two dot CSV. Then if you print out DF, you'll get your
dataset that was in the formula CSV file and you can see the data being
represented here. Now, not the difference in
layout when not using print. If you just say DF
and execute that, we get a very nice
table layout in JupraLab which is a bit easier
to read and see as well. We'll be using this
in the course. Now a very useful property for data frames by accessing
any properties, you can say dot and various properties.
We'll cover barriers. But the first one we
should always know about is the head method or
the head function, which shows you the
first five rows of your dataset or
your data frame. Now, we can specify different
number, for example, ten or one or 11 or
however many you want, but it is a very
useful property or method to use and access
with data frames. Now a very useful one ASOR is dot info for your data frame. D info gives you information
about the columns and rows. The most important
thing when you're doing this is how did Pandas
read in your dataset? We can see we have
various columns and we can see they're no now, there are values represented, and the D type is actually
most important one to look at. Because this tells
you how Pandas read in the CSV file and
how it's currently stored. Currently, the ID column
stores as integer. Car mac is stored as object, Pandas represents
strings and texts as object, which is
what you expect. Model year is
represented as int. Any datet fields we don't
actually want as integer, we actually want as
a date time field, which you'll get to
because that allows us to filter and modify
the data how we want. We'll change this but later on. Price column is
represented as a float, which is what we expect as well. So this was a recap
just to get it started to where we
left off from Part one. The only difference
is we're using DF, not to store the data, not the data variable, and we're using car
data two a CSV file. Great. Let's move on
to the next lesson.
4. Lesson 02 Excel Like Operations in Python: Great. Listen to El
operations in Python. So we've read in
the data before. In the previous lesson, now
we're going to do it again. Repeating that step,
input pandas PD, DF equals P dot read CSV and the filename
card data to dot CSV. Just remember the underscores are quite important as well. The first thing we're going
to do is sort by column. Now, this is quite common to do in Excel and we can easily do this in Python as
well, using Pandas. How do we do that? Well, again, we reference our data frame, DF. All our data is stored in DF
and we use the dot to get certain parameters or methods that we can actually change
the data and modify it. So there's a built in one just
like we use head and info. There's one called
sort values as well. So what you're going
to do is sort values. And there are certain
parameters we can select for that
built in method. So we're not just
sorting values, we want to sort values by price. And then how do we want
to? Another property is in terms of how
should it be sorted, while we want to make
it ascending false? Why do we That's order that
we want to sort it by. So let's see once we
run that execute it, we get the sorted by price, and we see the highest the
largest prices on top, and it goes in this order. So I have just shown
you the first five. You can obviously change it, you know, make it ten as well. Then this was just
not run before, let me run that over here. Now let's do it over here. There we go. That's
a good practice of how you can see if things
go wrong, what to do. The previous sort of DF
actually wasn't run. That's why it didn't work. It's a good practice too. When you see a problem,
try to solve it and many times when it happens is you haven't run
the previous cell, you haven't executed
the previous cell. Those are useful to see
how that can be fixed. So next we can filter rose. So for example, we
want to filter rose based on car prices
larger than $20,000. So we can do that quite
easily in Python. Now, we're not necessarily
modifying the data frame. We are filtering it. So how we filter it is we have to create a new
data frame or create a create a subset of data
frame that checks if the values are bigger
and smaller than 20,000. The first one we actually
have to look at is DF price. If we just look at
the inner bracket, what's inside the bracket. What's inside the
bracket is we're just checking is the current value in the price column
bigger than 2000. If it is, it'll give
you a Boolean value. Boolean is either true or false. It's going to actually
give you a new data frame. Now, because we
are actually only accessing one column data price, this is actually
called a series, which is a single table
with a single column. But this gives you
values of true or false. And what we're doing
there is actually we're taking the true
and false values, and we're now applying that
to our outer data frame. By doing that, we'll filter. We only want to see the
rows that are true. That is by default
what will happen. The false ones
will not be shown. And if you run it, this
is what you'll see. You'll see your output
being execute folded DF, you'll see only the cars carmX with value
bigger than 20,000. This is how you filter rows, and you can filter
data in various ways. And this is one way to do that. Now to do some
summary statistics. So functions in
Excel like average and Max and Min are
quite common to use, and we can do that quite
easily with pandas. So if you just look at, if you want to
access, for example, only the column core
price in the data frame. We can access just like
you've done head and info, um, now we can also work out the mean
just for that column. We can work out just the max for the column using dot max. We can work out the minimum
value for that column. These are all the different
various properties or specifically methods, we have access to with
pandas by default. I'm just printing
it out here and this is the results
that we're getting. So adding calculated columns. So we can add a new
column with formulas. For example, the price
of the 10% discount, for example. So how to do that? Well, first, again, let's
access the one column of price. All we do we add a 10%
discount with times by 0.9, and we save that into
a discounted price. We save it into a
discounted price. Let's run that and you can see that if you first
show the first five, this is a discounted price
in the new column over here. So again, this can modify and
add it to your data frame. So we can also use pivot tables. Pivot tables are quite
common and feature in Excel, and we can do that
as well with Python. Now, I do that using again, axis the data frame, and all we do is
we use a group by method and we specify we
want to group by karmic. And how do we want to group it? Well, we want to
average of the prices. So let's see what we get. So there were a few groups. There was Honda and
Kyoto were grouped. The others were unique, but they were grouped in the category. Worked out based
on the average of the Honda and Kyota I make. I think Kia also there were more than one
options for Kia. Now, you can also find
the top end values. In Exil, you can use conditional
formatting or filters to highlight the top end values. Python is pretty
straightforward. Again, you have this
method called dot enlargest and you can specify, okay, give me the top three largest values
based on the price. And when you run that executed, you get the top three
quite quick and easy to do all built independence. Change in case, you might want to use upper and lower
functions in Excel, you have that functionality. In Pandas as well. Again, you're going to access maybe the karmic
because that's strings. That's text. So we
don't just say upper here because now we accessing in different
properties of that. We have to access the
string properties. That's how we say dot string. And part of Dot
string by default, we have a dot upper method, and that just creates all
the letters in uppercase. There are various other
properties of string, and this is just one of them. Great. This
is what you see. The car makes now we
put in uppercase. You can see, as I
mentioned before, two Kyotos 202 kyo. That's the previous
example was just showing that how you group them
together, find an average. Now you can also add a
column based on condition. This is you do quite often
in El using if statement, the condition if true, if false. Now you can do this quite easily using Python, using
a Lambda function. I can start here
say if X is 15,000, then the result is
yes, else, it's no. This is a format that you can write using a single
line function. There are other ways to do
it, but this is just one way. The nice thing with
pans is we can apply this functionality
throughout Again, we act as a column of price and we use a
dot applying method, what we're applying,
applying this function. We're checking if X
bigger than 15,000, place yes, else place no, and we'll store that in a new column cord is it expensive. There you can see,
if you run that, you see yes, no,
yes, no, no, four. An cars above 15,000
would get a yes or no, no, or get a yes. So this order I spoke before
about the model year, converting strings to dates. Exil you can use date value
or format cells as dates. Python, you can easily
do that with to daytime. In this case, what you're
going to do is we're going to convert model year to
a datetime format. I won't be string anymore,
but a datetime format, and we're going to
keep it as a year. We're going to save that
and almost overwrite it into the existing
column model year. When you get that, you see uh, the model has been,
you know, changed. We can get rank as well. Rank is bolt in
function in Excel. Again, for the price column, you can use rank and ascending a false and just showing
the first five. You can use rank as
well. Rolling average. This is the rolling average and
Exil function you can use, you can use similar approach. Now there's bolting
meta quad rolling. You can choose your
Windows three or how wide value you want and
Wout dot mean as well. Then you'll get the
rolling average for the first five values as you can see, as
to be expected. Great. Well done for
getting this far. Hopefully you can see the
benefits of Python and all the different
functionalities you can easily do in
Python from Exl. Great. Let us move on
to the next lesson.
5. Lesson 03 Advanced Operations and Statistics: Let us move on to Lesson three, advance operations
and statistics. So we've covered info examples in Python and how you can do what you can do in
Excel and Python. Now we're going to use filters
to find specific rows. Now, as we've done before, we can read in the CSV
file using this line. And now we want to
get specific rows. And like we've shown you before, if we want something specific, we want to create a table
of true and false um, values, a table of that
that we're going to use to filter out
our main data frame. So we have, one in
the middle mini one inside the inner brackets, then we have the ashore data
frame we want to filter. So first, let's look at
this one. All right. So we want to find all the rows that contain
the comic Toyota. So to do that, we're going to first reference
the column mic. And then we need to access a method that
has string properties, so we say dot string. Then part of dot string is it has another method
called contains. We've done before
upper and lower and string dot string
method has a lot of other methods or sub methods it can use,
contains is one of them. So it checks. Does the string
contain the word Toyota? And this will give us a single
table of true and false. All the rows will give us does this row
content iota or not? True or false? And then we apply this to our
main data frame, DF over here on the
square brackets. The result is a specific rows. And if you run it, we'll
see our new data frame that just has two rows with a
karmic Toyota and Tota. This is a quick and
easy way just to filter data only what you want to
see. Very useful and handy. Find and Replace is a
common tool in Exl, in Python, you can also
do it very easily. So now we are modifying
the column specifically, we just use DF Comic
first reference it, and then we use a
built in method called replace and replacing
Teota with corolla. When you run that, we can see just first
print out the first five, we can see the Choda has
been replaced with corolla. Very quick and easy
way to do that. To remove duplicate rows. You can duplicates in various
ways in El with pandas. If a rose is an exact
duplicate of itself, can use dft drop duplicates and will remove that
row completely. An example if you just
want to print it out, although there are no
duplicates in this example, but if you wanted to,
you can use this. If you want to fall
in missing data, which is sometimes the case. We can use full NA method and apply that to
our data frame. And what do you want
to fill it with? We can fill it with zero,
one, or whatever you want. But something something that is maybe a bit more
applicable is maybe let's fill it if we do find an empty cell,
empty empty field. Let's replace it with the
average other price column. But nice way to fill
an empty value. Again, this example
doesn't have that, but you can always use in
your own if you need to. To maybe manu delete
columns of rows, you can use the drop method. If you want to, for example,
drop the price column. You can just by using DF, applying this to your
whole data frame, drop and then specify
the column name. We're just using the price
column run that you see is no more price column once
you execute this command. Be Lou and index matching, obviously very popular and
useful in Excel and we can do the same in Python. Again, in this case, we are using two inner
and outer data frame. The inner one is going
to give us a table, a single column table or series
of true and false values, and it's just going
to check is ID equal to car ID in this
case, we just made it five. It's going to check,
is it equal to ID? If not, it's going to be false. We're going to get
a list of single column of true and false. We apply that to the
other data frame and we get car details. Then as you can see,
there's only one ID five, and this is the row
that we'll get out. Another useful thing to maybe use with V Lou is to
combine data from two sheets. Now I've added
additional data as well, using the merge functionality. What I did here was, I created
a dictionary of two lists, one ID, so I can match
it to the existing one. And what I'm actually
adding is mileage. I'm just adding it to
the first five rows. I create a new data frame and I can just add this dictionary here as input and to
create a new dataframe, I say PD dataframe add the dictionary and I have
my new dataframe I can use. Now what I'm going to
do is I'm going to merge it pd dot merge, my original data frame
with my new one DF two, and I want to merge it on ID and how it's
going to be left? It's a left drawing, which means my original data frame DF that
has all the data already. That's going to be my main
data and it will stay there and only anything that matches it on
ID will be added. Printed out, you'll see how mileage has been added
to the first five. Create Pivot tables for
summary statistics, something we covered
a bit earlier as of, but this is
a different way. In Python, by default, data frame has
pivotables as a method and has certain parameters that you can use in
terms of values. We can reference the
price column index. It's going to be a model year and Ag funk is going to be me. That's what we're going
to use to specify. This is how you print
it out once you execute this row data. Then transposing data, again, something to flip
rows and columns. Quite quick and easy
to do this just DFT. And if you run that,
you'll get your data frame that's represented here as flip. We did this before manually
using average min and max. You can also use Pandas
built in describe method, which actually gives
you this by default, average deviation and
various other things. If you say DF dot describe
and you run that, you'll see you get
statistics on data. Count to mean, sanar
deviation, min, different quantiles and
max for all the data. In this context, price is the only column that's relevant, but it's quite handy just
to get a feel of your data. Aggregations. If you want to use custom formulas and
pivot tables, you can. For example, if you want
to groups specific column, group buying karmic
and relative to price, can define the
aggregation mean max and min and you can group
that as well you use an aggregation method and
you only get these functions displayed for you in your custom statistics and you can add various
other functionality. But we just using mean Max and M printed out to get custom statistics for
the various fields. If you want to, for example, maybe add a percentage change, you can use formulas like B
two minus B one, a B one. You can do that quite easily
with price change as well. Column if you add that.
You can use built in method for price change, added to the column
called price. If you get that, you'll
get this value over here or you get the
table over here, which is what you expect for the price change for
the various values. You can also work
out cumulative sum, sum range or
cumulative formulas. So using a CM sum, you can apply it to
the price column and you'll get cumulative price. And you print it
out, this is what you'll get with all your data. Correlation as well, quite
important and useful in Excel. Correlate range
one and range two. Again, based on the model year, correlated to the price, and you want to find
the correlation between model and price, and you can see a
correlation of 0.68, which means the R quite
correlated quite closely. Well done, this is some
advanced operations and stats getting this far. Let us move on to
the next lesson.
6. Lesson 04 Plotting Examples: Great. Let us now
go to Lesson four, which is plotting examples. Here's a basic line plot
that we're going to cover and Excel uses line
trot for certain trends. Now, in Python, as
we've shown you before, we use a plotting library
called MatplotLib Pilot is PLT, and we're going to use
that for plotting. Now, let's first read in our CSV file over here
like we've done before. For line plots, specifically, this one we only want to
use a first ten rows. I filtered it by just saying Df dot head ten and
stored ESMA DF two. Now to turn off
interactive mode, this is just for Jupraab. You can do this. And by doing plt dot Iof then how do
we specify the line plot? Well we just say df dot
plot. There's a method. Specify the X values, Y values, and what plot is it? We want a line plot
and add a title, price trend by model year
add a Y label, price. When we run that code,
we get the following. Now to actually plot
that, we have to say plt dot I on and plt dot show. This will give you plot of your data model
year versus price. Now we're going to do a barhot. Now again, start off by
switching of interactive mode. Again, in your own example, you don't really
need to do this, but for my presentation, I will, we're going
to do a bar chart. So great bar charts. It's useful to compare
categories and categorical data. Now, to do that, in this case, we're going to use
DF and say plot, again, comic wise price kind is bar like a bar
plot and the title. Once you run this,
you'll get a bar plot of the various data that we
have subcomk versus price. And then to do a pie chart, and this is obviously useful to visualize proportions
using Pie charts. We first have to group, we
have to group by karmic, and then relative to
price and cumulative sum, we're going to plot
that all in one line. The kind pie chart
or the pie chart. This is just a certain parameter four plotting pie charts. You don't have to worry
about that, but by default 1.1 will give us the um, the value you'll see
in the pie chart and the title the
distribution by Karmic. Once you run that, you'll see this is the
distribution and this was the percentages of that value
that you saw over here. The auto PCT, this is the 1.1 F. This is how the
percentages will be distributed. This is the values at C. This is a typical high
distribution of a mic. So again, we're going to switch of interactive
mode and do histogram. Again, for an Excel, we use histogram for
frequency distribution, and again, we're using DF. Just remember for the line plot, we use DF two, but for
the rest of plots, I'm using DF just to show the more data that
you're working with. So to show histogram
plot is kind just hist, a number of bins
we can choose as five titles spice distribution, color, sky blue,
edge color is black. Then we can add a X label
as well for that and we get to run that
plot that on show, then this is distri
price distribution for the plots for the data. To do scatterplot, also quite useful to show relationships
between variables, DFA plot, X axis, model year or Y axis price or the Y data or Xata
kind is scatter title, price versus model, colors
red and add a label as well. And when you run that,
we're going to get by using these commands you get
the scatter plot of a model year versus price. So we can also do a
box plot as well to visualize data spread using
box and whisker charts. So here we're going to
use df dot box plot. This is a unique type of plot. Column is going to
be price by mic. We're not going to use grid
or vertical properties, just make it false. Add a title, subtitle, label. When you get that, you
get your box plot and various distributions for
Tyota Lexis, Kia, and Honda. These were options that
had multiple rows. And the price distribution. Then we're going to
do a multi line plot. Multiline plot, plot multiple
series in one chart. Again, we actually using DF two. So the line plots we're using
DF two and all plots of DF because the plots will
be not proportionate. So we're using DF two and I'm
just making a copy of that, storing it multi DF. What I want to work out is work out discounted price as well. I have that in the column
and then I'm going to say the model price versus
a discounted price. I can see the comparison between two different
line plots. The purpose of this to show
two different line plots, just with a 10% discount, as we've done earlier,
if we do that, we ri out, we can see we get two different plots and you can see they match quite closely, obviously one just being 10% different price versus
the discounted price. Bar chart also something
quite useful to use to show contribution to total using
stacked bar charts. You can group by, again, cumulative sum kind bar and
stacked ba chart of prices. So you can also use. And there's the type of
plot that you'll see. Great. This is just, again, a bit more in depth introduction to all the various
types of plots you can use in Python and
something you might use in Excel as well and how to jump
from one to the other. Great. Hopefully,
that was useful. Let's move on to
the next lesson.
7. Lesson 05 Converting Data: All right. Listen five, this is where we're going to
focus on converting data. In this section, we're going to look at how to convert
data in different formats. Now if you input the following, we're going to have to convert
data frame to CSV file. We have been working with data, we modified it and change it. In Exil, we usually save data to CSV format
for easy sharing. We can also do that in Python. So if we modified
and filtered and sorted and change our data
frame however we want, now we can actually convert
that and save it as a new CSV file, sat two CSV. And again, we can save
it as cars that data CSV and leave the
index as false. And then you should
when you run that, you'll get this file saved
in your upla lab as well. If you want to convert dataframe
to JSON format as well. Now again, JSN is for web
or database applications, you can easily do that with the two JSN method that's
out of all the data frame. You can just maybe call
it cars data or JSON or by records and lines
equals true parameter. These might change based
on what you output. To convert a data frame to HML, if you need to convert it
for web display or reports, you can easily just
use dot two HML and save it as cars data two HML and discrete index is false. While doing on converting data, you can also convert data
frames to Excel as well, SQL Parke, feather formats. There's a lot of ways to convert the basin based
on what you need. Great. This actually
completed the course, hopefully you will
learn something and now for you to do your exercise, specifically working
on the project.
8. Closing: Now in closing, we're just going to summarize
what you're focused on. In this course, we focus on building our Python
skills from Excel. We showed you the
Jupiter lab and environment, how to
write Python code. We also showed you some
more advanced features of Python for Excel. Win work with a small dataset, we showed you how
to filter, sort, aggregate, group, and
workout statistics. Showed you how to
do various plots and convert data frames
in different formats. There's much more to Python
than we plan to cover. Stay tuned for next course
and happy coding. Thank you.