First Steps in Data Analysis using Python, Pandas and Jupyter Notebook | Paul O'Neill | Skillshare
Search

Playback Speed


1.0x


  • 0.5x
  • 0.75x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 1.75x
  • 2x

First Steps in Data Analysis using Python, Pandas and Jupyter Notebook

teacher avatar Paul O'Neill

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

    • 1.

      Intro

      1:03

    • 2.

      class overview

      1:24

    • 3.

      getting some data to work with

      1:09

    • 4.

      install anaconda

      5:14

    • 5.

      open up Jupyter Notebook

      6:32

    • 6.

      Analyse the data

      12:39

    • 7.

      cheatsheet and help function

      2:44

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

329

Students

2

Projects

About This Class

This is a beginner level Data Analytics class using Python, pandas and Jupyter Notebook. I explain in class what software you need and how to install it, it is quite simple. Everything is open source and free and will work on a Windows, Mac or Linux computer. By the end of the class you will have a working environment where you can use pandas to explore some data. The class does not require any previous programming or data analysis experience.

Meet Your Teacher

Teacher Profile Image

Paul O'Neill

Teacher

Hello, I'm Paul. I am an artist, cartoonist, teacher and data analyst. I live in Ireland but I've also lived in Japan for a significant portion of my adult life.

See full profile

Level: Beginner

Class Ratings

Expectations Met?
    Exceeded!
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Intro: Hi. Welcome to this class on data analysis using Python pandas on the Jupiter notebook. My name is Paul. Now being a data analyst for about 12 years. This is a beginner level class, so I don't expect people to have any prior knowledge or pandas ordered her notebook. And indeed, you may not have any prior knowledge off doing any data analysis. The goal of the glass is to get a working environment set up so you will have Jupiter notebook on panders and you will be able to analyze your own data. Set your own data. The the ability to analyze data to turn roll data into information is a very valuable skill to have these days. Many employers would value someone with those skills on their team, so I hope you'll be able to follow along during the class. On that, you'll be able to create your own project into your own Did analysis at the end of the class 2. class overview: Okay. This class has four men parts after the introduction. Part one is to get a data set to work with. In that video, I'll show you different places where you can download your own data set to work with. They're all open source on their old free to work with number two, then is to install Anaconda. The Anaconda distribution includes many packages over 100 packages on. It does include the pandas on the Jupiter notebook, both of which we'll need for this course. Then Number three is just open up the trip it or notebook and get ready to do some data Analysis on the last one is to actually start to do some with. Did analysis the goal at the end of the classes to have a working environment, you can analyze data sets using pandas. Entrepreneur notebook. So then, what is pandas? It's a software library written for the Python programming language for data manipulation and analysis On the dribbler notebook is an open source Web application like to create and share documents that contain life code equation visualizations on many other things. You can use it for that. A cleaning transformation, data visualization, etcetera 3. getting some data to work with: OK, since this is a very practical course, practical plaice. I strongly suggest that you follow along as much as you can with the class. To do that, you're going to need to find a data set. Did you want to work with it really Doesn't matter what the deficit is, as long as it's something that you're interested in. There are many places that you can find it a sets these days. Governments, for example, is one kiss. So here we have you open data portal. You also have Canadian government data portals, U. S government data portals. Choose a topic that you're interested in. Gwynn, Example, etc. Education. Find a data set that you're interested in downloaded. That's all you have to do to start with. Once you have your data set and the next section, we're going to look at Harvick Naxi loaded assets into a working environment and start doing some did analysis 4. install anaconda: Okay, So once were chosen a data set that we want to work with. The next thing we want to download that and then find somewhere on our computer, that weaken store it. So I store all of mine on my D drive. Um, this is a windows machine. Um, I created a folder on directory Cold. Did analysis inside that I have many more directories. Money, more folders. Each one is for a specific deficit. So we have ones on the armor. Whether did a set Brexit Bitcoin many others. I advise doing this so that everything is kept separate rather than everything being in one directory, which could get very messy once you start to have lots of data sets and lots of Jupiter notebooks and other things stored there. Okay, so night we've set up our data set. The next thing we need to do is create a working environment so we can actually do some Did analysis? Um, I recommend on a condom, which is a collection of python. Did analysis? Did a science packages? Um, it includes dead analysis on a machine learning neural network. That type of thing. There's about 100 packages included in this anaconda distribution. Um, and you can get it on Windows machines. Mac on UNIX operating operating systems. Also, it's what you go to on a condo dot com. Um, click on download button sort. It realizes that I'm on a Windows machine, but if you're on a Mac or the next year, it will work just as well. Um, you then have to choose which version off Paice and you want to use. You can use Pace and three or Peyton, to my advice would be to use the python. Three. Don't know what this version. The reason is that patient to well, if you go to python dot org's look, you see that they are sunsetting python, too, which is another way of saying python twos essentially dead. You can continue to use place into if you want to, but it's no longer supported, at least not officially supported. So this means that if there any problems, there's any security issues with pacing to they will not be fixed in the future. Uh, this came into effect generally the 1st 2020 So as I say, I would recommend starting with the python three. So once you download Python three. Uh, you should have a working environment. You're on a Windows machine and go to the start menu. We'll see. There's one. According to three. Expand that. There's the Anaconda Navigator, and this is a good place to start. When you're just starting out, we click on the navigator. Open it up. You'll see there's various different applications available. Exact applications you have made up exactly the same as these may be arranged in a different order, but the one that we're really interested in is the Jupiter notebook. So when you click on launch for the job Jupiter notebook, it will open up in whatever your default browser is. I don't know about something like this, uh, the men directory that it said to you. Remember, I had all of these different directories for my different DigiScents. So this kid's going to start looking at the data set that I chose the armor. Whether did said, If we go into that over here, you'll see there's new with a we can expand that. It says Notebook, Python three. If you did install the patient to, it'll say pace in two years. So I clicked on that we're going to get a new Jupiter notebook, and this is what it looks like. So if you've got this far, you have Ah, new Jupiter notebook opened up. Well done. That's the first big step towards being able to analyze. Did A using pandas on Jupiter notebook? 5. open up Jupyter Notebook: Okay, So if you've been following along and you should not have, you're Jupiter notebook set up. Um, a data set? No. Should be in the same directory. It's just easier to work with. So, no, we're going to start. First thing we have to do is import pandas. Um, we're working and Python. It's the same as any other python program key. What is import on the library that we're interested in is pandas Library. We're going to have to reference that library several times. So rather than typing out pandas each time we could be Lizzie on, Just give it a new name and he's their name. We're going to call it as p. D. Okay, we can run this. So you have a look star. Hopefully, that will turn to a number soon. Yeah. Okay, so it's successfully loaded pandas. Uh, if there was a problem, you'll get some sort of error message. Then you have to try and figure out what's going wrong. There are lots and lots of forums available where people could try and answer your questions. Or you can look for similar problems that other people have had. What solutions that I'm They've come up with. But hopefully it all works out. Successfully import the pandas library. We cannot import our did it. So we're going to import the data into a thing called a did Afrim did it for him. It's just a data structure within pandas with the data is stored and you can think of it as a two dimensional spreadsheet. Excel spreadsheet. It has rose on columns. So we'll call our data for him Just DF again. It Caesar type DF and typing data from each time. But you can call it whatever you want this idea of equals. Now we're going to call the Panis Library. So PD on within the pandas library, there is a function cold read. See, SV, our did it is in a CSE file. We're gonna use the read see SV function to get that data on me to call the function using dot notation so p d dot read underscored CSTV from the brackets. We need to tell it the name off the file that we wanted to to go and fetch on the file that I have is called whether dot CSP if you're did a file, your data set is not in the same directory as your Jupiter notebook. You're going to have to give it the full directory path so it can go and find it. If you don't, it will just come back with an error message saying Could not find the file file is missing or something. Good. Okay, so we run this again, you see, it changes to a number. So Ron correctly, we can just check that I was doing a print, so print open brackets DF. Now, if we just run this, it will return all of the rows. Andi, that's going to take up a lot of the screen because there are thousands of rose, or at least hundreds of rose anyway. So we can just look at the top. Shiro using another function this time called head. Open the brackets. If we put a number in there, it'll bring back that number of rows the default if you just leave it blank. The default is, I think, five rows. So we try running this. Yep. So brings back five rows. This column of numbers is not actually part of your did accept. This is an index. The others gives the data, friend so the first rule zero second rose 1234 This index could be used then to grab a certain role or a group of rose that you want analyze later. So it's a useful thing to have. We have seven columns in this deficit year. The month maximum temperature in that month, the minimum temperature the number of days in that month that had on their frost the total reinforced that month in millimeters on the total sunshine measured in ours for that month . You see, the state of goes back to January 1940 It You can also look at the bottom part of the data set if you want. It's basically the CME. Call me that on instead of head. Um, the function is called teal and again you can put a number in here. If you don't, it will bring back five rows again. So if we run that you see it, this is up to year 2015 cm Idea month temperatures accepted. Okay, so not with successfully imported or did A It's in a thing called did Afrim, which is this two dimensional data structure within pandas. We're not in a position. We can start analyzing the data 6. Analyse the data: Okay. In the last section, we imported our data into our data, friend. We just check that everything had loaded correctly, using the head on the tail functions on it looked like everything had loaded correctly. So now we're in a position to actually start doing some data analysis. So I've given myself four task or four questions that I'm going to trying to answer. When you're doing your own project, you can choose his many tasks. Issue like so number one. I'm going to try and find what was the lost on the highest temperature recorded in this data Set on When did acres or what was the month in the year? Number two. What is the amount of sunlight? Very during the year. Present analysis As a graph, we're going to try and do. Some did a visual ization rather than just grabbing numbers out of the data set. Number three has the number of air frost per year changed again. Present us is a graph a number four. Is there any correlation between the different values and the deficit on? I'm going to try and present this graphically as well. Okay, So the first task is to get the lowest temperatures in the deficit. So I have two variables. Lois. Tempting highest temp. And I'm going to try and get values and put those into those variables. So we say Lewis temp equals D F. Or did a friend original did of him and square brackets quotes team enclosed the courts, clothes the's square brackets dot notation again on the function his men. So this function just gets the minimum value in this particular column. So we're using the square brackets. The quotes on the name to specify which column within the data friend that we're interested in. You'll remember there was seven columns in this different. We're only interested in the minimum temperature. Be careful with square brackets on the run. Did brackets functions usually take a run of brackets. Get the mixed up, you'll get crazy error messages. Okay, so we run this one and then we can print daughter too. Values. So you see, minus 3.8 Celsius was the lowest temperature in this data set on 23 point. It was the highest temperature in the deficit. Not we also want to know when these occurred and we can grab the two rules out of the data said. But if the debt of for him and that will give us the month in the year when the temperatures occurred. So to do that you'll remember I said these numbers on the left hand side were indexes. So eat the did A frame gives each row its own unique identifier. That's what we're going to try and find. No. So again, two variables. We're telling it to look at the minimum temperature in the data from on the maximum temperature did it for him. Those two columns, I'm going to use a function I d X men and I D X max. So this is the index off the minimum value on the index of the maximum value. So once we run these, these two variables will contend numbers like four it 11 or whatever number the rule was. So you run these now being print out those two numbers, okay? So we can see our minimum temperature is in rose 754 on our maximum temperatures and row 497. So now everyone to actually grab those two euros were using a function called look or location When you're passing the index into location, the index has to be within square brackets, and then that has to be within square brackets. Then, of course, the print function takes the Rhonda brackets. We run this one. Okay, so you can see in December 2010 was our minimum temperature run the next one. So July 1989 was our highest temperature. Okay, so the next task was Hodges, the amount of sunlight, very during the year. So I'm going to look at it by month, and I'm going to calculate the average or the main number of ours of sunshine for January for February, for March and so on and then plot that out in a graph. So to do that, I'm going to create a new data from which is a subset of our men did it for him. I'm going to use a group by function because I need to group together all of the January results. All of the February is a little of the March results, and so on. I'm looking at this particular column, the hours of sunshine and again, I'm calling function Ming. So it's gonna couple it. The average mean for this cold for each month. Okay, so he run this one this next a couple of lines. It's just to set the size boat. It enables me to print art. The graph within contributor notebook on it also lies me to define the size. The default values are not very big within contributor notebook on It's difficult to read the years. Another numbers. So this just makes it a little bit bigger, a little bit easier to read. Okay, so we applaud out. 1st 1 that would but is a bar graph. Okay, so we can see Beginning of here January February. Not much sunshine. Uh, when you get to May and June, there's more. Ananta starts to tail off towards the end of the year again. So interesting thing on this one is that the maximum amount of sunshine seems to be may, but the longest day is in June. So there's something interesting going on there. Um, you're working with the data. There could be several possibilities. The data itself could be corrupted in some way. It could be incorrect. The code could be into correct. I don't think this quote is that's another possibility. You would have to check then another possibility is just that there's something interesting going on. We would expect the amount of daylight with ers of daylight to be greatest in June. But this is looking at sunshine the amount of sunshine which isn't exactly so. It may just be the June a load, the days or longer. Maybe there's more Clyde in June in general, on May times tends to be some here, a swell as bar graphs. You have other choices. We common dirt, this line and on comment. This one run this assed produces just the plot function just gives you a line graph on you can change the color. Um, so you can just put the first letter of some of the man colors like our is red or is orange G is green. So in this one, I also changed the transparency. Woods made the red more of a pink salmon pink color. So again, you can vary that from zero upto one. Okay, so that's our second question answered. The next one was, Has the number of air frosts per year changed again. Present. This is a graph. We're doing something very similar. We're using a group by function. This time we're grouping by year. We're looking at the number of a number of days that had an air frost on where totaling up or summing up those days. Okay, if we run this one and then check it print. Okay. We have a year on the number of air for us per year. Run this one again. Is just making sure that the size of the graph is readable. Never completed. I'd okay, so you can see there's quite a lot of variation from May be 20 up to over 80 days with their for us, there doesn't seem to be any strong pattern. But again, you could to be further investigation further Did analysis to see whether or not there is any variation over time if there's any trends in your data. Okay, so that's hard. Third question. Third time's the last one. We're going to look for any correlations between the different values, the different columns in the deficit. So you run this code, okay? Says code produces this psychedelic looking did a visualization. So this is correlations. So we have our seven columns, year months, temperatures and so on along the top down the side, and each of these squares is the correlation between the columns. So from the top left of the bottom right, you're going to get maximum correlation. Well, yellows, because the year, obviously, Carly, it's with the year, the month with the month and so on. It's these other squares that we're interested in so you can see the year is not really strongly coral living with anything. But if you look at months, there is some correlation with maximum temperature, minimum temperature, air frosts and then, if we look at the temperatures themselves, there is very strong positive correlation between T. Max on team in and also between team and Team Max. There's also a strong negative correlation between T. Max on the number of days that haven't air first. So, in other words, as the temperature, the Team Max increases the number of days that have an air force decreases, which is what we expect, and vice versus So as the number of days with an air frost increases, the maximum temperature will decrease. So this kind of visualisation is very good for looking for correlations within your data. If you had business it if for example, it may be that, um, some data on your customers is correlating with some other data, which may be no one had ever realized within your company. Um, who could be a useful way of discovering these things? 7. cheatsheet and help function: so you'll remember that one of the first things we did was to read data from a CSP fall into, ah, data for him using this function, read CSP if you know the name of function, but you're not sure what all the possible parameters are, there's a useful function you can use in Jupiter Notebook. It's this help. So you have helped open the brackets. P d dot reid CSP close brackets. On this function, I would give you lots of information own, read CSE or whatever function you're trying to find out about, including all of the parameters that you can pass in some notes and what it does to this read CSC comma separated files into a did for him more information than all of the parameters, including what you can pass in examples, etcetera. So it's a very useful function toe have, Of course, if you don't know the name of a function, um, that's a different problem. I would suggest getting a pandas cheat sheet on their lots of these. One example is here on this website cargo. So this cheat sheet has, um, things such as getting rid of duplicate rose. Um, checking for missing rose or missing data within rose and so on and so on. So this is a useful place to start. And there's many not just one cattle but many other websites have thes pandas cheat sheets . This website toggle is also a good source of data sets, so they have competitions. But they also have Martin. Lots of data sets again. These are open source and you can download them and work with them in order to die. Unload them though you need to open an account. Andi, it's a long time since I opened my comfort. From what I remember, you need a valid email address, but you also have to give them your cell phone number and they will send you by text message, a pin pin number that you have to put in. I think you only have to do this on the first time. That you open your account is just to verify your kind. But if you're happy enough to do that, there are hundreds and hundreds of possible data sets on all kinds of topics.