Data Science and Visualization For Complete Beginners - Part 1 | Lee Falin | Skillshare

Playback Speed


  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Data Science and Visualization For Complete Beginners - Part 1

teacher avatar Lee Falin, Software Developer and Data Scientist

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

9 Lessons (1h 4m)
    • 1. Introduction

      0:54
    • 2. Working with Google Colab

      10:33
    • 3. Exploring Data with Pandas Dataframes

      13:48
    • 4. Data Preprocessing - First Pass

      5:53
    • 5. Data Preprocessing - Second Pass

      8:43
    • 6. Creating our First Visualization with Altair

      7:15
    • 7. Enhancing and Refining our Visualization

      13:15
    • 8. Reviewing the Visualization Steps

      2:45
    • 9. Conclusion and Next Steps

      1:18
  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

18

Students

--

Projects

About This Class

29b35410.png

If you've ever wanted to learn more about data science and visualization, but felt overwhelmed by all of the background knowledge it seemed to require, this class is for you.

In this, the first in a six part series of introductory classes, you'll learn just how easy it is to get started with data science and visualization. 

No programming, math, or statistical experience is necessary. There's nothing to install and no difficult configuration steps. All you need is a computer, a web browser, and a desire to learn.

Throughout this series, you'll learn not just how to use industry-standard tools to employ a variety of data analysis and visualization techniques, but more importantly you’ll learn the reasoning behind those techniques, and when and why to use them in your own work.

I hope you enjoy the course.

Meet Your Teacher

Teacher Profile Image

Lee Falin

Software Developer and Data Scientist

Teacher

Hello! I'm Lee Falin, and I'm a software developer, writer, and educator who loves to learn, create, and teach. I'm currently a professor of computer science at BYU-Idaho, where I get to teach courses in software design and development.

One of my favorite things about software development is that it's a skill that enables anyone, regardless of their age or background, to bring their ideas to life and share them with the world. It just takes some time, patience, and a fair amount of hard work.

I've been writing software for almost twenty years, working in the commercial software, telecommunications, and defense industries. I've also founded a couple of software startups, and worked as a data scientist in bioinformatics research.

These days, I spen... See full profile

Class Ratings

Expectations Met?
  • Exceeded!
    0%
  • Yes
    0%
  • Somewhat
    0%
  • Not really
    0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Introduction: Hi, I'm Leigh failing. I'm a data scientist and educator. And this is the first in a six part series on data science and visualization. Now, this course does not assume any background knowledge and math and statistics, data science programming or anything like that. We're going to walk you through the very beginnings, the very basics of data preprocessing, data analysis, data analytics, statistical analysis, and data visualization and data science. Six part series will take you from creating very basic charts, very basic graphs, finding data insights all the way up through choropleth maps. We'll also be looking at some of the very beginnings of machine learning, which will lead into a second series of courses I have about machine learning. So hopefully you'll get a lot out of this course. You'll be able to learn a lot if you have any questions, reach out to me on my website at leaf a Lynda.com, or leave a comment in the discussion below. 2. Working with Google Colab: Now there are lots of ways to carry out data science work. Lots of different software packages you can install. But in order to keep things really simple, we're going to use one of my favorite tools called Google Colab. Google Colab is a Cloud-based data science development environment hosted by Google. And it allows you to do data science and machine learning work without having to install anything on your computer, because all you need is a web browser. So I'm going to be using Safari, but you can use any web browser. And we're just gonna go to colab dot research dot google.com. Now, depending on whether you've been to this website before or not and whether or not you're signed into a Google account. They are one of two things you'll see. When you first connect, you'll either see this opening screen or if you've never been there before, you will see a screen that looks like this. Either way, what we're gonna do is if you're seeing this screen, which is just kind of an introduction to Google Colab. We're going to go up here to File and select New Notebook. And then it will ask you to sign into Google. So if you don't have a Google account, just create one really quick. Otherwise, just sign into Google. And then when you, when you log into Google Colab, you will see a list of all of your files saved in Google Drive. And then once you are signed in on this page, you can click New Notebook and it will give you a blank page. Now there are two libraries we're going to be using today. The first is called pandas. Pandas is a data science and data exploration library written in Python. Now if you've ever used Python, that's okay. We're not going to assume any Python experience. But if you do have some Python knowledge, that will be an advantage to you. The other library we're going to be using is called Altair, which is a graphic and data visualization charting library, also written in Python. So first, in order to use Google Colab, over here on the right, there's a button that says Connect. So we're going to click that. What that does is it connects our Google Colab interface to a virtual machine hosted by Google in the cloud. What that means is we have access to basically a virtual computer that has a certain amount of memory and storage space available to it. Now once we close our web browser, that computer disappears. So any work we have that we haven't saved will be lost. So it'll be really important to save our work. Now, Google Colab is what's called a Jupyter Notebook environment. What that means is there is a series of cells and there are two types of cells. There are code cells and tech cells. A code cell is a cell that's got a little Run button over here on the left. And any code you type when you click the Run button will be executed and it's output will be displayed below. So for example, if I type a math equation here and I execute that, the result will be displayed below. I can clear the result by clicking this Clear Output button. And I can rerun the cell as many times as I want by clicking this button. Now if I make a change to the cell, in order to see the change, I have to rerun the cell by clicking the button. Now as a shortcut, one of the things you can do if you're on Windows, you can press Control Enter to run the cell. And you can see that if you hover over this, it tells you control enters a shortcut or on a Mac, you can hit Command Enter. So that's what I'm going to be doing throughout this exercise. Just Command Enter on a Mac or Control Enter on Windows to rerun whatever cell I'm currently in. Okay. Now a cell can have multiple lines in it, but the only thing that you'll see in the output is the result of the last statement. So if you have two statements here, only the last one's a result will show up, both of them will execute. And you can see this because Let's say we have five plus three plus two, which we know is 10. We can store that result in a variable. And then we can use that variable in a later expression. 5 plus 3 plus 2 is 10. We'll save that result in this variable. And then we'll use that result to ask what ten plus two is. And when we do that, we see we have 12. Now a lot of times we'll want to keep this information here visible, but we'll want to execute further statements. When we do that, we can add additional code cell. That code cell will appear below our current one, and then we can execute additional statements below that. If we need to, we can also add a text cell. A tech cell is just like it sounds, just some texts and you can add some styling and mark up this text cell in this environment uses what's called Markdown for its styling, but you can just type plain text. This is a text cell. Whatever you type will get previewed over here on the right. I can use Markdown to style it. And when I click out of that cell, it will just display as plain texts wherever it is in the document. Now, if there's a cell that I don't want in my document, when I click on it, I can go over here to the right, click on the trash can and remove it. Now, the way that Google Colab works in all Jupyter Notebooks work is that any information we have stored in memory is available later on. And when I say later on, I mean later on when we run the cell. So not later on moving down the page. So for example, if I add another code cell here and I asked for what is x? I can see that x is something that it remembers from earlier. I ran this earlier. So I can ask what the value of xs it says x is currently 10. I couldn't use that in a statement later on, even though it's in a different cell. But if I modify this, let's say I come over here and I say, okay, x is now going to be five plus five plus five. Okay? So now x is 15 and x plus 2 is 17. If I rerun this cell, since x is 15, x plus five will now display is 20. So sometimes we'll make a change earlier on and we need to, we need to rerun a bunch of different cells. There's a shortcut for rerunning all ourselves. Let's say I make this change so that x is going to be. And 20. So I want to rerun every cell because they all have something to do with x. And I want to display all of those values. So if I go up here to runtime, I can say run all. And it will run every cell in order from top to bottom. And you can see that that gets changed. There are also a couple other variations such as run before, run all the cells after, things like that. But basically we'll run the current cell, will run all the cells, or sometimes it will run everything before the current cell. Now, it's important to note that the order of the cells doesn't matter. What matters is the order of execution. So you can see this little number that's over here on the left. If I don't hover over it, it tells you the order of execution. This is the 17th thing that was executed. This is the 18th, this is the 19th. If I run this again, this is now the 20th thing that was executed. Okay, so the order doesn't matter. So if I add another code cell up here and I asked for x plus one, it will tell me it's 21 because x is sitting in memory of Google Colab. And it knows that x is, has the value of 20 because of the last time we ran this cell. Okay? Now if I run this cell again and then this one, and this one x is they're persisting in memory. Now sometimes things will happen where the memory will get all mixed up. We won't remember what values are in what variables. And so we'll need to clear everything out. So when that happens, if we go up to runtime, we can say factory reset runtime. If I do that, it'll give me a little warning. I'll say yes. Now, it will say that I need to reconnect on psi do. It'll connect me to a new virtual computer. But if I now ask for x, it'll tell me it doesn't know what x is because we've erased everything. So I have to rerun all the cells in order to recreate those variables. Now I can't run this one because it's also using x. So I'm going to get rid of this top cell. Just have what we had before. And I'm going to say, let's redefine x and then tell me what x is. And then let's use x and an equation. So that's the basic of Google Colab. Now if we want to save, we can come up here and select File and then Save. What it'll do is it'll save it directly in our Google Drive. And we can then rename it and come up here, say this is demo file. And then once I save it, it'll save it under this name into Google Drive. And we can also go up to File, select Download and select download dot pi if we want to download a Python file of this, if you are experienced with Python, what it'll do is it'll download it and display each code cell as a separate section of code. If you have Tech cells included, when you download your Python file, those will be included as well as comments. And then finally, you can also, if you are familiar with Jupyter notebooks, you can download a Jupiter Notebook file, the eye PY and B file. And that file is in a special format that other applications that know how to use Jupyter Notebooks would be able to deal with. But we will be working exclusively in Google Colab. 3. Exploring Data with Pandas Dataframes: So let's go ahead and clear out these cells. I'm going to see them delete everything. And then I'm going to do a factory reset, clear everything out, and reconnect. And so this will be our starting point. We'll start fresh with a brand new code cell here. And you just to get these code cells, you just hover over about the empty space between something and something else. And then you click code, or you can come over here on the left to insert a code cell below wherever the cursor is. Now, we're going to be using a collection of libraries. Course, libraries in data science and in programming mean a set of functions, a set of code that other people have already written for us. And the ones we're going to be using are very common in industry. We're going to be using the pandas library. We're going to be using the Altair library. And at the very end, when we talk about machine learning in the sixth and final part of the course, we're going to be using the SKLearn or scikit learn library. All of these are Python libraries. So data science is generally done using either Python or R. And it kinda just depends on your background. A lot of people coming from a programming background, we'll use Python, whereas people coming from a statistical background, we'll use our, it doesn't matter which one you use. You can kind of do pretty much everything in data science and visualization with either language. And each language has its own set of libraries to accomplish different tasks. But we're gonna be using Python because that's what my background is in. And we're going to be using the pandas library, which will allow us to explore data, manipulate data, do some preprocessing stuff that we need to do. We'll be using the Altair library, which is a very common library for data visualization. And then we'll, at the very end, as I mentioned, we'll be using the scikit-learn library, which is a library used for machine learning. If you then go onto my machine learning course, we'll be using a lot of pandas, hair, and SKLearn. And then we'll also introduce in that series the TensorFlow library, which is used for more advanced machine learning things. So as I mentioned, we'll be using the pandas library in order to process our data. So first, let's take a look at what data we're going to be processing. If you go to the website address leaf Valen.com slash data, you will be rerouted to a GitHub page that has a bunch of different data files. The one we're going to be using today is the Netflix titles data. Now if you want to know where this data came from, you can scroll down and see the sources and license information for morale this data came from, but the Netflix data is a collection of information about movies and TV shows on Netflix as of 2019. So if I click this link and come to this page, you'll see that the dataset is too large for GitHub to show it to me. So I'm going to click View raw. When I click View Ra, it will show me just a raw view of the data. This is what's called a comma separated file. So up at the top, we can see the header row, which tells us what information's here. We have a show ID, type, title, a director, cast, what country came from, the date it was added to Netflix, et cetera. And then each row. Has that information with each value separated by a comma. So here's the show ID. This is a movie called norm of the North king size adventure. This is the set of directors, etc. So we can't do much with the data like this. So what we need to do is import it into Google Colab. So to do so, I'm going to copy this URL. And I want to make sure I'm copying the raw URL. That'll give me the raw data file. And then first I'm going to load the Pandas library. The way we load libraries is we asked to import them. So we'll say import pandas. Now, I could leave it like this. And then every time I need to use the library, I would have to type out the word pandas. But there's a shortcut that we can do, or we can assign a nickname or an alias to pandas. And this is pretty conventional for people who do data science to give it a nickname of p d. So we'll import pandas and we will refer to it as p d. Now when I run this, you'll see that nothing happens. And that's because there is no output from the statement. We are loading the library and nothing else. As long as there's no error message, will know that everything went okay. If we misspell pandas and try to run this, it'll give us an error and say, it doesn't know what that is. So as long as we have spelled everything correctly, we see no output. In this case, no news is good news. So once we have our library imported, we can load our data. So to load our data, I'm going to create a variable, a space to hold our data. And then I'm just going to call it data. And I'm gonna set it equal to, let's say I'm going to use the pandas read csv command. And the read CSV command will read a data file either from a URL or from a location on our hard drive. If we have this installed locally. And it will then turn it into what's called a Pandas DataFrame. Now when we load our data in Pandas, we are getting what's called a data frame. A DataFrame is basically, you can kind of think of it like an Excel spreadsheet, where each row in our DataFrame is an entry or a sample or a piece of information. It is representing one instance of our data. Each column is representing one attribute or feature of our data. So the columns in a DataFrame will often referred to as a series. So I'm going to use parentheses here. And then in quotes, I'm going to paste our URL. Now if you are unable to copy that URL for some reason, you can also do data equals pd, read CSV, and in quotes, you can type https slash slash leave Phailin.com slash Netflix. That will also route you to this raw data file. But if you were able to copy it from the GitHub link, then you should be okay. Now one thing I should note that if you put a pound sign or hashtag in front of a line, see how it turns green. That means it's a comment and it will be ignored. By Google Colab. So I'm going to load that data and then I'm going to write data dot head. What data.dat does is it will show me the first five lines of the data file. So when I run that, I see that now my data file has been loaded into a DataFrame stored in this variable. And I've called that variable data, so I can refer to it later. And by using the head command, I can see the first five lines of the file. Now each line has a row index starting with 0, and then going forward, then it has each item in the data. You remember this was a comma separated file with a bunch of different values. Each one of those values is a column. In our DataFrame, we have to show ID, the type, the title, director, cast, et cetera. Okay, and a little description. If something is too long to fit on the screen. In Pandas, it'll always use a set of ellipses to abbreviate that. Now if I want to see the whole data file, instead of writing data dot head, I can just write data. Now what pandas will do is it will show me the first five rows, and then it will skip a bunch of rows and show me the last five rows. And then it will tell me how many there are. So there are 6,234 rows, and remember we start at 0, so our row index is always one less than the total, but it doesn't show me all the rows. And usually we don't want to see all the rows. We just want to see kind of a view of what the data looks like, just an expectation. And so whenever we see ellipses, that's panda, say there's more stuff here. I'm just skipping it. There are ways to make it show you everything. But like I said, generally we don't need to see all 6 thousand rows printed out here. Most often, we'll use the head command in order to show just the first five. Now, when we use this, here is the variable we're acting on, the data variable which holds our dataframe. And if I hover over that, it tells me what exactly it is. At the top of that little box, it says it's a DataFrame called data. And it is a DataFrame with a shape which tells us the dimensions of the data, 6,234 rows and two columns. Now, this dot just says, I'm going to execute a command that belongs to the DataFrame. And the command I'm executing is called head. And whenever we execute a command or a function, as we call it in Python, we end the line with a set of parentheses. Okay, the same thing we did up here with read csv. We had the pandas file. We are going to execute a command. And the command we're executing it with a function is read underscore csv. Now when I hover over this, it shows a whole bunch of stuff. All of this stuff are different settings or parameters that we can modify when we run this command. And then it tells us what the default values for each one is. Now we're just using all of the defaults for that one. Head has a smaller set of parameters. It only has one called N. N is the number of rows that it's going to show us. And it tells us here that That is an int, which means a number, an integer, and the default value is five. So if we don't say anything, then it will show us the first five rows. We can instead use a different value. We could say we want into equal to. And then it will only show us the first two rows. Or we could say we want n equals 7 and it'll show us the first seven rows. But in our case we're just using the default. So if we don't put anything, then n will equal five and it'll show us the first five rows. And here it gives you a little summary of what the command does. And if we scroll through this, it'll tell us what the parameter is due and then give us kind of an example of what we can expect from the command. Okay? So we now know what this looks like. We're going to do one more command to explore this data a little bit further. We're going to ask for the info of the DataFrame. So when I run this command, it will give me a little summary telling me again how many rows there are and how many columns there are. And then it will tell me something about each column. And this is got some important information. First it'll tell me the order. It'll tell me the name of the column. And then it will tell me how many non null values there are in the column. So a null value, it means an empty value of value that a cell in our DataFrame. So if we were missing, Let's say here, oh, here's an example here. So this show transformers prime does not have a director. And so instead of the director's name, we see a value called n, a n, which means not a number. And this is pandas, his way of saying this value was blank. So whenever we got this value from the data, and if we go back to our raw data, if we can find transformers prime, here, we can see this little empty spot here for the director. The director was not listed in this data. And you will see that a lot in data, well, we'll have missing values and this is a major thing that we have to deal with in data science depending on what we're doing with the data. But the info command will tell us how many missing values we have. So if there are 6,234 total rows, we can see show ID doesn't have any missing values. The type doesn't have any missing title, doesn't have any missing. But Director has a whole bunch missing. There are only 4 thousand that are not missing. And then for castes that are 5000, country 5700, cetera. And so we can see that release your rating duration description. These all have values in every field. Over here on the right, we can see what kind of values they are. So int 64 means it's a number. So show ID is a number, a whole number into means integer, which means there is no decimal part. So it's a number you would use for counting, or it could be a negative number. Release year, also a number. Everything else is an object. Usually not always, but usually object means it's just text or what we refer to in Python as a string. So you can see all of these other fields have text in them, except for the ID field and the release your field, which is just a number. Now some of them have numbers and texts combined. When that happens, we just refer to it as a text string, or pandas calls it an object. So we can see that the two different datatypes we have, we have two integers and 10 objects. And then it tells us how much memory it uses to hold all of this data. 4. Data Preprocessing - First Pass: Now that we have all this data, we need to decide what we're gonna do with it. And in our exercise today, what we're gonna do is we're going to create a visualization that shows the distribution of movies based on their ratings. So whether they're rated PG or PG 13 or rated R, These are movie ratings used in the United States. And so what we want to see is a visualization showing us that distribution. So in order to do that, one of the first things we need to do is filter our data so it only contains the information we're interested in. So in this case are in data contains we see we've got a mix of movies and TV shows, and we only want movies. So what we're gonna do is I'm going to hover over here. And you can see as I hover below the output, I'll get the code line. If I can't find it, I can just click in the cell and then click this code button and it will add another one below here for me. So what I'm gonna do is I'm going to create a filter, a Pandas filter that will tell me whether or not something is in fact a movie. So I'm going to say data, that's my DataFrame. And then I'm going to use a square bracket. And square bracket allows me to access a particular Item, a particular column inside my DataFrame. And the column I want to access is the type column. So if you remember, the type column tells me whether or not it's a movie, and we have to make sure we spell this exactly right, including whether it's uppercase or lowercase. So this is what's called case sensitive. So I'm going to ask for data and then in quotes, I'm going to write type. Now if I run that cell, what I will get is just the values in the type column, all 6,234 of them, again, skipping the ones in the middle. Now all of the values are there. It's just, we're not displaying all of them. Now, I don't want the actual values. What I want is to create a filter based on these values. I want panas to tell me whether or not this value says movie or TV show. And so what I'm gonna do is I'm going to ask whether or not that value is equal to. And when I ask if something is equal to, I use two sets of equals. When I want to assign a value, I'll use a single equal sign. But if I'm asking a question about whether something is equal, I'll use two equals. So I'll say is the type equal to movie? And again, I have to be careful with the casing since movie is uppercase here, I'm going to use an uppercase maybe here. Now if I run this, instead of telling me the actual values, it'll tell me true or false whether those values were equal to movie. So the first two items were movie, The next to or TV show and then movie. And so if I come up here, I can confirm that that is the case. So this is my filter. Now, I'm going to save my filter into a variable and I'm just going to call it movie filter. And I'm going to set it equal to the results of this question. So basically I'm asking pandas a question about the DataFrame. It's telling me true or false for each row in the DataFrame and I'm saving those results here. Now, once I have a filter, I can then say, okay Pandas, I want you to only give me the rows where the filter is true. And the way I do that is I say, give me the DataFrame. And remember we use square brackets to ask for something from the DataFrame. And what I want is all of the rows where movie filter has been applied. So if a row has a true value in movie filter, that row will be given to me. If the row has a false value and movie filter, we'll skip that row. So when I run this, now, I will see that every row that I get back, and again, we're not going to show all of them. Every row I get back are the rows with movie and you can see by the row indexes and I'm skipping some of the rows. That's because we're only getting the rows where this filter has a true value. Now, this is called a view or a filter of the data, and I can store that in a variable. And so I'm just going to call this movie data. And so now when I run this, my movie data is stored here. And I can ask, I can do anything I want to it just like I would with a normal dataframe just about so I can ask for the head of that. So if I run this, I can see the first five rows of my newly filtered movie data. So to review, here, we are creating a filter that says, tell me yes or no, true or false for each row in the DataFrame is the type equal to movie. I store those results in this variable. Then I say dataframe. I want you to give me all the rows where movie filter is true. And I store those results in a variable called Movie Data. And then here I'm just printing the first five rows of the movie data, dataframe, which are I can see are now all movies. So I filtered out everything that was listed as a movie. Okay. If I want to know how long this is, I can, instead of asking for the head, I can ask for the info. And it will tell me that this has 4,265 rows. And I can see that the row number span from 0 to 6,231 again because we're skipping a bunch of rows. And I can see that I still have some missing values. But that's okay because we're not going to be using those columns. Okay? So I'm gonna change this back to head so that I can see that. 5. Data Preprocessing - Second Pass: The next thing we want to know is about our movie ratings, because our goal is to show a distribution of movie ratings. So over here in the rating column is where I can find my movie ratings. So I want to know just kind of as an exploration before I start making my visualization, I want to know what those values are and how many of them there are. So I'm going to ask, from now on, I'm only going to be working with movie data instead of all of the data. So I'm going to ask for, I'm going to say movie data, and I'm going to ask for the rating column. Now sometimes you'll see me using single quotes and sometimes I'll use double-quotes. It doesn't matter. You can use either one to be consistent. I'll always use double-quotes. Usually I use single quotes because it's, I don't have to hold the Shift key. So I guess it's just a matter of laziness. So if I asked for the rating column, I can see all these different movie ratings. Now, I can ask for a summary of this, of the values. Add however, how many of each value there are by saying I want to execute the command, value counts. Now, since I'm executing it on the result of asking for the rating column, I'm only going to get the value counts for that column. When we ask in pandas, when we ask a DataFrame to give us just a single column, we call that a series. And so I'm executing this command on the rating series or the rating column. You can think of it either way. So when I run this, I can get a count of each value. So for the TV 14 ratings have got a 1000 of those four are I've got 506, PG, cetera. Okay. And it goes in order from whichever one has the most to whichever one has the least. Now if you're familiar with United States movie ratings, you'll know that some of these are not movie ratings. Tvm a TV for team, TV PG, as you might guess from the name, or actually TV ratings for television shows. And so even though something is listed as a movie, some of those still have TV ratings. These are probably what we call made for TV movies. Those are movies that are created, never sent to a movie theater, but instead just released in a way that they're shown either on Netflix or Amazon Prime or on a television broadcast. And so instead of a motion picture rating of R or PGI, they will get a TV rating. And we're only interested in movie ratings. So we're gonna do some more filtering. So I'm gonna add another code cell here. And I'm going to say I want to exclude all of the rows that have these TV ratings, these made for TV movies. So the way we're gonna do that is first I'm going to make a list of the ratings that I'm interested in. And I'm going to call these the MPI, AAA ratings. And that stands for Motion Picture Association of America. Now, there are going to be a whole list of these. And in Python or Google Colab, whenever I'm dealing with a list or collection of things, I always use square brackets. You'll notice we use square brackets up here because this is a collection of series, a collection of columns. And here I'm going to create a collection of movie ratings. Now my movie ratings are each going to be just some strings, some text values. And so in quotes, the first movie rating, I'm going to just go up this list from oldest to youngest. So we have the NC 17 rating. That's only for people that are older than 17. And then we have the R rating, the restricted rating. Then we have PG, 13. And then we have PG. And finally, GI. These are our movie ratings. Now notice I put spaces here that's just to make it easier to read. You don't have to use those spaces. You don't even have to have spaces after the commas, but it just makes it easier to read. So now we have a list of five different movie ratings, k. So if I run this cell, and this is important, I have to run this cell in order to store this list in memory. So if I run that cell, I can then use that movie MPA ratings value, just like we used earlier in the demonstration. And I can see it's a list of five things. Okay? So what are we gonna do with that? Is we are going to create another filter just like we did here, where we said, I only want rows where the type is equal to movie. I'm going to create a filter that says I have now only want rows where the rating is. The inside this list of ratings is one of these values. Now, I'm going to continue to operate directly on our filter data in the movie data that we've created from our first filter. And I'm going to filter it further down. So there's a pattern. You'll see a lot when we work in pandas is we have a DataFrame. We create a filter to kind of scale that down a little. Then maybe we create another filter to work further on that. And we can just keep filtering as we go. And so that's what we're gonna do here. So I'm going to ask, I'm gonna say movie data. And here's a little shortcut as I start typing, you see this auto-complete list show up. This is a list of things that start with what I've already typed. And I can use my arrows to move up and down here to select something. And if I press Enter, it'll just fill in the rest of that word for me. So it's just a little thing to make things faster. So I'm going to ask for, and I'll say movie data. And what I'm going to ask for is I want the rating column. Let's just look at that for a second. We've seen this before. This is all our ratings. And I'm going to ask for a filter where those ratings are in our list of values. And so I'm going to use a command here called is in. And this command takes a list of values. So if I hover over that, you can see it takes a list of values as its parameter, as its argument here. And I'm going to say I want all the rows where the movie, where the rating is in this list of MPI AAA ratings. Okay? So when I do that, it'll tell me true or false for every row in movie data. And you can see again, this is our filter data from earlier. Every row and movie data true or false, the rating value is in this list. And so this one is not an atlas. This one's not on that list, this one's not, this one's true. So six is true. So if we come back up here, we can see not in our list, not in our list on in our list is in our list, not in our list. And so now we have a filter that tells us true or false for every row, whether those values are in our list of movie ratings. So just like we did before, we're going to store this filter in a variable. I'm going to call it MP a filter. And then we are going to say movie data. I want to apply this filter, MPA, a filter. Now when I do that, the rows I get back are just the ones with MPA, AAA ratings, Just ratings in our list. And so this is a new DataFrame, rather a, a view or a filtered version of our earlier DataFrame. And so we've kind of been filtering down again. We started with a DataFrame called data that had all of the data. Then we created a filter and we store the results of that filter in movie data. And then we created another filter on movie data. And we're going to store the results of that filter. Let's call it real movies, because these are not just made for TV movies. And so we'll store the results of that DataFrame. And we will then just like we've been doing same pattern, will ask for the head so we can see the first five rows and make sure that those look like what we expect. And it does. And then we can ask for the information about that if we want. And we can see there are 13 entries now. So we've gone from, at the very beginning, we had 6000 rows and after filtering out things labeled as movies and then further filtering movies that are, have our MPA AAA ratings. We now have a 1013 rows left that we're dealing with in this DataFrame. 6. Creating our First Visualization with Altair: So now we have all of our movies that we're interested in. So now let's try to move on to our visualization. We finally have the data preprocess. We call this preprocessing. Some people call it data wrangling or data munging, where we are taking our data are raw data and filtering it and modifying it in order to get it in the shape, we need to do something useful with it. In our case, the useful thing that we're gonna do is create a visualization. So pandas itself has some really rudimentary visualization capabilities. They are pretty ugly, but they're good sometimes for just a quick visualization, let's look at that first before we make a nicer one, I had another code cell here. And I'm going to from this point on, be working on the real movies DataFrame That's just got our filtered final filter data in it. And so we're gonna say real movies. And I want the rating column. And let's look at it. We'll start out looking at the value counts. So if I look at that, I can see the distribution of movie ratings in our recently filtered data. And then what I can do is I can ask it to plot those value counts and I can tell it the kind of plot I want is a bar graph. So if I run this, I will see a bar graph. Now this information up here just tells me that this is what's called a matplotlib graph. If you're familiar with Python, you might have heard of that. It's a plotting library that Python uses. But we can see that for the rated R movies, they're around 500, PG 13 there, around 300. Now, there are not, there's not a lot we can do with this. There's a bunch of map plot lib commands we can learn, but we're not gonna be using the matplotlib library because the Altair library that we're going to be using instead is more widely used in data science. And it has a nicer set of commands that we can do a lot of neat things with it. So we'll ignore this plot for now. We'll add another code cell and we're going to import the Altair library. Now, the way we're gonna do that is very similar to how we first imported the pandas library. And so what we'll do is we'll say import, Altair. And just like before, we're going to give it a nickname or an alias so we don't have to say Alistair dot every time. And kinda the conventional nickname for it is Alt. Alt. So I'll run that. And once again, we won't see any output because we're just importing the library. If I can't remember if I ran something, I can just click off of it. And as long as there's a little number there to the left, it tells me that I have in fact run this cell. Now that I've imported the library, I'm going to create a really basic chart with it. So the way we do this as a series of command, Altair uses what we call declarative visualization, where we declare a series of commands that build up our chart piece by piece. Okay? So what we're gonna do is we're gonna say, okay, Altair, I want to make a chart. And when we want to make a chart, we first tell it what data we want to use. And so the data we want to use is our real movies DataFrame. Okay? Now if I were just to run this all by itself, it would just give me an error. So there's still some commands I need to do here. So I'm going to tell it, I'm gonna make a chart. I need to tell it now what kind of chart and the kinds of chart in Altair are called Marx, how are we marking up the chart? In our case, we're going to make a bar chart. So I'm going to tell it I want to use mark bar, okay? So I want to make a bar chart out of this data, but I need to give it further information. I now need to tell it, okay, I'm gonna make a bar chart. I have to tell it which columns I'm going to be using for different things, how I want to encode that data into my visualization. And so the way I do that is I add an encode command onto the end. And so you can see we're just adding. We start out with Altair. We want to make a chart, a bar chart, and now we're going to tell you how to encode it. So we just chain these commands together. And when we do the encoding, and I'm going to do this kind of a weird way at first and then I'm going to do it the way most people do it, but I'll do it the way that you're probably expecting to see it if you've never used Altair, I need to tell it what the x-axis is going to be and what the y-axis is going to be. So the x-axis is going to be our movie rating column and our y-axis, what I want for there is I want the count of movie ratings. So I can tell Altair that I want the count function. I want to count the number of ratings. Now for movie rating itself, since it's a column in our DataFrame, I just write its name. But count is a command that Altair understands. And so since it's a command or a function, I have to put parentheses after it. So now if I run this, I will get a very basic chart. Now we can see even though this is very basic, one thing you'll notice right away is the text is CRISPR and clearer. And that is because Altair is using a different kind of rendering then what pandas uses. Now, we want to make a chart that looks a little bit nicer than this. So I'm going to add another code cell here, and I'm going to type this out again, but I'm going to type it in a way that most people using Altair would type this. And it's important to do this this way. Because there are often a lot of different commands are a lot of different settings we want to put into each of these steps. So first, I'm going to ask Altair to make a chart out of our real movies data. And it's going to be a bar chart. And when I do this parentheses, so notice when I type the opening parentheses, my cursor shows up between a set of parentheses and I'm going to press Enter, so that one parentheses here and another parenthesis here. And then I'm going to type in code parentheses again, and then do the same thing. So I've got this blank space inside mark bar and a blank space inside encode. This is how most people will write this out. So now I can say I want my X axis to be the rating and then comma my y-axis to be the count. So if you look back to this original version, I have typed this exact same sequence of things in the exact same order with the exact same set of characters. The only difference is I've added a blank line so that they're spaced out differently. So if I run this, I'm going to get the exact same chart, but this is how most people will write this out. So if you missed that, go back, rewind the video, watch that again to see how I got this spacing, because this is how we're going to be doing it from this point forward. And this just makes it easier to add different settings to the bar chart, different settings to our encodings. And later on there'll be other things we want to add. So we will be working with this from this point forward. 7. Enhancing and Refining our Visualization: Now we're gonna make a bunch of different transformations to make this chart look nicer. And the first thing we're gonna do is we're going to look at these what are called the axis titles. So this is our y-axis, and this is our x-axis, and our y and x axis have labels here and here, and then they have titles. You might have heard this referred to as different terms, but in Altair they're referred to as titles and labels. So usually we need a good title for each axis. Sometimes depending on the labels of the axis, it may be obvious and we don't need a title, but almost always we should have some kind of title on each axis. Okay? So we've been using what we call shorthand syntax for encoding our x and y-axis. But there's another syntax so we can use the shorthand syntax just says I want the x-axis to be equal to this column, or I want the y axis to be equal to this command. So what we're gonna do instead is we're going to use the longhand syntax. The longhand syntax, we say we want the x-axis to be equal to alt dot and then a capital X. And then we wrap this in parenthesis. I'm going to do the same thing to x and y dot y for the y axis. And if I run this, this will give me the exact same thing. But what this does is it allows me to specify a bunch of different values for transforming what the x-axis looks like and what the Y-axis looks like. Whereas when we had just this version, we can only specify the values that are going to be used to generate those. When I have this version, I can specify a bunch of other settings. So the first thing I wanna do is change the title. So the x axis, I don't want to just to say rating. So by default, it's just going to tell me the name of the column or a description of the command that I used. And so what I wanna do is I want to explicitly set the title of this axis to be MP, a AAA rating. And usually when we create a title for a graph, whether it's an axis title or a chart title, we use what's called title case, where the capitalisation of each word except for certain words is uppercase. Now, you can look up how title case works. There are websites where you can look that up. So if you search for title case, there's even a Title Case Converter where I can enter some information here, like all the dogs in America and I'll say convert. And then it will tell me that The title case version of this is that these word should be capitalized and these words shouldn't. Usually prepositions are not capitalised and so the and n. So if I had not sure what something looks like in title case, this website will help me out and I can tell it which version of Tidal case I want to use based on this collection of settings. Not a 100 percent important for this tutorial, but good to know if you're not sure how title case works. You can, of course, do this however you want. If you wanted to have a really weird casing, you could do that. But we're going to be using standard title casing. The title of this axis is going to be MPA, a AAA rating. And the title of this axis. Let's make this say. Our goal is when we create a chart or visualization, we want people to be able to look at it at a glance and know exactly what it is we're trying to tell them. And so we want to be really explicit here. Number of movies on Netflix, okay? So this is the MPA rating, and this is the number of movies on Netflix with that rating. Okay, so now that we have that, let's go ahead and add a title to our chart. Now when we want to operate on the entire chart, we add another command here called properties. And just like before, I'm going to go in between the parentheses and press Enter. And here we do things a little bit differently because inside these parentheses, we are creating the properties for the x-axis. And the x-axis is a property of the encoding. And here we're creating the properties of the y-axis, which is a property of the encoding. Out here, we are creating the properties of the chart, so we're still going to use Tidal, but we don't need to wrap it in anything like we do with encoding because we're working already on the entire chart. And so the title of the chart will be Netflix movie ratings as of 2019. So we can see, we run that. Now we have a title on our chart where you have a title on each axis and that's starting to look better. Still doesn't look great, but it looks better. One day we might wanna do is make this chart a little bit wider. So I'm gonna add a comma here. This is going to be another property of my chart, and I'm going to ask for the width of my chart to be, let's say 500. Now notice this is not in quotes because it's a number rather than text. Okay? So when I do that, my chart will get wider. Now notice that when I do that, my bars get wider automatically. I can make the size of my bars a specific number by adding a value up here in the mark bar settings. But usually not always, but usually it's better to just let them auto fit to fill the size of the chart. So now let's put these in a different order. I there. It doesn't really make it since the order it has now it's not from smallest to biggest, are biggest to smallest. It's not going in order of the movie rating scale. All it's doing is going alphabetically by value. And that can make sense in some cases. But in this case, it's weird to have NC 17 sandwiched between the G and PG ratings. Usually we expect these ratings to be order of severity, either from most restricted to least restricted or the other direction. So what I wanna do is say how I want my X axis values to be sorted. So I'll add another parameter here called sort. And I have to tell it the order of saltation. And there are a couple of different ways I can do that, but the easiest way is to just give it a list saying, put this value first, then this value than this value, then this value. And I already have that list because if I go back up here, here is my list of movie ratings from most restricted to least restricted. So I'm just going to use this same list that I was using for my filter. So I will say for my sort, I want it to be equal to MP AAA ratings, which is my list of ratings. And so now my ratings go in order from NC 17 through G. So for most restrictive to the least restricted, now, if I wanted it to go in a different order, I could just directly specify a list here, instead of MPA ratings, I could put square brackets and then I can say I want g, and then I want P, G, and then I want PG 13, cetera. And I could write that all out. And if I did that, it would follow that order. But I'm just going to use the order that we already have MPA, AAA ratings. So I'll get the most restricted on down through the least restricted. Now, I think it would be neat if each of these bars was a different color. Some of you with more aesthetic tastes may look at that and say that looks like a dumb idea. But just for the sake of exploring more of what Altair can do, we're going to add a color encoding. Now with the color encoding, I can say that I want the colors of the values three based on a certain column. Now, just like before, there's a shorthand version and a longhand version. So the shorthand version is, I can just say I want the color to be based on the rating column. And when I do that, it'll give me a different color for each rating. And it will automatically create a little legend up here that tells me what color goes with which rating. But I can also do the longhand version, which is what I'm gonna do by saying this is a color encoding and the value I want to use is rating. And when I do that, it gives me the exact same results. But now I can set additional properties. And one of the properties I want to set is even though it's pretty cool that there's a legend here, since each of these colors is already labeled down here, this is kinda superfluous information. And so I'm going to tell it I don't want a legend. And so the way I do that is in the color encoding, I will just say legend equals none. And notice this is capital N and it is not in quotes. This is a special value in Python that means nothing, literally nothing not an empty string, not a 0, just nothing. So when I run that, now I'll tear will say, Okay, you don't want a legend, so it won't create a legend. And so each of these is a separate color. If I look at this, maybe I'm looking at these titles and these labels, and I feel like these are a little bit small for what I wanna do with it. So I want to make some changes to the label and title font sizes. And the way I do that is I'm going to add another command here called configure axis. And the Configure axis command allows me to specify some properties that will apply to both axes. So for example, maybe I want the label font size to be 14. So when I do that, my label fonts are now a little bit bigger. And so maybe I need to make sure I put a comma here. Maybe I want my title font size to be 18. And so know that those are really big. Now you'll notice this is not affecting the title of the chart, and there's a different way we have to do that and we'll talk about later. But let's say that I want these colors, these are so bold. Maybe I want these colors to be a little more subdued. So I'm going to set the title color. And I need to make sure to put a comma here. I'll leave it off so you can see what happens when you don't unless you've already experienced that. But what I specify the title color, there are a bunch of different ways I can do this, but the most common way is to use HTML color codes. And so prefix it with a pound sign. If you don't know what HTML color codes are. Html color codes are hexadecimal codes that start with a pound sign. And you can see they look like this. So if I pick any color here, this tells me the amount of red, the amount of green, and the amount of blue in hexadecimal. So black has none. If I go to white, it has f. F means full amount. The full amount of color. Not because it starts with F. That's just how hexadecimal works. And as I scroll around through different colors, I can see what those codes are. Some people have certain HTML color codes memorized if you do a lot of web design work. So I'm going to use just a nice light gray color. So we'll use D3, D3, D3 as our color code. And again, make sure it starts with a pound sign. Now since I didn't put a comma here, when I run this, it's going to tell me this is invalid. It's going to say it doesn't know why this is here. So if you see that, it's usually because you missed a comma or because he spelled something wrong. So when I put that comma there, notice now that my title is a little faded, kind of a little subdued there. So whether or not that looks good to you depends on your statics. Also notice this looks a little close, so I can increase the padding. So I can say I want maybe I want the title padding to be 10. And so that'll space that out a little bit on both axes. And now the last thing I wanna do is make this title chart title a little bigger, so I can't modify it here. This is properties for the whole chart, k naught for the chart title. There are some ways I could get around it. But what you'll usually see most people do is I'll add another command here called configure. Just like we did with our axis. We configure the axis. Now we're going to configure the title of the chart. And I'm going to say I want the font size to be, I'll say 24. So that looks good, but still a little bold. So I'm going to drop the color down. And I'm just going to reuse that same color code that we use, the D3, D3, D3. And so here is my final chart. Now, again, you may not love the color scheme. In a future lesson, we will talk about how to modify this to use the colors you want. 8. Reviewing the Visualization Steps: So here's our final version of our code. To walk through this again, we're telling Altair to create a chart for us based on this data. The kind of chart we want, that kind of marks we want to make on it, our bars, we want a bar chart. We're going to encode that data such that the x axis uses these properties. It's based on the rating column of our data. This is its title, and this is how the values are sorted. The y-axis is, has these properties. It's a count of each value, and here's its title. We also are adding a color encoding so that our values are color-coded based on the rating value. And we don't want a legend telling us what those colors represent. Next, we assign some properties for the entire chart, giving it a color and a specific width. We could also set the height here. If we wanted the same way, then we are going to configure the axes of the chart using these values that will be applied equally to both axes. And then we're going to configure the title of the chart using these values. And again, there are other ways to accomplish this, but this is the way we're gonna do it because I think it makes the most sense, especially for people just learning Altair. Now, once we have this, what do we do with it? Let's say we want to put it on a webpage or displayed in some report over here on the right. Whenever we create a chart and Altair, there's this little button we can click, and it will allow us to save this in a couple of different formats. So if you're familiar with SVG, you know, you can use that in a webpage. You can also use a PNG file. So when I click this, it'll, if you're using Safari, it'll ask you, do you want to allow downloads from this page? And I'll say I do. And then it'll download the PNG file for that chart. And it'll just be a file that shows exactly what we see. Now if I scale this up, you can see it gets blurry, just like all pings do. So if I need it to be a different size before I save it, I want to set to render it whatever size I want. Alternatively, I can download it as an SVG file. Now an SVG file is a file that we can also put it in a web page. The difference is when I put this in a webpage, if I zoom in on this, SVG files do not get blurry. They just are re-rendering the contents at whatever scale we ask it to. So some people like SVG files better for that reason. But depending on what you're doing with the file, you may need a PNG file instead, so you can use either of those. These other commands are not really important for the sake of our lesson and are not really used that much. 9. Conclusion and Next Steps: So that concludes the first part of our six part series on data science and visualization. I hope you'll tune into the rest of the series. There should be one coming online each week. In each episode we're going to be going a little bit further into data analytics and data visualization, doing some more advanced filtering, more advanced preprocessing, looking at different scenarios of how we handle things. And looking at more and more capabilities of Altair for visualization. And then we're going to be building up to the point where we can make things like choropleth maps, where we have map-based visualizations and then getting into some machine learning at the very end where we'll kinda introduce the topic of machine learning and look at how to do predictive analytics and visualization. As I've mentioned, I also have another series coming out on machine learning. By the time you watch this, it might already be out. If not, head over to my website, we failing.com, sign up for my newsletter and you'll be sure to be notified whenever that course comes out. Super low volume, super lows, fannie sort of thing. I just kind of announce things. I'm doing books, I've released courses, I'm offering things like that. So if you're interested at all a data science visualization or you just want to ask a question, Governor, my website, leaf Valen.com, sign up for my newsletter and just reach out with any questions or concerns you have. Thanks.