Data Science and Visualization For Complete Beginners - Part 3 | Lee Falin | Skillshare

Playback Speed


  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Data Science and Visualization For Complete Beginners - Part 3

teacher avatar Lee Falin, Software Developer and Data Scientist

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

11 Lessons (1h 6m)
    • 1. Introduction

      1:00
    • 2. Dealing with Bad Data

      16:51
    • 3. Statistical Analysis of Data

      2:16
    • 4. Transforming Columns Through Mapping

      7:39
    • 5. Creating a Histogram

      4:08
    • 6. Plotting the Median Value

      7:22
    • 7. Adding Labels

      8:59
    • 8. Creating Box Plots

      6:21
    • 9. Creating Scatterplots

      5:38
    • 10. Creating Interactive Plots for the Web

      3:15
    • 11. Recap and Next Steps

      2:39
  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

2

Students

--

Projects

About This Class

If you've ever wanted to learn more about data science and visualization, but felt overwhelmed by all of the background knowledge it seemed to require, this class is for you. 

In this, the third in a six part series of introductory classes, you'll continue to learn more about data science and visualization. Since this class builds of the techniques and methods from the first two classes in the series, be sure to complete those first.

Throughout this series, you'll learn not just how to use industry-standard tools to employ a variety of data analysis and visualization techniques, but more importantly you’ll learn the reasoning behind those techniques, and when and why to use them in your own work.

Meet Your Teacher

Teacher Profile Image

Lee Falin

Software Developer and Data Scientist

Teacher

Hello! I'm Lee Falin, and I'm a software developer, writer, and educator who loves to learn, create, and teach. I'm currently a professor of computer science at BYU-Idaho, where I get to teach courses in software design and development.

One of my favorite things about software development is that it's a skill that enables anyone, regardless of their age or background, to bring their ideas to life and share them with the world. It just takes some time, patience, and a fair amount of hard work.

I've been writing software for almost twenty years, working in the commercial software, telecommunications, and defense industries. I've also founded a couple of software startups, and worked as a data scientist in bioinformatics research.

These days, I spen... See full profile

Class Ratings

Expectations Met?
  • Exceeded!
    0%
  • Yes
    0%
  • Somewhat
    0%
  • Not really
    0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Introduction: I'm Lee Phailin. I'm a data scientist and educator. And this is the third in a six part series on data science and visualization. Today we're going to be looking at nutritional dataset and we're going to be doing some more advanced data preprocessing and talking about techniques for mapping a set of values that were given to a set of values that we need for a visualization. We're also going to be looking at more advanced visualizations where we combine different charts together, as well as looking at scatter plots and some of the customizations we can do there. Now again, the series doesn't assume any background knowledge in data science or statistics or math or anything like that. But I do assume that you've completed the first two parts of the series already. And they have kind of a fundamental grasp of the principles and things we talked about there. You don't have to have it all memorized or mastered or anything like that. But you should have a pretty firm foundation where you could go back and look at those exercises and not be too lost. So if you have any questions, reach out to my website at leafy lynda.com. Otherwise, I hope you enjoyed Part 3. 2. Dealing with Bad Data: So as always to begin, we are going to go to colab dot research dot google.com, and we'll select New Notebook. And then let's go look at our dataset. So to find our dataset, we'll go to github.com slash light and lower slash educational datasets. And we're going to be looking at the serial data. So the serial data is actually small enough that GitHub will give us a little preview of it. But just like before, we're going to go to the raw version to get the raw values of this and copy that URL. So now we'll go ahead and we'll connect to our Cloud instance so that we'll have our virtual machine ready. And then we'll import our libraries, their normal aliases. And again, you'll just see the same patterns. Data science is all about learning patterns, learning processes, learning to apply them in different situations. And we can repeat this over and over again. So we'll import our libraries. And then we will read in our data, say eta equals Pandas, read CSV, and this is our URL. And just like before, if for some reason you can't get to that URL. And the alternate URL that you can use is https colon slash slash, the failing.com slash cereal. And that will redirect you to the GitHub URL. But we'll go ahead and use our GitHub link and we'll say df.head. So we can look at the first few rows. And we can see that this work, and we have all kinds of interesting serial information. Now, we can see right away that there are some codes in this data. And we talked last time about how a lot of times with the dataset, it's not immediately obvious what different values represent and we need some kind of guide or what we usually call it data dictionary to translate those values for us. And so in this case, if we go back to our GitHub link, to our very first GitHub page. We can see there is a serial dictionary here. And if we click on that, we can see just a list of what some of the different values represent and what the different numeric values, what units those are in. So walking through these, we have the name of our cereal. We have the manufacturer code. For example, in means Nabisco, q means Quaker Oats. Until we can see the manufacturer code type of cereal, this is just cold or hot. And then we've got some nutritional values. And those nutritional values are in different units. This is based off the USDA nutritional labeling standard. So that's what those units are. And then for vitamins we actually have a percentage, 0, 25, or a 100. Now, if we look at this data, we can see there are kind of three different things we look for in data. When we're doing our data preprocessing clean up, first of all, we look for any missing data and we've talked about how those show up as NaN values, not a number values. And we, the easiest way to look for missing data right away is to look at the info print out. Because the info print out will tell us for the number of rows. We've got 77 rows in this dataset, 77 instances. Here are the number of non null or not missing values for each column. And we can see right away that we have 77 values in each column, which means we don't technically have any missing data as far as pay and this is considered. But there is a sort of second type of missing data that database administrators or database systems or the whatever system was used to pull the data out of the database, will sometimes fill in missing data with what are called dummy or default values. And we can see that in our potassium column here, there is a value of negative 1, 4 ohm and delight. Now unless own and delight sucks the potassium out of your body when you eat it, this is probably a dummy value where there was a value missing. And so it was filled in with a default value of negative one to show that there's no value, that we don't actually have the real data there. So we've got dummy values, we've got missing values. The third type of thing we have to worry about, and this is the most difficult are bad values, values that look like they could be right, but upon closer analysis probably are not. So if we look at the vitamins column here, it doesn't seem right that all of the cereals have the exact same percentage of vitamin. And if we expand our head pronounced if you say look at the first 10 rows instead of the first five, we can see that it would be a little weird that apple jacks and 100% brand would have this exact same percentage of vitamins. Also weird that Nabisco, it's a 100 percent brand, has 25 percent of your vitamins, whereas Quaker Oats, a 100 percent natural brand, has none. So either this is a very unhealthy cereal or there's something funny about this column. So I'm just going to ignore this column completely. Now fortunately, we don't need this column for our data analysis. If we did, our only recourse would be to go back to the database administrator and try to track down or whoever gave us the data and try to track down where do these values actually come from. Bad data is where we spend the majority of our time in data science trying to sift through how do we handle missing values, w values, bad values. What do we do about those things? Now? Looking at the rest of these columns we have which shelf the cereal is on. Now obviously that's going to be different for every store. So I'm going to ignore this column to wait and cups. And then something called rating. So the weight in cups or how much is in a serving and the rating is a rating of the cereals, but we're not told where this reading comes from. So for our analysis, we're not interested in any of these last five columns. Today. We're going to be looking only at the nutritional information for these macronutrients protein through potassium. So one of the things we do have to figure out though, is what to do about this negative value. So when we've got missing values, there are a bunch of different things. And can do, we can try to track down the real data so we could run to the store. In this case, you can get a box a moment enlightened lookup. It's passing value. Obviously in some situations that won't work. The other thing we can do is what's called imputation. We can look at all of the values and the potassium column and say, well, potassium for a serial is usually the median value is this much so anything that's missing, we can just plug in what the median value. This is a very common technique. We may refine it a little bit and say, Okay, we're only going to look at cereals made by this manufacturer. Or maybe we're only going to look at cereals with the same amount of sugar. There are all kinds of different conditions we could put in say, okay, give me the median potassium value for all cereals with eight grams of sugar. And that's the value I'll use for our missing values. The other thing we can do, and this is pretty common, is to say, okay, do I actually care about this row, the Ulman delight row? So in the case of having this negative value here, maybe our analysis really depends on having accurate values for potassium. So we don't want to do imputation. And so what we actually want to do is just get rid of any rows that have negative values. And which of those techniques you take, whether you're going to use w values are imputation or dropping the rows just depends on and different factors. How big is the dataset that we have, a bunch of data that we can afford to drop rows. Is each row super important? Is that column important? Maybe we'll just ignore. The whole column. Will say, You know what, I don't care about potassium, it's got too many bad values. So if, if there was only one row with bad potassium values, maybe we'd ignore that row. If they were a 100 rows with bad potassium values, then maybe we'd say, well, we'll just have to ignore the potassium column and keep all the rows otherwise, so that it really just depends on the shape of the data. So before we make any decisions, put this back to its default setting. Before we make any decisions, I want to see if there are any negative values in any of these other columns. And so we are going to build, so last time we built a compound filter where we asked the question, Is it true that this is a negative value, or is this a certain condition? And is this a certain condition? And is this a certain condition? Now we could do something similar this time using or where we could say, is it true or false? This is a negative, or this is a negative, or this is a negative. So let's see what that looks like. So I'm going to say, show me a filter. First of all, let's just do the potassium column because we know that one's got a negative Potassium is the name of a column. True or false. Potassium is less than 0. And just like before, it's going to tell me true or false for each row, whether that's the case. Now, let's create a compound filter. So I'm going to create a negative value filter. And I'm going to apply that to my data. This is something we've done a bunch of different times. And so this will show me, I actually have two different rows where potassium is negative one we have all men delight and cream of wheat. So let's create a compound filter. So I'm going to wrap this condition in parentheses just like we did before. And instead of using ampersand, I'm going to use this vertical bar. Now, that vertical bar on my keyboard is Shift Backslash. It's just above the Enter key. So this vertical bar means or so true or false, potassiums less than 0. Or let's go to the next column over is sugars less than 0. And so now I've got a potassium less than 0 and another potassium less than 0. And here's one, we're sugars are less than 0 and it's also got a bad value in carbohydrates. So let's add another condition. We're just going to explore all of these values where carbohydrates is less than 0 and we don't get any additional rows there. And I'm just going to copy and paste now. So I'm going to grab everything here. Paste this in, and I'm going to add fiber. Spell that correctly. It's five or less than 0. And then we'll add another condition is sodium less than 0. And then we'll add another condition. This pasting, this, repeating this over and over again. Or anything where fat is less than 0. And is there anything where protein is less than 0? And then finally, is there anything where calories is less than 0? So we've got all these different condition and show me all the rows where potassium is less than 0 or where sugars is less than 0, or, or carbohydrates is less than 0. Now notice we've wrapped all of these questions, all of these conditions in parentheses, this is required just like before for our compound filter. The difference between this compound filter and the one we made in the previous class is instead of the ampersand, which means, and which would say show me the rows where potassium and sugar are less than 0. We're using this vertical bar, sometimes called a pipe, where we're asking, show me the rows where potassium is less than 0 or sugars is less than 0, okay? And we're looking at this across all rows. Now, if we look at all of these together, we may think, hey, it sure would be nice if we could just say, show me any column less than 0. And you might be tempted to say, do something like this data where any column is less than 0. But the problem with this, and we'll get an error when we try to do that, is it says that we can't use the less than comparison for strings and integers. It, so here we've got an integer. But if you remember in our columns, not all of them are numbers. These are all numbers, but we have some strings. And so if we try to do it for the whole dataset, we can't do that. We have to pick specific columns. Now if all of the columns in our dataset were numbers, there would be a way that we could do this. It wouldn't be quite that simple, but we could do it a little more simply than we're doing here. But in our case, we have to be explicit for each column. Now we can see that we've got three different rows here in our case. So again, this comes back to what are we doing with the data. In our case, we're going to be looking at some things and voting sugar and carbohydrates and fat. But we're not going to be looking at the potassium column. And so in our case, we don't care about the fact that these two have bad values for potassium because my analysis is not going to use the potassium column. But I do care that the Quaker Oats row has missing values both for sugar and for carbohydrates. Now, as you point out here, you'll notice. Before I get too much further, we've got three different datatypes. In our DataFrame. We have object which we've seen before, either just strings. We've got integers which we've seen before. These are whole numbers and then we have floats and these flow values just mean decimal number. So fiber and carbohydrates are decimal numbers as opposed to sodium, which is a whole number or an integer. But for most of the time we treat integers and floats the same. Okay? So coming back here, we've got this row, row number 57 that we want to get rid of. So let's modify our filter so that it's only going to grab the rows that we care about. So we don't care about potassium. So I'm going to take this condition off. And here we can see that now we've only got the one row. So this one row is what we want to get rid of in instead of showing the whole row, when we drop rows, we have to know what row number it is. So to get this information, what we're going to do is ask for the index up those rows. And here we get a list of indices. And we've talked about how lists are always wrapped in brackets. And it tells us it's a certain type of list, this row indexes, and it only has one value if we had left. I'm just going to put this back for a second, hit undo a couple of times. If we were to put this all back and look at the index list here, we can see we'd have three rows, but we don't care about potassium be a negative. And so now we only have one row that we care about. And so I'm going to save this in something called bad row numbers. So I've stored that results the fact that row 57 is the one we want to drop. And so now what we're gonna do is we're gonna create a new DataFrame. I'm just gonna call it clean data. And let clean dataframe is going to be created by taking our existing DataFrame and dropping and we tell it what we want to drop. And in this case we'd want to drop all of those bad row numbers. And then we'll look at our clean data. And we'll look at the info so you can see the result. You can see now we have 76 rows in our clean data because we dropped, we've said create a, a row, a new DataFrame based on dropping a row, whatever rows are in this list from our existing DataFrame. And now we've got 76 rows. And if we were to take our clean data and we were to apply the negative value filter to it. We would see that we would not find any results because the negative value filter says, show me things where these values that are less than one. But in our clean data, that row isn't there because we've dropped it. So looking at our first five rows, we see all men delight with its negative potassium is still here because we don't care about negative potassium values. 3. Statistical Analysis of Data: All right, so now let's look at some of the statistics of these numeric rows. So using our clean data and we're going to be using our clean data dataframe for now on, I'm going to use a command called describe. And we'll describe does is it gives me a little statistical summary of all of the numeric rows in my DataFrame. So here, the count, this is how many values there are that we're not blink. This is the average and this is the mean average. Here is the standard deviation, smallest value, the biggest value, the 25th percentile value, the 75th percentile value, and the 50th percentile value, which we usually call the median. So here's the median value. And so we can kinda just get a view of the span of our data. So for example, sugar, we can see we go from 0 with the smallest to 15 with the largest. Now, one little shortcut we could have taken earlier is we could've said if we look at our original dataset, if we use describe on it, you'll notice that our minimum values we could have looked here and seeing, hey, carbohydrates, smallest value is negative one, so as sugars, so is potassium. And then we could've said, okay, those are the ones that have values that are less than one that we need to deal with. And so we could create a smaller filter based on that, but lots of different ways to accomplish the same thing in pandas. So looking at this, we can kinda get a feel for the span of the data. We can see if there are any extreme values. So we can look down here and say, okay, well if the average value is a 180, it doesn't see too extreme that the maximum value is 320 and the minimum value is 0. Some cereals just won't have any sodium or fiber or fat. That seems normal. Some cereals won't have any sugars, cetera. And we can see our potassium value is bad. It's still as a negative one here. Or max phi is 330, with the average at 90. When I say average, I usually mean the median when people kind of outside of data science and statistics say average, they almost always mean the mean. I'll try to be better about saying the word median. And so we can get this kind of a description of what our data looks like here. 4. Transforming Columns Through Mapping: So what we're going to be doing now is we want to create a visualization that shows kind of the breakdown of nutritional value for these different. That's cereal manufacturers are quick road serials more healthy, are Nabisco serials more healthy, things like that. And so in order to do that, we're going to be working with the manufacturer column quite a bit. However, the manufacturer column right now uses these kind of obscure codes. And we've talked before about when we create a data visualization, we usually want these values to be human readable that we're going to have in our visualization. So first, let's look at that column symbol. We'll say clean data. Let's look at our manufacturer column and let's see what the breakdown is using value counts as will tell us how many times each manufacturer, and that's counts with an ask how many times each manufacturer code shows up. So we can see we've got a bunch of code k. If we go back to our data dictionary, That's Kellogg's, we've only got one code for a, that is American home food products. And then we've got a budget of Gs, which are General Mills and kind of a smaller amount of the others. So most of our cereals are these two manufacturers. But again, what we wanna do is convert these codes into the actual manufacturer names. So the way we're gonna do that is we're going to create a new column in our dataset that we can use and we'll call it manufacturer name. And what we'll do is we'll say anytime there's an N and the manufacturer column are manufactured name column should say Nabisco. And anytime there's a q, it should say Quaker Oats in hundreds of k, should say Kellogg's, etc. So in order to do that, we're going to do what's called mapping. We're going to map these values or translate these values into the words that those values of those codes represent. So this is a two-step process. And the way, the first step, what we're gonna do is we're going to create a mapping dictionary, and I'm going to call it name. And a dictionary uses curly braces. Curly braces, depending on your keyboard, could be in different spots, minor above the square brackets. So if you find the square brackets, and notice we have different names for these things. You know, we've got parentheses, we've got square brackets, and these are curly braces. So that's just kind of a convention. So our dictionary is going to be created. Dictionaries are things that map from one value to another value. Usually we say from a key to a value. In this case, our key is going to be the manufacturer code and the value will be the manufacturer name. And so the way we're going to create our dictionary, and then the press Enter here. And we're going to say whatever we see, code a, put a colon here and that will say code a is the key. And the value that we want is American home food products. And then each of these mappings will be separated by commas. So our next mapping code G is going to be General Mills. Put a comma, next mapping code, kay? We'll be Kellogg's. Now notice all of these mappings, these key value pairs are going between quotes. And again, it doesn't matter if you use single quotes or double quotes. I usually use single quotes. But sometimes you'll see me use double-quotes, but they are typically interchangeable if they're ever not, I'll let you know. Very rarely is that the case. So I'm just going to be back and forth between the data dictionary in my code page here, and just typing out all the mappings. So Q is going to be Quaker Oats. And then our last one is Ralston for Ralston Purina. So we have all the mappings. Now, one important thing in this mapping process, we have to make sure that we do actually get all of the codes. We have to have a mapping for every value that we're going to be mapping from. If we don't, the result in our new column will be blank. And so we have 1, 2, 3 4 5 6 7, 1 2 3 4, 5, 6, 7. Now we want to make sure we can see we've typed the sorted alphabetically. Value counts by default sorts by value, but we can say sort by index. And if we sort by index, then it will sort by this value over here rather than than the actual counts. And somebody can see we have a GK in PQR, a JDK, and PQR. So those match up. So we just want to make sure that we've got a mapping for every value here. Okay? So now I'm going to run this so that it saves that mapping in memory. So now we need to apply that to actually do something. And this is the easy part. So we're going to say, it's like before, here's our DataFrame. And I want to create a new column, and that new column is going to be called, I'll just call it manufacturer name. And we'll just keep with the abbreviation. The old column was called manufacturer, will call this one manufacturer name. And we're going to set it equal to, we'll say we'll set it equal to our clean data, where we take our manufacturer column and we map it using our name mapping that we created. And then we'll look at our clean data. And we'll see that we now have four manufacturer we've got for anyway, we're code in is we have in our name comment this Go code Q, Quaker Oats, code k is Kellogg's, etc. So we have our mapping setup and we have our new column with our manufacturer name. So to recap what we did there. First, we looked at the, all the different values, sorted it by index so that we could see them in alphabetical order. We created a mapping dictionary using curly braces, where we have a set of key value pairs, where for every key, we're going to map to this value. And so we created an entry separated by commas for each of our codes in the manufacturer column. And then what we wanted them to be mapped to. Now there are different ways to do this in pandas, but this tends to be the fastest. Not just to the fastest to create, but also the fastest to run. Typically, when we're doing things in data science, we sometimes worry about speed and optimization things. Typically, we don't need to worry about that in this kind of data exploration phase, unless we're dealing with really, really big datasets with millions of rows. When we move into, if we have to take our process and apply it to some kind of production system where it's going to be running all the time or has to be super fast because it's online and connected to a website or something. Then maybe we have to do something where we have to optimize a little bit, but typically we don't have to worry about that for a while. Some of this data exploration phase where this kind of seeing what we can pull out of the data. 5. Creating a Histogram: All right, so now that we have our data cleaned up, we've got our manufacturer names, we've gotten rid of the rows are bad data that we don't want to include in our visualization. Now it's time to make our first visualization. So we use Altair just like before. We're going to create a chart based on clean data and the temperature we're going to create, we're going to create a histogram. We're going to create a histogram showing the distribution of sugars across all manufacturers. We're gonna do that first. And a histogram is based on a bar chart, so we'll say mark bar. And just like before, we'll press Enter here in case you wanted to do something to our bars later. And then we will encode again. And what we're going to do for our x and y axis, we're going to go ahead and use the long-hand form right away. And for our x-axis we want the sugars Sugar value. But instead of just showing the sugar value, we're going to say we want to bin those values and we want them to be set to true. And this is how we're going to create our histogram. The way a histogram works is we've got essentially a bunch of different bins or buckets. And those buckets will hold values. Say for example, the first bucket will hold any values between 02 and the next bucket, we'll hold any values between 24. And what we'll do then is we'll count how many times a serial falls in that bucket. And that's what we'll put it in the y-axis. For our y-axis, we'll say that we want it to be kept. Those values. Okay? So if we run that, we see we've got a set of bins, a set of buckets. So any cereals between 0, including 0, up until two, not including two, will fall in this bucket in a cereals, including two grams of sugar up until, but not including four grams of sugar, will be in this bucket. And then any cereals including four, not including six. So between 46, Lesson 6 in more than four is going to fall into this bucket. And so we get these counts at how many fall into these buckets. Now we can control the width of these men's. We can control how many bins there are, but it's usually better to let Altair control this for us automatically and apply some different rules and statistical rules for how the vent should be broken up. How many of them there should be that are generally pretty good for rules have none. So oftentimes people will want to create their own bins. And you have to be careful here because usually when people control these bins, like for example, they want to say how many sugars? Like we could have two bins between 088 and 16. And we could do that. So we could do that pretty easily. Show you how to do that really quick. So we can create a bin object. Instead of just saying create bins, we can say, I want to create some bins. And there are a bunch of different values here. And what we can say is we want the step to be eight, for example. And so it would create two bins. How many cereals are between 08 and then how many are between eight and 16? Or we could say we want every four grams like that. But again, you can see that this chart, it looks kinda chunky. Generally, if you've got a really chunky looking histogram, unless you have a good reason to bin things this way. And maybe you consider like less than eight grams to be low sugar and more than eight grams to be high sugar. So maybe that makes sense to you, but generally it's better just to let Altair say, here are the bins. Sometimes in the Sunday after washout forest statistical analysis, sometimes people will manipulate the bin sizes to drive home a certain point. It, it's really easy to say, Hey, look how many things fall in this group, just by controlling the bin size. And so that can create some unnatural visualizations. It's based on true data. But the way you structure the bins, you can manipulate the message a little bit so you have to be careful there. 6. Plotting the Median Value: So let's clean this up a little bit just because we like to practice creating visualizations with presentation quality. So let's change the title of our X axis to be more descriptive of what it is. Instead of saying sugars bitten, Let's say that it is sugar in grams whenever we have a visualization, but always want to make sure we have the unit and kinda conventionally we put the unit in parentheses. So that's sugar grams, or we could call it grams of sugar. But this is kinda more conventional. And then the y-axis, we will know that a title outside here, we'll say give it a title and we'll say number of cereal brands. So that's our y-axis label. And then let's add a chart title using properties. Let's say the title for the whole chart. We'll just say we'll call it a breakfast cereal sugar distribution. So whenever we create a histogram like this, we usually refer to it as a distribution. And some people will actually call this a distribution graph, but usually called a histogram. Okay, so let's add one more piece to this chart. Let's add a line that shows where the average amount of sugar is. So we can kinda see where that falls on some histograms. It's really obvious. This one has had a bimodal and so it may not be as obvious where the average value is. So let's get Altair to calculate that for us. Normally, like we talked about. Normally we want the data to be in the shape with the calculations in place before we start the visualization. But sometimes we need a single value for the visualization. And in those cases sometimes it's easier to get the ticket, the graphic library to calculate it for us. So I'm going to create a separate chart and we'll see what we'll do with this here in a minute. Also based on clean data. And I'm going to use a different kind of mark, since we want a straight line, I'm going to use what's called Mark rule in, or rule in visualization lingo is just a straight line, either horizontal or vertical line. And just like before, I'll press enter there. And then one of our encode. And since we want a line, I want a line that seem to be going up, down like this. What I need is an x value and x encoding for it. And what I want it to be is the, let's go with the median value of chars. And if I look at that, I run that, I can see that you can kind of see it show up over here. Just that little line. Let's make that look a little bigger. I'm going to go up here to the rule properties. I'm going to change the size to three. So it's a little thicker. And maybe let's make the color be red. And so now I've got a thick red line. Now, I can use a hex color here. If your CSS web design person, I could use this color to represent red or any HTML color code that I wanted. So I'll leave it like that. So again, these are being applied to the mark itself. Size of the mark, color and the mark, and this is the encoding, so that's where it shows up. Now if I wanted the mean instead, the kind of regular average, I can do that. But in my case, I want the median. So I've got the median value of sugars. And now the question is, how can I get this line onto here? So this is actually a lot easier than you are probably expecting it's going to be. So what I'm gonna do is I'm going to, instead of running when I run this chart, it displays it. But I can also create a variable to hold the chart just like this. And so now it creates the chart and it stores it here in a histogram. And I'm gonna do the same thing here, where I'm going to say, let's create a chart and store it here. And then I'm going to tell Google Colab that I want to see both of these charts. And what I'll terrible let me do is I can say draw the histogram and then on top of that, draw the median. And so it will figure out based on the axis values, how to combine those two visualizations together. And so here's my histogram chart with all my customizations. And then here is my median, my rule, my vertical rule that says here is the average value. You may be wondering why we use the median instead of the mean. When most people say the average, they mean the mean. The, if you take all the values, add them together, divide by the number of values you get the mean. The median value, however, is usually statistically better. And the reason for it is the median value says what is the actual middle value? If I were to align all the values up and look at the one that falls in the middle, or the look at what would fall in the middle if they were an odd number of values, what would that value be? And in this case that's seven. The reason this is better is because imagine you've got a set of values and let's say 24681000. What is the average value of that? If you use the mean, you'd add all those together, divide by five and you'd get 200. The average value is 200. Now, what's weird about this value is that it's not even one of the ones on the list, first of all, and most of the values are nowhere near to a 100. So it's pretty misleading to use the mean when you've got a big extreme number, what we call an outlier. And so in that case, if we were to use the median instead, if we line these values up, 2, 4, 6, 8, and then 1000, the middle value is 66, is a much better representation of the actual distribution of those numbers. And so the median, we say, is a more robust statistic because it is resistant to outliers. Because of that, statisticians and scientists will often use the median when they mean the average because of that feature. Now there are other ways to deal with this. We can get rid of outliers and do things like that. But most of the time we just use the median. Now, I'm going to make a change here because I want to use the mean value. And the reason I wanted to do that is so I can show you some more visualization things. So I'm going to change where we're using the median. I'm going to instead say I want to calculate the mean and I'll change the name of my chart just to be consistent. And so when I do that, first step to rerun this and then rerun that. And when I do that, it still looks like it's about seven, but it's not going to be exactly seven. And it's just because of how the mean is calculated. Instead of listing all the values out and finding the middle one, we're going to divide them. We're going to add them all together and divide by the count. It just like we showed in that example. So the reason I'm going to do that is because I want to show the actual value on the chart and I'm going to show you how to format that. 7. Adding Labels: One of the things we can do is we can add a text encoding up here to our rule. And the way I'm gonna do that is I'm going to say take my mean and then add a, another mark to market text. And this mark text, what it's gonna do, it, it works just like all the other marks, except the encoding is going to be different. Instead of encoding an x value or a y value or color value like we've done previously, we're going to encode a text value. And the value we want is the mean of the sugars. So if I do that, if I run this, you can see that I get a really big number. This is why we're using the mean. If I were to use the median, the number is just 7, which is good and better, but it won't. Let me show you all the different formatting things I want to show you today. So we're going to use the mean. And the mean average is 7.02630 level much numbers. Now just like before, we can save this, I'm going to create a variable called text. Let's save that. And then I'm just going to add it right on top of our chart. Just glue all these structures together. Okay, so now that I have that, I see that this falls right in the middle. It's right in the middle of that rule centered based on the size. And so what I'm going to do is I'm going to change how this text is drawn. And the way we're gonna do that, just like with all of our other encodings. We have a long form and a short form. So I'm going to use the long form, alt dot txt. Notice when we go to the long form, just like all the other ones, we switched to a capital instead of lowercase. And that's because this is the name of a text object that we're going to be creating, which has a lot of different parameters. And the parameter we're going to care about is the format parameter. And the format parameter lets us specify how to format this text and it uses what are called standard format specifiers. Now if you've never used the format specifier, you can find a list of them if you search online. But the one that most people use almost all the time in data science is a way to format decimals to a certain number of places. And that's what we want. We want to take this float. Remember we said that this is actually called float. And I want to make it so it only shows two decimal places. And so the format we're going to use is, I want a decimal number, and after the decimal, I want two numbers. And it's a floating point number. It's a, not a scientific notation number. It's not an integer, it's a floating point number. So when I do that and rerun that, remember, we're just saving it here so I could rebuild my chart. And so you could see now 7.03. So just to show that again, if I were to run that and then recreate the chart, notice this is 7.026. And so by changing the format to only show two spaces after the decimal, it will just cut them off. It. We'll round those numbers to two decimal places. Okay? So let's look at it a little better. But if I look at the And placement of this, It's not quite lined up, we're at y. So here in my mark text, I'm going to change the color here of the mark, in this case the color of the text. And I'm just going to use the same color as we're using for the line. And remember every time I make a change here after rerun to recreate this text part and then rebuild the visualization because we are gluing all these parts together. So if I make a change to this part or this part, or this part, I have to run that section and then rerun this section. And so now I've got those numbers here in red, which shows me that it goes with this line, but it's not very easy to read. So let's change a couple more things. First of all, let's change the alignment of this. And let's say that instead of centering where you want to align it to the left, and that will put it on the, so that it's aligned to the left side. Here, the left side of the number is aligned to the location where that number is representing. But it's still just a little bit hard to read because it's lined up perfectly in this bar is a little thick. If our bar were thinner, if we change our bar size to one, rerun that, rebuild our visualization. You can see if you were to zoom in like this, you can see that that number is exactly at wind up with the center of that line. But since our line is going to be a little thicker, it overlaps a little bit, and either way it's still be hard to read. So what we wanna do is say, okay, LAN to the left, but offset a little bit in this direction. And the way we do an offset in our mark is using what's called d x or d y for our offset, dx means delta x. How far do we want to change from the actual x value? And so we're just gonna kinda nudge it over about seven pixels. This is in pixels. And so what does notch that over? Rebuild that. And so that looks better. And actually that looks a little bit too far. So let's change it to five, and that looks better. So now we still have this kind of problem. It'd be nice if this were up here in this white space area, but make it a little easier to read. And we could try to figure out how far up the graph to make it with that gets really difficult, especially as our graph size changes. So instead, we can, by default, the y encoding for this mark is right in the center of that graph. Now, I want to move this number so it's up here on the graph. Now there are couple of ways to do that. One of the things to note is that in drawing, usually all visualization systems that computers use. Unlike geometry, where the y values down here at the bottom, when we start talking about drawing things, 0 is actually at the top. And so if we were to say something like change DY, let's say we want to set DY to make this move up by a 100 pixels. So if we move it up by a 100 pixels, we might say something like this, d y equals a 100. But if we were to do that, we'll notice that it actually moves down. And that's because in drawing and the way Altair handles drawing very consistently with how everything else handles drawing as we increase our y value. And this is kind of confusing because it's opposite of how we're graphing as we draw. The y-value increases as it goes down and zeros actually at the top. So if we want this to move up, we have to say negative. Now, the other problem with this, if we run this, that shows up kind of where we like it. The other problem with this is that it's based on the center of the graph. So if our graph were to get really big, Let's say that we make our lattice properties and coding here. And we say we want the height of our graph and we'll just change the height to be something like a 1000. So as we change that, we can see that we're no longer in the white area up here. And the reason for that is because it says where's the center of the graph? Move it up a 100 pixels from that. Okay? Now, our graphs not going to be that big. So because our graph can change size, we don't want to base it on the center of the graph. We can do is base it on the top of the graph. So instead of using a DY offset like this, down here in the encoding, we can specify that we want to encode the y value of this based on a certain pixel value. And notice that when we do this, we wrap this value like this. Notice this is not capitalized, so that's not very consistent. But when we specify a value like this, if I say a 100, it is from the top of the chart. And this way, if we go down from the top of the chart with or at least always be based off a kind of central reference point. So under pixels down, little too far, Let's go about 30 pixels down from the top. And that'll put it exactly where we want. Now, if we were creating a bigger graph, we still may need to adjust this value a little bit, but it's easier to adjust it from the top offset from the center. And we'd probably want to make the font size a little bit bigger as well if the Charcot too big. So again, what we've learned here, we created a histogram. We created a rule showing a vertical line showing where that median value is. And we created a label for that line. And we put the movie styled all of these separately, set all its properties, and then we concatenate it or combined all of those charts, the editor make a single visualization. 8. Creating Box Plots: Now let's create another visualization using this data. We're going to create what's called a boxplot. So we're gonna create a chart again based on clean data. And then we're going to use mark it. And you might be thinking it's mark box, but it's actually mark box plot is a special someone Voltaire's marks are simple shapes like text, lines, bars. Others are more complex like a boxplot which combines a bunch of different things together. And the way we're going to encode this is we're gonna kind of pieces together. We will start out with our x value, or x axis is going to be the manufacturer. And we'll change this to manufacturer name in a minute. And then let's set the y axis equal to. We'll look at the sugar content again. So we have the amount of sugars. And then let's just run that and see what that looks like. So the way a box plot works is it's a really great compact visualization for showing a lot of information for each of these manufacturers and for example, General Mills. And as we hover over this, we can get some information. We see here is the maximum value where this whisker goes through. These are called whiskers. Here, down here is the minimum value between here and here is the range. So this, all of the values fall in between inside this line. And then there's the box area. This is where most of the data is. And so this is the 25th percentile and the 75th percentile, and then this white line is the median. And so we can kind of see for this manufacturer the spread of sugar distribution. So right away we can see that Nabisco, at least the Nabisco cereals we have, have, on average have less sugar and then most of the other ones that we're seeing. Now, we don't have very many the VSCO cereals. So it's probably fair to compare Kellogg's and General Mills. And we can see that Kellogg's has a much bigger range. But the average, the median amount of sugar for Kellogg's is less than the median amount of sugar for General Mills. Now let's make this chart a little more interesting. First of all, let's change the manufacturer instead of using MFA. Let's use the manufacturer name so we can get the full name there so that we don't have to decipher what those codes mean. And then let's give each one a different color. So we add a color encoding here. Like we've done before. Our color encoding will also be based on the manufacturer name. And now you'll notice that here we've got a different color for each manufacturer. We have a legend created automatically when we do a color encoding. And because of that, these labels down here are superfluous. So we either keep this or we keep this. Now in this case, because these are so long, it's easier to read these over here and to keep those. And so let's get rid of these x axis labels. And the way we can do that is by saying in our x axis thing, we can say x is equals axis equals null. And we do that. It removes the x-axis labels. So now let's go ahead and clean up our y-axis label for title. We'll set the title to be. We talked about this before, sugar with its unit sugar in grams. And then the other thing, if we look at our legend, MSR name is not very descriptive, so let's give a title to our legend. And in order to do that, we have to create a legend property. And we can't just say like legend title or something like that. We have to create just like we did with our color scale. A legend object here. And inside that legend object, then we can set the title. Let's call it manufacturer name. Or maybe we'll just say manufacturer, be a little bit shorter. So now our legend has a title. It's easy to read. And let's go ahead and make this chart a little bit bigger properties, we'll set the width to be say, 800. So that's nice. So a little bit easier to read. We'll add a title to our chart. And this is going to actually be very similar to before. We'll just say sugar distribution by a manufacturer. But one of the things is these bars are, these boxes are pretty narrow. So let's change the width. And the way we do that is by setting the size of the mark. And you can see this is very similar to what we did up here when we change the size of this line. So we're changing the size of our box plot, of our boxes to be about a 100. And that looks a little bit nicer. And if we want, now, if we look here, we've got a color scale. We may or may not like that. So if we go back to our vega color schemes list, we can look at our categorical schemes and we can see that we're using looks like category ten right now. Maybe we want something in the pastels, pastel one or two. Let's use pastel one. So we can do that by setting our scale to scale object just like before. And will say the scheme of that color scales should be pastel one. So now maybe you like that better. But the point here is not so much his pastel better than the other color scheme, but to see how we build up this color object. And so we say we've got a color object. And here is the legend we're using. And if we want to set properties on the legend, we have to create a legend object. If we want to set properties on the color scale, then we have to create a scale object and set that equal to there. And there are other ways to do this and we'll look at those in more detail when we make some more advanced visualizations in the future. 9. Creating Scatterplots: One last visualization we're going to make is we are going to make a scatter plot. Scatter plots are really good if you want to compare two different numeric values on the x and y axis. So, so the way we'll do that is once again, Alt dot chart. We're using our clean data. And this time we're going to use Mark Circle scatterplots, use circles and code. And the way we're gonna do our encoding for our x-axis, let's look at the sugars value. And for our y-axis, Let's compare the amount of calories. At this, we have for a given cereal, it, this one, whenever Syria, this is, and we'll look at how to see that in a minute. Had two grams of sugar and 90 calories. This serial had seven grams of sugar and a 140 calories, etc. And so this is called a scatterplot. Now, as always, we have this kind of data exploration version of this, but let's make this a little bit nicer looking for presentation. And we'll see some other things we can do with it along the way. So first of all, let's set our title of our x axis is it's sugar in grams. And then our y-axis set that title. This is going to be calories and we measure that in kilocalories, kind of the usual measurement for calories. And one of the things that we see, our lowest calorie value is 50. And so we've got all this blank space down here. Most of the time, Altair is going to set the bottom of the scale, the bottom of the axis to be 0. We can tell it that we don't wanna do that. And we do that by adjusting the scale of this axis, just like we adjust the scale of our color scheme when we want to set the scale to do something different, we can set the scale of our axes and say I'll dot scale. And the property we're going to set here is 0 equals false, which means do not have the scale starts at 0. And so we do that. We now see our lowest value is whatever our lowest value is on the y-axis. And there are other things we can do to adjust that. But for now, that's all we want to have happen. So now let's once again had a color encoding, and I'm going to copy this color encoding from our departure from our boxplot. And it's just going to use the same code and it's going to have the same legend information and all this stuff. And so let's put this. It does add a comma here and then paste that right in. And the pastel is a little bit hard to see here. Let's get rid of the pastel part. Then we'll just use the default color scheme. A little bit easier to see. Now, all of these dots are the same size right now, but we can add another encoding to the size of our dots. And when we do that, just like before, I'll dot size. We can say that we want the size to be based on another column. So the x values based on the sugar, the y value is based on the calories we can make, the size, we can say based on how much fat there is, for example. And so now over here on the right we get a second legend that tells us the size of the circle corresponds to the amount of that. So let's customize that legend a little bit. Legend equals 0 dot legend. Just like before, we can set the title. And we'll say this is the amount of fat in grams. So now we can see here is the amount of fat in grams based on how big the circle is. So the bigger the circle, the more fat the circle the serial contains. Now right now, it's the way Altair works for sizing is it says a small size, whatever size of circle. This is how many pixels? It says, okay, this base size is going to be the smallest. And then I'm going to just go up by a certain amount. But we can control that. Maybe we want these circles to be a little bit bigger. So we can say for our scale, for this graph, this encoding, and whenever we want to adjust the range of values that an encoding uses, whether that's the range of colors or the rage of numbers. Like we did with this y axis, we modify the, the encoding scale. And so we're gonna say alt dot scale. And what we want to do is set the range of values for this. And the range of values is going to go between and we're going to use square brackets here. It's going to go between two hundred and ten hundred. So now our smallest circle is 200 and size in our largest circle is 1000 and size. So that's how it's going to psi, so circles. Now one other thing that might be nice for this visualization would be able to see what the cereal brands are. But to put all that text on the graph would be pretty messy. So one more encoding we can look at is to make this graph a little more interactive. We can add a tooltip encoding where we can say that the tooltip should be the name of the cereal. And so now if I hover over this, I see which cereal it is. 10. Creating Interactive Plots for the Web: Okay, so we've got this great chart with this interactivity. Now you might be asking, how do we make sure how did we get this onto, say, a webpage? Because if I save this as a PNG file and download that, I'm going to get the visualization the exact size that it is. But there's no like hovering your interactivity in this. And so what we can do is, is another way to export this chart in to include all the interactive components. And the way we do that is first we're going to save the chart just like we've done before. I'm going to save it in a variable instead of displaying it. And if I want to see it, I can just execute that variable is like this. Instead of displaying it, I'm going to say take that chart and save it. And I'm going to save it as I'll call it serial dot HTML. And it's important that it ends in dot HTML because that'll tell it what format we want. We want to format that will create an interactive webpage. And so if I run that, you'll see nothing prints out. But if I go over to my file listing, I see now I've got the serial dot HTML. So I'm going to click these buttons here, these little three, this little menu expander. And we'll say download. And it's going to download this HTML. And if I click that and open it, I can see HTML page. And it's got all the interactive parts of my chart. And if I look at the HTML code of that page in a text editor, I can see that it's included some JavaScript stuff for the Vega chart library. And that's what else hears based on, as well as all of the data needed for this visualization. And then it's created some HTML in order to embed this chart. And all of that is now interactive so you can put some additional HTML there or put this into an HTML page of your own. Just taking out, making sure you include all the scripts and all of this embedded data. Now one warning to this is that all of the data is embedded in the HTML page when you do this, since we only have 76 rows in our dataset that we're using for this, it's not a ton. But let's say you had a million rows and you are creating a visualization. If you were to try to create an HTML page with all of that data embedded in it. It would be huge. And so we have to be really careful with how much data we have. So kind of the rule of thumb, if you've got a small dataset, whether it's interactive or not, Altair is great. It would bigger dataset, I'll tears good for data visualization and exploration stuff. Or if you just need a static image that you're creating using a PNG file or an SVG file. Altair is great for that. Give a really big dataset that you want to have interactive. Usually people will use the D3 library for that. Altair is not great for that just because of the weight embeds the data. Anything more than 5000 rows Altair it's going to have trouble with. So that's kind of the rule of thumb. But it also just depends on how you want to handle your data embedding. 11. Recap and Next Steps: All right, So let's recap. We've done a ton of stuff today. From the top, we imported our dataset. We looked at ways to find values that are not quite right, whether they're missing values or values with W data or values that have bad data in them. And we talked about different ways to handle those. We talked about using described to see a statistical analysis that are numeric columns. We talked about creating a mapping between one column and a new column that would create by translating a set of values using a dictionary of key value pairs so that we can create a new column with values based on the existing column values. We talked about creating a compound chart where we create different pieces of a chart and then piece them together. We created a histogram using that technique. We also created a boxplot where we showed a bunch of different statistical information that can be kinda shown in a compact form. We talked about customizing that. And then we also looked at our scatterplot. And in our scatterplot, we looked at encoding not just the x and y values, but also color and size. So we have four different numeric quantities being represented in a single visualization. And we talked about being able to customize the scales that it uses both for the color scale and for the size scale. And we also talked about adjusting different chart properties. We also talked about saving that chart as an HTML page so that any interactivity we built into the visualization is included. And we talked about the pros and cons of doing that, as opposed to just saving a static image, either P and G or SVG format. Hope you've enjoyed this part of our six part series on data science and visualization. Next time we're going to look at a classic dataset in Data Science, which is the Titanic dataset. We're going to look at how to visualize different statistics. Passengers, which ones survived and which ones didn't. See if we can find any patterns there and create some more interesting visualizations based on that. As I've mentioned, I also have another series coming out on machine learning. By the time you watch this might already be out. If not, head over to my website leaf ala.com, sign up for my newsletter and you'll be sure to be notified whenever that course comes out. Super low volume, super lows, fannie sort of thing. I just kind of announce things. I'm doing books every least courses, I'm offering, things like that. So if you're interested at all in data science and visualization or you just want to ask a question, go over to my website, leaf Valen.com, sign up for my newsletter and just reach out with any questions or concerns you have. Thanks.