Data Science and Visualization For Complete Beginners - Part 2 | Lee Falin | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Data Science and Visualization For Complete Beginners - Part 2

teacher avatar Lee Falin, Software Developer and Data Scientist

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

12 Lessons (33m)
    • 1. Introduction

    • 2. Exploring the Census Data

    • 3. Loading Custom Data

    • 4. Compound Filters

    • 5. A Basic Visualization

    • 6. Sorting the Visualization

    • 7. Creating Additional Columns

    • 8. Cleaning Up the Visualization

    • 9. Modifying the Color Scheme

    • 10. Final Touches

    • 11. Recap

    • 12. Conclusion and Next Steps

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

If you've ever wanted to learn more about data science and visualization, but felt overwhelmed by all of the background knowledge it seemed to require, this class is for you. 

In this, the second in a six part series of introductory classes, you'll continue to learn more about data science and visualization. Since this class builds of the techniques and methods from the first class in the series, be sure to complete it first.

Throughout this series, you'll learn not just how to use industry-standard tools to employ a variety of data analysis and visualization techniques, but more importantly you’ll learn the reasoning behind those techniques, and when and why to use them in your own work.

Meet Your Teacher

Teacher Profile Image

Lee Falin

Software Developer and Data Scientist


Hello! I'm Lee Falin, and I'm a software developer, writer, and educator who loves to learn, create, and teach. I'm currently a professor of computer science at BYU-Idaho, where I get to teach courses in software design and development.

One of my favorite things about software development is that it's a skill that enables anyone, regardless of their age or background, to bring their ideas to life and share them with the world. It just takes some time, patience, and a fair amount of hard work.

I've been writing software for almost twenty years, working in the commercial software, telecommunications, and defense industries. I've also founded a couple of software startups, and worked as a data scientist in bioinformatics research.

These days, I spen... See full profile

Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Introduction: Hi, I'm Leigh failing. This is the second in a six part series on data science visualization. Now, like I said at the beginning of the first part, we don't assume any background knowledge and statistics and data science are programming, but we do assume that you are completing each of these courses in order. So make sure you've completed all of the exercises in part 1. Go back and do it a couple of times if things aren't clicking right away and then come back here and pick up right where you left off. So in this section, we're going to be talking about some more advanced data preprocessing, some more advanced filtering. We're going to look at some data from the US Census, add. We're going to be doing some more customized visualizations. 2. Exploring the Census Data: So once again, we're going to start out by going to Google Colab in our web browser. So that's colab dot research dot And just like before, once this page loads, we're going to select New Notebook to open a blank notebook that we can use for our data science and data visualization activity will start by connecting our instance to a new Cloud server. Now, if we don't remember to do that the first time we've run a cell, it will automatically connect to the can see that if we click Run, it'll automatically connect to a new virtual host. Just take awhile for that cell to run. Once we see the green checkmark here, we know that we are good. So the data we're going to be using today, if you go to the slash data, you'll be rerouted to the GitHub page that contains our list of different educational datasets. And the one we're going to be using today is the US census data for 2019. This is going to contain some information about population statistics that we're going to be using for our exercise. So if you click on that link, you will see that once again, this file is too big for a preview. So we'll click View RA, and this will be the URL we'll copy. Um, and I come back over here to Google Colab. And first, I'm going to go ahead and import both of the libraries we are going to be using today. Once again, we're going to be using pandas, which we will alias as pd. And since I know we're going to be using Altair just like last time. We'll go ahead and import that at the top as well. This is kind of the general thing that we do. If we know ahead of time which libraries we're using, we'll put them all at the top, all our import statements. So we'll run that. And once again, as long as we spelled everything right, we shouldn't see any output here. Now we'll add another code cell here. And we're going to load our census data. And I'm gonna say data is equal to PD dot read csv because we're going to be reading a CSV file. And then I'm gonna come back over here, copy that link and paste that in. Now if for some reason you are unable to get to that link, there is another shortcut URL you could use instead HTTPS colon forward slash, forward slash, leaf, a slash census. And that will take you to the same dataset. So either one of these, but we'll assume that we're able to get the GitHub URL. Okay? And then just like before, we will start out by looking at the first five rows of data using the head function on the DataFrame. And here we have our census data. Now there's a lot of codes and information in these different rows that is not immediately obvious what these mean. We have summary level, region division, state, sex, age, and some of these codes will be obvious. The age, for example, may seem obvious. But whenever we're dealing with a new dataset, it's always good to see if there's what's called a data dictionary or a document of some type that explains what different values in the data mean, what the different column names represent. And so we can get that if we go back to the slash data, to go back to our GitHub page, we can see there's a link here census data dictionary created by the US Census Department. So if I click on that, I can see that this dataset is the annual estimate of civilian populations by single year of age and sex for the United States and individual states from April 2010 to July 2019. And I can see where this came from. And then I can see what each column represents. So some love is the geographic summary level, region Code, Division Code, FIPS code, which we'll talk about later in more detail. But this is basically just a numeric code that the census assigns to every united, every state in the United States. The name of the state. Sex is the sex of the population under study and their age. Now if we scroll down further, we can see that this is our base population estimate. And then each of these columns is the estimated population for that year as of July 1st according to the US Census. And then for each column, we are told what these different codes mean. Now the ones we're most interested in, for the sex column 0 represents total. Male is one and female is two. And we're going to also be interested in the fact that age for 999 represents the total population. So if we go back and look at this data, and let's look at the first five and last five rows. So for example, this row in Wyoming, So sex to age 999. So if we go back to our data dictionary, we can see sex 2 is female and 999 means total. So this row means the total number of females in the State of Wyoming for each of these different years. Whereas the row above it means the number of 85 year-old females in the State of Wyoming for different years. So this row shows us that the total number, both male and female of four-year-olds in the United States. And remember this column, the estimator basis is April and then each of the subsequent columns is July. So the number of four-year-olds in the United States in July would have been 4,077,346 and so on. Now some levels and other interesting column we might be interested in. The sum level or summary level tells us whether a given row contains the statistics for the entire country or just for a state. And so if we look here, we can see that the fact that this is summary level 10 and the name of the geographic unit is United States that matches up this is summary level 40. And here we have a set of states instead. 3. Loading Custom Data: So let's go back to viewing just the first five rows using the head function. And then let's look at the column structure using the info function. You'll notice this same pattern we do over and over again when we first come across a new dataset will import the dataset using read CSV. We will look at the data using the head function and then we'll look at the column structure using the info function. Now, at this point you might be wondering, what do you do if you have your own dataset, your own CSV file, and you want to use Google Colab. Well over here on the left, this little folder icon. If you click this folder, you can drag a file into this window and it will upload it to your Google Colab instance. So for example, let's say that instead of getting this file from online, I had a local copy on my desktop. I can just drag this over to this upload window. We can see that little arrow appears at the bottom. And it'll say reminder that these files are going to be deleted whenever this runtime is recycled. So we talked about how we are connected to a cloud computer, basically a virtual computer on a server owned by Google. So whenever we close this window, whenever we close this browser window, or if we leave it open for too long without any activity. And at times out, the, the virtual computer will actually get recycled. It'll get shut down and allocate it to someone else. When that happens, anything we have stored in memory, including any files that we've uploaded will be lost. So we just need to make sure that this is all temporary. Even if we save a copy of this notebook, the files, since they're stored on a virtual computer, are not going to be there the next time we load it. So once we load this notebook, all the texts we've typed will be there. But if we need to access the files, will need to remember to re-upload them. Now once they are uploaded, if I hover over the file and I click on these three dots, I can say Copy Path. And then once I do that, instead of using the URL to GitHub, I would say data equals pd, read CSV, and in the quotes, I would just paste it in whatever that path is. And this is a path to the file on Google Colab is virtual computer that I've uploaded. So that's how you would do the same thing we're doing with our online files. If you had a local file, the results would be exactly the same. The only difference would be where the file is coming from. 4. Compound Filters: So now that you know how to do that, let's go back to our data. We see here, according to our data information, read out. This dataset has 13,572 rows and it has 18 columns. And we also see that there are no missing values. Every column has all 13,572. Further, we see that every column is a number except for the name column. And we can see that this is going to be a string or text which contains the name either of the country or of the state. So our goal today is to be able to create a graph that's going to show the relative state populations compared to one another for the year of 2019, the final year that's available in this dataset, the estimated state population as of July 1st, 2019. So the first thing we're gonna wanna do, and again, you'll notice these patterns developing, load the data, preview the data, look at the information about the columns. And now we're going to filter the data or preprocess the data to filter it down to just the information we want. And in our case, the information we want, just the totals we are not interested in relative population between male and female, but we want to know is total population. And we also want not individual ages but just total population. So in our case, according to the data dictionary, what we're going to want is the state-level summary. We're going to want sex to be 0, and we're going to want age to be 999. So what we're gonna do is use what's called a compound filter to be able to filter on multiple columns are multiple series at once, but we're gonna do it one step at a time. So first we're going to filter on the age column. So if we ask for the age column, just like we've done previously, and notice it's all uppercase. Always look at the casing of your column names because it changes from dataset to dataset. In this case it's all uppercase. So in our age column, we're going to be asking for those rows where the age is equal to. And remember when we ask for equality, we use double equal signs. If we're making something equal, we use a single equal sign. So in our case, we want to know which columns or which rows have the age equal to 999. And we can see in our dataset, and again, ellipsis means it's skipping a bunch of rows. And so this filter tells us true or false for every row, is this value equal to 999? So now I'm going to save this as a filter. So I'm just going to call this Data Filter. Actually I'm going to call it total filter. That'll be more descriptive. So I'm going to call it the total filter. And then I'm going to apply my total filter to the dataset. So when I do that, I can see that the only rows that show up are the ones where age is equal to 999. However, there's more than I want to filter on here because I also see that I've got entries for each gender. I've got entries for the total, and then I've got entries for male and I've got entries for female, and I only want the total. So what I need to do is tell it, I want to filter on where age is 999 and where sex is equal to 0. So to do that, I'm going to create a compound filter. So first, I wrap my filter conditions in parentheses like this. Then I use an ampersand, the little symbol above the seven on a US standard keyboard, maybe on a different location on your keyboard. The ampersand, which means AND, and then inside parentheses I put my second filter condition in this case, where sex is equal to 0. So my compound filter now says, my filter is going to tell me true or false for every row, is it true that age is equal to 999 and sex is equal to 0? I'm going to save those results in total filter and then use that filter on my DataFrame. So when I do that, you can see I've got all of these rows, one entry for each state, which is what we would expect. And it's all the rows where sex is 0, which means total for male and female. And then where age is 999, which means total across all ages. So now there's one more filter I want, because I don't want this United States row. I only want the rows for the states, not the total for the country. So I'm gonna add one more condition, we're going to say and show me where data. There are a couple of ways I could do this. One, I could filter out the name. I could filter out the state because the state ID of the United States is 0. But I'm gonna use summary level because that's what summary level is four. So I'm going to say where summary level in this case I'm gonna say is equal to 40. Now, when I run this, you'll see that we get just about what we had before, except we do not get an entry for the United States. Okay. And if I go all the way to the bottom and you can see that Wyoming is my final row. Now, I could also have done it the opposite way. I could've said Where the summary level is not equal to 10, which was the summary level for the United States row. Exclamation mark means not. And if I put it together with equals, that means not equal to, but for consistency will use double equal. And we'll say we only want rows where the summary levels equal to 40. Now, once again, we've got this data showing up, but let's save this in a separate dataframe, a filtered view of our DataFrame rather. And we will call this new DataFrame state data. Okay? And then just like before, we can ask for the head of that DataFrame. And here are the first five rows. And I can ask for the info of that DataFrame. And when I do, I see I've got 51 rows. So 50 states plus the District of Columbia, which is what I would expect to this. So that shows us that our filtering is working. 5. A Basic Visualization: So now that we have our data filter the way we want it, so that we only have the rows we're interested in. Let's go ahead and start working on our visualizations. So I'm going to add another code cell here. And we're going to use Altair to make a very basic chart to start out with. So just like before, we'll say alt dot chart. And inside chart will say which DataFrame we're going to be working with in our case we're using are filtered state data. And we're going to be using a bar chart again. So I'm gonna say mark bar. That's the type of chart we're using. And we talked about this in the previous tutorial. We're going to put a blank line here in case we need to customize anything about the market itself. We're going to add in code, parenthesis, another blank line. We're going to use the shorthand syntax first just to get an idea of what we can expect. So on the x axis, I want the name of the state. And on the y-axis, we're going to be using the population estimate for 2019 civilians. So I run that. I can see here is my bar graph. And I can see that California has got almost 40 million. And we've got all of these states can have arranged alphabetically. 6. Sorting the Visualization: Now it doesn't look very nice arranged alphabetically. So instead, let's switch over to the long-hand form for our x-axis and our y-axis, so that we can customize this a little bit. I'll dot x, the x axis and y for the y axis. So if I were to run that again, I'll see the same thing. But now, since we're using the long version, the longhand version, I'm able to customize the x-axis and customize the y axis. So in this case, I want to change the x-axis. I don't like this sort, I don't like kind of the randomness of this. What I want is it to go in order by size. And so what I can do is I can say how I want the x axis sorted. So I can say sort it by, and rather than specify a specific column, I can say sorted by the y-axis values. So if I run that, I get the data sorted from smallest to biggest. If I want it the other direction, if I wanted biggest to smallest. Instead of sorting by the y axis, I could sort by the inverse of the y-axis. So if I put a dash here, a minus sign and I say, run that again. Now I'm sorting reversed from largest to smallest. And I think that's the one we'll keep. 7. Creating Additional Columns: Now this one thing I don't like about this is the fact that this is in millions. It'd be interesting if we could scale this, so it would just be whole numbers. So like 5, 10, 15, 20. And then on the label, we could specify around the title rather we could specify that this is in millions. That's a common thing you'll see in charts. So one of the things about Altair is it has all of these capabilities to modify the data so that you can take the data in one form and then when you chart it, you can transform it into another form. And we'll look at some examples of that in the future. But it is usually better to preprocess the data ahead of time so that it already contains the information you want in the form that you want it. I'll tear has that capability, but it's a little bit clunky compared to what Pandas can do. So what we're gonna do is I'm going to add another code cell just up here. And we're going to create a new column in our data. And that new column is going to contain this information, but scaled by 1 million so that we are looking at the data in millions rather than the raw number. Now one thing to note about this is that when we add this column, we're not going to add it to our filtered data because this is just a subset of the entire original DataFrame. So we're going to add it to our full DataFrame, the stuff that's stored inside the data variable. So before we do this filtering, I'm going to add this column up here. So I'm gonna add another code block right before our filter information. And I'm gonna say that data. And then I'm going to give it, I'm going to create a new column by just specifying a column name here. And I'm just going to call it 2019 scaled and I'll use capital letters to kinda be consistent. So this will be our 2019 scale population. And I'm going to set that equal to. So the column we're going to base this on is this 2019 estimated population. What I want to do is scale this down by a million. So what we'll do is we'll say take that column from the DataFrame. We're going to take population estimate 2019. And we're going to divide it by 1 million, so one with six zeros. So what this is gonna do is it's going to create a new column called this. And the values of that column are going to be equal to the values of this column divided by 1 million. So if I run that, and then if we look at the first five rows, we'll see if we scroll over our new column over here. And we can see that this column is in fact equal to this one if it were divided by a million, and that's exactly what we want. Now. Now that I've run this, I'm going to rerun my filter data. And I can see that over here, my filter data also contains that column, that new column. And so now instead of using the original column in my chart, I'm going to use this 2019 scaled column that we just created. So everything looks the same except the y axis is now not using millions. It's using these whole numbers, these smaller numbers, which is exactly what I want. So now that we have that we need to change our axis title so that we can indicate that this is in fact in millions. So in our alt dot y description over here, let's change the title of that column so that it says the population in millions. So now our title tells us exactly what we're looking at, our scale. It's a little bit easier to understand generally, on an axis scale, if we have really big numbers, it's better to scale them down like that, because the human brain is just better at smaller numbers. And we can just have this title reminding us that these numbers we're looking at are in fact millions. So this is 40 million, 35 million, 30 million, etc. 8. Cleaning Up the Visualization: Now let's go ahead and add a title to our chart using the properties of the chart to add that. So we're gonna say the title of our chart. We want to be very descriptive about what this is. So we need to tell people what year this was. So we're gonna say this was 2019 estimated state populations. So now our chart has a title. So if we look at our x-axis, we can see that we've got the state names down here. It's at which tell us that this is the population for California and for taxes. And so now we need to decide should we use the name label, should we write state here. It's pretty obvious what these are. And so this is one of those cases where we really don't need a title on this axis because it's obvious this is, these are the state populations. Here are the names of different states. Having a title here would be kinda superfluous. So what we'll do is just like we've done previously when we don't want an aspect of the chart, we will set the title of the x-axis equal to NOT. And so that will remove the title completely. 9. Modifying the Color Scheme: Now a couple other things we might want to do. First, the color scale on this we talked about before how Altera uses a default color scale. Now, the color schemes that it uses for this come from the Vega color definitions. Vega is the charting library that Altair is based on the charting specification. And if you go to Vega dot github dot io slash Vegas slash docs slash schemes. You will see a list of the color schemes used and that are available to Altair. And there are two different types of color schemes. We have what we call categorical color schemes, which are for categorical data. Where we've got, here's a category, here's a category or that category, here's a category. And then we have quantitative color schemes. Quantitative color schemes use a single color and then have a range of values for that color. Now, one of the things about Altair is you can specify a different color scheme based on the bill, the built-in ones, or you can create your own. And we'll talk about creating our own in a future video. But in this episode, what we're going to talk about is switching out the color scheme. So kind of the default color scheme that it's using, this blue color. You noticed that every color is, every column is the same color. What I'd like to do is create a color scheme where we are kind of getting a faded gradient based on population size. So let's add a color channel here to our encoding. And once again, we'll use Alt dot color, the longhand version. And we want the colors to be based on the population values are scaled population values. So we'll say that we're going to base this off of 2019 scaled. So if we run it just like that, we'll see that we get this gradient with our current color scheme based on the population size. And we also see a little legend over here. Now this legend, again, isn't something we really need. So let's go ahead and turn that off by setting legend equal to none. And once we do that, the legend disappears and we still have these colors. And so now we can see we're using this blue color scheme. And what we wanna do, Let's switch to the purples color scheme. So the way we do this is a little bit complicated, but not too bad. So the first thing we do is we say in our color value, we want to change the scale, the color scale that it's using. But we can't just say that we want to use the purples scale like this. Instead, what we have to do is create a scale object that we're going to give to this. And there are a couple of ways to do this, but the way we're gonna do it is we're just going to say alt dot scale. And then within this object, we can say what the color scheme should be. So we will set the scheme equal to purples. Purples being the name of this color scheme. And you can use any of these that you like. So when I run that, now, I'm using the purples color scheme. 10. Final Touches: Now one more thing I want to do, just to make this chart maybe a little more aesthetically pleasing, I want to change the overall view of the chart. So to do that, I'm going to add a, another configuration function here. And it's going to be, we've looked at configure axis and configure title. This is going to be Configure View. And what I wanna do is I want to get rid of that line. And so I'm going to say that I want the stroke width of those lines to be 0. So when I run that, now, this outer line that goes around the edge outside of the axis disappears. And so we've got our stroke width set to 0. So I just have these nice horizontal lines. It looks like kind of an open chart, maybe a little more aesthetically pleasing. Just like in the previous video, we could add additional configuration commands to configure the axis titles. We could change their colors, the font sizes, we can change the label colors and font sizes. Lots of things we can do. 11. Recap: But for today, let's review what we've done. First, we loaded our dataset. We talked about how to load a dataset that we have locally using this folder icon and dragging it into Google Colab. We also reminded you that anything that we put into this is going to disappear when we close this down, even if we save that notebook. So when we save it, it'll save a copy of what we've typed into Google Drive, and it will also save a copy of the last results. So even though the data won't be there, the results of running those commands will be there. But if we rerun them without the data, then the commands won't work. So we'll have to re-upload any local files we wanted. We also talked about how to get the URL or the path to the local data. We click on these three dots and we say Copy path. And that's what we would paste in here to the read CSV command. We talked about how when we want to create a new column in our data, we specify the DataFrame and then the name of the column, and we set it equal to some existing column. And then whatever transformation we want to make on it. And what we'll be doing some more advanced transformations in the future. But in this case, we created a column where this column, these numbers were based on the scaled values of these numbers that you get by dividing them by 1 million. Then we created a compound filter by wrapping each of our filter conditions in parentheses and joining them together using an ampersand, which means, and so this filter says, show me every row where age is equal to 999 and sex is equal to 0 and the summary levels equal to 40. We then stored that filter in a variable, applied the filter to the DataFrame, and then looked at what our new DataFrame would be. And then finally, we created a visualization, another bar chart using Altair where we specified, Here's the data we want to use. Here is how we want the x axis encoded and we want those values sorted based on whatever the y-axis is, but in reverse direction. Here is our y-axis. And then we specified a custom color scheme by creating a color encoding and setting its scale equal to a new Altair scale object where the color scheme was set to purples. And we said that that's color scheme is based on those defined in the Vega color scheme definition. And there are a whole bunch of different ones you could use. And we also showed another configuration command where we said we can configure the view, which is this section of the chart, by saying that we want the stroke of the view, these outside stroke to be equal to 0. 12. Conclusion and Next Steps: So that concludes the second in our six part series on data science and visualization. Now in our next episode, we're going to look at some more advanced filtering. We're going to be doing some mapping of column values from one set of values to another in order to make our visualizations have more descriptive information. And the visualizations we're going to be making. Aside from doing some more advanced bar charts, we're also going to be looking at combining multiple charts together in the same visualization, adding text to our charts, and doing some different types of charts including scatterplots. Hope you've enjoyed this part of our six part series on data science visualization. As I've mentioned, I also have another series coming out on machine learning. By the time you watch this, it might already be out. If not, head over to my website, leave, sign up for my newsletter and you'll be sure to be notified whenever that course comes out. Super low volume, super low spammy sort of thing. I just kind of announced things. I'm doing books every least courses, I'm offering, things like that. So if you're interested at all in data science and visualization or you just want to ask a question, go over to my website, leaf, sign up for my newsletter and just reach out with any questions or concerns you have. Thanks.