Data Cleaning Fundamentals: Shape Your Data for Exploration | Ginette Methot & Curtis Seare | Skillshare

Playback Speed


  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Data Cleaning Fundamentals: Shape Your Data for Exploration

teacher avatar Ginette Methot & Curtis Seare, Data Crunch Podcast Cohosts

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

16 Lessons (41m)
    • 1. Course Trailer

      1:51
    • 2. Three Data Cleaning Principles

      1:09
    • 3. Installing Trifacta

      3:04
    • 4. Flows

      1:20
    • 5. Downloading Data

      2:23
    • 6. Inside a Flow

      0:50
    • 7. Grid Panel Overview

      4:52
    • 8. Data Recipe Overview

      2:43
    • 9. Ready-Made Recipe Steps

      3:37
    • 10. Quick Recipe Changes

      1:25
    • 11. Suggestion Cards

      6:01
    • 12. Keep and Delete

      2:34
    • 13. Drop-Down Menus Changes

      5:27
    • 14. Exporting Results

      1:58
    • 15. Project Explanation

      1:20
    • 16. We're Here for You!

      0:12
  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

183

Students

1

Project

About This Class

Join Data Crunch Podcast's cohosts, Curtis Seare and Ginette Methot, for a fun intro to data cleaning. This class is for anyone looking to either start working with data for the first time OR anyone who simply wants an introduction to the free version of Trifacta software, Curtis and Ginette's favorite data-cleaning tool. 

By the end of this course, you'll be able to use three basic data principles and many transforms to create a six-step data recipe—Trifacta's term for the changes you make to your data. We'll see you in class!

Meet Your Teacher

Teacher Profile Image

Ginette Methot & Curtis Seare

Data Crunch Podcast Cohosts

Teacher

Hi there! I'm Curtis Seare, and I'm Ginette Methot, and we cohost an Austin-based podcast called Data Crunch. We talk to people who do amazing things with data, often growing from their deeply passionate involvement with a subject--like detecting eye cancer in little children to saving the lives of honeybees. These world-changers are in every industry and every subject. There is no area or corner of the world that won't eventually be touched by the power of data.

We are passionate that you, no matter where you are or what work you do, can learn to be data literate in a data-focused world, not only to be able to understand the changing world culture, but also to do fascinating things while fusing your passions with data, because you can with the right tools and instruction. We're ... See full profile

Class Ratings

Expectations Met?
    Exceeded!
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Course Trailer: Hi, I am courtesy here, and I am one of the instructors for this course, and in this course, we're gonna be teaching you the principles of data preparation. I've been in the data space working professionally for about eight years. I'm the director of analytics. It'll start up here in Austin. I have a master's degree in analytics, and I also co host a podcast called Data Crunch that interviews people doing interesting things with data. So data is really central to what I do. And I'm really excited to share some of these skills with this class. We're gonna be talking about the three basic principles of cleaning data, and we're gonna teach you also the technical skills you need to learn. To be able to do that, we're gonna use a software program called Tri Factor, which is free. You can download, try, factor and get going right away. And the reason we chose this tool is because we think it's the easiest to work with. I've looked at a lot of data preparation software tools and dry factor is by far the easiest to you. So that's why we're going to use it in this beginner's course to show you how to do it. So don't worry. If you don't have a lot of experience and data or even technical skills, this really can be done by any. But the software makes it really simple. And we're gonna take you step by step through what it takes to clean and prepare your data . And I'm not the only one that's gonna be teaching this. I'm joined by my podcast co host Jeanette. I am Jeanette Method and I'm the other teacher who will be teaching this course. I got my degrees in English and in humanities and recently started working with data, so I can probably speak to this more than most. But you do not need to have a background in data to work with data, so there are no requirements for you to have a background to take. This course you're gonna build your very own six step data recipe and a data recipe in try practice terms is basically just the different steps you take to clean up your data. So we're very excited to work with you, and we're very excited to see what you create 2. Three Data Cleaning Principles: Hello and welcome to the course. We're really excited that you decided to join us just for a little bit of background. Were going to be using a data set of volcanic eruptions. This comes from the Smithsonian Institute, their global volcanism program. And it's all the confirmed eruptions that have happened in the world. So that's pretty interesting. We're going to dive into it and we're going to teach you the three basic principles of data cleaning. So that is one we're going to teach you how to look for errors in your data set and how to remove them. Two. We're going to show you how to look for data that doesn't really need to be in your data set. That's irrelevant and remove it. And we're going to show you how to look for ways that you can clarify the data set so it makes more sense to an end user. So those are the three principles that we're gonna be going over and again, we're gonna be using tri factor. So we're also gonna be teaching you the skills that you need in tri factor. To be able to do that, we'll show you how to upload data will show you how to build a data recipe will show you how to build transforms that move your data through these steps that are necessary to transform it. So without any further ado, we're just going to jump right into it, and we'll show you how to download track factor. 3. Installing Trifacta: Hi there, Jeanette here. So before we do anything you will need to download, try factor Wrangler, and I'll walk you through it, step by step. And if you've already downloaded it in the past and you happen to have it on your computer , skip this lesson and head to the next. And keep in mind that while we move through these lessons, feel free to positive video at any point. If you need more time to download or follow the steps. For those of you who haven't downloaded this onto your computer, let's get started. OK, so first you'll go to try factors Home page. So open an Internet browser and type in tri factor dot com. From here, you're going to look to the top of right of the screen and you'll see here that there's a button that says Download. Select it, and it will bring you to another page that has a little bit of information for you. Just a quick highlight. If you do have questions about system requirements, you can go to the link right here from here. Come over to this button on the right that says, Try factor Wrangler download and you'll notice there's a little beta tag across the button . Keep that in mind because this is a software that's still in development and as it updates , will update the course. You have the most recent information next to go ahead and select this download button, and it will pop up a registration screen now and asks for a company and a job title. But you're a student in the course, so I would recommend that you put in whatever you're comfortable with, but as a suggestion, something you could put in here under company would be Try Factor training and job title could be student, and from there it does require a phone number. Now I've put my phone number and in the past, and I've never had them call me to my knowledge, and I've never had a voicemail from them, so I don't think that they'll call you here. Put in your country on your state and then your email, and from there you'll create a password. You also select that you're not a robot and accept the licensing agreement. After you submit, you'll head to this last page, and here you'll select what you'll need for your PC or your Mac. Now, I'm gonna say this to my desktop as try factor Wrangler and let the download happen. Once your download is complete, just go ahead and select the file that you downloaded. My computer is requesting that a stick it in my applications folder. So I'm gonna go ahead and do that. You can tell up here. It's just copying it to my applications folder. So once it sound loaded and you know it's ready to go find the application and open up trifecta now on a Mac, it's going to ask you if you want to open this application because it's from the Internet, go ahead and say Guess by selecting the open button Congratulations and you are ready to play with data, we're gonna dive right into using basic data principles and exploring a use case of what this tool can dio 4. Flows: here we've stepped into the flows screen, and a flow is basically a package that holds both your data and the changes you make to it . And the screen will show you eventually a list of all your flows as you create them over time. For right now. Since we haven't started a flow yet, it's totally blank. But you'll see that there are three tabs up here ones the Flows Tab, which were on then data sets tab, which will show you the data sets you've uploaded once you've uploaded data sets and then results once you've actually run jobs and you have results to show. But let's go to the flow screen, and we're going to select create flow. So here you can write any flow name, any flow description, but we're going to be importing today a volcano data set. So for the flow name, I'm gonna go ahead and write the world's volcanic eruptions. You can put whatever you want there that makes sense for you and then as faras the description. This is a place you'd write anything that helps describe the flow and any other words that you might want their. So here I'm gonna write something to the effect of all confirmed volcanic eruptions of all time. Go ahead and press create. And this will create your very first flow. 5. Downloading Data: Now we need to add a volcano data set to try factor. And in order to do this, you'll need to find and download the Data Excel document from skill share. So go to your projects tab under this course and look at the right side bar to find the attachment. Download the data file called a volcano. Underscore eruptions Underscored data set. Once you've done that in, try fact us. Let the button import and add data sets on this data import screen. Here you have a few options for adding your data. You can either drag and drop the file or choose the file from your computer. Choose whatever upload method works best for you. Now I have the file on my desktop, so I'm gonna go ahead and drag and drop it into trifecta. This file will take a little bit of time to upload because it has multiple tabs in it and try Factor needs to identify and display them separately to give you options for upload. So feel free to pause your video while the computer is uploading for this class were on Lee going to work with the first tab by selecting the plus sign next to this top tab here. So while you have the option to open other data sources listed here, we're going to stick with this one tab for now. But for your knowledge, if you go to the plus sign for the whole file up here, it offers to can cat mate the data set into one data set, and that basically means it combines the tabs together and creates one data file. And you want to be very careful about doing this. It may not turn out to be quite what you're looking for. Also, if you wanted to, you could select all the tabs by checking the plus sign next, each one of them, and this would keep them separate. You'll also notice that there's a little eye symbol on the right side that you can use to show a preview if you don't remember which tab has what data, and that could be a really helpful tool. Now, as you select the plus signs on whichever data set you want, you'll see them load up here over on the right, and that's getting it ready for the upload, so it hasn't actually uploaded it yet. When you select this plus symbol now here, you can rename them. You can describe them. You can delete them through this trash can symbol, and you again have the option to preview them through this other little I symbol over here . So go ahead and select the import data sets button. 6. Inside a Flow: we've arrived inside a flow and we see three icons. The first symbol represents your imported data, and the second scroll like symbol represents a list of changes to your original data, and the third symbolizes your cleaned up data with those changes applied. And as you select the options, you can see that there is some file information associated with each of them in this details panel. Also, as we select each symbol, there's a blue action button for each of them that we can select, and your options are either swap or edit. Recipe swap just means to swap out your data for a different data set, or the other option you'll see on the other two symbols is edit recipe for us to go in and play with the data. Let's select this button, and now we get to go do the fun stuff 7. Grid Panel Overview: Now we're in where we can make the magic happen. And let's take a quick look around at some of the basic tools at your fingertips. Here is your data and in grid view, like you've seen before in Excel. You have rose here and you have columns here, and you can scroll around the grid with easy touch scrolling if your computer supports that , or you can scroll simply by using the scroll bars on the right or bottom of the cred. One other thing of note is that as you scroll over the hyphen next to the Rose, it shows you what role number this is after. Try factor imported it and what road is from the original data source, which comes in handy at times. So while this row is row number one now, it shows that it was actually road to from the original source, which then makes us ask, Where's road number one? And that is your header row now up here, the header row has the column names and flanking each column on either side of the column name are to drop down menus, which will explore in more detail later. But as a brief overview. The one on the left quickly changes the data type such a zip code or so security number or whatever data type it is. And this is where you'd categorize it. As we shift to the right drop down menu we see it offers a myriad of ways you can change your data now under the header row. With the to drop down menus, you have to really awesome tools. First, there's the data quality bar, which gives you a rough overview of Ah, columns. Data quality. It's a limited quality check, but it shows you dark gray for all the missing values in a column. As you'll see right here, it shows you read for all mismatch values or, in other words, values that don't match the data type that the column has been categorized, as which you'll see right here and green for all valid values. But please keep in mind that Green does not mean that the data is perfect. There still could be a lot of things wrong with it, even if it's marked green. All this really indicates to you is that a cell is not empty, and it matches the column data type for that respective column. The second fabulous tool here is the column, hissed a gram, which shows you a graphical representation of the data in each call. Each bar here represents one word value or category in the column, and right below the hissed a gram, you'll see a white information box that actually looks like a row. But if you'll notice it doesn't have a dash next to it like these other rose. So that's the clue that it's not a row. This information box actually changes content according to your actions. For example, let's find out which volcano is named the most. As I scroll over these bars with my crosshairs, take a look at the content box below it. As you can see, Etna is written 197 times, making up about 2% of the column. All this is really telling us is that Aetna currently appears in more rows in this column of the data set than anything else. If we look at the Rose, we realize each row represents a distinct eruption, so we know it's where the most eruptions have occurred, according to our current knowledge of the data set keep in mind that there may be more to the context of the data set that we haven't discovered yet. But at a cursory look, it looks like Aetna is the winner for most confirmed eruptions in the world's recorded history. This grid view also tells you a few other things here in the top middle. It's showing you that you have a full data set. Now. This is important for Trife acted to tell you, because if you have an incredibly large amount of data, try factor will only take a small, random sample of it is that you can work with it. The reason for that is, if the data were too big, it would make your computer really slow or not have enough memory toe load it all and work with it. And that's why it might just sample the data set next to that measure. It's telling you that you have 24 columns, 9815 rows and five types of data. Anything that's blue here, like this five you can select to find out more information. Another thing that you can do over here on the right is you can filter the grid. So if there is a word you're looking for something in particular, you can type it in here, and it will filter the grid for you. For the purposes of this class were using the grid view, which is the view we're in right now with columns and rows. But you'll notice that all the way over here to the left that you also have the option of a column browser view in this column overview. You can do things like quickly assessed the data hide columns you don't want to see in the grid view or apply really quick changes across several columns, like removing a bunch of columns from the data set. Now this is something you should definitely explore in more detail later, but for now, let's focus on our good view. 8. Data Recipe Overview: alright, it's time to talk about data recipes, which I think is one of the best features that try Factor has. So if you take a look with me over at the right hand corner, there's this icon that looks like a scroll, and if you go ahead and click on that, it's gonna open up what's known as the data recipe now Data recipe. It's a step by step list of all the changes that try Factor is making to your data. So each step is a change that try Factum makes happen on your data set in Tri factor terms Those steps they're known as transforms, and it does basically what it describes. Each step transforms your data somehow, and the imagery is really interesting to think about. It's kind of like a baking recipe or something that has a lot of steps that you take. But the fortunate thing here is that if you happen to mess up on one of your steps, you can just go back and easily delete it or change it by hitting the undo button right up here. And you can also redo if you decide you actually, one of that step to be there. So the reason that recipes are so great is because it gives you an auto trail to what you're doing with your data. If you're using Excel like a lot of people, do the work on your data. You'll often run into the problem that you have done a lot of stuff to your data. You've added columns. You deleted columns. You deleted rose. You've changed some data in the cells, and you've done all of these steps and eventually come up with a result. But then you realize, Oh, I made a mistake like five steps ago. But unless you are heavily documenting and writing down everything, you're doing an excel, it's really hard to try and go back and figure out what you did and what went wrong and how to fix it. And the other thing is, let's say you do all of your transforms correctly and excel and you present it to somebody , and they have a question about if it's accurate or if you did something right. If you do it in Excel, there's really no way toe show anybody what you actually did to the data again, unless you painstakingly right out all the steps and everything that you're doing, so there's no transparency there. There's no audit trail. Try factor helps you to do that. And this is a necessary thing toe have when you're dealing with data because so many things can go wrong and it's so important to have transparency when you're looking here at thes steps, if you'd happen to make a mistake, you can go back to Step three or step for and say, Oh, that's actually what I did That's where my mistake is. I can easily fix it, and all of the steps after that will automatically update. Or if someone wants to know what you did in your data prepper your analysis, you can very easily take them right here to the recipe screen and show them every single thing you did to the data. So it's reproducible, it's transparent, and it's something you have to have when you're working with large, complex data sets. Otherwise you're going to end up wasting a lot of time 9. Ready-Made Recipe Steps: We've talked a little bit about recipes and transforms, so let's take a look and discover what trifecta has automatically already done for you. When you load up this data set, it's already taken it through these four steps. And just so you can have a brief overview and understand what try fact is doing here, we're going to go through these really quickly. We could go deeper, but for now we're gonna keep it high level. Let's take a look at what your data looks like before you actually put it into Try factor. This is the plain text file of your data. You'll notice up here Volcano number volcano name those air your headers and then you start seeing each of these lines is kind of another row in the data sad. And it looks like they're using commas to separate out where the columns should be. So this is what your data actually looks like, but you can't really work with it like that. So trifecta actually apply. Some transforms in the recipe to get it into a usable format. So let's take a look at each of those steps. If you click on the first step What try fact is going to do is actually grey out the rest of these steps, and it's gonna show you what the result waas of this first step that it took. So your data comes in, here's the file trifecta. Does this initial step and this is the result. So it's saying it's breaking your data into rows and it has this little s our symbol in a circle that means it's the split rose transform. What you'll notice here is you have rose. First year has volcano number all. Can you name the second row, third row and so forth. So that's all that that first step did is just give you Rose. Let's see what it does when you do the second step so we can just click on that. It takes it out of the gray, and then it shows you the results of the second step that it took your data through. You'll notice it as an SP here. That's the split transform. And then it tells you that it split column one into 24 Collins on a comma. You noticed right there. In between those quotes, it's using a comma, just like we saw here that there's a bunch of commas that seem to be splitting up the data set. It's gonna say, All right, in this step, we have things separated by commas and we're gonna take those commas, and we're gonna create column breaks with each of those comments. So now you have your columns. All right, we're getting close, but we're not quite there yet. It still has these these ugly quotes. All of these data fields have quotes in them, which isn't really easy to work with. So the third step here try factor guesses. We probably want to get rid of those quotes. So it uses this r P in the circle known as the replace transform. So it's saying Take all of the quotes and replace them with basically nothing. So it's just using those two quotes here and putting nothing in between them, which basically means we're just getting rid of the quotes because we're replacing them with nothing. So you'll notice all the quotes that used to be around these numbers and and words are now gone. That's great. The last thing here, we also will notice in this first row, we actually have our column header names. But we don't really want that. In our data set, we want those is the column names? Well, try fact. Ah, again guesses that that's what we want to do. So when we look at this last transform, this hee transformed, which is the header transform. What that does is it takes that first row and it converts them into your column name. So now everything in that first row is now you're column name, volcano number, volcano name, eruption number, and so forth. 10. Quick Recipe Changes: Now, let's talk a little bit more about these transforms. If you don't like any of these steps, for some reason you have some options you could delete or edit any of them at any time. So you notice when I was hovering over them Initially, you get thes three options over here. Trash. Cran will totally get rid of it. So I just want to delete that transform. It's gone from my data set and now the headers are no longer in the column names. Now they're down here. I actually want that. So I'm just gonna hit on do and it will bring that right on back. You can also edit if you just hit this pencil icon and you also have more options if you hit the ellipsis right here. So under here you have a couple options. You can actually copy a step if you want to duplicate it, pasted or even paste it into another try. Fact a window where you're working on another data set and you also have the option to insert steps before and after the current step. So if I decided I needed to step before this header, I could just insert one before, and then there's another step that I can build. So now we've seen all of these steps the trifecta took to prepare your data. And now you know how toe edit or delete them if you'd want to. And now the track fact has done these initial four steps. The next six steps that you add are the ones that will count towards your six step recipe project that maybe you've already added some as we've been going along here, and if so, go ahead and upload a picture of it. 11. Suggestion Cards: this lesson in the next are the meat of this course. And if you haven't been already, I recommend that you mimic on your computer what I'm doing here as I do it. This will help you get the most out of this section. So one incredibly hopeful capability that trifecta has is it suggests what it thinks you'd want to do to prepare your data. Let's like some tax inside the grid. Now you'll see that there's a section on the bottom that popped up. This section lists several options on transform cards for how you can change the data you've selected. Above these transform cards, you have three options. Cancel your selection, modify your selection or add it to the data recipe. For right now, let's cancel it. Let's select the entire column eruption category by going up to the header row in selecting its name. Note here that if we don't cancel our column selection and we choose another column in the grid, try factor will add that second column On top of what we've already selected. Instead of thinking we're trying to make two different changes, that may be something we want to do, but maybe not. And if we don't want to do it, weaken simply de select the unwanted columns by again selecting the column names. The first suggestion listed here is Drop and Try Factor has automatically selected it for us. Drop means that we're removing the whole column from the data set, and there is actually a difference between drop and delete, which will get into more detail in the next lesson. Now, as we look a little more closely at the bottom of the card, there's some light grey explanatory text here. This text explains what this change will effect and or create. This card confirms the fact that it will only be dropping this one column, and if you look over to your recipe, it's put a temporary step in it that showing you what your recipe will look like if you choose this option. Actually think dropping. This is a great move because, as we can see from the hissed a gram, all the values in this column, except for the column header name, say the exact same thing confirmed eruption. And I don't need or want this information in my data set because it's obvious and implied information in the data set. So we're gonna go ahead and drop this column. This is an example of how to simplify our data. One of the principles of data cleaning we mentioned in the beginning. Now, try factor makes doing this very easy. So I'm gonna go ahead and drop the column and we can do this by selecting the ad to recipe button over here on the right above the option cards. Okay, let's choose another column. How about the VE I column? This time, let's select the rename card. As we can see, it shows a preview of what this change could look like. Here. Try factors. Put in a placeholder name called new column name until we put in our own name value to change the name. Let's select the modify button by selecting the modify button. It takes us to the Transform Builder, a place where we can modify, try factors, suggestions here. Let's rename our column by filling in the new name section right over here. Note that you can't have spaces in your column names, so if you want a space, use an underscore symbol. Also try factors. Naming convention is case sensitive so That's another good tip to keep in mind. Now, since I've learned that ve I stands for Volcanic Explosive Ity Index, let's spell out the acronym for this data set. This name might be important to change if our audience doesn't know what this acronym stands for. So this change clarifies our data. Another principle we mentioned at the beginning of the class. Now that we've renamed the column, you'll notice that Try Factor shows us a preview of what the column would look like if we made this change toe. Actually make the change. Go ahead and select. Add to recipe. Now let's pick another column. How about the volcano? Nein COLUMN. We see an option to aggregate. Since this is an intermediate transform that will cover in a future class. Let's skip past this one. For now, let's look for one that has multiple option dots underneath it. This here is a good example that if missing transform card, we see below the option card that there are these four dots. Each option dot offers a change to the volcano's name in this column, the first option or, if missing, option offers to replace a cell that's missing a name to something else of our choice. Or we can lower case all the names here Ever case all the names here or even proper case? The names here you may be asking, Why would I want to change the word case? And one hypothetical reason why is you might need to combine this data set with another one , and you need to match the word case to keep the capitalization consistent. This would streamline and potentially clarify the data, which is one of our principles of data cleaning for today. Let's proper case thieves volcano names by selecting add to recipe. You'll notice when you add to recipe that the preview it shows you goes away and it actually makes the change in solidifies the recipe step. Also note here that the quality bar and data type might change as we're working with the data because we're altering it, and as a result, try factor updates accordingly. Also, as Curtis mentioned, we can modify a recipe step at any point in the data recipe. Now that we've built out some steps, let's take a closer look when we select a step to modify it. The recipe will not preview any steps after the point with selected, and his Curtis showed us the steps or light grey. When they aren't activated, you can also delete a step at any point along the way, and the rest people stay at that step and in that last state we were working on, and it's going to stay that way until we select the last step in the recipe, and then it'll activate all our other changes. Also as a warning. Keep in mind that if we delete one of the steps here, it might invalidate future steps. Here's a good example of that. If we take out this step, it doesnt invalidate anything. But if we take out this step, it does. 12. Keep and Delete: I noticed something audible. We're looking at the volcano Name column. The column is actually missing one value, and that's weird because this is supposed to be a list of all the confirmed volcanic eruptions in the history of the world, so we shouldn't have a blank value on the volcano name column. So let's find out which value is missing. To do this, we can select the missing value in the data quality bar here. And once we've done that, we see that there's new information that pops up here next to this filter bar. It has rose than a colon, and then the words all and transformed one row. If we select the transformed one row instead of all, we'll notice that on Lee. The missing value is showing up here. And as we scroll across the columns, we see there's nothing else in that row, which makes me wonder where this role was in the original data set. So even in this preview mode, we have the ability of scrolling over this hyphen next to a road to find out more information, and this gives us helpful information now because it's telling us that this role was originally Row 9816 which means it was the last row in the data set, and it has no valuable information, so we can go ahead and just delete this. And deleting removes Rose from a data set. Unlike Dropping, which removes columns from the data set, and it may seem a little bit odd, Toe have two different terms for what on the surface seems to be the same action. Just getting rid of Rose. But columns and rows functions slightly differently, and so traditionally they've been treated differently. Columns have names and Rose, Ken and usually should have a unique I D column, which also identifies the row. However, it's easy to get rid of a column by simply saying Drop the volcano number column. But to get rid of Rose, you have to say, delete rows that match certain set of criteria. For example, in our data set, the criteria to delete a row could be something to the effect of delete All Rose, where the column volcano number is blamed, so the difference is nuanced. And this is why these seemingly identical removal transforms have different names. Now let's see what our options are in the suggestion cards. The first option is the keep option, which in some circumstances would be helpful because it would keep on Lee the rows that match our specifications kind of the opposite of delete. But in this instance, we don't want to keep this row, So let's go to the next card that says Delete. And this is what we want. So let's go ahead and deleted by selecting the delete card and adding it to our recipe. By deleting this empty row, we employ the important data cleaning principle of removing errors are missing values from the data set when appropriate. 13. Drop-Down Menus Changes: Okay, let's move on to the last way. We're going to clean data and add recipe steps in this course, so above the hissed a gram and quality bar, we see that there are two areas for drop down menus, which we've briefly touched on in the beginning of the course. The one to the left is the data type, and by using this drop down menu, we can quickly change the data type of the column. Don't be fooled by how simple the concept of a data type sounds. It's actually really important to get this right now. As we look at the Eruption number column, we notice that it's a zip code, which is a fun example, because try factor, meet a really good guest here because this is a five digit number that could pass is a zip code. It's easy to see why this happened, but it's wrong, so we need to fix it. Since we have these drop down menus, there is a quick and easy way to do that. Now go to the drop down menu on the left, and this is how we change the data type. Here we see some other data types from which we can choose another category for our data. The top categories are the most general and common to almost all programs that work with data. So you have your strings inter jurors decimals, which also could be known as floats and many other programs and languages. And we have billions. So we have the string category, and this is a category that usually has words and letters grouped under it. But a string can also be numbers that we wouldn't want to do math with, like potentially an eruption number, which you wouldn't add to another one in you in, subtract or divide or do any of those mathematical functions. So let's keep looking here. Next we have integers, which are whole numbers that we would do math with and then numbers with decimals and then Boolean, which means the data can only hold two values typically displayed as true or false. And then you also have a data type for dates for this class that were skipping over object in array and heading to more options, which are specialty types in trifecta and her self explanatory as you can see here. So out of these three options. The string seems to be the best fit because we don't want to do math with these numbers. However, here's a pro tip. There's an important reason why, in similar situations we might want to choose intruders. And the reason is that since this is an I D column, we may want to join it with another data source. And Inter jurors typically join faster than strings. Strings also take up a little bit more memory. So for small data sets, it really doesn't matter if we choose String or Inter juror because the difference in memory and speed are negligible. But if we were working with a really, really big data set, we might want to choose intruders. So since our data set is really small, we could really choose to make this column either string or entered your here. But if you do choose an integer, just remember not to do any math transforms on that column and you'll be fine. Next, let's head over to the down arrow on the right side of the column, and when we select this button, we see many quick, select options that offer another way to make some of the changes we've already talked about, like renaming a column or changing the data type, which is a bit redundant. But there are also options we haven't talked about, like editing the column. This is where we consort the column by selecting if we want the columns dated to ascend or descend. Or we can change the order of the columns around, and we can even duplicate or high to call him here. Now the sort option can be particularly useful. For instance, ever since we figured out what ve I stands for, I've been really curious about how Maney confirmed eruptions were higher numbers on this scale, zero being the mildest volcanic eruption and eight being the heaviest hitter. So let's head over to that column to use this sort function. Let's sort from the highest to lowest numbers, so choose descending. It's showing us blank values. So that's what it's placed as the highest value to select on Lee, the rows with entered values choose the valid values on the data quality bar and then select transformed above that, you may be wondering why we don't delete the rows with no values here. But since there's other important information along these rows. We don't want to get rid of them now. We only see the values that have numbers, and this way it makes it easier to explore the data. Here we can see some powerful volcano eruptions at the top, and by scrolling across thes rose, we find out their names and when they erupted, which offers some really interesting information now. One other thing I noticed in the volcano name column was that there's an unnamed volcano that erupted at one point, and I'm really curious if this is the only time this has happened or if there are other confirmed volcanic eruptions that aren't named. And to do this I'm going to filter our grid for unnamed volcanoes, and it looks like there are 14 rows of them. Now let's check on the latitude and longitude to see if they actually have a location, and it looks like they dio. And not only that, there are several that are repeat offenders that we can see here and now. I'm curious when these eruptions happened. Let's go over and move the start near column next to these columns to compare more easily. We see that these volcanoes didn't are wrapped that long ago, comparatively in the world's history. So one hypothesis is that these volcanoes just don't have names. But at very least we know these are not errors that we should delete in the data set, so let's leave it alone. 14. Exporting Results: Now we've almost completed our example. Work on this data set. There's definitely more we could do with it. But that's what you'll be doing in your six step recipe project. More work on this data set, so let's just send this start. You're back to where it was. Now that we've taken a look at dates next to the longitude and latitude. And while we're looking at the start, your I just wanted to clue you into an oddity Before you start working on your project. The dates in tri factor are a unique situation, and you'll notice that there are some mismatch values here, and when you look at them, they're still dates. They're just states earlier than 1400 a. D. When we actually reached out to try fact and asked them about this. And they said that programs that usually do have a lower limit on their dates, and they chose 1400 a. D as their lower limit. So any date before 1400 a. D is considered a mismatched value, even though it's a legit date, and they also said they'd never heard any feedback that this limit wasn't sufficient. So who knows? Maybe it'll change in the future, and then your dates won't be marked as mismatched. If it really bothers you, you can change the data type to enter Jer. Now let's finish up our final step, which is to run these changes across all our data and get our results. And we do this by going up to the generate results button up here and selecting it. It leads us to this new screen and here we can choose whatever file format you want. I'm just gonna choose a CSFB filed by un checking the Jason and see SV stands for comma separated values, which is a file type. I can open up in excel, then go ahead and press generate results. And here are results. We can look at the results summary here, which is an overview of your data. You can look around at the top 20 values, and you can also see things like median, minimum and maximum. But let's open a results. As we browse here, we can see some of our changes and choices like when we drop the eruption category column and we renamed the V I column, and also that we chose not to drop the unnamed volcanoes 15. Project Explanation: All right. So we have gone through the course, and now you get to apply what you've learned. So in this project, what we're gonna ask you to do is actually take the data set the volcanic data set and apply six transforms to it. So six steps in that data recipe beyond the four. That try factor automatically does for you. So while you're doing this, keep in mind the three data principles that we went over again. That's finding errors and deleting them, finding data that doesn't really matter. It's irrelevant for what you're trying to do in getting rid of it and clarifying the data set. So try to keep those three principles in mind. As you're coming up with your transforms, you don't only have to use the transforms. We went over in the class. You can really use anything that you want. The point is to be able to apply six new data steps to your recipe and come out with a data set that's cleaner than what it was coming in and help you kind of work through that. Just keep in mind where you want to take the data set. Maybe you want to create a map of villa volcanic eruptions or do something like that. In that case, you may want to, for example, remove a lot of columns that don't have anything to do with latitude and longitude or positioning whatever you're trying to do, just apply those data principles and apply those transforms to try and get there. And once you've done that, go ahead and just take a screenshot of it and upload it so that we can see the great work that you're doing. 16. We're Here for You!: congratulations on finishing the course. We're very excited to see the projects that you come up with, and if you have any questions along the way, please do not hesitate to reach out and ask us. That's what we're here for.