## Transcripts

1. Intro to Data Science: everyone, It's Max. And welcome to my course on the essentials of data science. Now, the first thing that we're gonna do here is we're going to give a short, little introduction to data science so that we understand what a data scientist is, and then we're gonna cover all of the kind of three big essential areas that you need to be a successful data scientists. All right, so what is data science? Well, data science is kind of summarize it in different ways, but the main parts of it are transforming data into information. And this is a really big step because a lot of people talk about, you know, data and Big Date and all of these things. But data by itself isn't really that useful until you can turn it into information. And so if you just have a bunch of numbers appearing somewhere and it's just, you know so much of it, no one can make sense of that. And that's where you need a data scientist to be able to transform all of these all of his vagueness and kind of this noise to that's going on, and you need to be able to extract information from it. And that's what a data scientist does. Now, what you do with this to with this information or how you get this information, it's through analysing your data. So a big part of it would be, you know, cleaning things up, doing some some processes on it. And then you analyze once you've cleaned things up a little bit, and that is one of the ways that you can then get information out of your data. Um, through this analysis, and you can kind of continue on and you see trends and patterns and all types of correlations, hopefully, on all of these things again build up into this turning data into information component. Um, and then ultimately, you also need to contextualize everything that you have, so your computer can't do that for you. Computer can kind of crunched the numbers and stuff, but it's your responsibility also to make sense, what's in front of you. And even if you see something, you just don't blindly trust it. But you need to understand, you know? Where am I at? Where am I coming from? Where is this data coming from? Need it be able to contextualized these things and then, of course, be able to apply as well as understand them. And so once you have this data, you know it's great. But turning it into an information into great information that you can use and directly apply. That's where the rial power lies. And that's also kind of the role of a data scientist. So that's what the data, that's what data signs pretty much is. And so what is the data scientists do? Well, we kind of already talked about this just a little bit, but let's go over it again. Any more concrete examples? And so a data scientist would, for example, get and process this raw data and then converted into something a little bit clear. So you can imagine kind of just like a data stream coming in. And it's you have this measuring device and constantly is just measuring all sorts of data on and because, like nothing is really Constance, everything will be fluctuating. I've been down, and so a date assigned to this would be the head of take all of this data. It be that kind of clean it up a little bit, you know, maybe reduced this fluctuation that you know isn't supposed to be there. That's just kind of background stuff going on and then put it into a format so that you can easily plotted against some things on. And then we already get to the next point that, you know, once the state as cleaner, you can maybe do start doing some calculations on them figuring out the core statistical components, you know, like, what is the average values of these? What? What am I really dealing with, You know, getting a first look at first understanding of what it actually is that you're tackling. And then once you have this kind of understanding that you can start to do some visualizations which helped you as a data scientist, maybe see some trends or patterns already. But visualization is also really key because they let you show it to other people, and they're a great means of communication. So they help both us a data scientist as well as helping others. When you try to convey this information to them, all right, and then finally, you have to suggest some applications of the information, So it's not really enough to just be able to look at it and say, like, Yeah, I see it goes up and down and that's that's good But what does that mean? How does this transfer into something useful? And that's also one of the key roles of a data scientist transferring information into knowledge. And so you've got this data into information step. But you also need to transfer this information into knowledge and those air to really powerful things that are worth a lot, a lot. And that's pretty much what a data scientist focusing on and then you can go further and, you know, and take this data and do machine learning with it or something. If you really understand what's going on or if you have some hypothesis of, you know what could happen so you can take things a lot further. But ultimately this kind of turning data into information and then into knowledge, that's kind of your role, all right, so let's go into the essential techniques or the essential components of data science. So the first essential component and we kind of touched on them. This already is statistics, and basically we're gonna cover this later on. But let's just give ah kind of quick wrap down. So in statistics need to understand different data types that you can encounter. And so there are day I can come in different ways, and we'll go again into more detail with this later. But it's not just, you know, you get a bunch of numbers date, I can come and very many different ways, depending on the field that you're in. And so you need to be prepared and you need to kind of be aware that data may not always just be a direct number for you. And then, of course, you need to understand some key statistical terms like you know, the different types of means and also understanding, fluctuations and data. And the reason that this is important is because these key statistical terms give you an overview of how this data is behaving. And depending on how the data is behaving, you may want to approach it differently. So if you know that your data is very clean, there is very little fluctuation. Then if you visualize things, you can probably trust what's going on or, if you want to maybe fit some curves to it or something But if you see there's a lot of fluctuation in your data visualizing it is going to be much more difficult because you just see jumps everywhere and you're not really sure which of this is actually true. And which of this is caused by, you know, like some interference somewhere or someone is messed with my system. And so all of these things will kind of be hinted to you through statistical terms. So it's probably good that, you know, you're kind of comfortable with these things and that you can be able to get some meaning meaning out of them. All right, on, then, finally it be and statistics to be able to, you know, split up on group or segment data points so that when you have this big data set, you want to be able to, you know, maybe split it up into smaller things, compare different regions, look more into more detail into some things and maybe, you know, isolate two components because, you know, hey, these things are probably gonna be important. The rest I don't really care about that much. So being able to kind of pinpoint and isolate and metal with the data a little bit. So these are the kind of statistical components that we're going to look into. All right, So the next big thing and we've already talked about this to is data visualization. Andi, we'll see why data visualization is a really key skill for data scientists. And then we're also be gonna be covering different types of grass that you can use and how you can compare different number of variables. So, for example, you can have one variable grass where you only look at one thing and you only want to look at this, and you want to see how these how this changes. You have your typical to variable grouse, which you probably know where you have this X and a Y axis. And then you can kind of see how two variables relate to each other, where you can have three variable or even higher variable graphs, and where you plot maybe three different things or even more if you want, as long as it makes sense next to each other, so that you can compare multiple things at the same time, all right. And now we come to the other big thing that you're probably gonna need as a data scientist , which is gonna be the ability to program now, not every data scientists can do this, but this is really, really essential, in my opinion, to your role as a data scientist, because knowing how the program is gonna make your life so much easier if you know how to program you can kind of take your ideas and your thoughts, and you can put them into actions in the computer. And you can just automate everything you can customize things you can explore, you can prototype, you contest, and you're not reliant on some, you know, application. You don't have to master some application. And if it doesn't work, if one feature isn't there, you have to contact customer support. And maybe it's not even possible. And then you have to wait for an update. Or maybe something is bugged with programming. There's just you're so much more reliant on yourself, and you can really just do whatever it is you want to do. And you're not reliant on other people or on the tools that other people have built for you . But rather you can just pretty much go and you know, just do what you want to do without there being major roadblocks andan. We'll also look at some essential packages and python. So in programming, you never want to reinvent the wheel. You always want to start off with the last person left off, and so the ability to program and be able to write simple programs you would I need to teach yourself. But you wouldn't need to right highly complex mathematical packages or data analysis packages. Those are already out there. All you need to do is be able to download them and implement them in your coat, and they're gonna work. You know, they've been tested a lot. There's a huge communities working on them on improving them and everything. All of this is for the community, and so the whole community kind of works together to improve it. No one's really directly trying to make a lot of money off of it, so they're not going to charge you all of these service fees and everything. Everyone's just trying to improve their package because if it improves, everyone also benefits from it. And so we'll look at some of the library's. We'll talk about some libraries that you can use, especially in python, and to help you along your way with data analysis and to become a successful data scientist .
2. Statistical Data Types: Hey, everyone, it's Max and welcome back. So in this chapter, we're gonna talk about statistical data types. Now, we're gonna look at the three different types of data which are summarised as a new miracle , categorical and orginal types of data. Now, these are the types of data that we talked about before. How you can't just expect your data to be be kind of new miracle. And so we'll see numerical data. But we'll also see the two other types of data that you may be. You know, encountering in your career is a data scientist. All right, so let's talk about numerical data. First, though, numerical data is also known as quantitative data on, and it's pretty much things that you can kind of measure. It's it's great numerical stuff that you can do math with. You can compare it, you know, saying this. Plus this makes sense that he is greater than be, um, thes air. You know all examples of numerical data in America, but data can we split up into two different segments. One of them is going to be discreet, and so discreet means the values only take on distinct numbers and an example of this would be, you know, um, I Q or something like that, A measurement of I Q. Or if you do a coin, toss the number of times that you toss head so you can you know you can have 15 heads. You can have 12 heads out of, you know, 20 coin tosses. You can have 500 heads out of 1000 coin tosses or 500 out of 600. Or all of these things are all of these air distinct numbers. And now they don't have to be whole specifically, but they do have to be distinct. So that's that's the kind of very important part that you know, there is a kind of step size that you're dealing with. And, of course, you can still say Hey, you know, flipping eight heads out of 20 is better than fiddling seven heads out of 20 if you want to flip heads that this, um, or flipping eight out of 20 is worse than flipping seven out of 20 if you're going for as many tales as you can. So all of these kind of comparisons that make sense so that's a discrete part of numerical data. Then we have the continuous part, and now the continuous part is really that values could just take on any number. And they're not in limited by decimal place. So a value that can, you know it can be like 1.1, and then the next value would be 1.2. That's not continuous. That's still discreet because you have this step size of 0.1 continuous means literally. Every number from start to finish can be taken on. And this doesn't mean that every possible number in the universe from negative infinity to plus infinity and all imaginary numbers and everything that comes with it that doesn't that that's not required for continues. It could really be that just every number between zero and one can be taken on. So, for example, let's say you have a bottle of water in this bottle of water can hold one leader. Now, if you fill your bottle up and it starts off empty and you fill it all the way up to the top, the amount of water that you've had needed to take on every single number between zero and one because you can't just fill up water, you know, a kind of small increments of say, Hey, I'm gonna put in 0.2 leaders every single time because the water doesn't just, you know, teleport from a to B. But when you're pouring in water, it's more like we see in the stream here, and the water level rises and rises and rises. And so the amount of water that we have in our cup needs to take on every value between zero and one. So that's an example of continuous data for um But you see that, you know, we could be limited to Zeron to be between zero and one. We don't have to, you know, start at zero and go all the way up to infinity or something. But it's just that the range that we're looking at, every single number can, um can be applied or every single number can happen. Um, another good example would be the speed of a car if you starts and you, you know you're standing still on your studying your standing at a stop light, and then you want to accelerating the speed limit is say, you know, 50 miles an hour or something to get to 50 miles an hour from your starting position, your car has to take on every single speed in between. And of course, you won't see that. You know, on your spot on the speedometer, it would say something like zero miles an hour, one mile an hour. You know, maybe you can go to like, it's going 0.10 point 20.3 or something like that. So it may look discreet to you, but that's not how your car is going. Your car doesn't say like, Oh, I'm going to go in the step sizes of speed. It's gonna accelerate, And it's gonna take on every value starting from zero going up to 50 miles an hour and you're gonna When you're in this transition, you're going to take on every single one of those speed values. So that's how continuous data it looks like. And it's important to understand the difference between this discrete and continuous. Um, just because you may want to approach it differently. Now, of course, if we're dealing with computers are computers can't deal with infinite numbers in the decimal places, we have to cut it off somewhere and so usually continues. Data is going to be rounded off at some point. But it's still important for you to know that you're dealing with continuous data here rather than discreet so that, you know, hey, there can still be other stuff in between here. And we're all of these things rather than, you know, having specific steps, sizes and all you see is just kind of a bunch of lines at every step size. But you can expect that when you have continuous data that everything is just kind of filled, filled up, that everything can and may even Welby in between certain places. So that's that's kind of the important thing to note between discrete and continuous. All right, so the next type of data that will have this categorical now categorical data doesn't really have a mathematical meaning, and you may also know it to be qualitative data and categorical data. It describes characteristics. So a good example of this would be, for example, gender. So here there is no real mathematical meaning to gender. Of course, you know, if you have the data, you can say male a zero and female is one, but you can't really compare the two numbers even though you assign numbers to them. And you may just do this so that you can split it up later on. Your computer can understand, but it doesn't really make any sense to compare. You can't say, you know is male equal you will. You can say mail is not equal to female, but you can't really say is one greater than the other or is one approximately equal to the other? Those things don't really make sense because they're not well defined. What does that mean? Um and you can't really add them up, either. You can't say male plus female that that doesn't That doesn't give you 1/3 category or something. So categories you can't really apply math them. But they're nice ways to kind of split up or group your data, and they provide these nice, qualitative pieces of information that are still important. It's just you can't really go that well about, you know, like plotting them on a lion or something like that. Um so those are important things to note with categorical data on, and then another example would, for example, be ethnicity where you could also have nationality. All of these things are examples of categorical types of data. Um, yeah. So, like we said, you can assign numbers to them, but that's really just for your code so that it's easy to kind of split them up, but you'd still can't really compare them. How are you gonna compare nationalities? There is really no definition for, you know, comparing one type of category to another, All right. And so the third type of data that you can encounter is something called orginal data. Now, ordinary data is a mixture of new miracle on category called data, and a good example of this would be hotel ratings. So you have, you know, star ratings there of 01234 or five stars or maybe even six stars or, you know, whatever it is, whatever hotels go up to these days, but it's still not as straightforward to compare. So I'm sure you've seen two different types of three star hotels. One of them, you know, had the bare minimums. The beds were okay, but it wasn't really anything special. And then you had this three star hotels that you could have sworn were at least four star, and so star ratings do make sense. We can say, you know, a four star hotel is probably better than the three tour hotel because there have been standards. There are standards for these things. They have been checked. You know, if you go to a Far Star hotel, you know what to kind of expect. But still, it's not completely defined. So, like, you know, coming back to this three star example, it's very hard if you just say, Hey, we're going to a three star hotel. It's very hard to know exactly what to expect because there are different parts of three star hotels. There are three star hotels and have developed onto, like, have a swimming pool, maybe, or something like that. And there are those three star hotels that are really more like hostels or something that I just made it past the to start place. And so there. It's much harder to kind of define or didn't know what to expect. Now, if you take averages of these star systems, though, then you do get a much better idea of what's going on. So if you have, you know, consumer reviews or something like that, you say, Oh, from you know, 500 reviews are hotel has an average rating of like 3.8. Then you know that the three star hotel that you're looking at is pretty much a four star hotel. It feels like a four star hotel, even though it may not have all of those qualifying characteristics. That's the kind of feel you get from him. Whereas from another three star hotel, you may have a rating of, like 2.9 or something. And they're you know, you know, this hotel is more towards the lower end of three stars. Some people may not even consider it to be three stars. And of course, you know this rating, maybe a little bit biased because they went to a different three heart star hotel first. And then they went to this one and they were expecting something completely else from a three star hotel. So they said, This can't be three stars. This is two stars, but it's because of the way that the ranking system is to find underneath and everything. And so when we have these averages with these orginal numbers than they kind of start to make a little bit more sense. All right, so let's go over a small exercise and see if we can identify what type of data we're dealing with. So the first thing we'll look at is gonna be the surgery response to happiness. Now you have people filling out a survey, Um, and then this, And then one of the questions is, you know, how would you rate your happiness? And it's gonna be bad, neutral, good or excellent. What type of data with this be? Well, this would be an ordinary type of data because it's still in the form of categories on dure , asking for the subjective opinion. But it does make sense. Seek is still compare them. You can say excellent is greater than good. Good is greater than neutral. Neutral is greater than bad. But what exactly does it mean to be good? An excellent you know, Where do different people draw the line for this? That there's still a little bit of vagueness involved, but generally it does make sense and you can't compare it. And if you have a lot of surveys and you average them, the values you're going to get are probably going to be very well, representative, or at least pretty good, representative. All right, So if we look at the next thing, which is gonna be the height of a child, what type of data is that? Now we can say its probably new miracle. And well, it actually most definitely is new. Miracle eso. The height of a child is a numerical value. But let's go a little bit deeper and say, Is the height of a child discreet? Or is the height of a child continuous? Well, even though when you measure height, you get something, like, you know, five foot on 53 or 160 centimeters or something like that. Um, it's not a discrete value, because to get that height, you have to have reached every single height before, Um and so even though at the moment you may be measuring it, you're kind of rounding it off to how much your measuring tape can measure. So, like, your measuring tape is kind of limiting the height. But if you had a super super pressed highest measuring instrument, you could measure not just, you know, five foot three or something like that. You could really go into detail with the inches and the decimal places, and they're on everything kind of going on. So the height of a child would be a new miracle data type, but it would be continuous. All right. Now, let's take about talk about the weight of an adult. Do you expect the weight of an adult to be either discrete or continuous, so we can probably agree that it's new Miracle because it's a weight value? It's It's pretty much defined to be a number. And what do you expect it to be, discrete or continuous? While the right answer here is gonna be continuous again because to reach a certain weight , they would have had to have reached every single weight in between before. So, again, a weight is something that we can consider to be continuous. All right, And so finally, let's look at the number of coins in your wallet again. Weaken already by the name. It says number of coins, so we can probably agree that this is a new miracle type of data, but the number of coins in your wallets would that be discrete or continuous? Well, the answer would be discreet because it doesn't really matter. What's your knowing your corns are? They could be 50 cent pieces. That could be 25 cent pieces, 10 or five or ones or anything, you know, like a two or something like that. But they're not going to be. But the number of corns that you're gonna have we're going to sum up to a whole number so you can have one corn. You can have to. You can have three all of these things, But you can't have infinite fractions of a coin. You can't have say, you know the square of two number of coins That doesn't really make sense. So you have a defined step size. You have one coin, and then if you have a second corn than you have to get 1/3 quantity of three, you're going in step sizes of one. So for the number of coins in your wallet, we'd be having discrete numerical data
3. Types of Averages: everyone, it's Max. And welcome back in this tutorial, we're gonna talk about the different types of averages. Now we're going to see the three different types of averages, which is the mean, the median and the mode. All right, let's get started. So we'll start off with the meat. Now the mean is the typical average that you know, and really, what the mean is, is he just some olive? He values up and then you divide them by the total number of values that you have now the great prose of the mean is that it's very easy to understand. It makes sense. We just have everything we have and just kind of capital up and the divided about what we have and that should give us a good representation of what is the average. And it also takes into account all of the data. So since we're adding everything up and then but dividing by how much data we have, we're taking into consideration every single data point. Now there are some problems with this. So one of the problems is that the mu may not always be the best description, and we'll see why, when we look at examples for when we should use the median and the mode and the mean is also very heavily affected by at lower years. So since we're taking everything into consideration, if we have big out liars, that's really gonna change. How are mean Looks like So if we just have normal values, you know, between like one and five and all of a sudden we have, like, 10,000 in there that's really gonna affect our mean, so mean is heavily influenced by outliers. And the bigger the outlier, the more the mean is influenced by it. All right, so let's see some examples of the mean We'll go through a worked example first, and we can see our data set here. Just a bunch of numbers. And what we're gonna do to calculate the mean is we're just gonna take every single one of these numbers, and we're gonna add them up and we can see the total result that would get here. And then the next thing we're gonna do is we're gonna take this total result. We're gonna count the amount of data points that we have, and we're gonna divide one by the other which then gives us our mean, as we can see here. So that's an example calculation of the mean. But let's see some example applications of the means. So when would we use it? Well, good application would say, if you look at the time it takes you to walk to the supermarket, so sometimes you walk a little bit faster and maybe it takes 20 minutes to get there. Sometimes you walk a little bit slower. It takes you 25. But on average it takes you somewhere like 22 or maybe 22 a half minutes or something like that. So if you say I'm going to go to the supermarket, you're like it's gonna take me this much time to get there. Um, another good example of the mean would be exam score for a class. So to gets a good understanding of how people do in an exam or in a class, you can look at the mean exam score last year. And since our exam scores are kind of in a smaller range, meaning is gonna be good to use it because you can get anything between zero and 100. But realistically, speaking. No one's probably going to get a zero. So you're ranges even smaller. And so you're less affected by out lawyers. And you kind of know how hard to class is gonna be just by being, you know, able to compare their meeting. So if you look at one class and its meaning is higher than the other, but they have a large number of students or something that you can probably say, Hey, it's easier to get a good grade here or something like that. Or maybe, you know, some of these more simpler over these without diving too deep into it. All right. Another good example of the mean would be to say, how much chocolate do you require when you get this kind of sweet craving and you're not going to say like, Oh, you know, I'm required one chocolate bar to check with bars or three, but like you're going to say, Oh, on average, you know, I require, you know, maybe 3/4 of a chocolate bar, and sometimes they may want a little bit more because I feel like it. When I start eating chocolate, crave it even more sometimes, you know, I have it at first and, like, the tasters doesn't sit right with me right now. And so I have a little bit less, but these air kind of the amount of things. So, like, if you have this craving, you know, either you say, Oh, I'm gonna try to be strong when you like. Why, No, this feeling. And I know if I eat about, you know, 3/4 of a bar of chocolate or something, I'm gonna feel good. My craving is going to be satisfied, so you kind of know what to expect. So these were some of the examples for how we would deal with a mean well, when we would use me. All right, So let's look at the next thing, which is gonna be the median. Now. The median represents the middle value in your debt data sets. Now, if you have an even number of data points, you don't really have a middle value. And so in that case, the meeting is gonna be the mean of the two values. So it's gonna be the two meeting values out of together and then divided by two. So the pros of using a median value is that the median can sometimes be more accurate than the mean, and we'll see some examples of this. The media also evenly splits your data. So you're not really, you know, affected by the mean in the sense that if you have an outlier in the mean on git drags everything to the right, it could be that your outlaw drags things so far to the right that all of your data is to the left of the mean and only the outliers to the right. So that would be extreme case. But that can happen. Where is the meeting? You know, it's always located directly in the center of your data in the meeting also doesn't care about outliers. So if you have huge out letters at the beginning and at the end, it doesn't really care, because outliers, by definition aren't very common because they're outliers. And so if you have some of the beginning or house some of the end, they're going to be very few in number, which makes him out liars. Therefore, the median doesn't really care about outliers that much. A con, though, is that the median doesn't really give you much information on the rest of the data. Sure, you know, you know what's at the center, but you don't know how does everything around me behave? You only know where is the center of our data. So let's see some examples. We'll do a worked example first, where we see our data set here and we can count how many values we have to go from left to right. Then we can say we've got 123456789 10 11 12 and 13 data points. So we've gotten ought number. And so our median value, our center value, is gonna be the seventh data point because it's six from the beginning, and it's also six from the end. So is equally spaced both from the beginning and from the end. And so that's why we see our median value Here is 26. It's located directly in the center. Now, what is the median useful for? Well, the medium is often used if you look at, you know, household incomes for a country. Because if you were to use the meeting than these billionaires, they would just completely you know, they would give you a false description of what really an average household income is because normally, if you have, you know, like an average value, you can say, Oh, the average household income from this family would be, say, $40,000 or something like that, or that would be the median value. But if you were to use the meaning instead than all of the billionaires and all the millionaires in the country, they would change that household income. And then you would say, Oh, you know, the average household income for family would look like 60 K And that's a bad representation, because that doesn't actually give you a realistic look at what the average household family has on the average household family really does its, you know, centered at like 40 K ensure there were people below him. There people be high, but that's what's in the middle. Whereas if you were to use the mean instead for your average, you would kind of get this inflated household income, which wouldn't be representative to the rest of your the rest of the country. Another good example of the medium would be the distance that people cover to get to work, so If you look at this in terms of you know, Kilometers, then you can say like, Oh, you know, some people, they walk to work and it's like, you know, one kilometer at most. So something like that. And then you can expect people to travel. Most people travel around three kilometers toe work. And sure, there are some you know, that travel much further because they want to live outside of the city. And there are some that travel very, very short distances because they have a house right next to the office where their house is the office or something like that, depending on where you're working. Um, but then you can look at, you know, like where in the middle. How do people travel toe work? What time or what distance do they need to cover? And so that would be another good use of the media. Um, a meeting, another good meeting values. What do you usually spend when you buy a new item of clothing? And so sure, you know, sometimes may go to that expensive clothing store and you could get a jacket that costs, I don't know, north of a couple 100 euros or dollars whatever system you want to use. And sometimes you can go to a second hand store and get it for very cheap. But usually if you go into stores a jacket, I don't know, maybe cost you, like, $100 or something like that. So, you know, if you go out, you can expect to pay about $100. Um, no, not really. You know, taking that much account into what story going into. So most of the stories that you're going to visit are gonna have that price for the jacket , so that would be another good use for the media. All right, let's look at the third type of offers that we can do, which is the mode. Now the mod looks at the most common value in your data, and it's not really defined if there are several most common values. But if there's only one most occurring value, then that's what your mode would be. So we'll see an example of this in a second to the pros of using the mode is that it's not only applicable to numerical data, so if you look at categories, for example, then you can say, Hey, we've got five people from the U. S you know, and two from Canada and one from France. And you know that the mode is gonna be the US because they're five people from the U. S. So mode is the great average. That's not only applicable to, um, numerical date on this sense beacon technically also applied to categories or toe article numbers if he wanted, so that you can say the most common country that we have worthy the average kind of country that we would expect here is the U. S. And sure there are other countries. But the the average or the most common one is going to be the yes in this case, Um, so yeah, and then, of course, and the other pro is that we allow to see what's most common what pops up the most. So that's a great use of the mode. If there are cases when you know recurring values happen a lot, which is the case for discrete numbers, for example, so indiscreet numbers values Ryker often, and so it's good to use the mode. Um, a con of of the mode is gonna be that it doesn't really again give you good understanding the rest of the data similar to what we had for the median, but also it's not really applicable. If you just have a bunch of different types of data, then there isn't really gonna be a mode. If there's not enough of each data, it's not really good to use the mode. You don't want to, you know, have thousands of data points and their most recurring value. It re occurs, like three times. That's not good. You want to use the mode for situations where data re occurs often. So like we saw the country example. But let's actually see a worked example, but also some other examples for the mode. So the worked example here would be again. We take our data set, and we can count how many times different numbers appear. And so if we go through, the numbers will see that 26 occurs the most, and so that's gonna be our mode here. So we've got 22 25 that both occurred twice, but 26 occurs three times, and so 26 is gonna be our Miller. It's gonna be our most occurring value. No, the mode is gonna be useful for things like the peak of a hist A gram. So if you draw this history, Graham And if you don't know what Instagram is, don't worry. We'll cover that in a later lecture to women, go into data visualization, but the peak of a hist a gram that's going to show you the mode of the data, the most occurring data. Um, a good another use of the mode would be if you look at employees and come on a company because that accompany you know, you can again have the boss, which takes off the mean and you can have, you know, higher level employees to which we kind of shift the median. But if 1/3 of your employees earn minimum wage, that that's gonna be the best average. Or say, 40% of your employees earn minimum wage or probably not your employees, because that wouldn't be a very good system have but a 40% of the employees that the company that you're looking at a very minimum wage that's not a really good thing. Toe have. And if you look at the mod, you'll easily see that the average in this case would be to earn minimum wage because that's what most people earn. And sure you know, the boss, he or the CEO or something, You know, he may shift the mean up heavily, and then the fact that you have higher ups if you look at the meeting value, you may even Welby too far. Um, you know too far to the right that you really don't consider thes employees that all are in the same amount. Um, but you really want to get that description, which is what you get here from the mode and then also, the outcome of an election is where you use the mode for. And sure, sometimes you may only have two values. Sometimes you may have three. But if you have different candidates and say you have five different candidates than the person with the most votes is gonna win the election because they have the most. And so they were again. You'll use mode
4. Spread of Data: Hey, everyone, it's Max. And welcome back to my tutorial. So in this lecture, we're gonna look at spread of data and we're gonna start off with looking at the terms ranging domain. Then we're gonna move on to understanding what variance and standard deviation means. And then finally, we'll look at co variants as well as correlation. All right, so let's start off with the range and domain. Now let's set up with the range. So the range is basically the difference between the maximum and the minimum value in our data set. So that's that's kind of simple to think about eso. Let's just kind of go through this with a worked example. Let's set up a company in the town and this is the only company in the town and the owner of the company earns a salary of 200 K a year, and then the employees, you know, they all have different salaries, but the lowest employees or maybe the part time workers they earn something like 50 k a year. So we've got data on kind of ranging from 15 K up to 200 K and so our range is the difference between the maximum and the minimum value. Nor do you know. So we take 200 K and we subtract 15 k from it, and we've got a range of 185 K in salary. So that's how big our salary can change. So it if we started 15 k it can go all the way up to 200 k So that's 185 k range of salaried of people in this company can have all right. And the domain is going to be the values that are data points can take on, or the region that our data points. Lion eso We look at this example again, our domain is going to start at 15 k and go up to 200 Cave. So what the domain defines it defines kind of starting and ending points or defined a section in our data. And so in this case, the domain would define, you know, which started 15 came and it would end at 200 K And what the domain tells us is that everything or all salaries within, you know, between 15 k and 200 K that their possible, but within this domain or within this company, it's not possible to have salaries outside of two bluestem name. So if our domain again this 15 K to 200 came then we can't have a salary of 14 K because that's outside of our domain. And we also can't have a salary of 205 K because again, that's outside of our domain. So pretty much all salaries within 15 to 200 KR possible. Anything outside of the domain is not possible because that's no longer in our domain. All right, so let's move on and look at the variance and standard deviation and we'll talk about the variance first, Um, and what the variants tells us It pretty much tells us how much our data differs from the mean value. And it looks at each mean value, and it looks at how different each value is from the mean value. And then it gives us the very incidents in calculation, and we don't really need to know the formula. It's more important right now just to understand the concept of Arians. And so what it variants really tells us is that tells us how much our data can fluctuate. So if we have a high variance, that means a lot of our values differ greatly from the mean value, and that would make our variants bigger. If we have a low variants, that means a lot of our values, our very close to the mean value and so that would make our variants lower. And now, if we turn to the standard deviation, the standard deviation is literally just the square root of the variance. So if you understand one, then you also understand the other. And now we can combine this if we know the range of our data to kind of get a better feel for datum. And so let's use an example where we have two different countries, just countries A and B, and they have the same mean height for women, which in this case will say is 165 centimeters or five feet form on will say that the range of heights for them could be identical. So let's say they can range. You know the range, let's say, could be like 30 centimeters or something go anywhere from, say, 1 50 all the way up to 80 or we can even increase that and say, like anywhere from his low is 1 40 up to like two meters or something like that. But let's just keep the range for these the same and they both have the mean height. Now, if country A has a standard deviation of five centimeters, which is approximately two inches, and country B has a standard deviation of 10 centimeters, which is approximately four interests than what you can expect, knowing these values is that if you go into country A, the people that you're going to see are gonna be much more similar and hypes, so our standard deviation is lower. That means our values differ lower from the mean. And so that means a lot of the women that you're going to see are going to be very close to 165 centimeters or five feet four plus minus two inches. So it's very what you can expect when you go to this company. And when you go to this country is that everyone is gonna be or every a lot of the women are gonna be about that height. Whereas if he goes to country be, they have a much larger standard deviation. And so you can't really expect everyone to be about 54 because it fluctuates a lot more. And so if you go to that country, you can expect to see a lot more women of different heights, both taller and shorter. Then 54 all right. And so that's how we can kind of use the variance and standard deviation or the standard deviation to give us a little bit more perspective on our data and kind of allow us to and first some stuff about our data. All right, so let's talk about co variance and correlation. And so Cove Arians will or already has the name very incident. But co variances measured between two different variables and it pretty much measures if you have to valuables. So let's say we've got you know, me drinking coffee in the morning and my general tiredness. So if I use these to values and, you know, get data point. So this is how much coffee I drank in the morning, and this is how tired I feel this morning or something like that. And so what the co variance does is it looks at how much. One of these values differs or changes when I change the other one. So what does that mean, for example? Well, if I drink more coffee, what the co variants would look at is how much does my tired miss change? So that's what you do with co variance. You see, You say I change one. How much does that affect? The other thing that I look at? Um and, uh, correlation is very similar to co variance. So we kind of normalize the co variance by dividing by the standard deviation of each variable. So what that means is we get the co variance for my drinking coffee versus feeling tired, and then we would just divide by the standard deviation of metric and coffee and a standard deviation of me feeling tired. And so really, what we're doing with the correlation is we're just kind of bringing it down to relative terms that would fit our data better. So that's kind of the abstract idea. The important thing to just keep in mind is that we're looking at one, and we're seeing how much that changes. And we're seeing how much that changed effects. The other one um, all right, so there are different types of correlation values that we can have, and they can range anywhere between negative one and one or so. Their domain is between negative one and one, and a correlation of one means a perfect positive correlation. So that means when one variable goes up, the other goes up. So for my coffee example, that would be if I have coffee in the morning, Then I also feel more happy. So the more coffee I have, more happy I feel. And of course, there's gonna be a limit. But let's say I only drink up to two cups of coffee or something like that, and I can drink anything in between. And the more I have, the more happy I am about it. So that would be a positive correlation. The more I have of coffee, the Maura half of happiness, and so they would kind of go up together, and then we get closer to zero. Um, zero point is going to mean no correlation to us, So anything between zero and one is going to be a kind of slightly positive correlation. It's not gonna be a super strong and we'll actually see some examples on the next line, but yeah, So anything between zero and one is gonna be a kind of slight positive correlation, not super strong. And then the closer you get to zero, the more it means no correlation. So an example for the zero case would be that it doesn't matter how much coffee I drink in the morning. It's not gonna affect the weather there unrelated. One does not affect the other. So I could drink, You know, one cup of coffee during a sunny day and one cup of coffee during the rainy day. And it's not going to change the weather. It's not gonna affect the weather. They're pretty much, um, un correlated. And then we can also go down into the negative range. And so the closer we get to negative one, or if we reach exactly negative one, that correlation of negative one means a perfectly negative correlation. And so he recon, take our example of coffee versus tiredness. And so the more coffee I have, the less tired I'm gonna be. So coffee goes up and tiredness goes down. So that's how we can kind of understand this correlation and it comes from the co variance that was important to understand the core variance but usually use the correlation because the correlation, because we divided by the standard deviation of each, is much better fit to our data Now there is one thing that's very important to remember, and that's that correlation does not imply causation. So just because two things are correlated, that does not mean that one causes the other. So a good example of this would be a if I live in a climate where it's usually cloudy in the morning, and I know it to be sunny in the afternoon. But every morning when it's cloudy, I drink coffee and then it becomes sunny in the afternoon. That's not even though they may be correlated me drinking coffee and it becoming sunny. It me drinking coffee does not cause it to be sunny. That's just, you know, by chance. It's just because it happens every day, and by chance there is this kind of correlation that appears. But that does not mean that me drinking coffee, you know, results in the weather getting better. A causation would be me drinking coffee and me feeling less tired or me drinking coffee and me feeling happy about it because I like the taste. Those would be cause ations, so that's an important thing to keep in mind. Just because things are correlated does not mean that one causes the other. All right, so let's see these things on a graph. And so here we have the examples again that we've talked about. But we can kind of see how the data would look like for different types of correlations and so we can see a perfectly the perfect correlation of one. So one goes up, the other goes up, we can see on the left side, and we pretty much get this really nice straight line. So one value goes up, the other value goes up with it, and then the closer we reach zero, the less related or the less correlation there is between them and then the more kind of variants we have and data So we'll notice for the case of perfect correlation, which is the one or the case of perfect anti correlation, which is the minus one, which again we have the example of more coffee less tired. I'm in those cases you know, we have a very nice thin line in our data, doesn't jump around a lot. But the closer we get to zero, the less we can see, you know, one causing the other. And the more we can see our data kind of spread out. And so that's what correlation would look like in terms of in graphics.
5. Quantiles and Percentiles: Hey, everyone, it's Max End. Welcome back in this tutorial. We're going to go through Kwan tiles and percentiles. All right, so let's get started. So what are Quintiles? While Kwan tiles allow us to split our data into certain regions that if we're dealing with probability, they all have the same probability of occurring or for just dealing with, you know, size of data, we want to split our data into equal regions. So that's what we can do with Kwan Tiles is just splitting everything up so that every time we split it, you know, we have equal amounts of data, all right, And so an example of a quanta will would be something known as a quart tile. And so that's when we split our data into four equal regions. Hence the name quartile. So a quantum. It was the general name for doing this splitting procedure. And then if we say quartile, that means we're doing quanta walls, but for four equal regions. And so this is something that you would probably often see unlike university admissions pages or something like that. And they say the top 25% of our applicants have at least a test score of, like, 90% or something, you know, And then they would say the bottom 25% for applicants or admission. Our admitted students or something like that have a test score that is, you know, 70% or 75% or something like that. Um, and then the median test score is 85%. So that's how you would go about court tiles. Is that you would have, you know, the lower 25% the middle 25 to 50. Then you've got 50 to 75 then you've got the top 25%. So the 75% to 100. And so you've got these four equal regions, which also includes your minimum value at the very bottom, your maximum with the very top and in the middle you've got your median value, So that's the value directly in the middle. That's because you're splitting it up on dio for equal regions. And so the value that separates the second Kwan tile what should be the 25 to 50 from the third quarter mile, which would be from 50 to 75. That value there would be the median value, all right? And so if we go into percentiles so percentiles that may have been a name that you probably heard before him percentiles again. An example of a quanta ill. But instead of saying you know, like a quartile, we do it for four. A percentile means spitting it into 100 equal segments. Hence the percentiles of the purse name. At the beginning, though, that's that's kind of where the percent you may have noticed percent means out of 100 or so . That's if you are familiar with percent. That's also the same kind of reasoning where this comes from. And so we've got percentiles, which means splitting into you 100 equal segments. Um, and so an example of this is often used in test scores. So if you ever taken something like the SS tees or something like that, then you get a test score. But you also get a percentile, and the reason that's done that is it's to judge not you versus of the tests, but you versus everyone else. And so if it's a difficult test than something like getting a test score of 60% But you're in the 95th percentile means your score is actually a lot better. And so what you can say with percent house, for example, is that every percent how that urine means you're better than you know. That's many other people. So, for example, if you reach the 99th percentile, that means you're better than 99% of the people that took the test. The 95th percentile would be 90. You're better than 95% of the people that took the test or something like that. And so that's why percentiles are often used for tests, and they're often used for normalization because they allow you to take into consideration . You know, these factors of like, Is it a difficult test is an easier test. Maybe more people are scoring higher, so they don't really judge you directly versus the test. But they normalize you against everyone else that took the test. So you take the test, you get a score and then, um, the percent out checks where that score lies relative to everyone else. And so these percentiles, they allow you to give a good normalization, and they allow you to do great comparisons because they allow you to kind of eliminate some of these factors of test difficulty. And of course, you know, there can always be luck involved in stuff in that may not get filled it out on an individual basis. But if you do this for a lot of students on, that's also why it's done in these kind of big standardized tests is that you get a percentile along with your score so that you understand if, you know, maybe if your scores lower, Um, but the test was really hard. You can still see you. No, I did really well because people found this test really hard, and it was even harder for them than it was for me.
6. Importance of Data Visualization: headroom. It's Max. And welcome back in this tutorial, we're gonna talk about the importance of data visualisation. All right, so what we're gonna talk about is first, we're gonna look at the role that the computer plays kind of for us and what role the computer is actually made for. Then we're gonna look at what role the human should play in terms of data science, them organise. Look at presenting datum, Andi. Finally, we'll talk about interpreting data. All right, So let's get started and talk about the role that the computer place No computer is much, much faster and calculating than human because that's what it's made for. It's made for crunching numbers. It's made for doing fast calculations. You know, if you think about how faster computers are there in the gigahertz range, so Giga means billions, so they just do billions of things every second. And so they're really good for doing repetitive things because they can do them so fast on Beacon. Give them these logical tasks in terms of programming, and we give them a structure and they just do it and they could do it over and over and over again. They're not gonna mess up. You just repeat the same thing. They won't get tired of it and they're really good. And they're really fast at doing these things. So that's the role of the computer should play for you. It should be kind of a means to get these hard number crunching and all of these things done. So there's. There's really no need for you to kind of work at all this complicated math because your computer can do it much better and much faster than you. And it's also less prone to error if you coat it correctly. So that's kind of the only part where you come in and it's only gonna mess up if you mess up but generally are. Computer does exactly what we tell it to do, and it's really good, and it's really fast at it now. What role should a human play in terms of data signs while humans have naturally developed to identify patterns and we've done this first for survival so that if we're walking around somewhere and we see a, I don't know, a big predator hiding that we can identify that pattern of the predator and we can kind of pick it out, even though it's trying to camouflage itself. So humans by nature have become very, very good at identifying patterns. And you can also see this if you look at the clouds and you see things or you see animal shapes in the clouds or other things, so those patterns aren't actually there. But humans have become so good at identifying patterns, we can see things in many, many places. And so that's what humans are really, really good at. We're able to look at things, and we're able to pick out patterns now. Another thing that's really good for humans is we're very creative, Um, and three that creativity. We can also use memory and bring it outside knowledge, and we can also use a general understanding. And so these are all things that computers can't do, so computers are kind of a means of getting stuff to us, but once it's actually there, it's our job to use our pattern recognition abilities. And of course, you can train machine learning algorithms for specific patterns later on or specific cases and make them really good at that. But generally, if you don't know exactly what's gonna come them are or your first step, as the data scientists would be to try to identify these patterns, use your creativity, use your memory. You know, bring in all of these different things. Use all of these different things that make you human and use all of that on the data. All of these things that a computer just doesn't have any access to. Okay, so you know, you're considering all of this. The best way to do all of this would be in terms of data visualisation. So you can't just show spreadsheets with a bunch of numbers. That won't really help you, because looking at numbers, it's really hard to pick out patterns. The best way to do it would just be the plot values. And then if we have these visuals in front of us, then you know we can really identify patterns. We can see things go up and down and, you know, we can see them fluctuating and we can see them make very thin lines. We can just look at a graph and we can just see things. And of course, you know, we need a little bit of practice to understand what that graph is trying to tell us. But once we understand the graph in in general, then you know we can look at new graphs and we can just see things we can start to see patterns. And they may not always be true, but that doesn't mean you know we can't pick them out, and then that's later on. You would also do some testing, trying to see if those patterns were true if they make sense. But generally, data visualization is very good for this because it allows you to invoke all of your human characteristics. The things that are really good that, you know, make us human the things that we talked about and the lost slide, all of the things like the computer can't do. Um, and sometimes you know you if you deal with just these numbers, it's data visualization is for you and one sense so that you can see these things and try to pick them out and use them later on. But also, if you're trying to show these things to other people, so maybe you have to do a presentation in a kind of summary, then you want to make sure that your data visualizations were good because the people that are gonna be looking at it are much, much less trained at looking at data and analysing data than you are. And so, if you try to convey them a message and just show them a big spreadsheet with numbers and just point out like here, look, look, look, these numbers, you know, they pop up and they're gonna be like, What are you talking about? So that's why it's really important to have really good data visualization skills. One of them is to enable you to do your job. But the other part of it is to show it to other people and to kind of help you convey information to them. Um, you know, and of course, we talked about statistical values and statistical values are very important, and they could give us a kind of good idea about the data and what's going on inside of the data. But visualising data is just taking it to the next level. Statistical values aren't enough there. They can give us, you know they can help us. They can support us. They can give us ideas. But if you really want to understand what's going on somewhere. We just have to take a look at what's going on and of course, there. It's also important to make sure you choose the right visualizations and everything because other times, you know, Major slept extremely weird. But just this skill of being able to present data both for yourself as well as for other people is very, very important for a data scientists. And then we go over to interpreting data, and we've kind of touched on this in Los section already. But really with data visualization, it just allows you to see the state, Um, and it allows you to apply some reasoning to the system. And so you can. If you look at data, either you see something which is great. You know, that means you can try toe test something, see if it's actually there where you don't see anything. And that also tells you something that you aren't really able to pick out a pattern so that there isn't there isn't anything obvious that's going on. There may be something underlying that's more complicated but obvious to the user. You know it's not there, and so all of these things allow you to, you know, kind of easily or much more easily analyze your data on kind of prepare. Where you going to do after that? So the state official ization, it really gives you a deep, deep understanding of what's going on with your data. And then when we interpret this data and we look at these visualizations, you know, maybe you see, um, dips and, you know, maybe you see some hills somewhere. We can try to understand all of this by bringing in our outside knowledge. So again, what the human is really good at, we can, you know, bring in the context of things. You know, maybe people are going out to lunch here. And so that's why activity decreases. Or maybe everyone is coming to work in the morning. And so that's why activity increases compared to, you know, 6 a.m. So all of these things we can bring in all of this context. We can bring in all of this understanding to try to interpret the data, to try to better understand what's going on. And then, of course, we're going to see hopefully some trends or patterns. Of course, Like I said, these may not always be there. So we're actually so good at pattern recognition that we can see sometimes patterns that aren't really there. And so a good example again of this would be just looking at the clouds in the sky. And you can see animal patterns. Maybe, but that's really not there. That's just our minds, you know, identifying all of these patterns. Um, and so yeah, that's That's pretty much why data visualization is so important to a data scientist. It's because this whole you human aspect is is just key in data Scientists, Key and Data Analytics. To be able to understand what's in front of you to be able to understand, to bring in these outside knowledge, to be able to contextualize this creativity that's really key to a good data scientists and a computer can help you with all of this. The computer can help you, you know, do the number crunching. A computer can help you set up the visualizations, and it can plot whatever you want for it. But ultimately it's up to you to choose the right visualizations dio to look at the data to be able to communicate the visualization as well. All of those things are up to you. And so that's why the human is so so important in data science.
7. One Variable Graphs: everyone, It's Max End. Welcome back, Mr Torrey. A We're gonna look at one variable graph. So we're actually going to see some of the types of grass that we can do, You know, that we talked about in our last tutorial where we just looked at the importance of data visualisation. So now we're gonna go into data visualization and look at the types of grass that you may want to use or that you may want to choose from, All right. And so the graphs that we're gonna look out in terms of one variable grafts, we're gonna be hissed A grams, bar plots and pie charts. So let's get started with hissed a grams. Now we can see an example of a hist a gram on the rights. But what's really cool about history grams is that it shows us the distribution of the data , and it shows us the distribution across all the values in our data. And so it shows us what happens the least, and it also shows us what happens the most and hissed a grams. They let us see where our data is concentrated, and they also let us see how it's distributed on DSO through those it kind of shows a general behaviour. And so really, what hissed a gram is is it looks at each value, and it just looks at how often that value has occurred. And so what we see here, for example, is that around zero, you know, we have the most occurrence of whatever value we're looking at, and I was moved to the left and we moved to the right. These values start to drop off, so they start to become less frequent. And so that's what history am shows us. They program shows was the kind of frequency how often these things occur. And so there are different types of history grams that you can encounter or I mean generally hissed. A gram is just this plotting of frequency versus your value. And so there are different ways that this history room can look like. One of them is the one that we've just seen, which is a normal distribution. Or it's called Goshen like kissed a gram because it follows this caution distribution or this normal distribution that you know. But we can also have, like an exponentially decaying value, so we start off very high, and the further we get away from the initial value, the quicker it's gonna decrease. And you can actually compare that to the Goshen like or to the normal distribution. So the normal distribution looks more of like a bell. It kind of goes up and then curves down slowly. Where's the exponential? It cuts off very fast and then kind of slows down later on. So they do have different behaviors. And then, of course, you know, we can also get not just one peek like we see in this first case and the Goshen like distribution. But we can also get things like two peaks, or you can even get three peaks arm or we can have very large extended peaks. And so our history grams, there are means of showing us how this data is distributed. Their means of showing us you know what things occur most frequently wears our data concentrated, but that don't That doesn't mean that they're gonna have to have a specific value. And so there were a specific shape. So there are many different shapes that are Easter games can take on, and depending on what shape that you get that also tells us something very different about our data. All right, so the next one variable part that will look at is going to be bar plots. And so what bar plots do is they may look a little bit similar test again, but first, but they're very different in some sense, because bar plots allow us to compare across different groups. And so that's what we see on the X axis down there is We look at different groups, and so we use the same variable, and but we can compare that variable over different groups. And so if you look at that in examples, what we see on the right here is we look at different countries and what we show is we show the average income tax. And so we see that country B, for example, has thehyperfix average income tax. Where is country D has the lowest income tax? Um, so through this, you know, we're still only looking at the income tax variable, but we're able to compare us over different groups over different categories, if you will, so other examples would be if you look at control groups and test groups there if you're doing some like medical study or maybe some psychology study or something like that, you always want to have your control group, and then you can have different types of test groups, and then you can plot each of these groups as a bar plot and you can look at the same variable. But you can look how that changes over the different groups. Another example would be something like comparing male versus female heights. So you've got one group that's male, the other group that's female, and you just plot their average height and then the tax, the income tax of different countries, which is what we've seen on the right over here. All right, And so the last one variable graph that we're gonna look at is going to be pie charts, and what pie charts are allows to do is they allow us to section up our data on the and then we can kind of split it into percentiles, and because of this, we can see what our data is made up of. So the whole pie corresponds to 100% and then we kind of cut it down to different slices and through that slicing and then hopefully also color coding like we've done here and maybe even labeling or most definitely labelling so that you know what slice corresponds to what value were able to see what categories you know or what categories are. Data is made up of Andi so we can see what is most prominent. But we can also see what is at least prominent all of these things. And then again, here we can see also distributions not as well as in the history, Graham. But we can still see distributions in terms of dominance in terms of how many groups, there are a data spread evenly. Is it, you know, heavily concentrated in one part of the pie. All of these things in allow, you know, is that's what we're able to do with part charts. We get this nice kind of group overview of one variable, so examples of this would be You can look at ethnicity distribution in a university and so you can have a pie chart and just each slice of pie, which is represented different ethnicity and depending on you know, how much of our percentage they make up. The total university profile. That's how big the slice of pie would be, and so you can see dominance of some ethnicities as well as you know, minorities. But you can also see just by how many slices that are. You can see how many different ethnicity groups there are. Another example would be you can split up star reviews for products rather than you know, looking at the average star review. You can also just use a pie chart, and you can see how many of my reviews or five stars. How many of them were four stars? 32 and one. And so there you can again also get this nice, different overview of how the review system would work.
8. Two Variable Graphs: everyone, It's Max and welcome back. Now we're gonna talk about two variable graphs, so the grasses we're gonna look at are gonna be scatter plots, line graphs to D hissed a grams or two dimensional Mr Grams and Box and Whisker plots. All right, so let's start off with scatter plots now for a scatter plot. What we're doing is we're really scattering all over data points onto a graph, and so pretty much every data point that we have we kind of put a little dot onto it on the graph and scatter. Plots are great because they allow us to see spread of data between two variables. So we're always potting one very ball on the X axis and then another variable on the Y axis . And it just pretty much allows us to see how the data is distributed for these two variables. And then through that, we can also see more dense areas. We can also see some sparse areas, and we can also look at correlation. So maybe you remember in the election we talked about correlations. Um, we're able to see through scatter plots where those correlations where or where there weren't any correlation. So all of these things, that's what scatter plots are really, really nice for scatter plots. Of course, we can also use them to have, like we see here little clusters. So not everything needs to be connected by a line or curve. Maybe something is more like a circle. And so that's what scatter plots can show us to. They can kind of show is thes groupings, and we see one cluster here. But maybe, you know, you have bigger plots. And then there would be smaller, you know, like 10 little different groupings for different things. So scatter plots are really great for that because they just show us where the data points are located for these two variables, and then we can You are self see, you know, like, how do these look like do? Does one variable affect? The other is there may be certain groupings that we can see. Where are dense areas where it's spars. Where are things concentrated, you know, is everything spread up all over the place is a very, very narrow and only in a specific region. Scatter plots allows to see all of these things very easily, and so some examples where we could use scatter plots would be. If we see if we look at the graph on the right, we can look at something like a car prices versus the number of cars sold. So each of these data points pretty much represents a car that's been sold. And then the X axis tells us the price that the car has been sold at. And the Y axis tells us the number of cars that have been sold at this price. And so what we see here, for example, very easily is that, um, Mawr, the cars priced the loss that gets sold. And then maybe you can think of that in terms of while the Mauritz price. Many people don't want to buy such an expensive car. Maybe they found a cheaper version of it. So maybe it's just a branding thing, which is why it's more expensive. Maybe there's something just as good quality that's cheaper. Um, maybe people just don't have enough money, so that's probably big factor to that. People just don't have enough money to buy these expensive cars, and so that's why they drop off. And so it may look a little bit different in terms of profits. But the higher the car is priced, the last we see it being sold. So that's one example of a scatter plot, then something else that we can look at is maybe the income versus years of education. So on the we would look at on the X X is how many years are someone has been educated and then we would look at the current income. And that would just be a point on the on the graph. And we can do that for many, many different people. And then we can see how different education for different people, how that effects their current income. So that's another thing where we can do a scatter plot, for. We can also go back to one of the earlier examples that we used very early on. But we talked about people travelling toe work and weaken dis plot, the distance traveled versus the time it takes and traveled toe work, and then we can see, you know, maybe some people travel faster. It could be that some people travel the same distance, but one takes longer than the other because one goes by car the other one goes by bike. The other one takes public transport all of these things. So all of that we can see in the scatter plots and just kind of take into account these different situations and see how that all looks for the for the general population of our data or just generally for data. So scuttle clocks are really, really great as a kind of first go to just also identifying trends identifying regions. I'm just giving a good overview for data on now. The next thing that will look at is gonna be line plots and line plots, and some sense are kind of similar to scatter plots. So we have the same bases of the X and the Y access, but the points are connected. And now it's very important to know when it choose line plots and wanted to scatter plots so line plots can carry a lot of advantages with, um, because this connectedness, it makes it very easy for us to see trends because we can see where these lines go, not just trying to connect the points in your head. You know, I kind of connect the dots, but that's exactly what I malign plot does it is. It connects the dots for us and so we can see these lines. It's great if we want to see an evolution of something. So maybe you want to see an evolution over time. Maybe you want to see an evolution over space and evolution with people something like that . Just if our data points are connected, it's great to use a line plot. So if we know that whatever happened before is connected to what happens now, it's great to use line plots because line plots show us how things evolve because they're all connected as a line. But if we're to do scatter plots and we just kind of plot points randomly and just because if we go back to our kind of car sold car price example just because someone bought an expensive car or if we look at the expensive car and it's been bought, say like five times and you look at a cheaper cards and bought 100 times, there isn't really a logical connection to make between the two. And so if we were to use line plots where we should use scatter plots, really, what we see is just a bunch of lines all over the place. So that's why it's important to kind of know when to use line plots and when you scatter plots, because it can be very, very helpful. If you use a scatter plot instead of a line plot, it's gonna be a bit more confusing because you have to try to connect the dots yourself in your head. But if you use a line plot instead of a scatter plot, it's gonna look really weird because there's just lines all over the place when you can't really see anything. So an example where we could use line plots is you have the typical distance versus time so you can look at you know how far someone or what time it is. And then you know how far someone has traveled just a general curve of distances. This time it's very, very common. You can look at the profit off a company versus the number of employees. So the more employees they imply employees, how does that change their profits? So, of course, they have to pay the employees more. But maybe the employees can also do more work and Hopefully, you know, that kind of cancels out what you pay them and then increases company profits. And then what we can see on the right here as we can look at your creativity and how that changes with stress. So you can see that the more stressed out your the less creative your And here it's also a good to use a line plot because you kind of gradually advance and stress. And so each point and stresses kind of related and the higher you go up in stress, the lower you go down and creativity. And so there is this kind of relation or we can see this evolution. So the more you get stressed out, the less creative you become them. So line plus a really nice here because there's not this chaotic movement everywhere, but it's very nice, and it's very easy to see this line. It's very easy to follow. Okay, so the next graph that we can talk about is two dimensional hissed a grams. Now we've seen one dimensional Mr Grams in the last tutorial where we looked at the spread of data and we looked at the peaks and how you know things were distributed to the right into the left, but we can also do a two dimensional hissed a gram. And so what a two dimensional instagram is. It's a one dimensional hissed a gram, but it's a pretty much a hist a gram for every single point of the other variable that we're looking at. So, um, really, what these things allow us to see if they allowed us to see how the different distributions of the two variables relative to to another. So we can see here, for example, in the red region that for those specific values them, they happen a lot. So that combination of values happens a lot. And so we're able to kind of pinpoint these frequency occurrences again, and we're also able to look at drop offs. But we're able to pinpoint that to to specific values now rather than just one, which is what we did to the two D hissed a gram. I mean, these things are much harder to see in scatter plots because and scatter plots. If we have a value occurring 100 times, it would just be the same dot and the dot wouldn't get bigger of course, you can make the dot bigger yourself if you wanted to, or you could change the color or something like that. But really, if you do a scatter plot and the same thing happens 100 times, it's just gonna look like one dot. Whereas for two dimensional history grams, we can see that it's not just, you know, it's not just happening once, but we can actually see the frequency of those variables or those those two variables together. So an example of a two dimensional instagram would be. If you look at ticket prices, where is the ticket sold? And so if you look at the lower left corner and we can kind of see this red peak, so that's cheaper ticket prices. But the tickets are also sold often, so we know that tickets at that price are sold quite often. These could be, you know, like new, rising brand bands. Things could be like, you know, you kind of standard bands that maybe you want to take someone on a day to, but you don't want to spend too much money on ticket. But still, a concert is a nice idea, and so that's a good ticket price that sells a lot of tickets because it gives you the pleasure of the event without making it too expensive. And then if you move more towards higher ticket prices, and then if you move more towards more tickets sold and then you can see that for high tickets, high ticket prices which would be, you know, like these big bands, then we can again see how many tickets were sold so we can see that for, you know, a higher price. And if we go up, tickets sold. So if you want to see lots of tickets sold for a high price, then the Red Peaks are going to give us all of these more famous artists. So that's, you know, one kind of application. But of course there are many, many better ones. It's just these things, you know. If if you're in the moment and you you can kind of then you would realize, Oh, this is when a two dimensional instagram would be a great thing for me to use. So a lot of these grass there great to know, And once you're in the moment that it's much easier for you to pick out which graph would be best representative. Um, finally, the last draft we're gonna look at, it's gonna be a box and whisker plot, and I want box and whisker plots. Loss to do is they allow us to see the spread within our datum. So it's not just like a bar plot, which just shows us one value. But we can actually see the statistical spread so we can see median values, which is what we see here. We can see court tiles, the little dots on the outside actually show us out liars. And so what box and whisker plots allow us to do is that they allow us to see the statistical information, but they allow us to see it visually. And that makes comparing across different groups, which is what we're doing here much easier. So a good example of that would be if we look a ticket prices for football games for different teams. So you have different teams in different teams, of course, use different stadiums and they have different popularity ease, and some teams may be much more expensive for the ticket. Prices may be much more expensive than other ones and so we can compare these ticket prices using box and whisker plots. And then we can see you know, what is the higher end of these costs. So those are going to be the more luxurious seats, and then we go to the bottom and those are going to be the less luxurious seats, probably the ones where you stand and then you have middle values, depending on you know, the standard seats and where you are in the stadium. If you're close to the field, if you're further away from a field but you're still sitting, all of these things we can kind of see hear. And that's what gives us a spread. We can compare that across different teams, and we can see the spread across difference teams. We can also see which teams are more expensive, you know, where do the prices vary the most for specific teams, and maybe some teams have a super launch, and then they have your you know, um, standing places that are just much cheaper and so you would see a lot more larger spread. Or maybe some teams just have, you know, Onley seats, and so you'd see a much lower spread and so all of these things were able to compare using box and whisker plots over different groups.
9. Three and Higher Variables Graphs: everyone, it's Max. And welcome back in this tutorial, we're gonna talk about three and higher variable graphs. So the grass that we're gonna look at it is gonna be heat maps. And then we'll also look at multi variable bar plots as well as how we can add more variables to some of the lower dimensional grass that we talked about earlier. All right, so let's start with heat maps Now. What heat maps allow us to do is they lost to plot two variables against each other in the X and the wine, and they lost to show an intensity or a size or something like that in the Z direction towards us. So an example of this, which is kind of what I've tried to illustrate on the right, is a customer moving through a storm. And so we contract the path of the customer in the X and Y direction of the store so you can kind of get this bird's eye view and see where they move to. And the darker spots actually tell us the positions where they spend more time at, so we can see that they spend a little bit of time you know, at the beginning they moved in men, and then they stopped once was what will we see? The dark spot being Maybe they found, like, the candy aisle or something. There was a specific piece of candy that they wanted, and then they moved on. And then they started to go are two rows the corner around the corner a little bit. Maybe they reached the fruits and vegetable section there and picked out several things. And then they started to head towards the check out counter, which happens at the very end. And they were moving out of more constant pace. Sometimes they stopped to look a little bit, but they just kind of continued moving on. And so the three variables that we've shown here is we've shown their exposition in the store. We shown there y position in the storm and to the color. We've also shown the time that they spend at each position. So that's what we can use. Heat maps for andan. Another example of the heat map would, for example, be if you take a flashlight and you move it over the screen. And really, what you're showing is the amount of time that you've shown the flashlight onto a specific region. So that's kind of another example of a heat map, but usually heat map. As the name implies, it allows you to track positions. And so it's very often used for things like tracking customers, three stores or just tracking general partner people location where they like to spend their time. And the intensity that you see in terms of the color is usually the amount of time that they spent their all right, so we can also do multi variable bar parts. A multi varied bar plots on DSO. It's is very similar to a single bar plot where we just parted one value over different groups. But rather than just putting one, we kind of cramping together and we plot several, and so an example of this would be that we plot, you know, gold scores. I'm goal scored for team, the shots taken on goal as well as the shots on target. And so we can see. Maybe there are teams that shoot less on goal without score less, but that's because they also shoot less and therefore they also shoot less on target. Or maybe There are some teams that just score a lot, and that's because they shoot a bunch. But they just don't hit the target that often. Maybe there are really good teams that score a lot, and they also shoot a lot on Target. And so all of these things were able to then compare over different groups. And so that's what we can use multi variable bar plots for if there are several variables that would give us a better understanding of the system than just looking at the variables and one at a time. But it also be really cool if we could compare all of them that make you to use mostly variable bar plots for that and just plot them on the same bar plot. And then we can see how they change, you know, within a group. We can also see how the change over different groups, okay and something that we can do is we can also just add extra dimensions to lower dimensional graphs that we've had. So we're kind of limited to three dimensions because that's the amount of space dimensions that we live in. But if we take a scatter plot for example where we started off with just the X and the Y axis and points located. What we can do is we can actually add 1/3 access so we can take the X and the Y and then we can add a Z and that gives us an extra depth dimension, which is exactly what we see here. So rather than just plotting, unlike a two dimensional field on like a plane, we can actually plot it in a volume. And so we can see this kind of scattered ball that we've done kind of ball that we've done here, which is kind of located at the center of our plot. And so this could be really cool because it allows us to see death to the problem with this is that we have snapshots every time, and so really, we're looking at two dimensional snapshots. And so to get the best understanding of this, we need to rotate are scattered plots or our plots as we do them so that we can also at in our depth perception because right now for looking at it, it may look three dimensional. But really it's just a two dimensional snapshot to get the best understanding if our scatter plot is located more towards us and more towards the left or something like that, or maybe it's just really high and close to us. Or maybe it's really low and far away to understand all of these things. We need to be able to rotate to our scatter parts so that we can see it from different angles, which then gives us this depth perception, and we can do the same thing with three D line graph. So here we see an example of maybe the position of a skier as their skiing down a hill. And then we can kind of trace that through time, and we see that they're kind of they're going down the hill and this nice zigzag motion as you should, and we can just track their position over time. So here we've added this extra dimension to through the line graph, rather than just taking maybe a time and a position in a time or something like that. We've added a second positions were actually even 1/3 position. So we've got the extra one, does that position and then we just trace it over time, and so that gives us this whole line here. And so that's how we can take these lower dimensional plots that we've looked at before. And we can just have extra dimensions to them if we want, as long as it's still easy to see, as long as it makes sense where we're looking at, Um, yeah, we're really just able to maybe just slap on another direction there and, you know, compare another variable.
10. Programming in Data Science: everyone, it's Max. And welcome back in this tutorial we're gonna touch on the third major section That is really great for data scientists. Or that should be an essential of data scientists, which is the ability to program. Okay. And so why do we program? Well, there are different reasons why we want to be able to program. The 1st 1 is gonna be the ease of automation. The second will be the ability to customize. And finally, it's because there are many great external libraries for us to use that just make our job so much easier. Um, all right, but so let's get started. Let's talk about the ease of automation for us. What do you mean with that? Well, being able to program it really allows you to prototype really fast allows us to automate things, and it also gives us the extra benefit of if we have something in our mind, we can just take that and kind of put it into the computer by programming it. And so we're able to automate everything very fast, and we don't have to do these repetitive tasks. Um, you know, maybe copy pasting stuff into or from Excel or all of these things. If we just want to repeat something or we want to quickly change something up and just change a small thing, we don't have to do a lot of stuff. We can just change that in our code and then click play and let the computer take care of all of that for us rather than us having do everything manually. So it's very easy for us to automate things and also for doing reports. It's very easy to automatically create these reports. You know, all you have to do is set up your program to deal with the data that you're going to give it, and then I can automatically create reports every week. And the reports can be different because you give a different data. Um, it should still look the same, but the data, the values can be different. And so that would just automatically create all these reports for you. And you don't have to do that all yourself. The program does it for you, Um, but you've built the program and you're giving it this different data. So you're still doing all of the analysis. It's just you get to skip the part of copy pasting and like looking across and taking over the values and doing all of the formatting of just doing the same report over and over and over again. I'm all about It's taken care of for you, and all you have to do is just put in the right data, you know, right out everything that you want to do and then click play and let the computer handle all of that for you because remember, that's what the computers doing good at doing doing these repetitive tasks. Okay, we also want to be able to program because it really allows us to customize. It's very easy once we go into data analysis, and when we see things that we get these ideas that we want to expand or different directions that we want to progress or analysis into and being able to program, it really just allows us to take all that and put it into code and just choose that direction and weaken very easily, dive much deeper into our analysis and discover things fast because it's up to us to where we want to go. And so this ability to customize with programming. It's It's very, very important because we're not reliant on anything else. We're not reliant on, you know, some software and maybe it breaks down. Or maybe we don't know how to perfectly use it. And we have to read the manual and read it like a help section. No, but we know how to program. And we just typed down exactly what we want to do exactly where we want to take it, exactly what we want to see, and we can customize very, very fast with that. We can also prototype very, very fast without on Maybe if a visualization is not working, to turn a scatter plot into a line plot is very easy. You just change one word. So all of these things are very, very easy to do with programming because we have all of that power at our fingertips, and we can just, you know, change everything that we're looking at, everything that's been calculated, maybe want to calculate an extra thing on, take up something else because it's irrelevant. All of these things were able to customize, and all of that we can do because we're able to program so really what we're doing is we're making the data. Our so we're taking full control of the data were taking full control of where we want to go with our analysis, what we want to see and what we want to show. All right, s So let's talk about first libraries, but also give you two great pipe in libraries that you should, you know, maybe feel comfortable with or that you should maybe consider using for data analysis. So, first of all, what are libraries? Will libraries are pieces of code. I've been pre written by others that you can just take in and use. And so a very good example of this is something known as a math library. And so that has all of the squared functions taking to the power, you know, taking the exponential, assigned the co sign all of these things that you know and you want to use. But you don't want a program yourself. So like it pretty much avoids that middle step of you having to program the equation to calculate a sign, because all of these things, those are things that we don't want to do. We don't want to get distracted from our target. We want to be able to do exactly what we want to do without having the program completely. Other stuff. And so that's what libraries air great for their developed by the community for everyone to use. You know, everyone is helping each other and these libraries, they just bring a lot of power with it. And so one of these libraries is called pandas and Panoz is pretty much like excel, but it allows us to do or we could do programming with it, which just makes it so much better, because we can do things so fast with it. We can do all this customization. We could do all this automation, whereas, you know, like Excel. If you give it too much stuff, too much to run, it would just start to crash because it has to handle all of this other things. All of these other visual things, you know, the u I. And there's a lot Mauritz a lot. It's not a structure as well where is and programming the program. You know, your computer just goes through everything step by step. It doesn't have to take care of all of these visualizations things. It just does the calculations down below. But we can still do all sorts of data management with them so we can shift our data around . We can drop columns, weaken, split things up. You know, we can split things up by row. We can pick out certain Rose. We can even do statistical calculations on our data so we can say, you know, hey, calculate the mean for this. We don't even have to, you know, make her own formula for calculating the meaning or for calculating the standard deviation or for calculating correlation between different columns. All of that can be done with Panoz with just a you know, a couple of key words. And so it's really easy to do data analysis with it because all of the functions that are there and we know exactly what we want to do, we don't have to write the code for all of it. So if you wanted to look at correlations, we just say, Hey, panels do correlations rather than having to, you know, code all the correlations for ourselves doing, you know, quoting that whole algorithm and that makes it really easy and really fast to get results and to get to where you're heading because you don't have to go into any of these middle places. You can pretty much just skip the middleman of having to, you know, right, All of those. I grow them to yourself, and you could just use them so that you have your start. You have your idea. You know exactly what you want to do. And you can do exactly that to get to your goal. Um, the other library, that is very cool will be Matt plot lib, which is what I use a lot for data visualization. It allows me to create graphs, allows me to visualize my data, allows a bunch of customization, so I could really just move everything around it. I can move my spines. I can turn things on and off. You know, all of these things are very easy to do with my popular. There's a lot of great customization that I'm able to do with it. So these are the kind of two basic private libraries that you should probably maybe get to know where you can look at some of my other courses. One of them panels would deal with the data analysis part and map lot lib would help you deal with the data visualization part of it.