Transcripts
1. Intro to Data Science: everyone, It's Max. And welcome to my course on the essentials of data science. Now, the first thing that we're gonna do here is we're going to give a short, little introduction to data science so that we understand what a data scientist is, and then we're gonna cover all of the kind of three big essential areas that you need to be a successful data scientists. All right, so what is data science? Well, data science is kind of summarize it in different ways, but the main parts of it are transforming data into information. And this is a really big step because a lot of people talk about, you know, data and Big Date and all of these things. But data by itself isn't really that useful until you can turn it into information. And so if you just have a bunch of numbers appearing somewhere and it's just, you know so much of it, no one can make sense of that. And that's where you need a data scientist to be able to transform all of these all of his vagueness and kind of this noise to that's going on, and you need to be able to extract information from it. And that's what a data scientist does. Now, what you do with this to with this information or how you get this information, it's through analysing your data. So a big part of it would be, you know, cleaning things up, doing some some processes on it. And then you analyze once you've cleaned things up a little bit, and that is one of the ways that you can then get information out of your data. Um, through this analysis, and you can kind of continue on and you see trends and patterns and all types of correlations, hopefully, on all of these things again build up into this turning data into information component. Um, and then ultimately, you also need to contextualize everything that you have, so your computer can't do that for you. Computer can kind of crunched the numbers and stuff, but it's your responsibility also to make sense, what's in front of you. And even if you see something, you just don't blindly trust it. But you need to understand, you know? Where am I at? Where am I coming from? Where is this data coming from? Need it be able to contextualized these things and then, of course, be able to apply as well as understand them. And so once you have this data, you know it's great. But turning it into an information into great information that you can use and directly apply. That's where the rial power lies. And that's also kind of the role of a data scientist. So that's what the data, that's what data signs pretty much is. And so what is the data scientists do? Well, we kind of already talked about this just a little bit, but let's go over it again. Any more concrete examples? And so a data scientist would, for example, get and process this raw data and then converted into something a little bit clear. So you can imagine kind of just like a data stream coming in. And it's you have this measuring device and constantly is just measuring all sorts of data on and because, like nothing is really Constance, everything will be fluctuating. I've been down, and so a date assigned to this would be the head of take all of this data. It be that kind of clean it up a little bit, you know, maybe reduced this fluctuation that you know isn't supposed to be there. That's just kind of background stuff going on and then put it into a format so that you can easily plotted against some things on. And then we already get to the next point that, you know, once the state as cleaner, you can maybe do start doing some calculations on them figuring out the core statistical components, you know, like, what is the average values of these? What? What am I really dealing with, You know, getting a first look at first understanding of what it actually is that you're tackling. And then once you have this kind of understanding that you can start to do some visualizations which helped you as a data scientist, maybe see some trends or patterns already. But visualization is also really key because they let you show it to other people, and they're a great means of communication. So they help both us a data scientist as well as helping others. When you try to convey this information to them, all right, and then finally, you have to suggest some applications of the information, So it's not really enough to just be able to look at it and say, like, Yeah, I see it goes up and down and that's that's good But what does that mean? How does this transfer into something useful? And that's also one of the key roles of a data scientist transferring information into knowledge. And so you've got this data into information step. But you also need to transfer this information into knowledge and those air to really powerful things that are worth a lot, a lot. And that's pretty much what a data scientist focusing on and then you can go further and, you know, and take this data and do machine learning with it or something. If you really understand what's going on or if you have some hypothesis of, you know what could happen so you can take things a lot further. But ultimately this kind of turning data into information and then into knowledge, that's kind of your role, all right, so let's go into the essential techniques or the essential components of data science. So the first essential component and we kind of touched on them. This already is statistics, and basically we're gonna cover this later on. But let's just give ah kind of quick wrap down. So in statistics need to understand different data types that you can encounter. And so there are day I can come in different ways, and we'll go again into more detail with this later. But it's not just, you know, you get a bunch of numbers date, I can come and very many different ways, depending on the field that you're in. And so you need to be prepared and you need to kind of be aware that data may not always just be a direct number for you. And then, of course, you need to understand some key statistical terms like you know, the different types of means and also understanding, fluctuations and data. And the reason that this is important is because these key statistical terms give you an overview of how this data is behaving. And depending on how the data is behaving, you may want to approach it differently. So if you know that your data is very clean, there is very little fluctuation. Then if you visualize things, you can probably trust what's going on or, if you want to maybe fit some curves to it or something But if you see there's a lot of fluctuation in your data visualizing it is going to be much more difficult because you just see jumps everywhere and you're not really sure which of this is actually true. And which of this is caused by, you know, like some interference somewhere or someone is messed with my system. And so all of these things will kind of be hinted to you through statistical terms. So it's probably good that, you know, you're kind of comfortable with these things and that you can be able to get some meaning meaning out of them. All right, on, then, finally it be and statistics to be able to, you know, split up on group or segment data points so that when you have this big data set, you want to be able to, you know, maybe split it up into smaller things, compare different regions, look more into more detail into some things and maybe, you know, isolate two components because, you know, hey, these things are probably gonna be important. The rest I don't really care about that much. So being able to kind of pinpoint and isolate and metal with the data a little bit. So these are the kind of statistical components that we're going to look into. All right, So the next big thing and we've already talked about this to is data visualization. Andi, we'll see why data visualization is a really key skill for data scientists. And then we're also be gonna be covering different types of grass that you can use and how you can compare different number of variables. So, for example, you can have one variable grass where you only look at one thing and you only want to look at this, and you want to see how these how this changes. You have your typical to variable grouse, which you probably know where you have this X and a Y axis. And then you can kind of see how two variables relate to each other, where you can have three variable or even higher variable graphs, and where you plot maybe three different things or even more if you want, as long as it makes sense next to each other, so that you can compare multiple things at the same time, all right. And now we come to the other big thing that you're probably gonna need as a data scientist , which is gonna be the ability to program now, not every data scientists can do this, but this is really, really essential, in my opinion, to your role as a data scientist, because knowing how the program is gonna make your life so much easier if you know how to program you can kind of take your ideas and your thoughts, and you can put them into actions in the computer. And you can just automate everything you can customize things you can explore, you can prototype, you contest, and you're not reliant on some, you know, application. You don't have to master some application. And if it doesn't work, if one feature isn't there, you have to contact customer support. And maybe it's not even possible. And then you have to wait for an update. Or maybe something is bugged with programming. There's just you're so much more reliant on yourself, and you can really just do whatever it is you want to do. And you're not reliant on other people or on the tools that other people have built for you . But rather you can just pretty much go and you know, just do what you want to do without there being major roadblocks andan. We'll also look at some essential packages and python. So in programming, you never want to reinvent the wheel. You always want to start off with the last person left off, and so the ability to program and be able to write simple programs you would I need to teach yourself. But you wouldn't need to right highly complex mathematical packages or data analysis packages. Those are already out there. All you need to do is be able to download them and implement them in your coat, and they're gonna work. You know, they've been tested a lot. There's a huge communities working on them on improving them and everything. All of this is for the community, and so the whole community kind of works together to improve it. No one's really directly trying to make a lot of money off of it, so they're not going to charge you all of these service fees and everything. Everyone's just trying to improve their package because if it improves, everyone also benefits from it. And so we'll look at some of the library's. We'll talk about some libraries that you can use, especially in python, and to help you along your way with data analysis and to become a successful data scientist .
2. Statistical Data Types: Hey everyone, it's max and welcome back. So in this chapter we're going to talk about statistical data types. Now we're going to look at the three different types of data, which are summarized as numerical, categorical, and ordinal types of data. Now, these are the types of data that we talked about before. How you can't just expect your data to be kind of numerical. And so we'll see numerical data, but we'll also see the two other types of data that you may be encountering in your career as a data scientist. All right, so let's talk about numerical data first though. Numerical data is also known as quantitative data. And it's pretty much things that you can kind of measure it. It's great. Numerical stuff that you can do math with. You can compare it, you know, saying this plus this makes sense. A is greater than b. These are all examples of numerical data. Numerical data can we split up into two different segments? One of them is going to be discrete. And so discrete means the values only take on distinct numbers. And an example of this would be, you know, IQ or something like that and measurement of IQ. Or if you do a coin toss, the number of times that you toss heads. So you can, you know, you can have 15 heads, you can have 12 heads out of 20 coin tosses. You can have 500 heads out of 1000 coin tosses or 500 out of 600, or all of these things. But all of these are distinct numbers and now they don't have to be whole specifically, but they do have to be distinct. So that's the kind of very important part that, you know, there's a kind of step size that you're dealing with. And of course, you can still say, hey, you know, flipping eight heads out of 20 is better than fiddling seven heads out of 20 if you want to flip heads, that is, we're flipping eight out of 20 is worse than flipping 7 out of 20 if you're going for as many details as you can. So all of these kind of comparisons that make sense. So that's the discrete part of numerical data. Then we have the continuous part. And now the continuous part is really that values can just take on any number and they're not in limited by decimal place. So a value that can Nino can be like 1.1 and then the next value would be 1.2. That's not continuous. That's still discrete because you have this step size of 0.1. Continuous means literally every number from start to finish can be taken on. And this doesn't mean that every possible number in the universe from negative infinity to plus infinity and all imaginary numbers and everything that comes with it that doesn't, that's not required for continuous. It could really be that just every number between 01 taken on. So for example, let's say you have a bottle of water, and this bottle of water can hold one liter. Now if you fill your bottle up and it starts off empty and you fill it all the way up to the top. The amount of water that you've had needed to take on every single number between 01 because you can't just fill up water, you know, and kind of small increments of say, hey, I'm going to put in 0.2 liters every single time because the water doesn't just teleport from a to B. But when you're pouring in water, it's more like we see in the stream here. And the water level rises and rises and rises. And so the amount of water that we have in our cup needs to take on every value between 01. So that's an example of continuous data for but you see that we can be limited to 0 and to be between 01, we don't have to start at 0 and go all the way up to infinity or something. But it's just that the range that we're looking at, every single number can can be applied or every single number can happen. Another good example would be the speed of a car. If you starts, you know, you're standing still and you're studying and you're studying at a stoplight. And then you want to accelerate in the speed limit is say, you know, 50 miles an hour or something. To get to 50 miles an hour from your starting position, your car has to take on every single speed in between. And of course you want to see that, you know, on your speedometer it would say something like 0 miles an hour, one mile an hour, you know, maybe you can go into like it's going 0.10.20.3 or something like that. So it may look discrete to you, but that's not how your car is going. Your car doesn't say like, oh, I'm going to go in the step sizes of speed, it's going to accelerate and it's going to take on every value starting from 0, going up to 50 miles an hour. And you're going to, when you're in this transition, you're going to take on every single one of those speed values. So that's how continuous data it looks like. And it's important to understand the difference between this discrete and continuous. Just because you may want to approach it differently. Now of course, if we're dealing with computers, are computers can't deal with infinite number as in the decimal places. We have to cut it off somewhere. And so usually continuous data is going to be rounded off at some point. But it's still important for you to know that you're dealing with continuous data here rather than discrete, so that you know, hey, there can still be other stuff in between. Here are all of these things rather than, you know, having specific step sizes and all you see is just kinda bunch of lines. At every step size. But you can expect that when you have continuous data that everything is just kind of filled, filled up, everything can, and made even well be in between certain places. So that's, that's kind of the important thing to note between discrete and continuous. Alright, so the next type of data that we'll have is categorical. Now, categorical data doesn't really have a mathematical meaning, and you may also know it to be qualitative data and categorical data. It describes characteristics. So a good example of this would be, for example, gender. So here, there is no real mathematical meaning to gender. Of course, you know, if you have data, you can say male is 0 and female is one. But you can't really compare the two numbers even though you assign numbers to them. And you may just do this so that you can split it up later on. Your computer can understand, but it doesn't really make any sense to compare. You can't say, you know, is male equal 0? Well, you can say male is not equal to female, but you can't really say, is one greater than the other or one approximately equal to the other. Those things don't really make sense because they're not well defined. What does that mean? And you can't really add them up either. You can't say male, female. But that doesn't, it doesn't give you a third category or something. So categories that you can't really apply math to them, but there are nice ways to kind of split up or group your data on. And they provide these nice qualitative pieces of information that are still important. It's just, you can't really go that well about, you know, like plotting them on a lion or something like that. So those are important things to note with categorical data. And then another example would, for example, be ethnicity, or you could also have nationality. All of these things are examples of categorical types of data. Um, yeah, And so like we said, you can assign numbers to them. But that's really just for your code so that it's easy to kind of split them up, but you still can't really compare them. How are you going to compare nationalities? There is really no definition for comparing one type of category to another. Alright? And so the third type of data that you can encounter is something called ordinal data. Ordinal data is a mixture of numerical and categorical data. And a good example of this would be hotel ratings. So you have star ratings 001234 or five stars, or maybe even six stars or whatever it is. Whatever hotels go up to these days, um, but it's still not as straightforward to compare. So I'm sure you've seen two different types of three-star hotels. One of them, you know, had the bare minimums the beds were okay, but it wasn't really anything special. And then you had this three star hotels that you could have sworn where at least four star. And so star ratings do make sense. We can say, you know, a four-star hotel is probably better than the three-star hotel because there have been standards. There are standards for these things. They have been checked, you know, if you go to a four-star hotel, you know what to expect. But still, it's not completely defined. So like coming back to this three-star example, it's very hard. Just say, hey, we're going to three-star hotel. It's very hard to know exactly what to expect because there are different parts of three-star hotels. There are three star hotels that have developed onto, like have a swimming pool maybe or something like that. And then there are those three star hotels that are really more like hostels or something that I've just made it past the two-star place. And so there, it's much harder to kind of define or just know what to expect. Now, if you take averages of the star systems zone, then you do get a much better idea of what's going on. So if you have consumer reviews or something like that and you say, oh, from 500 reviews, our hotel has an average rating of 3.8. Know that the three-star hotel that you're looking at is pretty much a four-star hotel. It feels like a four-star hotel, even though it may not have all of those qualifying characteristics, that's the kind of feeling you get from it. Whereas from another three-star hotel, you may have a reading of 2.9 or something and they're, you know, you know, this hotel is more towards the lower end of the three-star. Some people may not even consider it to be three stars. And of course, you know, this rating may be a little bit biased because they went to a different three star hotel first, and then they went to this one and they were expecting something completely else from a three-star hotel. So they said This can't be three stars, this is two stars. But it's because of the way that the ranking system is to find underneath and everything. And so when we have these averages, but these ordinal numbers than the kind of start to make a little bit more sense. Alright, so let's go over a small exercise and see if we can identify what type of data we're dealing with. So the first thing we'll look at is going to be the survey response to happiness. Now, you have people filling out a survey and then this, and then one of the questions is, how would you rate your happiness and it's going to be bad, neutral, good, or excellent. What type of data on with this B? Well, this would be an ordinal type of data because it's still a form of categories. And you're asking for the subjective opinion, but it does make sense. See, you can still compare them. You can say excellent is greater than good, good is greater than neutral, neutral is greater than bad. But what exactly does it mean to be good and excellent? You know, where do different people draw the line for this? That there's still a little bit of vagueness involved, but generally it doesn't make sense and you can't compare it. And if you have a lot of surveys and you average them, the values you're going to get are probably going to be very well representative or at least pretty good representative. Alright? So if we look at the next thing which is going to be the height of a child. What type of data is that? Now? We can't say it's probably numerical and well, it actually most definitely is numerical. So the height of a child is a numerical value. But let's go a little bit deeper and say, is the height of a child discrete or is it the height of a child continuous? Well, even though when you measure height, you get something like five foot five foot three, or a 160 centimeters or something like that. It's not a discrete value because to get that height, you have to have reached every single height of four. And so even though at the moment you may be measuring it, you're kinda rounding it off to how much your measuring tape can measure. So like your measuring tape is kind of limiting the height. But if you had a super, super precise measuring instruments, you could measure not just, you know, five-foot, three or something like that. You could really go into detail with the inches and the decimal places in there and everything kind of going on. So the height of a child would be a numerical data type, but it would be continuous. All right, now let's think about talk about the weight of an adult. Do you expect the weight of an adult to be either discrete or continuous? So we can probably agree that it's numerical because it's a weight value. It's, it's pretty much defined to be a number. What do you expect it to be discrete or continuous? While the right answer here is gonna be continuous again, because to reach a certain weight, they would have had to have reached every single weight in-between before. So again, weight is something that we can consider to be continuous. All right? And so finally, let's look at the number of coins in their wallet. Again, we can already by the name, it says the number of coins. So we can probably agree that this is a numerical type of data, but the number of coins in your wallet without be discrete or continuous? Well, the answer would be discrete because it doesn't really matter. What's your note your coins are, they could be 57 pieces, that could be 25 cent pieces, 10 or five or ones, anything like a two or something like that. But they're not going to be, but the number of coins that you're going to have, we're going to sum up to a whole number. So you can have one coin, you can have two, you can have three, all of these things, but you can't have infinite fractions of a coin. You can't have, say, you know, the square root of 2 number of coins, that doesn't really make sense. So you have a defined step size, you have one coin. And then if you have a second coin that you have to, because the third quantity of three, you're going into step sizes of one. So for the number of coins in your wallet, we'd be having discrete numerical data.
3. Types of Averages: Hey everyone, it's max and welcome back. In this tutorial, we're going to talk about the different types of averages. Now, we're going to see the three different types of averages, which is the mean, the median, and the mode. All right, let's get started. So we'll start off with the mean. Now, the mean is the typical average that you know. And really what the mean is is you just sum all of your values up and then you divide them by the total number of values that you have. Now, the great pros of the mean is that it's very easy to understand. It makes sense. We just have everything we have and we'll just kind of but all up and then divide it by what we have. And that should give us a good representation of what is the average. And it also takes into account all of the data. Um, so since we're adding everything up and then dividing by how much data we have, we're taking into consideration every single data point. Now, there are some problems with this. So one of the problems is that the mean may not always be the best description. And we'll see why when we look at examples for when we should use the median and the mode. And the mean is also very heavily affected by outliers. So since we're taking everything into consideration, ever if we have big outliers, thus really going to change how our mean looks like. So if we just have normal values between 15 and all of a sudden we have like 10 thousand in there. That's really going to affect our mean. So the mean is heavily influenced by outliers. And the bigger the outlier, the more the mean is influenced by it. All right, so let's see some examples of the mean. We'll go through a worked example first and we can see our dataset here, which is just a bunch of numbers. And what we're gonna do to calculate the mean, as we're just gonna take every single one of these numbers and we're going to add them up. And we can see the total result that we get here. And then the next thing we're gonna do is we're gonna take this total result and we're going to count the amount of data points that we have. And we're going to divide one by the other, which then gives us our mean, as we can see here. So that's an example calculation of the mean, but let us see some example applications of the mean. So when would we use it? Well, good application would say, if you look at the time it takes you to walk to the supermarket. So sometimes the walk a little bit faster and maybe it takes you 20 minutes to get there. Sometimes he walked a little bit slower. It takes the 25, but on average it takes you somewhere like 22 or maybe 22 and a half minutes or something like that. So if you say I'm gonna go to the supermarket, do you like It's gonna take me this much time to get there. Another good example of the mean would be exam score for a class. So to get a good understanding of how people do in an exam or in a class, you can look at the mean exam score last year. And since there are exam scores are kind of in a smaller arrange, a meeting is going to be good to use because you can get anything between 0. But realistically speaking, no one's probably going to get a 0. So your range is even smaller and so you're less affected by outliers. And you kind of know how hard the class is going to be just by being able to compare their means. So if you look at one class and its mean is higher than the other, but they have a large number of students or something, then you can probably say, hey, it's easier to get a good grade here, something like that. Or maybe, you know, some of these more simpler overuse without diving too deep into it. All right, Another good example of the mean would be to say, how much chocolate do you require when you get this kind of sweet craving? And you're not going to say like, oh, no, I require one chocolate bar, two chocolate bars or three. But like you're gonna say, oh, on average, you know, I require, you know, maybe three-quarters of a chocolate bar. And sometimes I may want a little bit more because I feel like it. And when I start eating chocolate, I crave it even more. Sometimes, you know, I have it up first and like, the taste just doesn't sit right with me right now. And so I have a little bit less. But these are kind of the amount of things. So like if you have this craving, you know, either you say, Oh, I'm going to try to be strong or you're like, well, I know this feeling and I know if I eat about three quarters of a bar of chocolate or something, I'm going to feel good, my craving is going to be satisfied. So you kind of know what to expect. So these are some of the examples for how we would deal with a mean. Well, when we would use the mean. All right, so let's look at the next thing which is going to be the median. Now, the median represents the middle value in your dataset. Now if you have an even number of data points, you don't really have a middle value. And so in that case, the median is going to be the mean of the two value. So it's going to be the two mean values added together and then divide it by two. So the pros of using a median value is that the median can sometimes be more accurate than the mean, and we'll see some examples of this. The median also evenly split your data so you're not really affected by the mean in the sense that if you have an outlier in the mean and it drags everything to the right. It could be that your outlier drags things so far to the right. All of your data is to the left of the mean and only the outliers to the right. So that would be an extreme case, but that can't happen. Whereas the median, you know, it's always located directly in the center of your data. And the median also doesn't care about outliers. So if you have huge outlier is at the beginning and at the end, it doesn't really care because outliers by definition aren't very common because they're outliers. And so if you have some at the beginning or has some at the end, they're going to be very few in number, which makes them outliers. And therefore the median doesn't really care about outliers that much. A con, though, is that the median doesn't really give you much information on the rest of the data. Sure. You know what's at the center. Don't know how does everything around me behave. You only know where is the center of our data. So let's see some examples. We'll do a worked example for us where we see our dataset here. And we can count how many values we have. If you go from left to right, then we can say we've got 123456789, 10, 11, 12, and 13 data points. So we've got an odd number. And so our median value, our center value, is going to be the seventh data point because it's six from the beginning and it's also six from the end. It's equally spaced both from the beginning and from the end. And so that's why we see our median value here is 26. It's located directly in the center. Now, what does the median useful for? Well, the median is often used if you look at household incomes for a country. Because if you were to use the mean, then these billionaires, they would just completely, you know, they would give you a false description of what really an average household income is. Because normally, if you have like an average value and you can say, oh, the average household income from this family would be say, $40 thousand or something like that, or that would be the median value. But if you were to use the mean instead, then all of the billionaires and millionaires in the country, they would change that household income. And then you would say, oh, you know, the average household income per family would look like 60 K. And that's a bad representation because it doesn't actually give you a realistic look at what the average household family has. And the average household family really does. It's centered at like 40 K and sure, there are people below them, there will be high, but that's what's in the middle. Whereas if you were to use the mean instead, for your average, you would kind of get this inflated household income, which wouldn't be representative to the rest of your, the rest of the country. Another good example with the median would be the distance that people cover to get to work. So if you look at this in terms of kilometers, then you can say like, oh, you know, some people, they walked to work and it's like one kilometer at most, so something like that. And then you can expect people to travel. Most people travel around the three kilometers to work. And sure there are some, you know, that travel much further because they want to live outside of the city. And there are some that travel very, very short distances because they have a house right next to the office or their house is the office or something like that, depending on where you're working. Then you can look at, you know, like, where in the middle, how do people travel to work, what time or what distance do they need to cover? And so that would be another good use of the median. A median. Another good median value is what do you usually spend when you buy a new item of clothing? And so, sure, sometimes may go to that expensive clothing store and you could get a jacket that costs, I don't know, north of a couple €100 or dollars, whatever system you want to use. And sometimes you can go to a secondhand store and get it for very cheap. But usually if you go into stores, a jacket, I don't know, maybe Castiel like a $100 or something like that. And so, you know, if you go out, you can expect to pay about a $100. Not really. Taking that much accountant to what story you're going into. So most of the stores that you're going to visit are going to have that price for the jacket. So that would be another good use for the median. Alright, let's look at the third type of average that we can do, which is the mode. Now the mode looks at the most common value in your data. And it's not really defined if there are several most common values. But if there's only one most occurring value, then that's what your mode would be. And so we'll see an example of this in a second. To the pros of using the mode is that it's not only applicable to numerical data. So if you look at categories, for example, then you can say, Hey, we've got five people from the US and two from Canada and one from France. And you know that the mode is going to be the US because there are five people from the US. So mode is the great average. That's not only applicable to numerical data in the sense that you can technically also apply it to categories or two ordinal numbers if you wanted, so that you can say the most common country that we have where the, the average kind of country that we would expect here is the US and sure there are other countries, but the average or the most common one is going to be the US in this case. So yeah, and then of course, the other pro is that we allow to see what's most common, what pops up the most. A great use of the mode. If there are cases when recurring values happen a lot, which is the case for discrete numbers, for example. So in discrete numbers, values recur often. And so it's good to use the mode. The mode is going to be that it doesn't really, again give you a good understanding the rest of the data similar to what we had for the median. But also, it's not really applicable if you just have a bunch of different types of data, then there isn't really going to be a mode if there's not enough of each data and it's not really good to use the mode. You don't want to have thousands of data points and they're most reoccurring value. Reoccurs like three times, That's not good. You want to use the mode for situations where data reoccurs often. So like we saw the country example, but let's actually see a worked example, but also some other examples for the mode. So the worked example here would be again, we take our dataset and we can count how many times different numbers appear. And so if we go through the numbers, will see that 26 occurs the most. And so that's going to be our mode here. So we've got 22 and 25 that both occur twice, but 26 occurs three times. And so 26 is going to be our mode, is going to be our most occurring value. Now, the mode is going to be useful for things like the peak of a histogram. So if you draw this histogram and if you don't know what a histogram is, don't worry. We'll cover that in a later lecture too when we go into data visualization. But the peak of a histogram that's going to show you the mode of the data, the most occurring datum. A good, another use of the modal be if you look at employee income and accompany. Because that accompany, you know, you can again have the boss, which takes off the mean. And you can have a higher level employees to which we kind of shift the median. But if 1 third of your employees earn minimum wage, that not just going to be the best average or say a 40% of your employees earn a minimum wage are probably not your employees because that wouldn't be a very good system to have. But if 40 percent of the employees at the company that you're looking at earn the minimum wage. That's not a really good thing to have. And if you look at the mode, you'll easily see that the average in this case would be to earn minimum wage because that's what most people earn. And sure, you know, the boss, he or the CEO or something, you know, he may shift the mean up heavily. And then the fact that you have higher ups. If you look at the median value, you may even well be too far, too far to the right that you really don't consider. These employees at all are in the same amounts. But you really want to get that description, which is what you get here from the mode. And then also the outcome of an election is where you use the mode for and sure, sometimes you may only have two values, sometimes you may have three. But if you have different candidates and say you have five different candidates, then the person with the most votes is going to win the election because they have the most. And so there, again, you'll use the mode.
4. Spread of Data: Hey everyone, it's max and welcome back to my tutorial. So in this lecture, we're going to look at spread of data. And we're going to start off with looking at the terms, range and domain. Then we're going to move on to understanding what variance and standard deviation means. And then finally, we'll look at covariance as well as correlation. All right, so let's start off with the range and domain. Now. Let's off with the range. So the range is basically the difference between the maximum and the minimum value in our dataset. So that's, that's kind of simple to think about. So let's just kind of go through this with a worked example. Let's set up a company in town, and this is the only company in the town. And the owner of the company earns a salary of 200 K a year. And then the employees, they all have different salaries, but the lowest employees, or maybe the part-time workers, they earn something like 50 K a year. So we've got data on kinda ranging from 15 k up to 200 K. And so our range is the difference between the maximum and the minimum value in our DNA. So we take 200 K and we subtract 15 k from it. And we've got a range of 185 K in salary. So that's how big our salary can change. So it can, if we start at 15 k, it can go all the way up to 200 k. So that's a 185 K range of salaried up people in this company can have, alright. And the domain is going to be the values that are data points can take on or the region that our data points lie in. So if we look at this example again, our domain is going to start at 15 k and go up to 200 K. So what the domain defines, it defines kind of starting and ending points or a defines a section in our data. And so in this case, the domain would define Nino it we would start at 15 came and it would end up to a 100 K. And what the domain tells us is that everything or all salaries within between 15 k and 200 K, that they are possible. But within this domain or within this company, it's not possible to have salaries outside after this domain. So if our domain again is 15 k to 200 K, then we can't have a salary of 14 k because that's outside of our domain. And we also can't have a salary of 205 K because again, that's outside of our domain. So pretty much all salaries within 15 to 200 K are possible. Anything outside of the domain is not possible because that's no longer in our domain. All right, so let's move on and look at the variance and standard deviation. And we'll talk about the variance first. And what the variance tells us. It pretty much tells us how much our data differs from the mean value. And it looks at each mean value, and it looks at how different each value is from the mean value. And then I gives us the variance. It does some calculation and we don't really need to know the formula. It's more important right now just to understand the concept of variance. And so what variants really tells us is it tells us how much our data can fluctuate. So if we have a high-variance, that means a lot of our values differ greatly from the mean value and that will make our variance bigger. If we have a low variance, that means a lot of our values are very close to the mean value. And so that will make our variance lower. And now if we turn to the standard deviation, the standard deviation is literally just the square root of the variance. So if you understand one, then you also understand the other. And now we can combine this if we know the range of our data to kind of get a better feel for data. And so let's use an example where we have two different countries, countries a and B. And they have the same mean height for women, which in this case we'll say is a 165 centimeters or five feet, four. And we'll say that the range of heights for them could be identical. So let's say they can range. The range, let's say could be like 30 centimeters or something. Can go anywhere from say, 150 all the way up to 80. Or we can even increase that and say like anywhere from as low as 140 up to two meters or something like that. But let's just keep the range for these the same. And they both have a mean height. Now if country a has a standard deviation of five centimeters, which is approximately two inches, and country B has a standard deviation of ten centimeters, which is approximately four interests, then what you can expect knowing these values is that if you go into country a, the people that you're going to see are going to be much more similar in height. So our standard deviation is lower. That means our values differ lower from the mean. And so that means a lot of the women that you're going to see are going to be very close to a 165 centimeters or five feet, four plus, minus two inches. So it's very what you can expect when you go to this company, at, when you go to this country, is that everyone is going to be, a lot of the women are going to be about that height. Whereas if you go into country B, they have a much larger standard deviation. And so you can't really expect everyone to be about 504 because it fluctuates a lot more. And so if you go to that country, you can expect to see a lot more women of different heights, both taller and shorter than 54. All right, and so that's how we can kind of use the variance and the standard deviation or the standard deviation to give us a little bit more perspective on our data and kind of allow us to infer some stuff about our data. All right, So let's talk about covariance and correlation. And covariance will are already has the name variance in it. But covariance is measured between two different variables. And it pretty much measures if you have two variables. So let's say we've got, you know, me drinking coffee in the morning and my general tiredness. So if I use these two values and you know, get data points, this is how much coffee I drink in the morning and this is how tired I feel this morning or something like that. And so what the covariance does is it looks at how much one of these values differs or changes when I change the other one. So what does that mean, for example, well, if I drink more coffee, what the covariance would look at is, how much does my tiredness change? So that's what you do with covariance. You see, you say, I change one, how much does that affect? The other thing that I look at? And now correlation is very similar to covariance. So we kind of normalize the covariance by dividing by the standard deviation of each variable. So what that means is we get the covariance for my drinking coffee versus feeling tired. And then we would just divide by the standard deviation of metering and coffee and a standard deviation of me feeling tired. And so really what we're doing with the correlation is we're just kind of bringing it down to relative terms that would fit our data better. So that's kinda the abstract idea. The important thing to just keep in mind is that we're looking at one and we're seeing how much that changes, and we're seeing how much that changes affects the other one. All right, So there are different types of correlation values that we can have and they can range anywhere between negative 11 or so. Their domain is between negative 11 and a correlation of one means a perfect positive correlation. So that means when one variable goes up, the other goes up. So for my coffee example, that would be if I have coffee in the morning, then I also feel more happy. So the more coffee I have, the more happy I feel. And of course there's going to be a limit. But let's say I only drink up to two cups of coffee or something like that and I can drink anything in between. And the more I have, the more happy I am about it. So that would be a positive correlation. The more I have of coffee, the more I have of happiness. And so they would kind of go up together. And then when we get closer to 0, the zero-point is going to mean no correlation to us. So anything between 01 is going to be a kind of slightly positive correlation. It's not going to be a super-strong. And we'll extra see some examples on the next slide. But yeah, so anything between 01 is going to be a kind of slight positive correlation, not super-strong. And the closer you get to 0, the more means no correlation. So an example for the 0 case would be that it doesn't matter how much coffee I drink in the morning. It's not going to affect the whether they're unrelated. One does not affect the other. So I could drink one cup of coffee earning a sunny day and one cup of coffee during the rainy day. And it's not going to change the weather, it's not going to affect the weather. So they're pretty much uncorrelated. And then we can also go down into the negative range. And so the closer we get to negative one or if we reach exactly negative one, that correlation of negative one means a perfectly negative correlation. And so here we can take our example of coffee versus tiredness. And so the more coffee I have, the less tired I'm going to be. So coffee goes up and tiredness goes down. So that's how we can kind of understand this correlation. And it comes from the covariance. So it is important to understand the covariance. We usually use the correlation because the correlation, because we divided by the standard deviation of each, is much better fit to our data. Now, there is one thing that's very important to remember, and that's that correlation does not imply causation. So just because two things are correlated, that does not mean that one causes the other. So a good example of this. If I live in a climate where it's usually cloudy in the morning and I know it to be sunny in the afternoon, but every morning when it's cloudy, I drink coffee and then it becomes sunny in the afternoon. That's not even though they may be correlated. Me drinking coffee and it becoming sunny. It me drinking coffee does not cause it to be sunny. That's just by chance. This just because it happens every day and by chance there's this kind of correlation that appears. But that does not mean that my drinking coffee, you know, results in the weather getting better. A causation would be me drinking coffee and me feeling less tired or me drinking coffee and me feeling happy about it because I like the taste, those would be causations. So that's an important thing to keep in mind, just because things are correlated does not mean that one causes the other. Alright, so let's see these things on a graph. And so here we have the examples again that we've talked about, but we can kind of see how the data would look like for different types of correlations. And so we can see a perfectly, a perfect correlation of one. So one goes up, the other goes up. We can see on the left side, and we pretty much get this really nice straight line. So one value goes up, the other value goes up with it. And then the closer we reach 0, the less related or less correlation there is between them. And then the more kind of variance we have in data. So we'll notice for the case of perfect correlation, which is the one or the case of perfect anti-correlation, which is the minus one, which again we had the example of more coffee, less tired. And in those cases, you know, we have a very nice thin line and our data doesn't jump around a lot. But the closer we get to 0, the less we can see one causing the other, and the more we can see our data kind of spread out. And so that's what correlation would look like in terms of graphics.
5. Quantiles and Percentiles: Hey everyone, it's mocks and welcome back. In this tutorial, we're going to go through quartiles and percentiles. All right, so let's get started. So what are quantiles? Well, quantiles allow us to split our data into certain regions that if we're dealing with probability, they all have the same probability of occurring. Or if we're just dealing with sizes of data, we want to split our data into equal regions. So that's what we can do with quantiles, is just splitting everything up so that every time we split it, you know, we have equal amounts of data. Alright? And so an example of a quantile would be something known as a quartile. And so that's when we split our data into four equal regions, hence the name quartile. So a quantile is the general name for doing this splitting procedure. And then if we say quartile, that means we're doing quantiles but for four equal regions. And so this is something that you would probably often see online university admissions pages or something like that. And they say, the top 25 percent of our applicants have at least a test score of like 90 percent or something, you know. And then they would say the bottom 25 percent for applicants or our admission or admitted students or something like that, have a test score. That is, I don't know, 70 percent or 75 percent or something like that. And then the median test score is 85%. So that's how you would go about quartiles, is that you would have the lower 25 percent though, middle, 25 to 50, then you've got the 50 to 75, and then you've got the top 25 percent. So the 75 percent to a 100. And see you've got these four equal regions, which also include your minimum value at the very bottom, your maximum at the very top. And in the middle, you've got your median value. So that's the value directly in the middle lots because you're splitting it up into, for equal regions. And so the value that separates the second quintile, What should be the 25, 250 from the third quartile, which would be from 50 to 75, that value there would be the median value. Alright? And so if you go into percentiles, percentiles that may have been a name that you, you've probably heard before. Percentile is again an example of a quantile. But instead of saying, you know, like a quartile, we do it for, for, a percentile means spinning it into 100 equal segments. Hence the percentiles, the perks name at the beginning though. That's that's kind of where are the percent. And you may have noticed percent means out of a 100 or so that's if you are familiar with percent, then that's also the same kind of reasoning where this comes from. And so we've got percentiles, which means splitting into you 100 equals segments. And so an example of this is often used in test scores. So if you've ever taken something like the SATs or something like that, then you get a test score. But you also get a percentile. And the reason that's done that is to judge not you versus the tests, but you versus everyone else. And so if it's a difficult test, then something like getting a test score of 60 percent, but you're in the 95th percentile, means your score is actually a lot better. And so what you can say with percentiles, for example, is that every percentile that you're in means you're better than that many other people. So for example, if you're reached the 99th percentile, that means you're better than 99% of the people that took the test, the 95th percentile would be 90. You're better than 95 percent of the people that took the test or something like that. And so that's why percentiles are often used for tests and they're often used for normalization. Because they allow you to take into consideration, you know, these factors of like, is it a difficult test, is an easier test. Maybe more people are scoring higher. So they don't really judge you directly versus the test, but they normalize you against everyone else that took the test. So you take the test, you get a score. And then you have the percentile checks where that score lies relative to everyone else. And so these percentiles, they allow you to give a good normalization and they allow you to do great comparisons because they allow you to kind of eliminate some of these factors. A test difficulty. And of course, you know, there can always be luck involved in stuff and that may not get filtered out on an individual basis. But if you do this for a lot of students, and that's also why it's done in these kind of big standardized tests is that you get a percentile along with your score so that you understand if, maybe if your score is lower. But the test was really hard, you can still see, you know, I, I did really well because people found this test really hard and it was even harder for them than it was for me.
6. Importance of Data Visualization: Hey everyone, it's max and welcome back. In this tutorial, we're going to talk about the importance of data visualization. All right, so what we're going to talk about is first we're going to look at the role that the computer plays kinda for us and what role the computer is actually made for. Then we're going to look at what role the human should play in terms of data science. Then we're going to look at presenting data. And finally, we'll talk about interpreting data. All right, So let's get started and talk about the role that the computer place. Now, computer is much, much faster calculating than a human because that's what it's made for. It's made for crunching numbers, it's made for doing fast calculations. You know, if you think about how faster computers are there in the gigahertz range. So giga means billion, so they just do billions of things every second. And so they're really good for doing repetitive things because they can do them so fast. And then we can give them these logical tasks in terms of programming. And we give them a structure and they just do it and they can do it over and over and over again. They're not going to mess up. I can just repeat the same thing. They won't get tired of it. And they're really good and they're really fast at doing these things. So that's the role that the computer should play for you. Be kind of a means to get these hard number crunching and all of these things done. So there's really no need for you to kinda work out all this complicated math because your computer can do it much better and much faster than you. And it's also less prone to error if you code it correctly. So that's kinda the only part where you come in and it's only going to mess up if you mess up. But generally, our computer does exactly what we tell it to do and it's really good and it's really fast at it. Now, what role should a human play in terms of data science? Well, humans have naturally developed to identify patterns and we've done this first for survival. So that if we're walking around somewhere and we see a, I don't know, a big predator hiding that. We can identify that pattern of the predator and we can kind of pick it out, even though it's trying to camouflage itself. So humans, by nature have become very, very good at identifying patterns. And you can also see this if you look at the clouds and you see thing or you see animal shapes in the clouds or other things. So those patterns aren't actually there, but humans have become so good at identifying patterns. We can see things in many, many places. And so That's what humans are really, really good at. We're able to look at things in, we're able to pick out patterns. Now, another thing that's really good for humans is we are very creative. And through their creativity, we can also use memory and bring it outside knowledge. And we can also use a general understanding of so these are all things that computers can't do. So computers are kind of a means of getting stuff to us. But once it's actually there, it's our job to use our pattern recognition abilities. And of course, you can train machine-learning algorithms for specific patterns later on or specific cases and make them really good at that. But generally, if you don't know exactly what's going to come, then our first step as a data scientist would be to try to identify these patterns. You use your creativity, use your memory, you know, bring in all of these different things. Use all of these different things that make you human and use all of that on the data, all of these things that a computer just doesn't have any access to. Okay? So using, you know, you're considering all of this. The best way to do all of this would be in terms of data visualization. So you can't just show spreadsheets with a bunch of numbers that don't really help you. Because looking at numbers, it's really hard to pick out patterns. The best way to do it would just be to plot values. And then if we have these visuals in front of us, then we can really identify patterns. We can see things go up and down and we can see them fluctuating and we can see them and make a very thin lines. We can just look at a graph and we can just see things. And of course, you know, we need a little bit of practice to understand what that graph is trying to tell us. But once we understand the graph and in general, then we can look at new graphs and we can just see things. So we can start to see patterns. And they may not always be true. But that doesn't mean we can't pick them out. And then that's later on. You would also do some testing trying to see if those patterns with true, if they make sense. But generally, data visualization is very good for this because it allows you to invoke all of your human characteristics. The things that are really good that make us human. The things that we talked about in the last slide, all of the things like the computer can't do. And sometimes you deal with just these numbers. It's data visualization is for you and $0.01 so that you can see these things and try to pick them out and use them later on. But also if you're trying to show these things to other people. So maybe you have to do a presentation and I kind of summary. Then you want to make sure that your data visualizations are good because the people that are going to be looking at it are much, much less trained at looking at data and analyzing data than you are. And so if you try to convey them a message and just show them a big spreadsheet with numbers and just point out like here, look, look, look these numbers, you know, they pop up and they're going to be like, What are you talking about? So that's why it's really important to have really good data visualization skills. One of them is to enable you to do your job, but the other part of it is to show it to other people and to kind of help you convey information to them. You know? And of course, we talked about statistical values. And statistical values are very important and they can give us a good idea about the data and what's going on inside of the data. But visualizing data is just taking it to the next level. And statistical values aren't enough there. They can give us, you know, they can help us, they can support us, that can give us ideas. But if we really want to understand what's going on, somebody who just have to take a look at what's going on. And of course they are. It's also important to make sure you choose the right visualizations and everything. Because other times you may just look extremely weird. But just this skill of being able to present data both for yourself as well as for other people, as very, very important for a data scientists. And then we go over to interpreting data. And we've kind of touched on this in last section already. But really with data visualization, it just allows you to see this data and it allows you to apply some reasoning to the system. And so you can, if you look at data, either you see something which is great. That means you can try to test something, see if it's actually there where you don't see anything. And that also tells you something that you aren't really able to pick out a pattern so that there isn't, there isn't anything obvious that's going on there. Maybe something underlying That's more complicated, but obvious to the user. Just not there. And so all of these things allow you to kind of easily, are much more easily analyze your data and kind of prepare where are you going to do after that? So the standard visualization that really gives you a deep, deep understanding of what's going on with your data. And then when we interpret this data and we look at these visualizations, you know, maybe you see dips and you know, maybe you see some hills somewhere. We can try to understand all of this by bringing in our outside knowledge. So again, what the human is really good at, we can bring in the context of things. You know, maybe people are going out to lunch here. And so that's why activity decreases. Or maybe everyone is common to work in the morning. And so that's why activity increases compared to six AM. So all of these things, and we can bring in all of those contexts. We can bring in all of this understanding to try to interpret the data chart, try to better understand what's going on. And then of course, we're going to see hopefully some trends or patterns. Of course, like I said, these may not always be there. So we're actually so good at pattern recognition that we can see sometimes patterns and aren't really there. And so a good example again of this would be just looking at the clouds in the sky. And you can see animal patterns may be, but that's really not there. That's just our minds, you know, identifying all of these patterns. And so, yeah, that's, that's pretty much why data visualization is so important to a data scientists. It's because of this whole huge human aspect, is it's just key in data science. It's key and data analytics to be able to understand what's in front of you, to be able to bring in these outside knowledge, to be able to contextualize this creativity that's really key to a good data scientist. And a computer can help you with all of this. The computer can help you do the number crunching and computer can help you set up the visualizations and it can plot whatever you want for it. But ultimately it's up to you to choose the right visualization due to look at the data, to be able to communicate the visualization as well. All of those things are up to you. And so that's why the human is so, so important in data science.
7. One Variable Graphs: Hi everyone. It's Max and welcome back. In this tutorial, we're going to look at one variable graphs. So we're actually going to see some of the types of graphs that we can do that we talked about in our last tutorial where we just looked at the importance of data visualization. So now we're going to go into data visualization and look at the types of graphs that you may want to use or that you may want to choose from. All right, and so the graphs that we're going to look out in terms of one variable graphs are going to be histograms, bar plots, and pie charts. So let's get started with histograms. Now, we can see an example of a histogram on the right. But what's really cool about histograms is that it shows us the distribution of the data and it shows us the distribution across all the values in our data. And so it shows us what happens the least, and it also shows us what happens the most. And histograms, they let us see where our data is concentrated and they also let us see how it's distributed. And so through this, it kind of shows a general behavior. And so really what a histogram is is it looks at each value and it just looks at how often that value has occurred. And so what we see here, for example, is that around 0, we have the most occurring. So if whatever value we're looking at, and as we move to the left and as we move to the right, these values start to drop off so they start to become less frequent. And so that's what a histogram shows us. This diagram shows us a kind of frequency how often these things occur. And so there are different types of histograms that you can encounter. Or I mean, generally a histogram is just this plotting a frequency versus your value. And so there are different ways that this histogram can look like. One of them is the one that we've just seen which isn't normal distribution or it's called gotten like histogram because it follows this Gaussian distribution or this normal distribution that you know, but we can also have an exponentially decaying value. So we start off very high. And the further we get away from the initial value, the quicker it's going to decrease. And you can actually compare that to the gosh, unlike or to the normal distribution. So the normal distribution looks more of like a bell. It kind of goes up and then curves down slowly, whereas the exponential, it cuts off very fast and then kind of slows down later on. So they do have different behaviors. And then of course, we can also get not just one peak like we see in this first case and the gosh, unlike distribution, but we can also get things like two peaks or we can even get three peaks or more. We can have very large extended peaks. And so our histograms, there are means of showing us how this data is distributed. There are means of showing us, you know, what things occur most frequently, whereas our data concentrated. But that doesn't mean that they're going to have to have a specific value. And so there are specific shapes. So there are many different shapes that are histograms can take on. And depending on what shape that you get, that also tells us something very different about our data. All right, So the next one variable part that we'll look at is going to be bar plots. And so what bar plots do is they may look a little bit similar to histograms at first, but they're very different in some sense because bar plots allow us to compare across different groups. And so that's what we see on the x-axis down there as we look at different groups. And so we use the same variable and we can compare that variable over different groups. And so if we look at that in example, so what we see on the right here is we look at different countries. And what we show is we showed the average income tax. And so we see that country B, for example, has the highest average income tax, whereas country D has the lowest income tax. And so through those, you know, we're still only looking at the income tax variable, but we were able to compare iss over different groups, over different categories, if you will. So other examples would be if you look at control groups and test groups. So if you're doing some like medical study or maybe some psychology study or something like that. You always want to have your control group. And then you can have different types of test groups. And then you can plot each of these groups as a bar plot and you can look at the same variable, but you can look how that changes over the different groups. Another example would be something like comparing male versus female heights. So you've got one group that's male, the other group that's female, and you can just plot their average height, um, and then the tax, the income tax of different countries, which is what we've seen on the right over here. All right, and so the last one variable graph that we're going to look at is going to be pie charts. And pie charts that allows to do is they allow us to Section up our data and we can hide the split it into percentiles. And because of this, we can see what our data is made up of. So the whole Pi corresponds to a 100 percent. And then we kind of cut it down at different slices. And through that slicing. And then hopefully I'll some color coding like we've done here and maybe even labeling or most definitely wavelengths so that you know, what slice corresponds to what value. We're able to see what categories, um, you know, or what would categories our data is made up of. And so we can see what is most prominent. But we can also see what is least prominent and all of these things. And then again here we can see also distributions not as well as in the histogram, but we can still see distributions in terms of dominance, in terms of how many groups there are. Miss the data spread evenly, is it heavily concentrated in one part of the pie? All of these things allow, you know, it's, that's what we're able to do with pie charts. We get this nice kind of Group Overview of one variable. So examples of this would be you can look at ethnicity distribution in a university. And so you can have a pie chart and just each slice of pie which is to represent a different ethnicity. And depending on how much of our percentage that make up the total university profile, that's how big the slice of pie would be. And so you can see dominance of some ethnicities as well as minorities. But you can also see just by how many slices they are. You can see how many different ethnicity groups there are. And another example would be you can split up star reviews for a product. So rather than looking at the average star review, you can also just use a pie chart and you can see how many of my reviews or five-stars, how many of them were four-stars, 321. And so there you can again, I'll say get this nice different overview of how the review system would work.
8. Two Variable Graphs: Hey everyone, it's max and welcome back. Now we're going to talk about two variable graphs. So the graphs that we're going to look at are going to be scatter plots, line graphs, 2D histograms are two-dimensional histograms and box and whisker plots. All right, so let's start off with scatterplots. Now, for a scatterplot, what we're doing is we're really scattering all over data points onto a graph. And so pretty much every data point that we have, we kind of put a little dot onto it on the graph. And scatter plots are great because they allow us to see spread of data between two variables. So we're always plotting one variable on the x-axis and then another variable on the y-axis. And it just pretty much allows us to see how the data is distributed for these two variables. And then through that, we can also see more dense areas. We can also see some sparse areas, and we can also look at correlations. So maybe you remember in the lecture we talked about correlations. We were able to see through scatter plots where those correlations where or where there weren't any correlation. So all of these things, That's what scatter plots are really, really nice for. Scatter plots. Of course, we can also use them to have, like we see here, a little clusters. So not everything needs to be connected by a line or a curve. Maybe something is more like a circle. And so that's what scatter plots can show us too. They can kinda shows these groupings and we see one cluster here. But maybe, you know, you have bigger plots and then there would be smaller, you know, like 10 little different groupings for different things. So it's got our costs are really great for that because they just show us where are the data points are located for these two variables. And then we can use our cell of see, you know, like how, how do these look like? Do, does one variable affects the other? Or maybe certain groupings that we can see where our dense areas, where it's sparse. Where are things concentrated, you know, is everything spread up all over the place is at very, very narrow and it only in specific region. Scatter plots allow us to see all of these things very easily. And so some examples where we could use scatter plots would be if we see, if we look at the graph on the right, we can look at something like a car price versus the number of cars sold. So each of these data points pretty much represents a car that's been sold. And then the x-axis tells us the price that the car has been sold out. And the y-axis tells us the number of cars that have been sold at this price. And so what we see here, for example, very easily set them more than the car is priced, elastic gets sold. And then maybe you can think about in terms of, well, the more its price, maybe people don't want to buy such an expensive car. Maybe they found a cheaper version of it. So maybe it's just a branding thing, which is why it's more expensive. Maybe there's something just as good quality that's cheaper. Maybe people just don't have enough money. So that's probably a big factor tree that people just don't have enough money to buy these expensive cars. And so that's why they drop off. And so it may look a little bit different in terms of profits. But the higher the car is priced, the last we see it being sold. So that's one example of a scatter plot. Then something else that we can look at is maybe the income versus years of education. So we would look at on the x-axis, how many years someone has been educated. And then we would look at the current income. And that would just be a point on the graph. And we can do that for many, many different people. And then we can see how different education for different people, how that affects their current income. So that's another thing where we can do a scatterplot for. We can also go back to one of the earlier examples that we used very early on, where we talked about people traveling to work. And we can just plot the distance traveled versus the time it takes and traveled to work. And then we can see maybe some people travel faster. It could be that some people travel the same distance, but one takes longer than the other because one goes by Kiara, the other one goes by bike, the other one takes public transport, all of these things. So all of that we can see in the scatter plots and just kind of take into account these different situations and see how that all looks for the, for the general population of our data or just generally for data. So it's gotta, plots are really, really great as a kind of first go to just also identifying trends, identifying regions. I'm just giving you a good overview of your data. Now, the next thing that we'll look at is going to be line plots. And line plots in some sense are kind of similar to scatter plots. So we have the same basis of the x and the y-axis, but the points are connected. And now it's very important to know when to choose line plots and scatter plots. So line plots can carry a lot of advantages with them because this connectedness, it makes it very easy for us to see trends because we can see where these lines go, not just trying to connect the points in our head. You know, I kind of connect the dots. But that's exactly what I'm a line plot does, is it connects the dots for us. And so we can see these lines. It's great if we want to see an evolution of something. Maybe you want to see an evolution over time. Maybe you want to see an evolution over space and evolution with people, something like that. Just if our data points are connected, it's great to use a line plot. So if we know that whatever happened before is connected to what happens now, it's great to use line plots because line plots show us how things evolve because they're all connected as a line. But if we're to do scatter plots and we just kinda plot points randomly. And just because if we go back to her or kind of car sold car price example, just because someone bought an expensive car or if we look at the expensive car and it's been bought, say like five times, then we look at a cheaper cards and bought a 100 times. There isn't really a logical connection to make between the two. And so if we were to use line plots where we should use scatter plots, really what we'd see is just a bunch of lines all over the place. So that's why it's important to know when to use line plots. And one, you use scatterplots because it can be very, very helpful. If you use a scatter plot instead of a line plot, it's going to be a bit more confusing because you have to try to connect the dots yourself in your hand. But if you use a line plot instead of a scatter plot is going to look really weird because there's just lines all over the place and you can't really see anything. So an example where we could use line plots is we have the typical distance versus time. So you can look at, you know, how far someone or what time it is and then how far someone has traveled. Just a general curve of distance versus time. That's very, very common. And you can look at the profit of the company versus the number of employees. So the more employees they imply a employ, how does that change their profits? So of course, they have to pay the employees more, but maybe the employees can also do more work. And hopefully, you know, that kind of cancels out what you pay them and then increase this company profits. And then what we can see on the right here as we can look at your creativity and how that changes with stress. So we can see that the more stressed out your, the last creative UR. And here it's also good to use a line plot because kind of gradually advance and stress. And so each point and stresses kind of related. And the higher you go up and stress, the lower you go down and creativity. And so there's this kind of relation where we can see this evolution. So the more you get stressed out, the less creative he become. So lime plus a really nice here because there's not this chaotic movement everywhere. But it's very nice and it's very easy to see this line. It's very easy to follow. Okay? So the next graph that we can talk about is two-dimensional histograms. Now we've seen one-dimensional histograms in the last tutorial where we looked at the spread of data and we looked at the peaks and how things were just distributed to the right and to the left. But we can also do a two-dimensional histogram. And somewhat a two-dimensional histogram is it's a one-dimensional histogram, but it's a pretty much a histogram for every single point of the other variable that we're looking at. So really what these things allow us to see is they allow us to see how the different distributions of the two variables is relative to another. So we can see here, for example, in the red region that for those specific values, them, they happen a lot. So that combination of values happens a lot. And so we're able to kind of pinpoint these frequency occurrences again. And we're also able to look at drop-offs, But we're able to pinpoint that to two specific values now rather than just one, which is what we did to the 2D histogram. And these things are much harder to see in scatter plots. Because in scatter plots, if we have a value occurring a 100 times, it would just be the same dot and the dot wouldn't get bigger. Now of course, you can make the dot bigger yourself if you wanted to. Or you can change the color or something like that. But really if you do a scatterplot and the same thing happens to a 100 times is just going to look like one dot. Whereas for two-dimensional histograms, we can see that it's not just, it's not just happening ones, but we can actually see the frequency of those variables. Are those, those two variables together. So an example of a two-dimensional histogram would be if we look at ticket prices versus the tickets sold. And so if you look at the lower left corner and we can kind of see this red peak. So that's cheaper ticket prices, but the tickets are also sold often. So we know that tickets at that price are sold quite often. And these could be, you know, like new, rising brand bands. These could be like, you know, you kinda standard bands that maybe you wanna take someone on a day to, but you don't want to spend too much money on a ticket, but still a considers a nice idea. And so that's a good ticket price. That sells a lot of tickets because it gives you the pleasure of p event without making it too expensive. And then if you move more towards higher ticket prices, and then if you move more towards more tickets sold, then you can see that for high ticket, high ticket prices, which would be, you know, like these big bands. Then we can again see how many tickets we've sold. So we can see that for a higher price. And if we go up and tickets sold, so if you want to see lots of tickets sold for a high price, then the red peeks are going to give us all of these more famous artists. So that's one kind of application. But of course, there are many, many better ones. It's just these things. You know, if, if you're in the moment and you can kind of, then you would realize, Oh, this is when a two-dimensional histogram would be a great thing for me to use. So a lot of these graphs, they're great to know. And once you're in the moment, then it's much easier for you to pick out which graph would be best representative. Finally, the last graph that we're going to look at is going to be a box and whisker plot. And I want box and whisker plots allow us to do is they allow us to see the spread within our datum. So it's not just like a bar plot which just shows us one value, but we can actually see the statistical spread. So we can see median values, which is what we see here. We can see quartiles. The little dots on the outside actually show us outliers. And so what box and whisker plots allow us to do is they allow us to see this statistical information, but they allow us to see it visually. And that makes comparing across different groups, which is what we're doing here much easier. And so a good example of that would be if we look at ticket prices for football games for different teams. So different teams and different teams of course use different stadiums and they have different popularity. These, and some teams may be much more expensive or their ticket prices maybe much more expensive than other ones. And so we can compare these ticket prices using box and whisker plots. And then we can see, you know, what is the higher end of these costs? So those are going to be the more luxurious seats. And then we go to the bottom. And those are going to be the less luxurious seats, probably the ones where you stand. And then you have middle values depending on, you know, the standard seats and where you are in the stadium. If you're close to the field, if you're further away from the field, but you're still sitting. All of these things we can kind of see here and that's what gives us the spread. We can compare that across different teams and we can see the spread across different teams, but we can also see which teams are more expensive. You know, where do the prices vary the most for a specific team? So maybe some teams have a super launch and then they have your standing places that are just much cheaper. And so you would see a much larger spread. Or maybe some teams just have, you know, only seats and see, you'd see a much lower spread. And so all of these things, we're able to compare using box and whisker plots over different groups.
9. Three and Higher Variables Graphs: Hey everyone, it's max and welcome back. In this tutorial, we're going to talk about three and higher variable graphs. So the graphs that we're going to look at it, it's going to be heatmaps. And then we'll also look at multivariable bar plots, as well as how we can add more variables to some of the lower-dimensional graphs that we've talked about earlier. All right, so let's start with heatmaps. Now what heatmaps allow us to do is they allow us to plot two variables against each other and the x and the y, and the laws to show an intensity or a size or something like that in the z direction or towards us. So an example of this, which is kinda of what I've tried to illustrate on the right, is a customer moving through a storm. And so we can track the path of the customer in the x and y direction of the store. So you can kind of get this bird's eye view and see where they move to. And the darker spots actually tell us the positions where they spend more time at. So we can see that they spend a little bit of time at the beginning they moved in men and then they stop one choose what will we see that dark spot being? Maybe they found like the candy aisle or something. There was a specific piece of candy that they wanted. And then they moved on and then they started to go or to run us the corner around the corner a little bit. And maybe they reached the fruits and vegetable section there and picked out several things. And then they started to head towards the checkout counter, which happens at the very end and they are moving at a more constant pays. Sometimes they stopped to look a little bit, but they just kind of continued moving on. And so the three variables that we've shown here, as we've shown there, exposition in the store, we shown there y position in the storm and to the color. We've also shown the time that they spend at each position. So that's what we can use, heatmaps four. And then another example of a heatmap would, for example, be if you take a flashlight and you move it over the screen. And really what you're showing is the amount of time that you've shown the flashlight onto a specific region. So that's kinda of another example of a heatmap, but usually heat map, as the name implies, it allows you to track positions. And so it's very often used for things like tracking customers through stores are just tracking general people location, where they like to spend their time. And the intensity that you see in terms of the color is usually the amount of time that they spent there. All right, So we can also do multi variable bar plots and multivariate barplot. So it's this very similar to a single bar plots where we just plotted one value over different groups. But rather than just putting one, we kind of cramped them together and we plot several. And so an example of this would be that we plot goal scores, goals scored for team the shots taken non-goal, as well as the shots on target. And so we can see maybe there are teams that shoot lesson goal without score less, but that's because they also shoot less and therefore they also shoot less on target. Or maybe there are some teams that your score lot and that's because they shoot a bunch. They just don't hit the target that often. Or maybe there are really good teams that score lot and they also shoot a lot on target. And so all of these things we're able to then compare over different groups. And so that's what we can use multivariable bar plots for. If there are several variables, that would give us a better understanding of the system than just looking at the variables in one at a time. But it also be really cool if you could compare all of them, then we could use multivariable bar plots for that and just pop them on the same bar plot. And then we can see how they changed, you know, within a group. We can also see how they change over different groups. Okay? And something that we can do is we can also just add extra dimensions to lower-dimensional graphs that we've had. And some were kinda limited to three-dimensions because that's the amount of space dimensions that we live in. But if we take the scatter plot, for example, where we started off with just the x and the y axis and points located. What we can do is we can actually add a third axis, so we can take the x and the y, and then we can add a z. And that gives us an extra depth dimension, which is exactly what we see here. So rather than just plotting unlike a two-dimensional field, unlike a plane, we can actually plot it in a volume. So we can see this kind of scattered ball that we've done, kinda, kinda ball that we've done here, which is kind of located at the center of our plot. And so this can be really cool because it allows us to see depth to. The problem with this is that we have snapshots every time. And so really we're looking at two-dimensional snapshots. And so to get the best understanding of this, we need to rotate our scatter plots or our plots as we do them, so that we can also add in our depth perception. Because right now if we're looking at it, it may look three-dimensional, but really it's just a two-dimensional snapshot. And to get the best understanding if our scatter plot is located more towards us on more chores to the left or something like that. Or maybe it's just really high and close to us, or maybe it's really low and far away. To understand all of these things, we need to be able to rotate our scatter plots so that we can see it from different angles, which then gives us this depth perception. And we can do the same thing with 3D line graphs. So here we see an example of maybe what the position of a skier as you're skiing down a hill. And then we can kind of trace that through time and we see that they're kind of, they're going down the hill and this nice zigzag motion as you should, and we can just track their position over time. So here we've added this extra dimension to the 3D line graph, rather than just taking maybe a time and a position in a time or something like that. We've added a second positions, were actually even a third position. So we've got the x to the 1, does that position, and then we just trace it over time. And so that gives us this whole line here. And so that's how we can take these lower-dimensional plots that we've looked at before in, we can just add extra dimensions to them if we want, as long as it's still easy to see, as long as it makes sense where we're looking at. We're really just able to maybe just slap on another direction there and, you know, compare another variable.
10. Programming in Data Science: everyone, it's Max. And welcome back in this tutorial we're gonna touch on the third major section That is really great for data scientists. Or that should be an essential of data scientists, which is the ability to program. Okay. And so why do we program? Well, there are different reasons why we want to be able to program. The 1st 1 is gonna be the ease of automation. The second will be the ability to customize. And finally, it's because there are many great external libraries for us to use that just make our job so much easier. Um, all right, but so let's get started. Let's talk about the ease of automation for us. What do you mean with that? Well, being able to program it really allows you to prototype really fast allows us to automate things, and it also gives us the extra benefit of if we have something in our mind, we can just take that and kind of put it into the computer by programming it. And so we're able to automate everything very fast, and we don't have to do these repetitive tasks. Um, you know, maybe copy pasting stuff into or from Excel or all of these things. If we just want to repeat something or we want to quickly change something up and just change a small thing, we don't have to do a lot of stuff. We can just change that in our code and then click play and let the computer take care of all of that for us rather than us having do everything manually. So it's very easy for us to automate things and also for doing reports. It's very easy to automatically create these reports. You know, all you have to do is set up your program to deal with the data that you're going to give it, and then I can automatically create reports every week. And the reports can be different because you give a different data. Um, it should still look the same, but the data, the values can be different. And so that would just automatically create all these reports for you. And you don't have to do that all yourself. The program does it for you, Um, but you've built the program and you're giving it this different data. So you're still doing all of the analysis. It's just you get to skip the part of copy pasting and like looking across and taking over the values and doing all of the formatting of just doing the same report over and over and over again. I'm all about It's taken care of for you, and all you have to do is just put in the right data, you know, right out everything that you want to do and then click play and let the computer handle all of that for you because remember, that's what the computers doing good at doing doing these repetitive tasks. Okay, we also want to be able to program because it really allows us to customize. It's very easy once we go into data analysis, and when we see things that we get these ideas that we want to expand or different directions that we want to progress or analysis into and being able to program, it really just allows us to take all that and put it into code and just choose that direction and weaken very easily, dive much deeper into our analysis and discover things fast because it's up to us to where we want to go. And so this ability to customize with programming. It's It's very, very important because we're not reliant on anything else. We're not reliant on, you know, some software and maybe it breaks down. Or maybe we don't know how to perfectly use it. And we have to read the manual and read it like a help section. No, but we know how to program. And we just typed down exactly what we want to do exactly where we want to take it, exactly what we want to see, and we can customize very, very fast with that. We can also prototype very, very fast without on Maybe if a visualization is not working, to turn a scatter plot into a line plot is very easy. You just change one word. So all of these things are very, very easy to do with programming because we have all of that power at our fingertips, and we can just, you know, change everything that we're looking at, everything that's been calculated, maybe want to calculate an extra thing on, take up something else because it's irrelevant. All of these things were able to customize, and all of that we can do because we're able to program so really what we're doing is we're making the data. Our so we're taking full control of the data were taking full control of where we want to go with our analysis, what we want to see and what we want to show. All right, s So let's talk about first libraries, but also give you two great pipe in libraries that you should, you know, maybe feel comfortable with or that you should maybe consider using for data analysis. So, first of all, what are libraries? Will libraries are pieces of code. I've been pre written by others that you can just take in and use. And so a very good example of this is something known as a math library. And so that has all of the squared functions taking to the power, you know, taking the exponential, assigned the co sign all of these things that you know and you want to use. But you don't want a program yourself. So like it pretty much avoids that middle step of you having to program the equation to calculate a sign, because all of these things, those are things that we don't want to do. We don't want to get distracted from our target. We want to be able to do exactly what we want to do without having the program completely. Other stuff. And so that's what libraries air great for their developed by the community for everyone to use. You know, everyone is helping each other and these libraries, they just bring a lot of power with it. And so one of these libraries is called pandas and Panoz is pretty much like excel, but it allows us to do or we could do programming with it, which just makes it so much better, because we can do things so fast with it. We can do all this customization. We could do all this automation, whereas, you know, like Excel. If you give it too much stuff, too much to run, it would just start to crash because it has to handle all of this other things. All of these other visual things, you know, the u I. And there's a lot Mauritz a lot. It's not a structure as well where is and programming the program. You know, your computer just goes through everything step by step. It doesn't have to take care of all of these visualizations things. It just does the calculations down below. But we can still do all sorts of data management with them so we can shift our data around . We can drop columns, weaken, split things up. You know, we can split things up by row. We can pick out certain Rose. We can even do statistical calculations on our data so we can say, you know, hey, calculate the mean for this. We don't even have to, you know, make her own formula for calculating the meaning or for calculating the standard deviation or for calculating correlation between different columns. All of that can be done with Panoz with just a you know, a couple of key words. And so it's really easy to do data analysis with it because all of the functions that are there and we know exactly what we want to do, we don't have to write the code for all of it. So if you wanted to look at correlations, we just say, Hey, panels do correlations rather than having to, you know, code all the correlations for ourselves doing, you know, quoting that whole algorithm and that makes it really easy and really fast to get results and to get to where you're heading because you don't have to go into any of these middle places. You can pretty much just skip the middleman of having to, you know, right, All of those. I grow them to yourself, and you could just use them so that you have your start. You have your idea. You know exactly what you want to do. And you can do exactly that to get to your goal. Um, the other library, that is very cool will be Matt plot lib, which is what I use a lot for data visualization. It allows me to create graphs, allows me to visualize my data, allows a bunch of customization, so I could really just move everything around it. I can move my spines. I can turn things on and off. You know, all of these things are very easy to do with my popular. There's a lot of great customization that I'm able to do with it. So these are the kind of two basic private libraries that you should probably maybe get to know where you can look at some of my other courses. One of them panels would deal with the data analysis part and map lot lib would help you deal with the data visualization part of it.