Applied Data Science - 2 : Statistics | Kumaran Ponnambalam | Skillshare
Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
5 Lessons (1h 1m)
    • 1. About Applied Data Science Series

      8:12
    • 2. Types of Data

      7:29
    • 3. Summary Statistics

      16:10
    • 4. Statistical Distributions

      19:05
    • 5. Statistics Correlations

      10:09

About This Class

This class is part of the "Applied Data Science Series" on SkillShare presented by V2 Maestros. If you wish to go through the entire curriculum, please register for all the other courses and go through them in the sequence specified.

This course focuses on the Statistics for Data Science. It goes through basic concepts of statistics that are required for performing data engineering and machine learning operations as a part of this series.

Transcripts

1. About Applied Data Science Series: Hey, welcome to the course are played data signs with our This is your instructor, Cameron Parnham belong from video Mastro's Let's Go Through and understand what this course is all about. The goal of the course is to train students to become full fledged data practitioners. So we're focusing on making people practitioners who can execute into event data since project right from start off acquiring data all the way to transforming it, loading into a final later our destination and then performing organs analytics on them on finally achieving some business results from this analysis, what do you What you by taking this course is you understand the concept and concepts of data signs, you understand the various stages in the in the life cycle off a data science project, you develop proficiency to use our ANDI use are in all stages off ANALITICO right from exploratory Data Analytics to directive an hour. It takes to modeling toe. Finally doing prediction using machine learning algorithms learned the various data engineering tools and techniques about acquiring data and cleansing data on transforming data. Acquired knowledge about the friend machine learning techniques on also learn how you can use them and also most importantly, then you can use them become a full fledged data science practitioner and who is can immediately contribute to real life data. Science projects notto mention that you want to be taking this knowledge to your interview so that you can get a position in data science. Terry was this practice we wanted touch upon this particular thing off theory versus practice, data, signs, principles, tools and techniques. Image from different signs and engineering disciplines. No, they come from computer science, computer engineering, information, terry probability and started sticks, artificial intelligence and so one on theoretical study of data signs it focus on these scientific foundation and reasoning off the various Mission Learning Gardens. It focuses on trying to understand how this mission learning Salgado's work in a deep sense on be ableto develop your own algorithms on. Develop your own implementation of these algorithms to predict a real ball problems. Just one dwells into a lot off in our equations and formal on deprivations and reasoning. Whereas the pact is on the up late at part of the data, science focuses on a playing the tools, principles and techniques in order to solve business problems get the focus on trying to use existing techniques and tools and libraries on how you can take these and a play them to really work problems and come out with business deserves. This one focuses on having adequate understanding of the concepts a knowledge of what are the tools and libraries available on how you can use these tools and libraries to solve real world problems. So this course is focused on the practice off later signs, and that's why it's called Applied Data Science Inclination of the courses. This data science is a trans disciplinary subject, and it is a complex subject. It doesn't mainly three technical areas to focus on. So there is math and statistics that is mission learning. And there is programming on this course is oriented towards. You know, programming is oriented towards existing software professionals. It is heavily focused on programming and solution building. It has limited and asked required explosion exposure. The math and statistics on it covers overview Off machine learning concepts gives you articulate understanding off how these machine learning all guarded them books. But the focus is on using the existing tool to develop real world solution. In fact, 90 95% other work that later science time. Just do in the real world is the practice of data science. Not really, Terry, of greater science and this course strives to keeping things simple and very easy to understand. So we have definitely made this very simple. We have stayed away from some of the complex concept. We either they tried toe tone down This complex concepts are just stayed away from them so that it makes easy for understanding for people of all levels off knowledge in the in the data science field. So it is a kind of a big nurse course. If I may say that the core structure it is goes through the concepts of greater sense to begin with, what exactly is their assigned? How does data science works? It looks at the life cycle of data saints with their various life cycle stages. It then goes into some basics of started sticks that are required for doing data signs. It then goes into our programming. It question to a lot of examples of how you would use our programming for various stages in data science project. The various stages in data sent injured Data engineering, part effort. What other things you typically do in there that's engineering one of the best practices in data undulating, it covers those areas. Finally, there is the modeling and predictive analytics part where we build into the mission Learning or God Adams. We also look at Endo and use cases for these machine learning algorithms, and there are some advanced topics also that we touch upon. Finally, there is a resource bundle that comes as a part of this course, and those results bundle basically contains all the data sets. The data filed the sample court example coat on those kind of things that that we actually teach as a part of this course which is covered in the examples all of them are given in the resource bundle. So do I Don't know the resource bundle that has all the data you need and all the core sample that you need for you to experiment the same things yourself. Guidelines for students, the fasting this toe understand their data. Saints is a complex subject. It needs significant efforts to understand it. So make sure that if you're getting stuck, do review and relieve you the videos and exercises does. He called help from other books on land recommendations and support forums. If your queries 1000 concerns does, and that's a private message, our do posted this question question, and we will be really happy. Toe responded that as soon as possible. We're constantly looking to improve upon our courses, so any kind of feedback that you have is welcome on. Please do provide feedback through private messages are two emails on the end of the course . If you do like the course, do give leave a review. Reviews are helpful for other new prospective students to take this course and to expect Maxim disc ones from other future courses from We Do Mastro's, we want to make that easy for our students relationship with the other. We do Masters courses are courses are focused on data science, really a topics basically, technologies, processes, tools and techniques of data saints on. We want to make our courses self sufficient as much as possible, eh? So what that means is, if you are an existing we do master student, you will make see some content and examples repeated across courses. We want to make themselves a vision So rather than saying that, are any point in the course? Okay, girl, look at despotic like other courses. Register for the other course and learn about this. We rather want to focus on this course itself. Keep two things in the same course itself. Unless that other concept is a huge concert. That theirselves of separate course. We want to India them as a part of this course itself. So you might see some content that is repeated across courses. Finally, we hope this course helps you to advance your career. So best of luck. Happy learning on Don't keep in touch. Thank you. 2. Types of Data: Hello. Welcome to statistics for data science. In this model, we are going to be seeing about some of the basics of statistics that are required for mission learning and Predictive Analytics uses. So the girls off this particular model is basically to describe the basic statistics that are required for Dale assigns. Ah, very simple, very simple level that we're going to deal with. We're gonna explain the concepts are very high level and at a very simple way. We're going to a wide family and mathematical representations, Asthma Just possible. We want to just keep it simple so that everyone who, with different levels off mathematical exposure, can understand what is going on. S o. I hope this is going to be useful to you on if you feel you want to learn more about that, there are other courses and other material for that. We are trying to award them in this course so that we want to keep but the minimum possible for everybody. We're going toe one of the types of data on what are they are and what we want to do with them. Types of data play a very important role in data science because Mission Learning Guard Adams typically are impacted by what type of data are being passed to them. So some machine learning algorithms are good with some types of data, which is what we will see in the predictive and I'll fix model. So it is good to learn what this type of data are and what you can do with them. So there are four types of data typically you deal with on day typically differ in the meaning on the operations that you can basically do on them. On these four types of data are called categorical data are nominal data are factors. The second diapers, orginal data told us. Interval on the foetus called issue. So what exactly are these? Let us start with categorical categorical data represents categories on types. We have been seeing types and categories all over the place. The best example of categorical data is gender, gender being male or female. So what is especially special about this? Categorical later in that there's always a fixed set of values, like in the case of gender that it just male and female. It has no implicit ordering our sequencing. You cannot wreck sequence them or order them in any racing, one is better than the other one is higher than the other, so all of them are considered equal. You can't really compare them, like with a greater than or a lesser than simple. In the case of categorical data, some of the examples are like the list of fruit like apples, oranges, grapes. In a soccer team, you can have different types of players, like different er midfielder forward are you have types of cars like a seeding issue. We coop etcetera as FNC. The religious categories they all are typically have a fixed set of values on. You do not have any kind of implicit ordering our sequencing among them. The second type of Dota is called ordinary data, which also produce. Mitt is like categorical data in that it also has, like a fix it of values. But that is ordering among these values you can actually our them and say one is better than the other one is higher than the other on their typically represent a scale off measurement like a scale of 1 to 10 a scale of one to fight, or something like a high medium low. Excellent, very good, Good It usually the presents escape, but it is still categorical. Data on you can do comparisons like you can do greater than or less than kind of comparisons, but you cannot do any kind off arithmetic operations like Addition, subtraction, multiplication division on each of them are some of the examples you can see. It's like review ratings like Outstanding, Very good Good. That's an example off normal data, the pain level's turn being the highest, like 1 to 10 student grades. A B C D e of the is the higher. So you always have something like the highest and lowest you can actually compact them? Asked Impacto categorical data. The third type of data will be dealing with this called the interval Data. So in total data is typically numeric data on measurement. It is measurement where the difference between the numbers has some meaning, like distance. For example, suppose our district in part and be 60 miles between B and C is 70 miles. The questions is 10 miles on the distance. Tell nihilism meaningful number, a meaningful list in So in an interval, data there is meaning with referenced rubber with up to the distance here represents things like time, distance, temperature, these kind of stuff. The most important thing you want to notice. That addition and subtraction is possible, like you can add time or subtract time are distances at temperatures, but multiplication and division are not possible. You cannot multiply distance one and destined to and get another distance. You will always get distance choir. It just doesn't make sense. Toe no multiply and divide distances, so that is what it means. Examples are like time of the day dates distance between two points temperature and things like that. This is interval data. On the last type you would see is what is called the ratio. The ratio data as everything else, like any kind of numeric gator that you would see that doesn't qualify as as any other any other other three types of data we just saw. All kinds of arithmetic operations are possible with ratio data. A truce ido is possible, but zero is a true valid value in In the case of a ratio data. Some examples of these are like great speed amount in a kind of continuous measurement data that you would see in real life? Yes. Consider invades your data. So how did they all compare? Ah, here is a nice combat is in charge between each of the four types. Best great values are applicable in all the four of them continues. Values are applicable only in interval and ratio data. Because orginal a normal are just categories. Frequency distributions we will see later are applicable for actually all of them move union personals. There are applicable only foreign travel ratio on ordinary Sorry for the red being on the yes and ordinary and 1,000,000 person ages. Addition and subtraction are possible with interval and ratio data. Multiplication division is only possible on ratio mean and standard deviation. Again, we're going to be looking at what? What did are in the future sessions are there applicable only on in travel and ratio on. Of course, you can find really ratio like ratio between A and B in ratio data only on a trial zero is applicable only in the case off ratio data. So this is all they all compact with each other. I hope this presentation is makes helps you in understanding what the various types of data are. Thank you 3. Summary Statistics: we will see our something about somebody statistics Somebody's Saturday six are an important, A very important part off statistical and analytics. It is something that you are always doing as a basic analysis for any kind of data that you see. So what is somebody statistics when you have, like, a set off observations like a set of data points, maybe 10 data points, £100. 1,000,000 data points. You want to somehow characterise them on characterize thes spread on the type off data that you're seeing in 23 R four numbers, And that is why you have a set of somebody statistics like you have like a basketball player on the basketball player has Bean escorting a number of points for every match. You want to carry crazy performance off the basketball player. So you come up with some kind off summary statistics like points per game. So you look at okay. In the last 10 games, this guy higher points per game of 20 but he has a career average of carrier points per game off 15. So what you're doing is you're trying to get a number of data points in this case, the number of points the players caught in each of the matches on trying to summarize them into one or two numbers. That represents how the actual individual values looked like. So this is what we call summary statistics. So these observations observations have a number off data points and somebody started sticks are used to characterize them. So what are the videos? Somebody, uh, statistics that we look at you start with Central tendency on door are three different types of them, but just mean, which is nothing but the average median and more. Then there is variation. Variation between the point is measured by variance and standard deviation on. There is also skew how the data is cute at towards one end or the other for measuring that we have corporal's. Now we will go ahead and see what each one of them are. Let's start with central tendency, central tendencies, a measure off. Very the data is Stendhal is centrally 10 towards So we start with mean everybody is family with mean are the average Onda. How be computed is also very simple. You just add all the numbers and they went by the number off the count of the numbers we have. So at all the numbers divided by the count and then you get the mean are the average pretty simple and straightforward. A more lesser used measure off center tendency is the median median is nothing but the middle value. The middle value means you have a set of numbers. What is the middle value that you find in there? How do you find median? You take this, set off numbers and order them in ascending order and then find, which is exactly in the middle of that list. That is the middle value on if the number off the count of the numbers is even, Let's say you of 10 numbers you would find. What or how do you find the middle is you basically take the among the Let's say we have, like 10 numbers. You take the middle two numbers, add them together and divide by two so they find the average off the middle values on. That is what is going to give you the median. Let's say you have 10 numbers. Take the 15 6 number, add them and divide by two, and you will get the media the next is more more is nothing but the value that occurs most in the data set. So you have you for you. Our data said, where the numbers are not unique, they keep repeating more. That's the number that occurs the most on which one do you use where and that depends on the situation. It is very situation dependent on what? Based on what you will be using for measuring 10 central tendency. A lot of times you might actually look at all the three to understand some characteristics aboard the data. So this is a view. Compute central tendency. Suppose you have a set of observations like these are the numbers. 1345578999 So how do you So you have 10 numbers years account of 10. Uh, some of these numbers is 60. Just add them up. And how do you find me? Is some by count, which is 60 bite, then under the six, the symbol that is used for mean is the symbol mule amuse. Used tomato are usually represent me. The median is nothing but the middle value. If you look at that list of numbers, the middle value. Among this, 10 are the one that directory in the fifth, under sixth in the list, and that is five and seven. So you just take those two numbers five and seven Adam and divide by two. On you get 66 is the media, and the motor is the most occuring number on in that list. You will see the number nine is occurring three times, so that becomes the mode. This is how you calculate these three metrics simply whenever you have any kind of statistical analysis package. In fact, any of the programming languages are the tools that you have provide libraries or functions to compute all the three of them. So pretty much you don't have to be right in court toe manually. Compute them. You always will have some kind of function. Help our library help compute any of these things. The next comes variants. Variant is used to measure how the values are distributed around a mean. So you are the mean, which is the central tendency. But how are the values distributed around the mean? Are they closer to the mean? Are they know far every friend? I mean, you can have numbers ranging from 4 to 6 with a meaner file. And you can have numbers ranging from 1 to 10 with a meaner fight. So even though for both of them the meanest five, the distribution of these numbers are different. 4 to 6 and 1 to 10. How do you measure that distribution? And for that you use variance and standard deviation. So you basically see that if the very into small are the standard, deviation is small, the very in the variability in the data is very small. If the values are high, the very ability is very So how do you go about computing? A very intense standard deviation. So on the right side you have a table, an example table on in this one. But you're going to go and start computing the variance and standard deviation. The first thing you do overs here you have, like fire data points 54635 and to the first thing you do is compute me the mean of the value. All these values is for once you compute the mean of the values. Then you start subtracting each value from the mean. So you say four minus the meanness. Four. So four minus for a 06 minus. Sorry for my six is minus two. It is started. You're doing mean minus the value. So for minus six is minus two four. Minus three is one for minus five s minus one. So you get all these values. Then what do you do? Your Squire? All these values. So Squire of zero is zero. Quiet off. Minus two is four oneness. One oneness minus one is 12 west for So what happens when you Squires? You're basically eliminating the negative value from the list. Once you do this choir, you sum up all the squares, the sum of all the squares who do the subtract each number from the mean and then you Squire Adama. And then you summed them up and you get a value of 10. And how do you get variance is you Divide this number 10 by the count. So there is a fight. Five values in there, Some off Squires. The stents, or 10 by five is two. So the variance off this data said is too. So once again, you subtract each value from the mean squired them up and then some up. All the squares divide to some off Squires by two. And that's all you get radiance to find. Standard deviation. Just divided guest do is quiet route off the variants and then you get standard deviation. One thing you notice is that the variants the measure off variants is usually the unit of measure is usually the square off the unit off the individual values. What I mean to say about this is suppose each of these values represent distance. Let's say each of them is miles four miles, six miles, three miles. Then variants is actually when you the room because he's quiet. All the values variances two miles squared. So get so to get the value in the same unit in miles, you have to Squire rode them and you get standard deviation. So here is the data said, Where are where you have a list set off? They said the distance and miles, the main is four miles under standard deviation ist 1.41 miles. So that is all you represent the data here for mean and standard deviation moving on. The next thing you want to see is about quarter tiles quartile is used to find. Basically, it actually gives you a measure of our lot off things, so it gives you a measure of the central tendency it gives you. Measure of the range. Range is nothing but at the minimum value and maximum value in a data set. Some measure of tenants and central tendency median the range, which is the minimum value in the maximum value on how was the data skewed? Is it skewed toward the minimum value are skewed towards the maximum value. All of them is measured by using corporal's. So, given a set of observations, how do you find quart else's? You given a set of observations you ordered them in sequence on Divide them into four equal sets. So she has a set of values. The same set of values you saw earlier for computing mean on you divide them into fourth equal sense. So each service, at 25% contains 25% off the values in the data set. So the first value you see is the mean value. The one that you see at the 25th percentile is called the first quarter. In this case, it is four, The one that you see at the 20th percentile is called the second quarter below the median. The one you see at the 3rd 1 or the 75th percentile is the third quartile on one. The maximum value actually forms the fourth quartile as a maximum value. The mean value first quartile median, third quarter max value. So by looking at these five runners, men first quartile median the quarter than max. It actually use your nice picture ization off how the data is distributed. One thing you want to see is there between the mean value and median, 50% of the values occur. I mean, value and median 50% of the values again. Similarly, between the median and the max value, another 50% of the values occur. And most importantly, between the first quarter on the third quarter, 50% off the values, right, first quarter and third quarter, 50% of the values occur. So this is how you look at and try to characterize the data and let it go and look at some examples here. So here is a set off data sets all of the same min and Max values but it gives you how data can vary and have you can interpret it. Let's look at the first Data's and using 1358 10. This is kind of equally distributed. Why this is equally distributed is the best ins between the minute to first quantities like toe between the first and meeting This, like to media and third quartile is like three, so it's kind of evenly distributed between each of these numbers. The 2nd 1 you see is most values are closer to the centers. You see that 1456 and 10. So between the first and third quarter, that is four and six. You'll see that 50% of the data occurs USO 50 person off the numbers in your data center between four and six, whereas the overall ranges between one and 10. So it gives you another kind of variation where a number of alleys are closely packed together out on the median. But there are a bunch of value there are they front them, so it's kind of certain values are away, but most of them are packed on the media. Third data set is where you see 1237 and 10. So even though the ranges between one and 10 the meeting history so 50% of the values I just covered in 100 on another 50 person knuckle between three and 10. So it's kind of skewed to the left and you look at the next round, you see that risk you to divide because the median is seven. So between seven and 10 or 50% of your values, But we've been one on seven, another 50%. So more rallies are occurring between the median and the maxilla just repeating, whereas there's less number of values occurring between the men and the media. This is a nice presentation off. You know, looking at this is how you look at the data and trying to find out, but less looking at the quarter. You try to find a number of things about the data. You like to see what the central tendency, what is the range of the data and they evenly distributed, are rescued, even look real, all of them by just looking at the quarterfinals. The last thing we want to look at him somebody start the sticks is what we call us outlier . So what is an outlier in our player? Is an art value occurring in a data set on dress typically towards theme axinn or the imminent of the desert? Because it is already it is obviously going to be towards it to the max and odd imminent of the data set all players. Why they are important in analysis is that they intend to start the somebody started sticks of the data set. So if you're using data without players formation learning it can start the behavior off your mission learning algorithms to So that is one of the main things you want to remember . There are players are very important when you're passing data into mission learning algorithm. An example here is usually you have a set of observations like 1245 on 20. You obviously immediately see that 20 is like an art number sitting there, which is why we call it the outlier. So without players. So let's a tray which had strayed to compute the mean and standard deviation off this data said with and without our players. So without player, you see the meanest 6.4 and the standard deviation of 6.94 But you remove your player and just take the 24 numbers 124 and five and try to compute the mean Now see the mean is just three, and standard deviation is just 1.5. This is how much and hopefully you can start. You are numbers, so you have to be very careful about all players. Whenever you look at the data sec and you have to decide you want to keep the old players for your finance analysis are not. Otherwise it's going to have this kind of distorted performance. It can actually give you the wrong idea. Wrong analysis, wrong actions, all kinds of stuff. So you want to be very careful about our players. So this completes our discussion on somebody started 4. Statistical Distributions: Hi. This is your instructor, Cameron. Here we are now trying going to see in the section about distributions. Statistical distributions are probability distributions. Distributions are a Bischel way off some racing and showing trends on you. If you have been used Toe Analytics, you will be seeing a lot about these distributions that in your real life. So what is the distribution distributions? Show how the data values are spread in a given. Observation said so you have a set of data. You're collector. Number of observations or samples are examples. And then you're trying to find out how these values are spread across in a given data set. So how do you do? Distributions is distributions basically contained a set of bins. The bins are our groups are shown on the X axis. So in this example chart on the right side, the bins are actually like the types of feedback you get like excellent. Very good, Good Farron bad. So each one is a been on. Then you count the number of observations that occurred in each of the bins. So suppose you collected feedback from, say, 50 people and then you want to show how maney excellence you got how maney very goods you got. How maney goods you got on. This is how you show them in a distribution is you steak each off the type off rating and put them on the X axis and then the count that you find his phone as five put on the Y axis . And this is how a distribution would look like. This is how you do for a categorical or ordinary data. What about interval or ratio data? In the case of interval or ratio, you're The bins are usually ranges off values. You convert them into ranges of values like 1 to 10 10 to 2020 to 30. Typically, they are equal sized ranges. And then you show how maney values occurred in each off these ranges. That is how you would do a distribution for interval or ratio data. Here is an example off how you would build a distribution. So in the top, you see, like 10 number. So this is your data, said the data said. Has 10 numbers in them on. I want to build a distribution. How do I do that? First, I create Ben's Stubbins shown here is I'm trying to create bins in the range of two numbers . They want to do 3 to 4567 to eight and nine today, and then every value is taken and put into the corresponding been. So the faster values for I take this four and put into the bin 3 to 4 because that's the range it falls into. Then you take seven on put them in the in the ben 78 because that's the arrange it falls into on. Then you keep doing that for every value that you find in the data set. And then finally you count the number of values that occurring. Each of the been the number of values from 1 to 2. The been want to do there are three values your country. So once you have the count on depends you plant the beans on the X axis. 123456 789 10 and blood. The counts on the Y axis on. Then you pop. Put in the points of bars are whatever you on the show and this is how you build a distribution chart for a given set of data. Now, when you have distribution that are different shapes. You end up when you make a distribution. So when you have, when you try to draw smooth plain on these plots or here, let's say Go back to the earlier distribution and you try to draw SMU, train on the plot like this. So one of the ones you drive distribution, you can actually draw a small plane on the top off all the bars. On the top of the points, you get a shape on the shape. Typically, that the mines were kind of distribution. It is so there are different types of distribution, like the J shape distribution. You say the values are the lower Ben's have a lot off values than the higher bins in the case. Off a normal distribution, you see that the middle bins have the most values on the on the lower and the upper bins to know that that money number of values in the case of a rectangular distribution, all the bins have equal number of values in a buy model distribution. You see, the two bumps are two mountains in the distributions of that's where it's called by model, so there is a no on the lower said. There is one been that has a lot of values on the higher side, there is one win that has a lot of values. Then there is the positives. Q. And the negatives que where in the pastor's kill, while the lower bones typically have a more number of values. In the case of the negative Studi, higher events have more number of values. This is how the distribution that are different types of distributions that are typically there. Then comes the most important thing called probability distribution. So what is a probability distribution? This is a little complex concept if you want to pay more attention to trying to understand it, so it assigns a probability toe each measurable substitute off. Possible outcomes often experiment again. There's a lot of complexity here. Let's rated one by one. You have an experiment. An experiment is nothing, but I'm collecting data. So an extra man, maybe like I'm trying to find among 100 patients. You know what kind off Ohka suppose I have 100 patients on? I want to find what are the age ranges off these patients. When they wanted my shirt age ranges off these patients and I want a plot up regular distribution. What I would typically do is I want to take the age off this patient's 100 patients on, then put them in buckets off 0 to 10. 10 to 2030 22 30. Like that on. Then I want to put the count on the Y axis and draw nice blood. Now that is a regular distribution. Now what does a probably distribution is in stuff plotting the count on the Y axis. I'm going toe plod the probability off each of these ranges occurring. So how do I do that? Let me go back to the earlier plot and then show you No. Here it is the area distribution you built. But the been sun shone on the X axis and counts on the Y axis. Now I just change the count toe. Probably be on hold. I should've probably off each of these bends. It is very simple. You just take the count in each of the bins, be worried by the total number of values and that becomes the probability for each with being so in this case, the total number off values to stem the count, and there's been one toe do a street. So three. But then our 30.3 is the probability of the been wantedto point to the probability of the been what? Beautiful. So you plot probably on the Y axis Ben's on the X axis that becomes your probability distribution. Let's go back to the other slide and start digging a little deeper. So in this case, let's say I have aged on the X axis on the probability that a patient has that age is spotted on the Y axis, which you're measuring by saying I'm collecting a data from 100 patients and then I'm putting them all into a distribution and then converting do a probably distribution with the technique We just docked herbal. So each possible rangers part around the X axis, which is, as we said, H 0 to 10 10 to 2020 to 30 on, then the probably that particular age group occurs is spot on the Y. Axis probability is always a value, but being 0 to 1. So we saw that it may be like if 30 agents dictated 30 patients out of 100 patients are of the age group 20 to 30. The probability of 20 to 30 is 30 by 100. Under this 1000.3, you can have probably distributions to be either discrete or continuous. Discreet means they can be just Vince are. It can be. You can use them to plot, continue a set of values and then plot a nice cover on them, too, so you can do both in terms of probability distributions. One of the most popular distribution that you see is called a normal distribution Are the Goshen distribution. So what is a normal distribution? A. Normal distribution is distribution where when you block values off, given data set on a chart, it takes the ship off and our Melco So you plot the values on a chart on a Devon data and under the data takes what is called normally looking curled. What is a normally looking car? Let's look on the right side chart. A normal looking car typically has, assuming they're symmetric about the mean, which means you take the mean, which is the middle bar. You see the chart being cemented. Both sides look typically the same. There is no schooners. On the left are schooners on the right or anything like a by modern anything like that. So it is equally symmetric about the mean. And there are other characteristics about a normal distribution on that is to see. Let's start with trying to understand what each of the mean it says, about 68% off the values our live within one standard deviation off the mean What does that mean? So the mean is brother on the middle? But just we call here expert. We could have also called again you. There is a slight different, but we don't want to go into that. And then what does X plus one sigma on what is X minus one sigma? Suppose you know, data said the mean of the values is fight. Understand that deviation is too so X plus one sigma is five plus two that a seven x minus one signifies three. So between 3 to 7 68% of the values occur. So the data said had 100 values with a were mean off five and standard deviation of to 68. Off them would be between the values three and seven. Let me repeat. I have a day Does it off 100 values with a mean of five standard deviation of to 68. Off them will be between three and seven. But just, I mean plus one sigma and mean miners Once it then you say 95% of them will be between within two standard deviation of the mean, which means 95% of the values are going to be between express to an X minus two, which is fight plus doing to do four. So five plus four and five minutes for between, Why the values one and nine, 94% of the values would be occurring. So you just shows you how the values would be evenly distributed on if the values are distributed in this manner and confirming toe this particle of shape onto these numbers, it becomes a normal distribute on distribution. So why would you want us of find out if your data is normal or not? Because once you know your data looks like a normal distribution, there are a lot off standard regular built in formulas that you can typically start playing . There is a set off regular assumptions that you can't start applying to your data. You don't have to sit and computer a lot of stuff. A lot of things are already computed for you. There are farmers and libraries available for you, which you can start using if you know your data is normally distributed on. Typically, it's also said that most of the data says that you'll find will be normally distributed. Lectures. Okay, under on the left, said, You see another chart which is giving you how the normal distribution will look like for different values of mu. On different values are variants of Sigma's classic Myers standard deviation. Sigma Squired is radiant. So what different values from you and standard division? How do these values look like? Just the fast covers? Very narrow. That is because you see the standard deviations of the mirror mirror the variances pretty less as a variance keeps increasing, the height of the curve goes down on the spread of the CO goes up. So that's all you see. Different normal distributions are take shape. Here is an example off a normal distribution. It is about employees off a cereal factory. It shows about the number of years people work and how many people fall into that particular number off your scategory. So the number of years has plotted on the X axis. The frequency is front around the Y axis. That means the number of people. So let's say that the only look at the number of us work this eight there are 100 of his own people in that A particular have been on the mean off. This data is 10.21 Standard deviation is 4.1. So when you look at a figure like this, it immediately gives you a nice picture. Ization off how the data is spread. You look at justice picture inches. Okay, this is all my data. Looks like the mean is around 11. 11. You just look at the chart. You can easily see the meanest around 11. They does nicely, spread their own, not skewed. Anyway, you maybe you can make some assumptions regarding them. Next comes a very important distribution called a by model distribution by model distribution by a nominal distribution. A binomial distribution is about data where the date eyes either zero or one. It is not a number officially a 01 Suppose you have a test, which is you ever set off 100 patients. And you want to say, Does this patient have cancer or not? So that is a test. So you have 100 patients. You ask this question 100 times for each of these patients, you have. I answered with just either one or zero or s or no, since it is only two possible values, that is by its called by an army of on a distribution for that kind of a data is called a binomial distribution. So how do you plot a binomial distribution? Suppose you have it is basically describes a probability off a bully, an outcome which is to say that if I have under patients, what is the probability that 30 persons are 30% off my patients? We love cancer. What is the probability that 50 off my patients will have cancer? What is the probability that sandy off my patients will have cancer? That is what a binomial distribution usually tries to answer. So let us say the example on the right side, maybe we are like 10 patients, okay. And out of this 10 patients were trying to find the probability what is authority that to patients have cancer and that probably these point to are 20% off my patients. What is the problem with that? Five patients out might have cancer on the probability is somewhere here, like 50.12 are 12% of the patients. So if you have, like, a number of trials in this trial and nothing but the number off observations in this case that the number of patients, if I have a number of patients and I have kay, is the number of successes, Kay is the number of patients among this in who might have cancer. What is the probability you're tryingto strong probability distribution and say, Okay, what is the party that 10% of patients, 20% of my patients, 30% of medications will have cancer. So this is a plot off all the probability. So the problem is plotted on the Y axis the number off observations on trials on the X axis of the number of patients and then you just our bar as to how you would how you want to look at that. Here is an example of binomial distribution. In this case, what you're doing is you're trying to flip a coin for times when you flip a coin four times , which is nothing but four observations are four examples or four trials each. For each trial, you're going to get a binomial upward. It's gonna be either head or tail. So there are only two possible outputs off every trial, head or tail. Then you tryingto take all the four trials and see how maney total heads can I get. So when you do a flip off coins four times, these are all the possible combination. There are 16 possible combinations off how you would get head our day and they're actually shown here. You can either get head, head, head head. It had had take on maybe someone like head tail, tail head. So although the 16 observations you trying to find how many times I will get in this you tailored product with a chart here and say excess, noting by the number off it. So how many times you will get zero heads? So that was one which is when you get all tales one out of 60. So well, one other 16 the probably one divided by 16 is 160.6 to 5. What is the probably I'll get to one head only. You get four times in this list. You'll see there are four times you'll only get only one head. So that's four by six Howard times. I get only 62 heads that has six by 16 Onley three hit four by 16. I want you. And where do you get foreheads? That is won by 60. The euro This nice. Probably That is coming up here on. Then you go her and blood. This the number of hits you would get down the X axis on the probably be on the y axis. And you then get this nice Belka like this. This is what an example of a binomial distribution. And you can use this to figure out. You know how it is the property I'll get to and then you can look at Okay. This is how the probability is distributed across. So that completes our discussion about distributions. Thank you. 5. Statistics Correlations: high in this section, we are going to talk about correlation. Correlation is the foundation off radar signs and mission learning You keep talking about when you talk about data signs about signals about the insight, information, knowledge, all kinds of things. The basis off all of them is correlation. When you know mission learning, you are tryingto protect something but some other things. Those thing you're trying to predict is the target. That thing you're trying to use to predict is called the Predictor variable on toe. Predict the target. There has to be a correlation between the predictor and the target. If you have correlation, you can use mission learning to predict. If you don't have correlation, then you can do nothing. Correlation is the foundation off data Saints on that has traded. Learn something more aborted. So what is correlation? Correlation is a mutual relationship connection between two or more things. So there are two things which are represented by two sets of numbers. When you say correlation, it means that when one set of numbers, when the value goes up, the other also goes up or the value goes down on one side. Devalue also go stone on the other side. So on the right side you have a chart between black pressure and age on. You see that as eight increases, the blood pressure also increases or decreases. This shows the relationship between the variable age and variable bread pressure. On this, what we call ask correlation. When one thing goes up, the other thing also goes up. It shows interdependence between two sets of values are the interdependence between two value two variables. So correlation, as we said again, is with the correlation between two sets of data is how much one changes when the other also changes, how much one changes, How closely does one change when the other also changes? And it is the basis off data signs. As I just explained, correlation is needed between the predictor and target variables in order to for you to do accurate predictions. And here we saw an example of age and blood pressure measuring correlation. How do you measure correlation? There are a number of ways arm assures, that we have been using to measure correlation, but the most important and the most popular one is what we call the Pearsons correlation coefficient. We would be using this coefficient measure in all our examples in a predictive analytics models. So please pay close attention to this. Pearson's correlation coefficient is a number that various between minus one and plus one, the closer the value is two minus one. That more negative correlation is the closer. The value is two plus one. The more positive the correlation is on the closer, the value is 20 There is no correlation on the examples for them are actually shown in the various charts in the bottom. So suppose you take two variables and blood one on the X axis on the other on the Y axis on each date apart and on each example. Our sample this product here in this chart. But, he said, there is a perfect positive correlation on the value. It's one you see that as it increases as the excess access increases, why also increases in a really, really straight plane? When the correlation is like 0.8 highly positive correlation, you'll see that they almost fall in a straight leg. But there are little ups and downs somewhere when your value of 0.3, which is low, positive correlation, the values start spreading out, but they're still kind of fall into a straight line, you know, some word in a straight line, but there is a lot of variants when there is no correlation. Of course, the values are all over the place when the value zero and negative correlation is when when one value goes up the other ghost town, that is what you call a negative condition. One goes up, the other goes down. So the same plumbing on the X and y you see that when there is low negative correlation, the values are all over the place. But it is still somewhat falling into a declining straight line. When he movinto a high negative correlation, it is almost a straight line on a perfect negative. Correlation is it was a straight declining line. So this is our correlation between two sets of values can vary between one another on. It was important for you to keep plotting your data in these kind of charts and keep looking at that damn and seeing how, what kind of correlation I'm seeing in the charts and what is my be essence? Correlation coefficient measured what each of them one very important thing we know don't realize about correlation is what is the relationship between correlation and causation. So correlation, as we saw, is that the relationship between two values the causation is the reason for a change in value. You have variably and variable be on when variable a goes up. Where you will be also goes up. Does it mean that variable a ISS, the cars off variable beat? Let's take a simple example off. The relationship would invade Andi cholesterol levels. You would typically see that as weight goes up, cholesterol levels also goes up. Is it a causation? Is wait a causation for college for cholesterol levels to go? Yes, that is because it is scientifically proven that the more off where you weigh typically the more far do you have in your body on. That means that is going to be a number of scientific, no reasons why there you're going to be having more cholesterol. So there is a reason there is an explanation for this causation that when weight goes up, your cholesterol level also goes up. Let us ask the reverse question. Is the cholesterol level the concision for Wait? No, just because your cholesterol level goes up doesn't impact weight because scientifically disproven, it is weight which is impacting cholesterol, not the other way around. Right? Heuristic. So correlation may not in play causation, so that is another important thing. So just because to radia, bills are later each other just because to wear ables one goes up, the other goes up does not mean one is a causation for other. Then I say another example off breast size. Okay, let us compacted. Dress size against cholesterol. 11. When your dress size goes up, typically your cholesterol level also goes up. Does this mean you're just size is the causation for cholesterol level? No, they are dependent. They are correlated to 100. But oneness, not the causation for other did have. Actually, 1/3 cost were just nothing but your weight. So wait is the causation that so there is a correlation between your weight and breast size between your weight and cholesterol levels and the Vader's, the causation that your cholesterol levels goes up on your dress size cause. So correlation made, maybe do toe correlation between two variables may either be do the causation. They might have a common cost, like dress size on duh. Cholesterol levels have a common cost like way, or it can be just purely incidental. There might be no reason. So you have to. Whenever you see a correlation between two variables, you have to go on, figure out. Why are they correlated with each other? Is there a causation involved? Is that a common cause in ball on there? Purely incidental. When you try to predict something in mission learning, you want the predictor variables, Toby the causation for the target variables. That is when the prediction will hold good in the future, when your predictor variables are the cause for your target variables, there does. Brent commission Learning is going to be a good if that is missing. If the relationship is purely incidental, there is no guarantee that the correlation you see today will also happen tomorrow if the relationship is just incidental. So you wanna always go and look at the reason why some variables are correlated with each other when you're doing mission learning. So here is an example off a relationship between US highway fatality rates and fresh lemons lemons important the U. S. From Mexico, an interesting thing you see is that ask the fresh lemons important from Mexico. Keep going down. The for the first it was lower. The fatality rate is high. So why is U S highway fatality rate that from road accidents high when the imports the U. S. A. From Mexico is low? This is purely incidental. As you see, this is on a time blame. So there could be two different independent things that may be happening on this time blink , which is impacting both of them. But the U. S. I were further later has no relationship to the fresh lemons imported the U. S. From Mexico. So you and there you see a correlation here. There is no reason to believe that there is some relationship with each other. So you have to be very careful when you see correlation between two variables. You have established the ground truth as to why you see this kind of relationship. So that completes our discussion on on correlations. Thank you