Transcripts
1. Introduction Class Overview: I and welcome to statistics for Business Analyses on Data Science. My name is Karima Anomie business intelligence analyst who are spent in significant amount of time walking on business intelligence. Thus require motor skills such as statistics I'm really excited to present to you is statistic calls that stands out. Descriptive statistics is all about. The basics will start by learning the different types of data. We'll distinguish between population and sample data. Well started the levels of measurements we can use. Then we learned the difference between categorical on numerical variables. Also, we learned out of state the ultimate Asia mean media and mood, and finally we learned how to quantify variability. So what are you willing for? Let's begin this journey together. See you there.
2. Understanding Population and Sample Data: before processing any data and making decisions, we shield introduced some key definitions. The first step off every statistical analysis you perform is to determine whether the data you are dealing with is a population or example a population is a collection off. All items of interest to our study on is usually denoted with an open case in the numbers were obtained when using a population are called parameters. It's sample is a subset off the population and is denoted with the lower keys in on the numbers we've obtained when walking with a sample. Cold statistics. Now you know why the field we're studying is called statistics. Let's say we want to prefer miso v off the job. Prospects off student studying at the New York University. What is the population? You can simply walk into the New York University and find every student's right. Well, that's what surely not be the population off any Y u students. The population off interests includes not only the students on campus, but also the world's At war. On exchange are broad distance education students, part time students, even the world's enrolled but are still in high school Populations are to define on arts observed in real life. It's sample. However, it's much easier together. It is less than consuming on less costly. Time and resources are the main reasons. We prefer drawing samples compared to analyzing on entire population. So let's draw a sample. Then, as we first wanted to do, we can just go to the New York University Campos on Ansari contain because we know it will be full of people. We can do an interview, 50 of them Cool. This is a sample drawn from the population of N. Y U students. Good joke population are to observe and contents. That's why statistical tests are designed to work with incomplete data, and you will almost always be walking with sample data and make data driven. Decisions on inference is based on it, right, so the statistical tests are usually based on sample data. Samples are key to accurate insights. They have to defining characteristics. Randomness on representativeness is sample must be both random and representative for inside. To be precise, a random sample is collected when each member off the sample is chosen from the population strictly by chance. A representative sample is a subset of the population that accurately reflects members off entire population. Let's go back to the sample, which is discussed. The 50 students from anyways students contain We walked into the university contain on violated boots conditions. People were no Children. By chance, they were group off anyway. Students were therefore launch. Most members did not even get the chance to be chosen as there were not any contain those. We conclude that example was not random, but was it representative? Well, it represented a group of people, but definitely not all strains in the university container. To be exact. It represented. The people have launched at the university contain and also they've been about job prospects off anywhere you students will eat in the university contain we would have done well okay. He must be wondering how to draw a sample that is both random aren't representative. Well, the safest way would be to get access to the students database and contact individuals in a random manner. However, such surveys are almost impossible to conduct without assistance from the university. All right, through this course will explore both sampled on population statistics. After completing discourse, samples and populations will be a piece of cake for you. Thanks for watching
3. Various Types of Data and Levels of Measurement: you may be watching discourse. Probably you want to use this knowledge acquired as a stepping stone to a career in business analysis, business intelligence on data science, either we before we can start any analysis intestine. We have to get acquainted with the types of variables we usually encounter. Different capsule variables require different types of statistical on visualization approaches. Therefore, to be able to classify the data you're walking with his key, we can classify data in two men. Will is based on his type on its measurements level. Let's start from the times of data we can have them is categorical on numerical data that agree Good data describes categories or groups. One example is car brands like Mercedes, BMW and Audi. They sure different categories. Another example is answer to yes and no questions. So if I ask questions like, Are you currently enrolled in a university or do you only car Yes, I know would be the two groups off answers that can be obtained. This is categorical data, a miracle. Data on the order hand, as its name suggests, represent numbers. It is for the divided into two subsets discrete and continuous discrete data can usually be counted in a finite matter. A good example would be the number of Children you want to have. Even if you don't know exactly how many, you're absolutely sure the value will be on integer such as zero, 12 or even 10. Another instance is great on the exam indicated 1015 160 or 2400. What is important for your variable to be defined as discreet is that you can imagine each member of the data states knowing that he sat score range from 624 100 on the same points separate. It's easier to understand discrete data by saying it's the opposite of continuous data. Continual detail is infinite and impossible to count. For instance, your which can take on every value in some range. Let's dig a bitch, depart into these. It gets on the scale. On the scale shows were Â£50 or 68.434 kilograms, but this is just an approximation if you get Â£0.1. If you go on, the skill is unlikely to change, but you knew where to be under them. Â£50.1. Now think about sweating. Every drop of sweat reduces your wits by the weight of that drop. But once again, it skill is unlikely to capture that change. So is that What is it continues variable. It's going take on an infinite amount of values. The matter how many digits there are after the dots to conclude your which can vary by incomprehensibly small amounts on this continues by the number of Children you want. Tohave hysterically understandable on is discreet just to make sure there are other examples or discrete and continuous data. Grates at the university are discreet. A, B, C, D E f or 0 200% the number of objects, in general, the matter. If bottles, classes, tables or cars they can only take into job volumes. Money can be considered both discrete and continuous, but physical money like banknotes and coins are definitely discreet. You can't be one dola onto for three cents. You get all weepy Aidala on 24 cents. That's because the difference between two sums of money can be one cent at most. What else is continuous, Apart from wits author measure, means are also continuous. Examples are right area distance on time. All of this come very by infinitely smaller amounts. Incomprehensible for human. It start on. The clock is discreet, but time in general, isn't it can be anything like 72.123456 seconds. They're constrained in measuring wits, ICT area distance and sidebar technology. But in general, they can take on any value, all right, these very types of data. So let's explore the levels off measurement. So far, we have been able to distinguish between categorical on numerical data for Dumber. We saw that numerical data coming discreet on continuous, so it's time to move on to the other. Classifications levels off measurement. Discomfort splits into two groups. Qualitative on quantitative data. Qualitative data can be nominal. Or, you know, nominal variables are like the categories we talked about just now. Niceties BMW and Audi or 94 Seasons Winter spring somewhere on autumn. The irons numbers and cannot be ordered, but in our data on the order and consists off groups in categories which follow a strict order. Imagine you have been asked to reach your lunch, and auctions are disgusting on appetizing, neutral, tasty and delicious. Although we have words are not numbers. It's obvious that this professes are ordered from negative to positive those the level of measurements is qualitative or, you know. Okay, So what about quantitative variables? Well, they are also split into two groups, Interval on reissue, Interval and ratio are both represented by numbers. But off world major difference ratios after zero on intervals. Don't Muslims. We observe in the reward our issues. Their name comes from the fact that they can represent re shoes off things. For instance, if I have two couples are you have six apples, you have three times as many as I do. I was this found out? Well. The regime off six and two is three. What are examples Are number off objects in general distance on time. Intervals are not ask. Omer Temperature is the most common example of our interval variable. Remember, it's cannot represent a ratio. Fins on doesn't have a true zero. Let me explain. Usually temperature is expressed in socials or for a night they are bought in the world. Variables said. Two days five degrees socials, 41 degrees for a night. On the yesterday was 10 degrees socials, or 50 degrees for a night. In terms off socials, it seems two days, twice colder. But in terms of foreign nights, not really. The issue comes from the fact that zero degrees socials and zero degrees for nights are not true seals. These skills are artificially created by humans for convenience. Now there is another skill called giving, which asked a true 00 degree. Curving is the temperature at which atoms stop moving on. Nothing recorder down zero degree Kelvin. This equals when a store it and sell it to 3.15 degrees socials. Or when it's forward on 59 points 67 degrees for a night. Variable showed in carvings are issues us we have to get through. Zero. Are we convict Iqlim that one temperature is two times more than the order socials on fire night are not true. Zero on our intervals. Finally, numbers like 23 10 10.5 X sector can be both interval or issue, but you have to be careful with the context you are operating in. So we've gone through the title data on the measurement levels in the next lesson, with city types of charts and graphs that I used most often. Thanks for watching
4. Visualisation Techniques for Categorical and Numeric Variables: and order with city, different types of data levels off measurement. We are ready to explore different graphs and tables which will allow us to visually represent the data we are working with. Visualising data is the most intuitive way to interpret it, so it is invaluable skill. It is. What is there to visualize data If you know it's type on measurements level. Assuming recall, there are two types of variables. Categorical aunt in a miracle. So let's begin with categorical variables, some of the most common way to visualize them. Our frequency distribution tables bar charts by chance on but little diagrams. Okey forced. Let's see what a frequency distribution table looks like. It's asked two columns, the category itself and the corresponding frequency. Imagine your only car shop on you sell only German cars. The table below shows the categories of cars Audi, BMW, I miss cities on the number of units sold or the frequency by organizing your later in this week, you can compare the different fronts and see that Audi has been sold the most. So that is a frequency distribution table. Using the same table, we can construct a bar chart also known as column charts. The vertical axis shows the number off units sold while it bar represents a different category indicated on the results are access. In this way, it is much clearer that Audi is the best selling brunt. Okay, let's represent the same data as if by chance, in order to Beautiful, we need to calculate what percentage of the total each brand represents. Statistics, this is known asked related frequency. Not really already difficult. Nces are up 200% by charter, especially useful when we want to not really compare items. I'm on each mother but also see their share of the total. Okay, this example could easily be transformed into the business example of market share. Market share is so predominantly presented by bar charts that if you set for market share on Google images, you would only get by chance. Imagine that data in our table is representing the sales off Audi, BMW and Mercedes in a single German city. The chatter shores the market. Share that each off these brands us. Lastly, we have the marital diagram. Infarct. A parrot or diagram is nothing more than a special type of bar charts where categories are shown in descending overall frequency by frequency statistician, mainly normal for currencies off each item. As we said Elian example, that's exactly the number of units sort. Let's go back to our frequency distribution table on although the brands by frequency. Now we can create a bar chart based on the reorder table on Douala almost off Imparato diagram. There is one last touch to make it one. A Korbel. The same graph showing the community frequency the community frequency is a some of the related sequences. It starts a difficulty off the first brand. Then we are the second, the third and so on that it's finishes. Arts 100% see other part of diagram combines the strong side off the bar. By chance it is it is a competitive powerboat between categories, and that's part of the total for them. Or if this was a market, she a graph, you could easily see the market share or the top two out of five companies. The Pareto diagram, also known as 80 20 Rule States darks. 80% of the effects come from 20% off. The course is a real life example. Is is determined by Microsoft's that by fixing 20% off software books, they managed to solve 80% off the problems. Customers experience a part of that from current view information like that. It is designed to show our sub totals change with traditional category and provide. Also, it'd better understanding of our data, Oakey decided main ways in which we can visually represent categorical data. What about the miracle variables we already know all to create graphs and tables for categorical variables, So let's do the same for numerical variables. Whenever wants upload data, it is best to first order it in a table. So as we did with categorical variables, let's start by creating a frequency distribution table. Here is a list of 20 different numbers if you challenge them in a frequency table like the one we use for categorical variables, would obtain a table with 20 rules off them. Represents in one number, it's a corresponding frequency of one as each number, because exactly one time this table will be impractical for any analysis. So when we deal with the medical variables, it makes much more sense to group the data into intervals and then find a corresponding frequency in this week, we make a summary of the data that allows for a meaningful visual representation. How do we choose this interval? Generally statistician prefer working with groups of data that contain 5 to 20 intervals. This way, the summary can be useful. However, this various full case to case on the caricaturist off intervals likely depends on the amount of data we're working with. In our example will divided data into five intervals off equal live. The simple formula that we use is as follows. The interval with is equal to the largest number. I know the smallest number divided by the number of desired intervals. In our case, the lens off the interval should be 100 minus one, divided by five. The result is 19.8 now. We want around off this number in order to reach a need or representation. Therefore, interval will be as follows. 12 2121 41 40 watt to 61 61 to 81 on 81 to 1 a one each interval as a with off 20. The most common graph used to represent the medical data is instagram forced who let out a created, and they will provide a description of the way he did. Size represented. We're going to use the frequency distribution table for my previous example to oppose out. Yeah, it is. As you can see, it looks like a bar charts what's actually conveys very different information. As any bar charts, the political access is in the medical time and shoes. The absolute frequency this time do the results are access is numerical two. So each bar, as with equal to the entire well on ICTs equal so difficult. NSI I noticed all the different bars are talking disease to show dearest consternation between the intervals. Each interval ends where the next one starts, whereas in the bar chatter is Aurelia. Different bars represented different categories, so the parts were completely separate. Okay, sometimes it is useful to plot the intervals begins, but if, rather than the absolute frequency, as you can see, the instagram looks the same visually but gives different information to the audience. Remember, relative frequency is made up of percentages. This is how we can build a instagram in order to represent the miracle data. Okey. So far, we've covered graphs represents only one variable But how do we represent relationships between two variables? Let's explore across tables on scatter plots. Once again, there is a division between categorical on the medical variables. Let's start with categorical variables. The most common with representative is using cross tables or are some statisticians? Call them contingency tables. Imagine you're on Investment manager. Are you manage stocks, bonds, real estate investments for three different investors. Each of them has a different idea off risk, and hence their money is located in a different way. Um, only three assets Colossus. So it cross they were representing all the data looks the following way. You can clearly see the rules showing the type of investment that's been made on the columns with each investor's allocation. It's a good practice to calculate. It'll thous off true and columns as it is often useful in for the analysis. What is that? It's a total off. The rules gives us total investments in stock bones on real estate. On the other hand, this off total of the columns kiss off the holdings off each investor. Once we have created a cross table, we can proceed by visualizing the data. Various full charts in such case is a violation of bar charts. Call decide by side bar charts. It represents the holdings off investor in different types of assets. Stocks are in green bones in red, on real estate in blue. The name of this type of charts come from the fact that for each investor the category off assets are represented side by side. In this week, you can easily compare our set holdings for a specific investor among investors. Easy Right now. Graphs are very easy to create. One Red Rose. You have identified the type of detail you're dealing with and decided best week to visualize it. Finally, who like to conclude with a very important graph? This got applause. It is using represents into numerical variables For this example, we have got at the reading and writing off South schools off 100 individuals. Let me for show you the graph before analyzing it. All right, First South scores by component Teran from toe 100 to 800 points on. That's why our data is bonded. Within the range of 200 to 800 Second are vertical axis shoes. The writing schools. Why the results are access contains reading schools third there are under students on the results corresponds to a specific points on a graph. Each point gives us information about a particular students. The four months, for example, this is Jane. She's called 300 on writing but 5 50 on the road in parts Scott uploads usually represents lots and lots off observations. When interpreting it's got uploads is the decision is not expected to look into single data points. It will be much more interested in getting the idea off. Our dictator is distributed, so the first thing we see is that there is no videos up trained. This is because lower writings calls are usually obtained by students with low ridin schools, and I are writing schools have been achieved by students with IRA in schools. This is logical, right? Students are more likely to do well on boots because they to test are closely related. Second, we noticed a concentration of students in the middle of the graph, which calls in the region off for 50 to find 50 on both Brilliant on writing. Remember, we said that this cause can be anywhere between 200 to 800 well, 500 is average score one on get so it makes sense that a lot of students fall into that area. But there is. This group of people would put very high writing on reading schools. The exceptional students tend to the excellent on boats components. Finally, we have Gene from immunity go. She's far away from other observation, as she's called a both of rich one reading but poorly on writing. This observation is called an outlier as it goes against the logic off the old it is it will and more about off liars on how to treat them. In our analysis, let's our own in this lesson. So we have gone through the very basics. In statistics, we have covered populations, samples, types of the reviews, graphs and tables, and it is time for us to dive into the art off. Statistics, measurements off, central tendency on variability. Thanks for watching and see you there
5. Calculating Measures of Central Tendency: business, and we'll introduce you to the three measures of central tendency. Don't be terrified by technology. We're talking about the main media on the mood. Even if you're familiar with these terms, please stick around as we will explore their upsides and shortfalls. The first measure will study is the main, also known as the simple average. It is noted by the Greek letter meal for Population on the X Bar, for example, we can find the mean off a detailed state by adding up all off its companies on dividing them by their number, the minute the most common measures of central tendency. But it has a huge downside. It is easily affected by outliers, less aid ourselves with an example. These are the prices off Pisa on 11 different locations in New York City on 10 different locations in Ellie. Let's calculate the means or the two data sets. Using the formula for the mean In NYC, we get $11 whereas for L. A. Just 5.5 on average, pizza in New York can't be twice as expensive in L. E. Right Or it's The problem is that in our example, we have included one push place in New York, where the charge success $6 for Pisa and this doubled demean. What we should take away from this example is that the moon is not enough to make definite conclusions. So how can we protect ourselves from this issue? We can calculate the second measure and media. The media is basically the middle number in an order. It sits. Let's see how it works. For example, in order to calculate the median we have told our little state in our standing order, the media of the details states is the number at position on plus one divided by two in the order list where N is the normal off observations. Therefore, the media for NYC is at the sixth position or $6 much closer to the observed crisis than the mean off $11. What about Ellie? We have just tenderizer visions in Ellie. According to our formula, the media is at the position 5.5. In cases like this, the media is a simple are rich of the numbers exposition five and six, therefore, at the 1,000,000 off, early prices is five on $5. So we have sent the media is not affected by extreme prices, which is good when we have posh New York restaurants on the streets. Pizza sample. But we still don't get the full picture. We must introduce another major. The mood. The models value that because most often it can be used for boating, a miracle and categorical data, but will stick to a numerical example. After country frequencies off each value we find out, the more off the New York P surprises is $3. Now that's interesting. The most common price. Cough pizza in NYC. It's just $3 but the mean on the media letters to believe it was much more expensive. Okay, let's do the same on finding more off a piece of prices from each price appears only ones. How do we find in Morden? Well, well said there is no mood but connected in our 10 moods. Sure I can. It will be meaningless with 10 observations on gun experienced statistician, I will never do that. In general, you often have multiple moves, usually two or three moves a tolerable, but more than that would defeat the purpose of finding a mood. There is one last question We traveled answered. And that is which merger is best, based on the example with just saying, it shows us that the measure off central tendency should be used to get the rather than independently. Therefore, there is no best. But using all the war is definitely the worst. Now you know about the mean median on mood. We use that knowledge to talk about skew nous. Thanks for watching.
6. Calculating the Measures of Asymmetry: after exploring the measures of central tendency, let's move on to the measure off a symmetry. The most commonly used to measure a symmetry is skew nous. This is a form a lot of calculated. Almost always you use software that performs the calculation for you. So in this lesson will not go into the competition, but rather understanding skill. Nous Steelers indicates whether the observation in a data sets are concentrated on one side . This can be confusing in the beginning, so let's see an example. Here we are three deficits and respective frequency distributions. We have also calculated the means medians and the moods. The false datasets are in mint off 2.79 on a median off to hence, the men is bigger than the medium. We see that this is a positive all right skew from the graph. You can clearly see that the data points are concentrated on the left side notes that the direction of the skill is counter intuitive. It does not depend on which side in line is let into, but rather toe which side it. Still, it's leading to so rice que nous means the outliers are to do right It is interesting to see the measures of central tendency incorporated and a graph when we have rights units. The main is bigger than the medium, and the mood is value with the highest visual representation. In the second graph, we have plotted a data set that as an equal mean median mode, the frequency off a currency is completely symmetrical. Are we call these zero or low skew? Most often you will year people say that the distribution is symmetrical for the data sets . It's open nine in Indiana, five on the mood of six. As the main is lower than the median, we say that there is a negative or left skew. Once again, the highest point is defined by the mood. Why is it called the Left skew again? That's right, because the outliers are to the left. All right, so why is que nous important? Ask unions tells a lot about where the data is situated. As we mentioned in our previous lesson, the mean the median under mood should be used together to get a good understanding of the date is it measures off. Our symmetry likes que nous or the link between central tendency measures after politics, Yuri, which ultimately allows us to get a more complete understanding of the detail we are working with. Thanks for watching.
7. How to quantify Variability: next on our to do list of the measures off their ability. There are many ways to quantify variability. However. We're focused on the most common worlds. Variants, standard deviation, coefficient of variation in the future of statistics will typically use different formulas when working with population data and sample data. Let's think about this for a bit. Where you have the whole population, each data point is known, so you are under present shoe All of the measures you were calculating. We take a sample off this population on you compute its simple statistic. It is interpreted as an approximation of the population parameter. Moreover, if you extracts 10 different samples from the same population, you get 10 different measures. Decisions have solved the problem, but I just in the algebraic formulas for many statistics to reflect. These issue, therefore, will explore both population and sample formulas as their boat used. It was the asking yourself why their unique formulas for the mean median on mood? Well, actually, the sample many is average off the sample data points for the population mean is the average of the population data points, so technically there are two different formulas, but they are computed in the same way. Okey Now, after this short vacation, it's time to go onto variance cereals measures. The discretion of the sets of data points around their mean value. Population variance denoted by Sigma squared is equal to the sum of squared differences between the observed values on the population mean divided by the total number off observations sample. Varias, on the other hand, is denoted by S Kuwait on is equal to the sum of square to differences between the observed sample values on the sample mean divided by the number off sample observations miners one. Okay, we're getting acquainted with statistics. It is hard to grasp everything right away. Therefore, let's talk for a second to examine the formula for the population and try to clarify its men in the middle. Part of the formula is its memory. It'll So that's what we want to comprehend some of differences between the observations on the me and square them. So the clothes I ain't nobody demean the lower the result will obtain on the fourth I away from the main, the larger. But why do we elevate to the second degree? Squarely? Differences has two main purposes first by square and the numbers. We always get no negative competitions, which I'll get into deep into the mathematics off it. It is intuitive that discretion cannot be negative. Discretion is about this. Does on distance cannot be negative if, on the other hand, we calculated difference and they're not elevates to the second degree. Who would obtain both positive and negative values that went stormed would cancel out, leaving all's which no information about the discretion. Second Squared amplifies the effect off large differences. For example, if the men is you unravel observation Off 100 disquiet spread is 10,000 all now it's time for a practical example. We have a population of five observations. 1234 and five less fine. It's variants. We started calculating the mean one plus two plus three plus four plus five divided by five equals three. Then we apply the formula with your soul. One minus three squared close to mine of three squared plus three minus three squared plus four miles, three squared plus five minus three squared. All of these components have to be divided by five. When we do the math, we get to so the population variance of the data sets is too ports. What about the sample variance? This would only be suitable if we were told that this fight observations way sample drawn from the population. So let's imagine drastic ease The sample men is once again three. The numerator is the same. But the denominator is going to be four instead of five. Giving us a sample variance off 2.5. To conclude a serious topic, we should interpret the results. Why is the sample variance bigger than the population variance In the first case, we know the population. That is we are all the data we calculated the variance. In the second case, we're told that 1234 and five was example drawn from the bigger population. Imagine the population of the samples with these nine numbers 111234555 on five. Clearly the numbers are the same, but there is a concentration around the two extremes off the deficits. One on five variants off this population is 2.96 So our sample variance arsed rightfully corrected upwards in order to reflect the higher potential very beauty. This is the reason why they are different formulas for sample on population data. This was a very important lesson, So please make sure you have understood it. Well, you can enforce what you've learned by doing the exercise available in the course resource section. Remember, To better understand statistics, you need continuous practice. Thanks for watching.
8. Standard Deviation and Coefficient of Variation: well variance is a common measure off data discretion. In most cases, the figure you'll obtain is pretty large on arts to compay, as the unit of Measure Men's is quit, Easy Fix is to calculate its square roots on obtaining statistic nude as standard division . In most analysis, you perform standard deviation will be much more minute for Dan VARIAS. As we saw in the previous lesson, there are different measures for population and sample variance. Consequently, there is also population and sample standard deviation. The formulas are the square root off the population variance and squared, for example, variants respectively. If you have a calculator in your hands, you'll be able to do this job all right. The other measure we still have to introduce is the coefficient of variation. It is equal to the standard deviation divided by the mean another name for the term is relative standard deviation. This is an easy way to remember its formula. It is simply the standard deviation relative to demean. As you probably guessed, there is a relation and sample formula once again. So standard deviation is, in most common measure off the ability for a single detail sets, but Why do we need yet another measure, such as the coefficient of variation? Well, comparing the standard deviations off to deter sets is minimalists. But comparing coefficients off variation is not. Here's an example of the comparison between standard deviations. Let's ticket prices off. Pisa are 10 different places in New York. They went from 1 to $11. Now imagine that you only have Mexican pacers onto you. The price look more like 18.81 places to 206 point, and I won bases, given the exchange rate off 18 points. 81 pistols for gondola. Let's combine our knowledge so far on fine, distant a deviation and coefficients of variation off. These to deter sits. First, we have to see if it is a sample or population. The question is, are the only level restaurants in New York off course not. This is obviously example, drawn from all the restaurants in the city. Then we have to use the formulas for sample measures or very beady. Second, we have to find a mean. The $1,000,000 is equal to 5.5 on remaining vessels, is one or 3.46 The third step of the process is finding the sample variance for lonely formula that we showed earlier. We can obtain 10 points $72 quid on 3793 points. 69 vessels, quite. They're expected stand post on a division are three points to $7 on 61.59 This was let's make a couple of observations. First, Syrians gives results in square the units whilst on a deviation in origin. Now units. This is the main reason why professionals prefer to use standard deviation as the main measure off very ability. It is directly interpret. Hable Square dollars means nothing even in the field off statistics. Second, we goes on the deviations off 3.27 61.59 for the same pizza at the same 11 restaurants in New York City. The same strong don't worry. It is time to use our last tool. The coefficients of variation divided Istana deviations by the respective means. We get the two coefficients of variation. The result is the same 0.60 notice that it is not dollars pesos Douglas Quaid or basil squid. It is just 0.60 Yes, use all the great advantage of the coefficients of variation easels. Now we can confidently say that it's to deter states have the same variability, which is what we expected beforehand. Let's recap what we've learned so far. There are treatment measures are very beauty, various standard deviation on coefficients of variation it off them us different strains on applications. You should feel confident using all off them as we're getting closer to more complex statistical topics. I remember Aristotle's advise. Involve me, I would understand. So please don't forget to get involved with the exercise. Also refer to the nudes provided for Excel formulas to easily calculate these measures. Thanks for watching.
9. Measures of the Relationships betwen two Variables: cookie. We've covered all you. Never it measures. Now it's time to sing. Measures that I use where we work with more than one variable will explore measures that can help us explore the relationship between the rebels. Our focus will be on core variance on linear coalition coefficients. Let's think of an example that is very easy to understand. Our will help us grasp the nature of the relationship between two variables. It beats better. Think about relisted, which is one of the main factors that it's harming house prices. Yes, size right. Typically, larger houses are more expensive as people like Harvey in Extras piece. It's every issues was data happen. Several houses on the left side we consider size off each hours on. On the right, we have the price at which it's been listed in the local newspaper. We can present these data points in its cut, applauds the X others who show a house off the size and why access will provide information about its price. Because certainly notice a Parton. There was a clear relationship between these variables because it's variables are correlated on. The main statistic to measure these coalition is called Cove Erian's on Like variants, Cove areas may be positive, equal to zero or negative. To understand the concept better, I would like to show you the formulas that allow us to calculate the convenience between two variables. It's asked formulas with an S because once again, there is a population on example. Formula here, they are says, is obviously sample data. We should use the sample Cove Erian's formula. Let's apply it in practice for the example with So Alia extends for outsides are y stands for house price. We need to calculate them in size. Condemning price. This has been computed as well of the same coast on the division, in case we need them. Literally. Okey. Now let's tackle it. Denominator of the co variance function, starting with the first house, all multiplied the difference between each size on the average, outsize by the difference between the price of the same house on the average house price. Once we already we have to perform this car pollution for all houses that we have in a table and then some D numbers we've obtained. Now we have to divide the some approved by the sample size guys, my nose one. The result is a cove Erian's. It gives us a sense of direction in which the two variables are moving and they go in the same direction. The conference will have a positive sign, while if they go in opposite directions, the coveralls will have a negative sign. Finally, if their movements are independent, the coherence between your house size on its price will be equal to zero. However, there is just one tiny problem with co variance do. It could be a number like five or 50 but it can also be something like 0.23456 or even over 50 million, as in our example, the use of a completely different skill. How could one into a pit stop numbers and find a part of this lesson? I will answer that question. See you, they and thanks for watching
10. Correlation Coefficient: Correlation adjusts Cove Erian's so that the relationship between the two variables becomes easy on intuitive to interpret the formulas for the coalition. Quite fishing's are the Cove Erian's departure by the products of the standard deviations off the two variables. This is either sample or population, depending on the detail you're working with. We already have the standard deviations off to DigiScents. Now we use the formula in order to find the sample coalition coefficients. Mathematically, there is no way to obtain a coalition value greater than zero or less than minus one. Remember the coefficient operation we talked about a couple off in essence ago? Well, this concept is similar. We've manipulated the strange core variance value in order to get something intuitive. Let's examine it for a bit. We got a sub coalition coefficient of 0.87 so there is a strong relationship between the two values. A coronation off one, also known as perfect Positive correlation means that the entire variability of one variable is explained by the other variable. Oh era. Logically, we know that the size determines the price. On average, a bigger house he built, the more expensive it will be this relationship only goes this week was the Houses Butte. If for some reason it becomes more expensive, it's size doesn't increase. All do. There is a positive correlation. Okey, A coalition of zero between two variables means that they are absolutely independent from each other. We would expect a coalition of zero between the price of coffee in Brazil on the price of houses in London. Right? It's variables don't have anything in common. Finally, we can have a negative correlation coefficient. It's coming profit negative, coalition off minus one or much more likely on in perfect negative correlation off value between minus one on zero. Think of the following businesses a constant producing ice cream on the conference selling umbrellas. Ice cream tends to be sold more when they were. That is very good, and people buy umbrellas when it's really. Obviously, there's a negative correlation between you two on hands. When one of the continent's makes more money, the other ones all right. Before we continue, we must note that the coalition between two variables X and Y is the same as a coalition between Y on X. The formula is completely symmetrical, which respect to boot variables. Therefore, the collision off price on size is the same house, the one off size on price. This leads us to cause ality. It is very important for any analysts or research to understand the direction. Of course, on relationships, in the house, in business sighs courses the price and not right. But this we conclude our class on statistics for business and other states on data science . Thanks for watching.