2024-Lean Six Sigma GreenBelt Analyze Phase - Test of Hypothesis using Microsoft-Excel & Minitab | Dimple Sanghvi | Skillshare
Drawer
Search

Playback Speed


  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

2024-Lean Six Sigma GreenBelt Analyze Phase - Test of Hypothesis using Microsoft-Excel & Minitab

teacher avatar Dimple Sanghvi, Master Black Belt, Data Scientist, PMP

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

    • 1.

      Data Analytics intro

      3:12

    • 2.

      Project Work

      0:51

    • 3.

      Minitab

      2:16

    • 4.

      what is Descriptive Statistics

      4:32

    • 5.

      Understand Box Plot Part 1

      5:22

    • 6.

      Understand Box Plot Part 2

      7:37

    • 7.

      Understand TestofHypothesis

      5:27

    • 8.

      Understand Types of Errors

      4:49

    • 9.

      Understand Types of Errors-part2

      5:57

    • 10.

      Remember-the-Jingle

      4:34

    • 11.

      Test Selection

      5:40

    • 12.

      Understand 1 sample t test

      6:57

    • 13.

      Understand 2 sample t test example 1

      5:32

    • 14.

      Understand 2 sample t test example 2

      3:14

    • 15.

      Understand Paired t test

      3:59

    • 16.

      Understand One Sample Z test

      5:16

    • 17.

      Understand One Sample proportion test-1p-test

      4:01

    • 18.

      Understand Two Sample proportion test-2p-test

      1:39

    • 19.

      Two Sample proportion test-2p-test-Example

      2:21

    • 20.

      Using Excel = one Sample t-Test

      6:51

    • 21.

      Understand the Non Normal Data

      15:15

    • 22.

      Conclusion

      2:25

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

392

Students

22

Projects

About This Class

This comprehensive Data Analytics Bootcamp curriculum covers the concepts of Statistics foundation, analyzing data using Minitab

  • Learn about
  • Basic of Statistics
  • Descriptive Statistics
  • Graphical Summary
  • Distributions
  • Histogram
  • Boxplot
  • Bar Chart
  • Pie Chart
  • Test of Hypothesis
  • Types of Errors
  • One Sample T-Test
  • Two Sample T-Test
  • Paired T-Test
  • One-Way-Annova
  • Chi-square test

Who is this class for?

 Anyone who is a Lean Six Sigma Student or who wants to understand and apply statistics  and graphical analysis

Key Takeaways

  • Understand how to do some basic analysis
  • Understand and apply tools required during the Measure and Analyse Phase of Six Sigma Projects
  • Which graph to use when?
  • Some common mistakes we make when we perform graphical analysis
  • Creating graphs for drawing the conclusion

Meet Your Teacher

Teacher Profile Image

Dimple Sanghvi

Master Black Belt, Data Scientist, PMP

Teacher

Empowering People to unleash their brilliance, and create impact | Consultant | Independent Director on Corporate Board, NSE & BSE | Lean Six Sigma Master BlackBelt | Leadership Coach & Mentor | Specializing in AI, ML, Data Science Coaching | Pet Lover

Let's connect on LinkedIn for professional growth and networking opportunities https://www.linkedin.com/in/dimplesanghvi/
Talks about #chatgpt, #dataanalytics, #coachingbusiness, #storytellingwithdata, and #leansixsigmablackbelt

Join me Join my Telegram channel to embark on the journey of Lean Six Sigma and Storytelling, where I'll share my expertise on data-driven insights, process optimization, predictive analytics, AI, ML, data science, and even ChatGPT. My commitment is to help others achieve results by sharing my kn... See full profile

Level: All Levels

Class Ratings

Expectations Met?
    Exceeded!
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Data Analytics intro: Hello friends. Let's get started on this training program, corners data analytics using MiniTab. What are you going to learn in this course? So the skills that you will learn in this course are some basics of statistics. We will be covering descriptive statistics, graphical summary, distributions, histogram, box-plot, bar charts, and pie charts. I'm going to set up a new series on test of hypothesis, which I will be sharing in the link as a link in the last video. But let's first understand all the different types of graphical analysis. Who should attend this class? Anyone who has, who is a student of Lean Six Sigma, who wants to get certified as Green Belt, Black Belt, or who wants to apply statistics and graphical analysis in their place of work. Even though you might be an entrepreneur or you might be a student and you want to understand statistics using MiniTab. I'm going to cover all of it. We're going to learn what mistakes commonly happen when we are analyzing. Because when we do analysis using simple theory based data points, everything appears to be normal. So I'm going to show you some traps in which our analysis will fail and how you should avoid those traps. We will try to, at the end of this program, you, what will you take away from this program? You will understand how to do some basic analysis. You'll understand what are the tools that are required during your measure phase, like capability calculations and so on. We will use during the analyze phase so if possible, to cover test of hypothesis. Otherwise, if it get, the video gets bigger, I will put it as a separate sees. Ivan also cover which graph to use when some common mistakes we have and we perform graphical analysis and creating graphs. And how do I derive insights and conclusion from those graphs? This will really help you in understanding this program really well. Let's see what is a Minitab? Minitab is a statistical software that's available and it has multiple regions. So I go find a new project. My Minitab screen looks something like this. I have a navigator on the left side. I have my output screen on the top, I have my data sheet, which is very much like an Excel sheet, which I can work with. I can keep adding these sheets and have lots of data. I can do a lot of analysis using my options. We're going to cover basic statistics, regression. We will be covering a lot of basic statistics and we'll be covering lots of graphs using different types of data, right? So if you were interested in knowing these things, you should definitely enroll and watch my video. Thank you so much. 2. Project Work: Let us understand what is the project work that we're going to do in this data analytics program using MiniTab. As I told you, we are going to work with MiniTab. And this is the Minitab that I will be using. I will also be sharing with you a datasheet, your project data sheet, where I have multiple examples, where we are doing calculations on capability. We will try to see distributions and you can see that there are various tabs. Example one example two example three, we'll try to do some trend analysis. We will try to see Pareto charts. We have lots of data that has been shared with you, which will give you a hands-on experience on working with data, right? So let's get started. 3. Minitab: In this class, we're going to learn about hypothesis testing. I'm going to teach you hypothesis testing using MiniTab. I'm also going to teach you hypothesis testing using Microsoft Office. That is using Excel and Microsoft Office for those who are interested in going for MiniTab. Let me show you from where you can download Minitab. Minitab.com under Downloads. Here we come to the download section. You have MiniTab statistical software, and it is available for 30 days for free. I have also downloaded the trial version on my system and Dando analysis and showed you showed it to you. Remember, it is available for 30 days only. You please ensure that you complete the entire training program within the first 30 days. When you feel the value in this, you should definitely go ahead and go by the licensed version of MiniTab, which is available over here. I just have to click on Download and download Woodstock. It starts with a free 30-day trial. And it's good enough time for you to practice all the exercises which are driven. It will ask you for some personal information so that they can be in touch with you and they can help you with some discounts. If there are any. You have a section called as Dr. MiniTab or you have a phone number. If you're calling from UK, it will be easy for you to call over there. But if you're talking from other places, talk to MiniTab is a much easier option. This is a very good statistical tool and they keep upgrading the features regularly. So personally, I feel that this investment will be worth it. But for those who cannot afford to go for the license, they can use Microsoft Office, at least some of the features, not all, but some of the features are available. So initially I will show you the entire exercise of different types of hypothesis using MiniTab. And then we will move into Microsoft Excel, stay connected and keep learning. 4. what is Descriptive Statistics: In today's session, we are going to learn about descriptive statistics. Descriptive statistics means I want to understand measures of center. Like measures of center, mean, median mode. I want to understand measures of spread. That is nothing but range, standard deviation, and variance. Let's take a simple data that I have. I have cycle time in minutes for almost a 100 data points. I'm going to take the cycle time in minutes from my day project data sheet. I'll go to MiniTab and I will paste my data where here I want to do some descriptive statistics. Stats. Click on Basic Statistics and say Display Descriptive statistics. When I do this, it gives me an option in the pop-up window, which is called as, which shows me the available data fields that I have. I have cycle time in minutes. So it is telling me that I want to analyze the variable cycle time in minutes. I'll just click on, Okay, and immediately you will find that in my output window. I can just pull this down. In my output window. It is showing me that it has done some statistical analysis for the variable cycle time in minutes. I have 100 data points over here. Number of missing values are 0. The mean is 10.064. Standard error of mean is 0.103, standard deviation is 1 to minimum value is 7.5. One is nothing but your quartile one is 9.1. Median, that is, your Q2 is 10.35, Q3 is 10.868, and the maximum value is 12.490. If I need more statistical analysis, I can go ahead and repeat this analysis. This time, I'm going to click on Statistics. And I can look at the other data points that I need. Suppose if I need the range, I don't need standard error, I need I need inter-quartile range. I want to identify what is the mood. I want to identify what is the skewness and my data. What is the kurtosis in my data? I can select all of it and say, okay, I will click on, Okay. When I do this, all the other statistical parameters that I have selected will come up in my output window. This is my output window. So it's again tells me that additional data point that I selected. So radius is nothing but your standard deviation squared. It is 0.0541. It is telling me the range that is maximum minus minimum. It is 4.95. Inter-quartile range is 1.707. There is no mode in my data. And number of data points at 0 because there is no more, the data is not skewed. The values very close to 0, it is 0.05, but there is kurtosis. It means my data is not appearing as a non-work go. So good, we like to see how my distribution looks. Let's do that. I click on stats, I click on Basic Statistics, and I will click on graphical summary. I'm selecting cycle time in minutes. And I'm saying I want to see 95% confidence interval. I click on, Okay, let's see the output. The summary of the cycle diamond minutes. It is showing me the mean, standard deviation, variance. All the statistics things are being displayed on the right-hand side. Mean, standard deviation, variance, skewness, kurtosis, number of data points minimum first quartile median, third quartile maximum. These data points which you see as minimum Q1, median, Q3 and maximum will be covered in the boxplot. The boxplot is framed using these data points. And when you look at the Velcro, it says that the bell is not steep curve, it is a little fatter curve, and hence the kurtosis value is a negative value. We will continue our learning more in detail in the next video. Thank you. 5. Understand Box Plot Part 1: In this lesson, we are going to learn more about boxplot. A boxplot is one of the graphical technique which helps us to identify outliers, right? Let us understand how a boxplot gets formed. Let's understand the concept first before we get into the practicals. A boxplot is called as a boxplot because it looks like a box and it has viscous like the cat. The cat has on its face. Now, just like the way the cat cannot have and less viscous, the size of the whisker of the box plot will be decided on certain parameters. You will see some important terminologies when you're forming a boxplot. Number one, what's the minimum value? What's the quartile one? What is the median? What is the core tight? Three, what is the size of the maximum whisker? And what is the maximum value on the data point? Here? The minimum dogs over the minimum point and where the whisker can be extended. Q1 stands for first quarter, which means 25% of the data. Let's assume for ease, we have 100 data points. 25 per cent of the data will be below this one mark. Between Q1 and Q2. Twenty-five percent of your data will be formed, will be present. Q2 is also called as the median or the center of your data. So if I arrange my data in ascending or descending order, the middle data point is called as a median and it is called as Q2. Q3, or otherwise also called as upper quartile, talks about the twenty-five percent of the data after the medium. So technically, by now you have covered seventy-five percent of your data will be below your third quartiles, 25 per cent below Q1, 50% of the data below Q2, Seventy-five percent of the data is below Q3. So technically, out of 100% of the data, 75% of the data is below Q3. It means twenty-five percent of my data points will be above Q3. Now the distance between Q1 and Q3 is called, is called as the box size. And this box size is also called as inter-quartile range. Q3 minus Q1 is called as inter-quartile range. As I told you at the beginning of the class, that the size of the whisker depends upon the interquartile range or IQR. Q3. I can this line form 1.5 times the size of the box. So 1.5 times into IQR plus q3 will be the upper limit for my whisker. On the right side. On the upper side. If I want to draw the whisker on the left side, it is nothing but the same 1.5 times into inter-quartile range. But I subtract this value from Q1 and extended till that value. So it sets up the lower limit. You might have data points which are coming below the minimum point. You might have data points which are coming beyond the maximum size of the risk of these data points are called as outliers. The beauty of boxplot is it will help you to identify if there are any outliers in your dataset. Let's see how can I construct a boxplot? Because physically I don't have to worry about finding out 2525% percent. And really by person, we will go to MiniTab and then do the work. So let's see this datasheet. So in our previous class, we did some descriptive statistics on this. And we found the data points. We found minimum Q1, Q2, Q3, and maximum data point. Let's try to construct a boxplot for the cycle time in minutes. So I will click on graph. I will go to box plot and see a simple boxplot and click on, Okay, I'm going to select cycle time in minutes. And I'm going to say, Okay, let's see the data view. If you look at this boxplot, the below line is called as the one. It is 9.16. The median is the middle line, and it need not be exactly in the center. The top of the box is Q3, which is 10.86 in this data range, and the interquartile range is 1.7. My box can extend for 1.5 times on the elbow and it can go 1.5 times into 1.7 on the balloon. And you are seeing that there are no asterisk marks in this boxplot, very clearly indicating that there are no outliers in my current dataset. Let's pick up some more data set. In our next video to understand how do box plot. 6. Understand Box Plot Part 2: Let us continue our journey on understanding boxplots more in detail. If you go to the sheet in your project file, which is called as a boxplot. I have collected data of cycle time for five different scenarios. As you can see that some places I have more number of data points, like I have almost 401745 data. In some places, I have only 14 data points. So let's try to analyze this in more detail to understand how boxplot works. I have copied this data onto MiniTab, case one, case two, T3 and T4. So first thing I would want to do is do some basic descriptive statistics for all the foreign keys. I'm selecting all of it. And then I'm seeing, when I see my output, I can see that in three of the cases, I have 45 data points. In the fourth case, I have 18 data points. In the fifth case, I have 14 data points. So the number of data points are very, if you look at my minimum value, it is ranging from 1, one, twenty one, twenty two. And the maximum value is somewhere between 4090 them. In one scenario I have developed values from 21 to 40. In one scenario I have values from two to 90, which very clearly shows that the number of data points or do this. But my range of value is white. So if you look at the rate, it's ranging from 18.8 to 99 points. So in case two, I have 1200 as a range, so 99 years. And the same can also be observed as standard deviation. You can see that the skewness of data is different and kurtosis is different. Let's first understand the box plot in detail. And in the next video, when I'm talking about the histogram, we will understand the distribution pattern using the same data set. Let's get started. I click on graph. I can click on boxplot and I click on simple. What I can do is I can take up 11 case at a time to analyze my data. So case one, it shows me a box plot and this boxplot very clearly shows that there are no outlier in my data. And the range is between. When I keep my cursor over here, I have 45 data points. My whisker is ranging from 21.6 to 4.4, and my inter-quartile range is 5.95. My median is 30.3. My first quartile is 26.9. My third quartile is 32.85. Let's redo this thing for case two. When I do my keys too, if you now look, the box is looking very small because here my data points are same. Fortified by Vickery is again ranging from 21.6 to 40 for seem like my previous scenario. But I have outliers over here, which are far beyond. If you remember, the descriptive statistics for kids to my minimum value is one and my maximum value is 100. My median was seemed like my previous scenario. My Q1 is also similar, not same, but similar. And Q3 is also similar. But when you look at the box plot, the box is very small, very clearly indicating that do my inter-quartile range is 6.95. My viscous can only go 1.5 times and any data point beyond that, Misko will be called as an outlier. I can select these outliers, right? And it is very clearly see, k is two, the value is 100 and it is in row number one. Row number 37, I have a value called as 90. In row number 30, I have a value called is 88. And in row number 21 I have a value called as one, which is a minimum size. So I have outliers on both the sides. Let's understand case three. When I look at the chemistry, I put my cursor on the boxplot. I have the same 45 data points. My viscose or from 21.6 to 40 for seem like my case one, case two. But in this scenario, I have lot of outliers. On the lower end. That is, on the bottom of my core, tight, right? It is easy for us to click on each one of them and see how my boxes are. Now the beauty over here is I have only 18 data points, but still I have an outlier. Let's do it for k is five. And understand that as well. I have a smaller box. I have only 14 data points and I have an outlier on the up button, and I have an outlier on the lower end. Here the value is 23. But seeing these plots differently makes it difficult for me to do a comparative. Can I get everything on one screen? So I go to graph, I go to boxplot. I will do simple environment selected. I'm selecting all the cases together and seeing multiple graphs. I'm seeing skin and I'm seeing the axis should be seen. Grid lines should be seen. And I click on, Okay. I'm getting all the five data points, five cases scenario in one graph. This will make it easy for me to do the analysis, that case one. So do individually when I saw the case one, if we're showing us a big swath. But when I'm doing a comparison of one next to each other, I can know that in case two I have outliers on the top and the bottom. In case three, I have outliers on the bottom side. In case four, I have outliers on the top side. In case five, I have outlets on both the sides. The number of data points are different. The bulks will get drawn. The size of the box cannot be determined by the number of data points. I have 45 data points, but my box is very narrow. And I have 14 data points and my box is white. So the size of the box. So if I have 14 data points, it is going to divide my data into four parts. So three data points below Q1, three data points between Q1 and Q2, three data points between Q2 and Q3, and three data points beyond Q3. Whereas when I had 45 data points, it is getting distributed as 11111111. My median would be the middle number. So what is the learning from this exercise is that by looking at the size of the box, you cannot determine the number of data points. But what you can definitely determined is that in mind that dataset, do I have data points which are extremely high or low? So the purpose of drawing a box plot is to see the distribution and identify outliers, if any. I hope the concept is clear. If you have any queries, you are free to put it up in the discussion group. And I'll be happy to answer them. Thank you. 7. Understand TestofHypothesis: Hello friends. Let us continue our journey on MiniTab data analysis. Today we are going to learn about hypothesis testing. You might have heard that we do hypothesis testing during the analyze and improve phase of our project. So to understand how hypothesis test works, let us understand a simple case scenario. I will come back to this graph again and explain you that it is. As you know, when we go to the court of law, the justice system can be used to explain the concept of hypothesis testing. The judge always starts with a statement which says, the person is assumed to be innocent until proven guilty. This is nothing but your null hypothesis, the status quo. When they are caught case which goes on. The lawyers tried to produce data and evidences. And unless and until we do not have strong data and strong evidences, the person is In the status of being innocent. So the defendant or the opposition lawyer is always trying to say that this person is guilty and I have data and evidence to prove it. He is trying to work on alternate hypothesis. And the judge says, I'm going with the status quo of null hypothesis by default. Let me explain you in a more easy way. You and I, we're not taken to the court of law because by default, we all are in OSA, that is the status quo. Who are pulled to the court of law. People who are who have a chance of have come, have committed some crime. It could be anything. So same way. What do we try to do hypothesis testing on when I'm doing my analyze phase of the project. So I have multiple causes which might be contributing to my project. Why? We do a root cause analysis and we get to know that, okay? Maybe the shipment got delayed. Maybe the machine is a problem, maybe the measurement system is a problem. Maybe the raw material is of not good-quality. We have multiple reasons which are there. Now I want to prove it using data, and that is the place where I tried to use hypothesis testing. All the processes have variation. We know all the processes follow the bell curve. We are never add the center. There is some bit of variation in every process. Now the data or the sample which you updated, is it a random sample coming from the same Banco? Or is it a sample that's coming from an entirely different bell curve? So hypothesis testing will help you in analyzing the same. Whenever we set up a hypothesis test, we have two types of hypothesis, as I told you, the status quo or the default hypothesis, which is your null hypothesis. By default, we assume that the null hypothesis is true. So to reject the null hypothesis, we need to produce evidences. Alternate hypothesis is the place where there is a difference. And this is the reason why the hypothesis test has actually been initiated, right? We will understand with lots of examples. So stay connected. So when I'm framing up null and alternate hypothesis, Let's say, I am saying my mu are nothing but my average, my population average is equal to some value. Always remember, your alternate hypothesis is mutually exclusive. If mu is equal to some value, the alternate hypothesis would say mu is not equal to that value. By say, mu is less than equal to some value as a null-hypothesis. For example, if I'm selling Domino's Pizza, I see my average delivery time is less than equal to 30 minutes. The customer comes and tells me, know, the average delivery time is more than 30 minutes, that becomes my alternate. Sometimes, if we have the null hypothesis is mu is greater than equal to some value. For example, my average quality is greater than equal to 90%. Then the customer comes back and tells me that know your average quality is less than that percentage. So always remember the null hypothesis and alternate hypotheses are mutually exclusive and complimentary to each other. We will take up many more examples as we go further. 8. Understand Types of Errors: Let us understand some more examples of null and alternate hypothesis. So suppose if my project is about to shed you, my null hypothesis is a fixed value. So I would say my current mean of my current average time to build to share Julie's 70% are. Current. Average of P to S is 70%. The alternate hypothesis would mean that it is not 70%. Suppose I'm thinking about the moisture content of a project. I'm into a manufacturing setup and I want to measure if the moisture content should be equal to 5%. Or 5% is what is acceptable by my customer, then I can say my moisture content is less than equal to five per cent. Then the alternate hypothesis would claim that the moisture content is greater than five per cent. The case where the mean is greater than, then the null hypothesis. We do not have the interest in that problem. Let's understand it further. The question was, did a recent TED those small business loan approval process reduce the average cycle time for processing the loan? The answer could be no. Means cycle time did not change. Or the manager may see that yes, the mean cycle time is lower than 7.5%. So the status quo is equal to 7.514 minutes. And the alternate says, no, it is less than 7.414 minutes or days, whatever is the main unit of measurement we are measuring, right? So by default, your status quo is go null hypothesis. And the example or the status that you want to prove easier alternate hypothesis. Now, there could be some sort of arrows when we make decisions. So let's go back to our code case. The defendant is in reality not guilty, right? Let me take up my laser beam. By default, the defendant or the reality is the defendant is not guilty. Verdict also comes that the defendant, the person is not guilty. It's a good decision, right? So yes, we have made a very good decision that the person is innocent. In reality, the defendant is guilty. And the verdict also comes that he's guilty. The decision is a good decision. What happens is, in reality, the person is not guaranteed, but the verdict comes that he's guilty and innocent person gets convicted. It's an error. It's a very big error. In Northern person, given a sentence and put in jail, given a penalty, that's an error. The error can even happen on the other side, where in reality the person is guilty, but the verdict comes that he's not guilty. Guilty person is declared as innocent and he's set for it. This is also an arrow, but which is a bigger error. The bigger error you can write down in the comment box, what do you think? Which error is the bigger arrow? Is the error a bigger error or is the error be the bigger arrow? It no sane person getting convicted is a bigger error or is a guilty person moving on the roads free, either bigger arrow? I hope you have already written the comments. So the reality is this becomes my bigger error. And this is called as type one error. Because if an innocent is convicted, we cannot give back the time that he has lost. We cannot get he would go to a lot of emotional trauma. If a guilty is declared as innocent, we can take him to the higher court and Supreme Court and to get him to prove that yes, he's not he's guilty, right. So I can get this decision over here that the person is convict. He should be convicted and he should be declared as guilty and should be punished. So this error is called as type two error. If somebody asked you which error is a bigger error, type one error, it is also called as an alpha error. And this is called as a beta error. Right? Let's continue more in our next class. 9. Understand Types of Errors-part2: Let us understand the types of arrows once again. So as we know that if the person is not guilty or the person is innocent, and the verdict is also saying that the person is not guilty. It's a good decision. If the person is guilty, verdict is he's guilty. The decision is again, a good decision. The convict is not, has to be sentenced or should be punished. The problem will happen when an innocent person is proved as guilty and he suffers. The second type of problem which happens when guilty person, a person with a criminal is declared as innocent. And he said, This is called as a type one error. That is, an innocent person getting convicted or punished is a type one error. It is also called as alpha arrow. A guilty person, criminal set free is called as a type two error or a beta error, which is also an error which we want to avoid. The level of significance is set by the Alpha value. So how confident do you want in making the right decision? So type one error happens when the null is true, but we rejected. Type two error happens when in reality the null is false, but we fail to reject it. Now how does this help us process? So let us just understand this every day for lunch sheet. Right? Let's understand this in more detail. This is the actual scenario. Let's write the actual on the top. And this myths like the judgment. Okay, now, let's think about the process. The process has not changed. Has not changed. No alternate will be process has changed. Now the judgment is noted. And the judgment is the process has improved. Okay. Now I'm going to ask you a very important question. If a process has not changed and the judgment is that there is no change, this is the correct decision. Process has changed and the judgment is also that the process has improved. That's also a correct decision. Now, imagine the process has not changed, but we declared that now I have an improved process and an improved product and I inform the customer, Is it correct? An error. And this is called as a type one error because seem old, but our debt is sold to the customer as new product. Can you understand what will happen to the reputation of the company? The team or product is sold to the customer as new products. New one core product. So what will happen to the reputation of the company? It will go for a toss and hence we say, this is not a good decision. Now understand here also the process has changed. The process has improved, but the judgment comes as not improved. This is also an error. I don't deny it. This is called a type two error or audit is also called as a beta error. Right here. What happens is that we are not communicating to the customer that the improvement has happened, right? So we do not we are keeping the improved items in brood product in the warehouse. Now this is also not correct, but the bigger error is here where actually we have not done an improvement, but I'm informing the customer that you're bad people join. 10. Remember-the-Jingle: When we do test of hypothesis, there are always two hypothesis. One is the default hypothesis, which is the null hypothesis, and second is the alternate hypothesis which you want to prove. And that's the reason you are doing the hypothesis. So when you do the hypothesis, the reason we do is we are never having the access to the whole population. So when we collect the sample, we want to understand, is the sample coming from the bell curve or the distribution from where we are understanding whatever variation you see, is it because of the natural property of the dataset. Sometimes you sample could be at the end corner of the Velcro. And that is a place where we get the confusion that does this data belongs to the original Velcro or does it belong to the second alternate? Welcome. That is there. We will be doing exercises which will be giving you an understanding of this in more easy to do. Hypothesis, you get information like the p-value apart from the test statistics results. You also get the p-value. We always compare the p-value with the null value that we have set. Suppose you want to be 95% confident. Then you set the p-value as 5%. And if you set the confidence level is 90%, then your Alpha value is ten per cent, or your p-value is 0.10. The reason we do a p-value is that if you can see this bell curve, the most likely observation is part of the center of the bell. Very unlikely observation are coming from the tail. This p-value, the green reason, helps you tell whether it belongs to the original Velcro or does it belong to the alternate bulk of that is, you are trying to prove through the alternate hypothesis. Hence, the p-value comes as a help for you to easily remember this. Remember the jingle. Below, null. It means if the p-value is less than the alpha value, I'm going to reject the null hypothesis. P high level flight. If the p-value is more than the alpha value, we fail to reject the null hypothesis, Concluding that we do not have enough statistical evidence that the alternate hypothesis exist. We will be doing a lot of exercise and I will be singing this jingle multiple times so that it's easy for you to remember. Below null, go behind nullcline. Some of the participants with, when I do the workshop get confused, they will say none go means what? The other thing which I tell them to easily remember is f for flight and F for field. So if P is high null, we'll fly. It means you're failing to reject the null hypothesis. Null hypothesis will exist. The alternate hypothesis will get rejected. Remember one more thing which is mostly asked during the interview. The p-value was at 1.230.123. Would you reject the null hypothesis or would you accept the null hypothesis? Or would you accept the alternate hypothesis? Or will you accept the null hypothesis? As a statistician's? We never accept any hypothesis. Either we reject the null hypothesis or we fail to reject the null hypothesis. We always say it from the point of view of null because the default status quo easier null hypothesis. If the P is high, we do not accept the null and alternate hypothesis. Are we do not accept the null hypothesis. We say we fail to reject the null hypothesis. If the p is low, we do not accept alternate, but we say, I reject the null hypothesis, concluding that there is enough statistical evidence that the data is coming from the alternate Bellcore. We will continue with lot of exercises. And this will give you confidence about how to practice and interpret and use inferential statistics in your analysis when you're doing it. 11. Test Selection: One of the most common question which my participants are asked when I'm entering the project is that which hypothesis should I use rent? So this is a simple analysis which will help you understand that. Which tests should I be using? Just like the way when a patient goes to a doctor, the doctor does not prescribe him all the test. He just put his grabs him the appropriate test based on the problem that the patient is fishing. If the patient sees I met with an accident, the doctor would say that I think you should get your X-ray done. He would not be asking him to go for his COVID test or RT-PCR test. If the person is coughing and is suffering from fever, then RT-PCR is suggested. And at that point of time we are not able to satisfy the x-ray. It looks similar way when we do simple hypothesis testing, we're trying to understand or another compare it with the population. We want to understand what test should we be performing? When, if I'm testing for means, that is your average, then you compare the mean of a sample with the expected value. So I'm comparing the sample with my population. Then I go for my one-sample t-test. I have only one sample that I'm comparing. I want to compare if the average performance of the, if the average sales is equal to x amount, which is the expected value. So we were expecting the sales to be, let's say 5 million. My average is coming to say 4.8. I have I met that are not. So then I can go and do a one-sample t-test. Compare the mean of samples with two different proportions. So if I have two independent T's, so let's say I'm conducting a training online. I'm conducting a training offline. It is the Shrina and I have a set of students who are attending my online program. I have a different set of students who are attending my program of mine. I want to compare the effectiveness of training. So I have two samples, and these are two independent samples because the participants are different. Then I go for two-sample t-test. If I want to compare the two samples so people come for my training. I do an assessment before my training program about their understanding of what Lean Six Sigma. And I can take the training program and the same set of participants attend the test after the training program. So the participants or the scene. But the change which has happened is the training which was impacted on them. I have the test results before the training and I have the test results after the training, I want to compare the training is effective. Then I go for two-sample paired t-test. Progressing further. Suppose if I am testing for frequency, I have discrete data and I want to test the frequency because in discrete data I do not have averages. I take frequencies. So when I'm comparing the count of some variable in a sample to the expected distribution, just like the way I had one sample t-test. The equivalent of it for a discrete data would be my chi-square goodness of fit. I, by default expected to be a normal value or a particular value or unexpected value. And I'm comparing that. How far is my data? I go for a chi-square goodness of fit. This test is available on MiniTab in Excel. It is not available. So I will be creating a template and giving it to you, which will make it easy for you to do the chi-squared test. All three different types of chi-square test using the Excel template. If I have to count some of the variables between two samples. So it will be chi-squared homogeneously t-test. I'm checking a simple single sample to see if the discrete variables are independent. I do Chi-Squared independence test. If I have a proportion of data, like good or bad applications, I've accepted versus rejected. And I am saying that okay, 50% of the applications get accepted, or twenty-five percent of the people get placed. I have a proportion which I want to test. If I have only one sample, I go for one proportion test. If I want to compare proportion of commerce graduates versus science graduate or proportion of finance, MBA, people with marketing MBA people, I have two different samples, so I can go for two proportion test. So to summarize the thing, when I'm testing, am I testing for averages? Am I testing for frequencies like discreet data or am I testing for proportions? Depending upon that, you are picking up the appropriate test and working on it. We're going to practice all of it using Men dab and using exit. The dataset is available in the description section. In the project section, I invite you all to practice it and put your projects, your analysis in the project section. If you have any doubts, you can put that in the discussion section and I'll be happy to answer your doubts. Happy learning. 12. Understand 1 sample t test: Let us understand which hypothesis tests should I use? In Minitab, you have an assistant which can help you do that decision. So if you go to assistant hypothesis testing, it will help you identify based on the number of samples that you have. To suppose if you have one sample, you might be doing one-sample t-test, one sample standard deviation, one sample percentage defective, chi-squared goodness of fit. If you have two samples, then you have two sample t-test for different samples. T-test if the before and after items are the same. Sample standard deviation to sample percentage of defective chi-square test of association. If you have more than two samples, then we have one way ANOVA standard deviation test, Chi-square percent is defective and chi-squared test of association. We will be practicing all of it with lots of examples. So let's come to the first example. We have the ADHD of calls in minutes. We have taken a sample of 33 data points. Average is seven, minimum value is four minutes, maximum value is ten minutes. The reason we have to do a hypothesis testing is the manager of the processes that his team is able to close the resolution or on the call in seven minutes. And the process average is also seven minutes, minimum is four minute. But the customer sees that the agents keep them on hold and it takes more than seven minutes on the call. So now I want to statistically validate whether it's correct or not. Whenever we are setting up hypothesis testing, we have to follow the five step six step approach. Step number one, define the alternate hypothesis. Define the null hypothesis, which is nothing but your status quo. What is the level of significance or your Alpha value? If nothing is specified, be sent Alpha value as five per cent. We first set the alternate hypothesis. So in our case, what is the customer saying? The customer sees that the average handle time is more than seven minutes. The status quo or the SLA agreed is the ADHD should be less than equal to seven minutes. As I told you, the null and the alternate hypothesis will be mutually exclusive and complimentary to each other. Now, identify the test to be performed. How many samples do I have? I have only one sample of the HD of the contact center. So I am going to pick up one sample t-test. Okay? Now I need to do the test statistics and identify the p-value. If you remember the previous example lesson, we said if p-value is less than the alpha value, we reject the null hypothesis. If p-value is greater than five per cent or Alpha value, we fail to reject the null hypothesis. Let us do this understanding. So if you remember, we have our project data. In the project data, we have the test of hypothesis. Over here. I have given you the AHG of coal in minutes. So I have copied this data onto MiniTab. So let's do it in two ways. First time and show it to you using assistant. Second, I will show it to you using stats. So if I go to assistant hypothesis testing, what is the objective I want to achieve? It's a one-sample t-test. I have one sampling. Is it about mean? Is it about standard deviation? Is it apart, defective or discrete numbers? We're talking about the average 100 times. So I'm going to take one sample t-test. For data in columns. I have selected this. What is my target value? My target value is seven. The alternate hypothesis is the mean age of the call in minutes is greater than seven. This is what the customer is complaining. Alpha value is 0.05 by default, I click on, Okay. Let's see the output. To see the output you're going to click on View and output only. You will see that. If you see the p-value, p-value is 0.278. You remember below non-goal be high nullcline is this value of 0.278 greater than the alpha value of 0.05? Yes, it is. Hence, I can conclude that the mean is d of coal is not significantly greater than the target. Whatever you are seeing as greater than target, it is only by chance. So there is not enough evidence to conclude that the mean is greater than seven with five per cent level of significance. And it also shows me how the pattern is. There is no unusual data points because the sample size is at least 20. Normality is not an issue. The test is accurate. And it'd be good to conclude that the average handle time is not significantly greater than seven minutes. I can go ahead and reject the claim given by the customer. The few calls that we see as high-quality, high-value goals. This could be only by chance. The same test. I can also do it by clicking on test stat, basic statistics. And I'll save one sample t-test, one or more samples, each in one column. I will flick your select ADHD. I want to perform hypothesis testing. Hypothesized mean is seven. I go to Option and I say, what is the alternate hypothesis I want to define. I want to define that the actual mean is greater than the hypothesized mean. Click on Okay. If I need graph, I can put up these graphs. Click on Okay, and click on Okay. I get this output. So the descriptive statistics, this is the mean, this is the standard deviation and so on. Null hypothesis is mu is equal to seven. Alternate hypothesis is mu is greater than seven. P-value is 0.278. Concluding that null flight, we fail to reject the null hypothesis, concluding that the average 100 time is around seven minutes. Let's continue. We received our output. We saw all of this, and we have concluded that the average handle time is not significantly greater than seven minutes. 13. Understand 2 sample t test example 1: Let's do one more example of two teams, two samples. So in this example, two teams whose performance need to be measured. The manager of DMB claimed that his team is better performing team than DNA. The manager of a team advocates that this claim is invalid. Let's go to our dataset. So if you go to the project file, you will have something called as team a and team B. So let me just copy that data. Okay. Let me go here and place the radar on the right side. Why can also do I can take a new sheet and paste the data. Right? So let's come to as hypothesis testing, two-sample t-test. Let me delete this value. And TB, the team a is different from the VM. I can also say based on the hypothesis that is team be claimed that his team is better than a. So I can say it is less than TV. And I click on Okay. Again, in this example, I get an output which says that the team is not significantly less than TB. Do you have the values of 27.727.3? There is no statistical difference between both the tips, right? So both the examples which we got were like that. So let's go and see one more example. I have taken cycle time of process one and cycle time of process B. So let's just copy this data. This is another data set. And I go, What's my alternate hypothesis? Both the beams are different. What is the null hypothesis? Both the teams are same. Because these two teams are different. I will go ahead and do my two-sample t-test. The data of each team is separate. And I'm seeing is different from TB alpha value is 5%, and then I click on, Okay. Now, if you see the output this time, it says that yes, the cycle time of a is significantly different from the cycle time of dB. Here, this 26.8, twenty-seven point six. But if I look at the distribution, the distribution that this red is not overlapping with this red. So there is a difference in the cycle time of the two teams. If I have to do the same thing using stats, basic statistics, two-sample t-test. Like your time of being e at the time of TB options, are there different? I can have my graphs. I don't want an individual graph. I will only take the boxplot and say, okay, mu1 is population mean of cycle time of processes, cycle time of process B. Now if you'll see there is a standard deviation that is a difference. The p-value is 0, telling that, yes, there is a significant difference between both the teams. Be low, none cool. So here we are rejecting the null hypothesis, telling that there is a significant difference between E and D. Right? I have seen the same thing with the distribution goes on. So there is a larger distribution or here and there's a smaller distribution. I can do my graphical analysis that I did learn on your right and then see how the team is performing. So this is the summary of DNA. Mean is 26, standard deviation is 1.5. And if I scroll down, I get for team B and it is coming in this way. Now I want to overlap these graphs so I can click on graph and a histogram. And I'll say a bit fit and silky. And I will select these two graphs on separate panel of the same graph, same vitamin C max. Click on, Okay. Click on Okay. Can you see that the bell curve of both of them are different? Let's do an overlapping graph histogram. And in multiple ground overlay on this graph. Can you see that the blue and the red, there is a difference? And hence, yes, the kurtosis is different, the skew is different, and that's what is the conclusion in my two-sample t-test, which says that the distribution there is a significant difference. There is a statistically significant difference between sacred time of being EN fighter, dying off. The second thing, we will learn about bed t-test in our next example. 14. Understand 2 sample t test example 2: Let's come to our example. Two. There are two centers whose performance needs to be measured. The manager of sensory claimed that his team is a better performing team than the center B. The magnitude of the center be advocates that the claim is invalid. Again, I will follow my five-step process. What is the alternate hypothesis? Is better than B. Let's make it more easy. It is not equal to T, is not equal to TB, or center is not equal to center. What does the non-hypothesis center a is equal to center V, level of significance, five per cent. How many samples do I have? I have two samples, center editor and center B data. Because I have two samples, I need to go for two-sample t-test. Let's go to our Excel sheet. I have the data for Centauri and center B. I'm going to copy it in Minitab. I'm placing my data here. Let's do the two-sample t-test. So I go to Stat, Basic Statistics and say two-sample t-test. Both the samples are in one column. Each sample has its own column, so I'm going to select this sample. One is sensory sample. Do you center B? Option is hybrid. That is no different. So the difference between a and B is 0. And I go ahead and do it. I can have my individual box plot and say OK, and say Okay, let's see the output. So the sensory data is yours and TBI data is here. And if you see the p-value, the p-value is high. Again, I got an example which says that be high null fly, meaning there is no difference between center and center B. If you see the individual value, but you see the same thing. Let's see the boxplot. The boxplot says that the mean is not significantly different because it would have taken a sample. That's the reason it is, and you are seeing a value of 0, which is an outlier. So we should be considering that. The same thing. Let me do it using hypothesis testing. Two-sample t-test, sample mean. Sample is different. The mean of center is different from the mean of center B and C. Okay. So does the mean difference, the mean of Santa Fe is not significantly different from the mean off center. Right? If you see this distribution, you can find that the red part is completely overlapping with each other, telling that there is no enough evidence to conclude that there is a difference. There is a difference when you see the mean, 6.86.5. But that could be because of a chance. And there is a standard deviation also. Hence, these show it using the red bars, telling that there is not a significant difference between sensory and center week. We will continue learning about other examples in the coming video. 15. Understand Paired t test: Let us understand one more example. This is an example of paired t-test. If you look at this case study, the psychologists wanted to determine whether a particular running program has an effect on their resting heart rate. The heart rate of 15 randomly selected people were measured. The people were then put on a running program and measured again after one year. So are the participants saying before versus after? Yes. And that is the reason it is not two-sample t-test, but it is a paired t-test test, the before and after measurement of each person or in bands of observation. So if I go back to my dataset, I have something called as before and after, there's a different stage, I'm not taking the difference value. I've taken the data for the 15 people and put up in mini tab. Right? Now, I want to do because it's the same person before and after I, we want to understand the different hypothesis testing. I'm going to take paired t-test. The first thing was, what's the alternate hypothesis? Before and after is different. If you remember, the program of before and after, they want to determine if they have an effect on the run. The measurement one is before, measurement tool is up. Mean of before is different from the mean of after. So that's my alternate hypothesis. So what's my null hypothesis mean of before is there is no change. The alternate sees the before is different from after. Alpha value is 0.05. Let's click on Okay. Let's see the output. So does the mean differ? What is a p-value of 0.007? The mean of before is significantly different from the mean of after. If you look at the mean value, it was 74.572.3. But there is a difference. So if you see the difference is more than 0. And if I look at these values of before versus after the blue dot is after the black dot is before. Most of the participants, their heart rate had reduced after the running program. Few of them were an exceptions, but that could be an exception. There is no unusual paired differences because our sample size is at least 20. Normality is not an issue. The sample is sufficient to detect the difference in the mean. So I can see that, yes, there is a difference between both of them. Wonderful. So again, quick revision. Hello, null goal as the p-value is less than the significance level, we conclude that there is a significant difference between both the readings. If I have to do the scene, I click on Stat, Basic Statistics. Bad detest, each sample in one rule. Before, after option is they are different. Let me take only the boxplot and histogram of I don't want to pick the histogram. I'll only take the boxplot. Null hypothesis. The difference is 0. Alternate hypothesis is difference is non-zero, p-values low, concluding that I reject the null hypothesis. And there is a difference by adopting the program. So if you see the null value, the red dot is much away from the mean of the confidence interval of the box toward concluding that there is a difference between by undergoing the program by this heart specialist, right? So in the next program, we will learn, take up more examples. 16. Understand One Sample Z test: The quick recap of the different types of tests that we learned is that if I'm looking at how different is my group and between the population, I go for a one-sample t-test. When I have two different groups of samples, then I go for two-sample t-test. If these samples are independent. If I will go for a paired t-test. Paired t-test. If the group the same set of people, but it is or different point of time. Like we saw the example of the heartbeat. So the people were measured on their heartbeat. The report through a running program and post the running program. How was that hot resting heartbeat, right? So those are the things that we sorted. Now let's continue with more examples. So we add on use case number five, fat percentage analysis. The scientists for a company that manufactured process who want to S's the percentage of fat in the company's water source. The advertisement posted date is 15% and the scientists measure that the percentage of fat is 20 random samples. The previous measurement of the population standard deviation is 2.6. Now this is the population standard deviation. The standard deviation of the sample is 2.2. When I know the population parameter, I can go ahead and use one sample z-test because the number of samples I have is one. And I want, I have the known standard deviation of the population. Now, again, I'm going to apply the same thing defined the alternate hypothesis, right? So what am I going to say? What's the alternate hypothesis? Fat percentage is not equal to 603050. What are the null-hypothesis fat percentage is equal to 15%. Level of significance five per cent. Because I know it's a one-sample test and I have the population standard deviation. I'm going to use one sample z-test. Let's do the analysis. I have opened the project file and I have the sample IDs and cause a fat percentage data over here. Let me copy this data into Minitab. But copied the percentage of fat with the scientists have done. Because we know that population standard deviation, I can go ahead and use one-sample z-test. My data is present in a column. It's the fact presented. The known standard deviation was 2.6. I want to perform hypothesis testing. Hypothesized mean, it's 15%. So my null hypothesis is the fat percentage is equal to 15. My hypothesis is fat was a big a is not equal to 15. I can pick a graph of boxplot and histogram and say, Okay, I will show you the output. So the null hypothesis is fat percentage is equal to 15. Alternate hypothesis is fat percentage is not equal to 15. Alpha value is 0.05. My p-value is 0.012, as my p-value is less than the alpha value, P low, none cool. So I reject the null hypothesis, concluding that the fat percentage is not equal to 50. If you see over here, the fat percentage is more than 50. I can redo the same test. This time. I can go ahead and check. Is my fat percentage greater than the hypothesized mean. Let's do it. And still I get my p-value more confidently, 0.006 very far from my Alpha value. Concluding that yes, the Alpha, the null value is hypothesized, mean is 15. But the sample says that there is a high probability that your fat percentage in the source is more than 50. What is the advice we will give to the company? We will advise the company that you cannot sell the naming that the container is 15% because our factor is more than 15%. So to be safe, you can change the label of the product to saying that the fat percentage is 18, right? Because we have five per cent is going through 20. So a consumer will be happy to receive a product which is containing less fat. Then to receive a product which is containing more fat because we are all health-conscious, right? So let's continue in the next class. 17. Understand One Sample proportion test-1p-test: We will continue on our hypothesis testing. Sometimes we might have a proportion of the action, right? We do not have averages or standard deviation or variance to measure though, that we are doing. Let's take this example six, the marketing analyst wants to determine whether the male, the advertisement for the new product resulted in a response rate different from the national average. Normally whenever you put an advertisement in the paper, they say there are the advertising company usually see is that we will be able to impact 6% outcome or 10% outcome or some number outcome right here. What is, it's the same type of scenario. Here. They took a random sample of 1000 households who have received advertisement. And out of these 10 thousand households, sample 87 of them made purchases after receiving this aggrandizement. So this company, which is an advertising company, is claiming that I have made a better impact than the other advertising's. The analyst has to perform the one proportion z-test to determine whether the proportion of households that made a purchase was different from the national average of 6.5 because this is 8.7. In this case. What is your alternate hypothesis? Alternate hypothesis is the advertisement is different than the response to the advertisement is different from the national average. Here we will say there is no difference. They both are sin, alpha value is five per cent. And we're going to take up one proportion, z-test, event proportion test. I'm supposed to take you to the minute. So let's go to MiniTab. I can go ahead and these dads, basic statistics, one proportion. I do not have data in my column, but I have summarized, right? So let me close this, cancel, let me close this. So I have taken one sample proportion test. I have summarize data. How many events have been are we absorbing? We are observing 87 events to happen. The sample is of thousand. I need to perform hypothesis test and the hypothesized proportion, 6.5, 0.0656%.5, right? So it is 0.065. This proportion is not equal to hypothesize proportion. I say, Okay, I see, Okay. Now the null hypothesis is the proportion is equal to 6.5 per cent. Alternate hypothesis is the proportionate impact is not equal to 5.56 per cent. P-value is 0.008. What does it mean? Yes, be low, none cool. So we reject the null hypothesis, concluding that the effect of the advertisement, He's not 6.6.5 per cent, but it is more because if you see the ninety-five percent confidence interval, it says 0.7% to 10%, right? You have got a proportion of 88.7%. And the 95% confidence interval of proportion is much ahead of 6.5, it starts from 7. So we can conclude that there is a significant impact of the advertisement and we can go over this advertising company. Let's continue in our next lesson. 18. Understand Two Sample proportion test-2p-test: Let's do this exercise one more time using Assistant. So we have the numbered 80 beef products by supplier E that we have checked. 725 are defective or non-defective. So how many is that effective? So if I do a subtraction, it would be 777802 minus 725 is 77712 products of sampling the supplier B were selected by 73. Perfect. So how much is defective? One, 39. So let's try to do our two proportion test using Minitab assistant as this then hypothesis testing, sample pieces, stool, sample percentage defective supplier E, 0 to 7771 to 139. The person is defective of supplier E is less than the percentage defective of supplier B. I will go ahead and click on Okay. And I get that. Yes, that percentage defective or supplier is significantly less than the percentage defective of supplier B. And if I scroll down, Yes. So it says the difference, this supplier a is reading readiness. That from the test you can conclude that the percentage depictive of supplier a is less than Supplier B at 5% level of significance. When you are seeing this percentage. Also, you can clearly see that we will continue with the next hypothesis testing in the next week. Do 19. Two Sample proportion test-2p-test-Example: Now let us understand the next example. This is an example where an operation managers samples a product manufactured using raw material of two suppliers, determine whether one of the supplies raw material is more likely to produce a better quality product. So 802 products were sampled from the supplier E 725 or perfect, that is non-defective. 712 products were sampled from Supplier B, 573 or buffet. That is, it's not defective. So we want to perform because what is their personal data non-defective percentage? Yes, I have got two proportions, supply array and Supplier B. Let's go to main. I can go to Stat, Basic Statistics two proportion test. I have my summarize data, the evens by the first ease, 725 or both act out of 802. So let's take 725025723712572371. The option with them seeing is there is a difference and let's find it out. So the BVA, the null hypothesis, is there is no difference between the proportion. Alternate hypothesis is there is a difference between both the proportions. When I was looking at the p-value, the p-value comes out to be Z, to be low null. It is concluding that I have to reject the null hypothesis. There is a difference in the performance of the two suppliers. Now, if I think about because I'm talking about perfect or non-defective, currently, sample one has 90% perfect and sample two has 80% perfect. So concluding that supplier E is a better supplier than Supplier B. Right? So, thank you so much. We will continue in the next lesson. 20. Using Excel = one Sample t-Test: Many times we understand test of hypothesis, but there is a challenge that we have. The challenge is that I do not have MiniTab me. Can I not do test of hypothesis with an easy way rather than going through a manual calculation using statistical calculator. Do not worry that is possible. I'm going to show you how I can get to do test of hypothesis using Microsoft Excel. Go to File. Go to Options. When you go to Options, go to Add-ins. When you click on Add-ins. Let me click here. You have an option which is called as Excel add-in in the Manage option. So select Excel add-in and click on Go. Click on Analysis ToolPak and ensure that this tick mark is on. Once you have that, you will find that in your Data tab. You have data analysis available. If let me click on it for you to understand what's possible. In data analysis. I have an OR correlation, covariance, descriptive statistics, histogram, t-Test, z-tests, random number generation, sampling regression and all those things. So it is becoming very easy for you to do hypothesis testing. At least the continuous data hypothesis tested easily through Microsoft Excel as well. I'm going to take you through step-by-step exercise for now. Let's go back to the presentation. Let's take the first problem. That is, I have the descriptive statistics for the HD of the call, the manager of the processes that his team is working to close the resolution on the call in seven minutes. But the customer sees that he's kept on hold for a long time, and hence he is spending more than seven minutes. If I look at the descriptive statistics, it is telling me ten minutes, median is seven, average is 7.1. Now I would want to do this analysis using Microsoft exit. So let's get started. I have this use case in the project data which I have uploaded, click on ASD, of course, it takes you to this place. Now, I will first teach you how to do descriptive statistics using Microsoft Excel. I'm going to click on data analysis under the Data tab. I'm going to look for descriptive statistics. Click on, okay. My input range is from here to the bottom. I have selected. My data is grouped by columns. The label is present in the first row. And I want my output to go to a new workbook. I want summary statistics and I want confidence level of me. I click on OK. Excel is doing some calculation and getting it ready for it. Yes. Here is my output. I click on former over here to see what's the output. So you can see you are mean, median mode, standard deviation, kurtosis, skewness, range, minimum, maximum, sum, count, confidence level. All these things are easily calculated by a click of a button. I do not have to write so many formulas. Now, let us go back to our dataset. I want to do the hypothesis testing. What is my null hypothesis? When the null hypothesis is that the ADHD is equal to seven minutes. Alternate hypothesis. The ADHD is not seven minutes. There is a different alpha value I'm setting up as 5%. And with that, I'm going to conduct the tests that I'm going to connect is a one-sample t-test. When you are doing one-sample t-test using Microsoft Excel, you will have to follow a small trick. The trick is, I'm going to insert a column over here. And this, I'm going to call it as dummy. Because Microsoft Excel comes with an option of two-sample t-test. I have HD of the call in minutes and dummy where I have written down to zeros, zeros. However, the average median, everything for 0 is always 0. Click on data analysis. I will go down and I will say two sample t-test assuming equal variance. I'm going to select this. I'm going to click on, Okay. My input range, one is this line. My input range through this dummy. My hypothesized mean difference is seven minutes. Label is present in both the Alpha value set as five per cent. And I'm telling that my output needs to be in a new workbook. I click on Okay, it is doing the calculation and getting me the output. You can see that the numbers have conveyed as a practice, I just click on the karma in the Format section so that the numbers are visible. I'm changing the view because dummy does not have any data. I am free to go ahead and delete this column. Now let's understand what do we always look for? We look for this value, the p value. Do you remember the formula? Let me get my formulas over here. Yes. What is the conclusion? The conclusion is P high. I fail to reject the null hypothesis. Concluding the ADHD of the call is seven months. I'm rejecting the alternate hypothesis because my p-value is beyond 0.05. I'll be taking up more examples in the following lessons. So I'm looking forward for you to continue this series. If you have any questions, I would request you to drop your questions in the discussion section below, and I will be happy to answer them. Thank you. 21. Understand the Non Normal Data: Our normal or not. Let us try to understand how do we work when my data is not normal? Or even before getting there, let me introduce you to this gentleman. Any guesses? Who is the gentleman? You can type in the chat window if you know. And even if you do not know, that's perfectly fine. There are no penalty points for wrong guesses. Yes. Some of you have guessed it right? He's the famous person behind our normal distribution. Mr. Carl cos. He's the great mathematician. And he was the person who came up with the concept of the Gaussian distribution or the normal distribution. So here is the brain behind concept of normal distribution and all the parametric tests that we are taking. If my data is not normal, then it can be skewed. It could be negatively skewed or it could be positively skewed. If I say negatively skewed, it is technically having a tail on the left side. Positively skewed means tail on the right side. It means my data is not behaving in a normal way. My data can be not normal because it is following a uniform distribution or a flat distribution like this. Then also it's not following the normal distribution. My data can have multiple peaks, something like this, which represents that there are multiple data groups in my dataset. And it's not normal behavior. Because my data has all these things. I need to treat this data differently when I'm doing my hypothesis testing. And why is this data not normal? It could be because of the presence of some outliers. It could be because of the skewness of my data, or it could be because of the kurtosis that's present in the data. So the reason for your data not behaving in a normal way could be one of these. Let us summarize, what did we learn? My data is not normal if the distribution has a skewness, has unimodal, it's not unimodal, but in fact this bimodal or multimodal distribution. It is a heavy tail distribution containing outliers. Or it could be a flat distribution like a uniform distribution. These are some basic reasons why my data is not behaving in a normal way. Odd, it is not a normal distribution, then there are multiple distributions. There are other distributions as well, which talks about the exponential distribution, which models the time between the event. The log-normal distribution. Which says that if I apply the logarithm on the data, then my data will follow a normal distribution. Poisson distribution, binomial distribution, multinomial distribution. Let us understand some examples, real-life scenarios where the non-normal distributions can be applied. If you look at this, whenever I am trying to predict something over a fixed time interval. Then I use Poisson distribution for my analysis and hypothesis. Some examples of Poisson distribution or the number of customer service called received in the call center. The number of patients that present a hospital emergency room on a given day, the number of request for a particular item in an online store in a given day. The number of packages delivered by the delivery company in a given day, the number of defective items produced by a manufacturing company in a given day. If you observe there is a common behavior here. Whenever we are trying to understand something on a particular time period, it could be a given day, it could be a given month, given B. Then we prefer to do our analysis using Poisson distribution. Some examples of log-normal distribution. The size of the file downloads from the internet, the size of the particles in a sediment sample, the height of the tree, the size of the financial returns, the size of the insurance game. If you see these examples, like if I take the example of financial returns from their investment, you might see that out of my portfolio of investments, some investment gave me a very good return of 100%, 100%, 150 per cent, 80 per cent. And you will also see that I have made investments in some part in my portfolio because it resulted in a zero return or a negative return because I'm in loss. But overall my portfolio is giving me a return of 12 to 15% or 15 to 20 per cent. You are trying to say that your distribution is technically not a normal distribution. You have very low returns and very high returns. But if you apply the logarithm on your data, then it behaves like a normal distribution that overall your portfolio will result in a return of some X percentage. Similar applies even in the insurance claim. Let us try to understand the application of exponential distribution. The time between arrivals of customers in queue, the time between failure in a machine, your factory, the time between purchases in the retail store, The time between phone calls and the contact center, the time between page views on the website. Now if you see between the Poisson distribution and the exponential distribution, there is one common element. What is the common element? We're trying to study with reference to time. Whenever you're doing a normal distribution, It's not with reference to time. Right? So these are some applications. But the difference between a poison and an exponential is in a Poisson distribution. It is on a particular day, on a given day, on a given week are given month. Here we are trying to understand the time between the two evens. What is a time gap between the two events? Then the exponential distribution can help you out. We can, let's understand the application of some uniform distribution, like the heights of the student in the class. Needs of packets in a delivery truck. Some packages are very big, some packages are small. If you put it in a distribution, you will also find that it's a flat distribution or a uniform distribution because for each category of packages, you'll have approximately the same number of, similar number of packages. Goods that you're delivering. The distribution of test scores for a multiple choice exam. The distribution of waiting time at a traffic light, the distribution of the arrival time of a customer at a retail store. So if you see all these examples following uniform distribution, it's not a bell curve. Because you have continuously people who are arriving at the retail store. It's not that there is a sudden peak. And the real-world scenarios of heavy-tailed distribution, it means the distribution where the outliers are present, the signs of the financial loss and an insurance industry or other signs of financial loss. In a few ask a trader, they would see that extremely high and an extremely low number. The size of the extreme rainfall. So we do not have extreme rainfalls every year. So we wouldn't be able to say that whatever has happened, it's because of an outlier. And heavy-tailed distribution are usually impacted because of the presence of outliers. So if your data is having outliers, then you can also see that the distribution for load is a heavy-tailed distribution. And we will understand in the next session, what type of non-parametric tests should I be performing? Depending upon the type of the non-normal data that we are starting. The size of the power consumption, the size of the economic fluctuation of the stock market crash. These are all examples of your heavy-tailed distribution. Examples of bimodal data. Here you need to understand bimodal means there are two outcomes that we're trying to study. The distribution of exam scores of students who studied and those who did not. Distribution of ages of the individual in a population who from two distinct age groups, height of two different species, salary distribution of employees from two different departments. Godspeed on a highway with two groups of slow and fast drivers. So here you can see that I am having two groups of data which are different. And I'm trying to understand the behavior are I will go ahead and do my investigation as part of my hypothesis or the resource that I'm trying to do. If I have more than two groups, two different, more than two different groups, like three different groups for different groups, then it becomes a multimodal distribution. Right? So I think by now you would have gotten an idea of what are the different distributions which are not normal distributions. So how do I determine if my data is not normally? The first dot become, it comes to our mind is a normality test. But even before doing a normality test, you can use simple graphical methods to find out if your data is normal or not. You can use histogram. And here the histogram is clearly showing multiple moves. So I can clearly see that this is not a normal distribution. If I tried to put a fit line, then also I can see that there is skewness in my data. I can also use box plot to determine if my data is not normal. So here you can see that I have a heavy tail on the left side telling that my data is skewed. I can also have outliers which a boxplot can easily highlight. So I can hide, identify the heavy-tailed distribution using the boxplot. Also. I can use simple descriptive statistics where I can see the numbers of mean median mode. And when I see that these numbers are not overlapping or not close to each other, that also simply indicates that my data is not normal. I can look at the kurtosis and skewness of my data distribution and then come to a conclusion if my data is behaving normal or not. So I have shown you other ways of identifying whether your data is following and not non-normal distribution or if your data is following a normal distribution. Now I would say one more thing. Do not kill yourself if your mean was 23.78 and median is 24, and the mode would be like 24.2 or 24. So if there is a slight deflation, we still consider it to be normal. Right? Skewness close to zero is an indication that my data is normal. But if my skewness is beyond minus two or plus two, it is definitely our non-normality proof. Ketosis is also one more way of identifying if my data is following normal distribution. Most of the time we prefer the kurtosis number to be in 0-3. But if you're ketosis is negative, it means that it's a flat curve. Audits follow a uniform distribution. Audit could be a heavy-tailed distribution of high kurtosis could also be an indication that your data is too perfect. And maybe you need to investigate if there are, they have not manipulated your data before handing it over. Other favorite ADText or Anderson-Darling test, where we try to understand if my data is normal or not. So the basic null hypothesis whenever I'm doing NAT test, is that my data follows a normal distribution. So this is the only test where I want my p-value to be greater than 0.05 that I get, I fail to reject the null hypothesis, concluding that my data is normal, and I fall back on my favorite parametric test, which makes it easy for me to do the analysis. But what if during the ADA test, your data and your data analysis shows that the p-value is significant, that it is less than 0.05, maybe it's 0.02. Then it concludes, my data is not normal distribution. And I need to investigate what type of non-normality it has. Accordingly, I will have to put up the test and then take it further. We will continue our session in the next Venice day. I hope you liked it. If you have any questions, please feel free to comment in the WhatsApp or in the Telegram channel or in the comments section over here. Any topic which you would like to learn as part of the y's Wednesday's session. I would be happy to look into that. If you can put those comments in the chat box or in the WhatsApp group or the telegram. I really love teaching you and I thank you for being wonderful. Students. Take care. 22. Conclusion: I would like to thank you very much for completing the program. It shows that you are highly committed on your journey for learning. You want to upskill yourself and I trust you have learned a lot. I hope all your concepts are also clear. I want to ensure that I tell you what are the other programs that I do want skillshare. So on Skillshare, I have many other programs which are already there and many will come up in the future weeks and future months. What the programs are like storytelling with data, how I can use the analytics, data visualization, predictive analytics without coding, and many more. Apart from this, I also work as a corporate trainer. I ensure that all my programs are highly interactive and keeps all the participants very much engaged. I designed the books which are customized for my workshop, which also ensures that all the concepts are clearly understood by the participants. My games are designed in such a way that the concepts get loans in a while they play. There are a lot of games which are designed for my programs. And if you are interested, you are free to contact me. I have also done more than 2 thousand hours of training in the past two years during the pandemic. These are just a few of the workshops. So if your organization wants to take up any corporate training program which is offline or online. Or if you feel that personally you want to upskill your learning, you're free to contact me on my e-mail ID. Stay connected with me on LinkedIn if you liked my training, please ensure that you write a review on LinkedIn. Also, I also run a Telegram channel where I put lot of questions where people can learn the concepts and they will, they might just take few seconds for them to do it. Apart from that, please ensure that you write to leave a review on Skillshare, that how was your training experience? Please do not forget to complete your project. I love people when they are committed and you have proved that you are one of them. Please stay connected. Stay safe, and God bless you.