2025-Lean Six Sigma GreenBelt Analyze Phase - Test of Hypothesis using Microsoft-Excel & Minitab | Dimple Sanghvi | Skillshare
Search

Playback Speed


1.0x


  • 0.5x
  • 0.75x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 1.75x
  • 2x

2025-Lean Six Sigma GreenBelt Analyze Phase - Test of Hypothesis using Microsoft-Excel & Minitab

teacher avatar Dimple Sanghvi, Master Black Belt, Data Scientist, PMP

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

    • 1.

      Analyze Phase of DMAIC- Data Analytics introduction

      3:12

    • 2.

      Project Work

      0:51

    • 3.

      Basics of Statistics

      4:34

    • 4.

      Importance of Levels of Measurement or Data Typs

      15:57

    • 5.

      Measures of Center and Measures of Dispersion

      9:13

    • 6.

      Minitab

      2:16

    • 7.

      what is Descriptive Statistics

      4:32

    • 8.

      Descriptive vs Inferential Statistics

      9:13

    • 9.

      Concepts of Inferential Statistics Part 2

      7:01

    • 10.

      Concepts of Hypothesis testing in detail

      12:22

    • 11.

      Introduction 7Qc Tools

      1:34

    • 12.

      Checksheet

      5:03

    • 13.

      Boxplot

      8:33

    • 14.

      Understand Box Plot Part 1

      5:22

    • 15.

      Understand Box Plot Part 2

      7:37

    • 16.

      Pareto analysis

      19:20

    • 17.

      Concept hypothesis testing and statistical significance

      5:56

    • 18.

      Understand Test of Hypothesis

      5:27

    • 19.

      Null and alternate Hypothesis concept

      7:01

    • 20.

      Statistics Understanding P value

      7:48

    • 21.

      Understand Types of Errors

      4:49

    • 22.

      Understand Types of Errors-part2

      5:57

    • 23.

      Remember-the-Jingle

      4:34

    • 24.

      Test Selection

      5:40

    • 25.

      Concepts of T Test in detail

      19:02

    • 26.

      Understand 1 sample t test

      6:57

    • 27.

      Understand 2 sample t test example 1

      5:32

    • 28.

      Understand 2 sample t test example 2

      3:14

    • 29.

      Understand Paired t test

      3:59

    • 30.

      Understand One Sample Z test

      5:16

    • 31.

      Understand One Sample proportion test-1p-test

      4:01

    • 32.

      Understand Two Sample proportion test-2p-test

      1:39

    • 33.

      Two Sample proportion test-2p-test-Example

      2:21

    • 34.

      Using Excel = one Sample t-Test

      6:51

    • 35.

      Correlation analysis

      27:56

    • 36.

      Pearsons Correlation analysis concept

      15:50

    • 37.

      Point Biserial correlation

      11:17

    • 38.

      Logistic Regression

      19:43

    • 39.

      Logistic Regression practice

      20:01

    • 40.

      ROC Curve

      18:49

    • 41.

      Understand the Non Normal Data

      15:15

    • 42.

      Kruskal Wallis test 3 or more groups nonnormal data

      13:20

    • 43.

      Design of Experiments

      4:23

    • 44.

      The areas of application for a DOE

      4:01

    • 45.

      Types of Designs in a DOE

      4:42

    • 46.

      How to reduce the number of runs

      5:23

    • 47.

      Type of Effects

      4:30

    • 48.

      Fractional Factorial design

      10:48

    • 49.

      Plackett Burman Central Composite design

      3:13

    • 50.

      Conclusion

      2:25

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

447

Students

25

Projects

About This Class

This comprehensive Data Analytics Bootcamp curriculum covers the concepts of Statistics foundation, analyzing data using Minitab

  • Learn about
  • Basic of Statistics
  • Descriptive Statistics
  • Graphical Summary
  • Distributions
  • Histogram
  • Boxplot
  • Bar Chart
  • Pie Chart
  • Test of Hypothesis
  • Types of Errors
  • One Sample T-Test
  • Two Sample T-Test
  • Paired T-Test
  • One-Way-Annova
  • Chi-square test

Who is this class for?

 Anyone who is a Lean Six Sigma Student or who wants to understand and apply statistics  and graphical analysis

Key Takeaways

  • Understand how to do some basic analysis
  • Understand and apply tools required during the Measure and Analyse Phase of Six Sigma Projects
  • Which graph to use when?
  • Some common mistakes we make when we perform graphical analysis
  • Creating graphs for drawing the conclusion

Meet Your Teacher

Teacher Profile Image

Dimple Sanghvi

Master Black Belt, Data Scientist, PMP

Teacher

About Me

I am dedicated to empowering individuals to unlock their potential and make a meaningful impact. As a Consultant and Independent Director on a Corporate Board (NSE & BSE), I bring a wealth of experience to my roles, including being a Lean Six Sigma Master Black Belt and a Leadership Coach & Mentor. My expertise extends to AI, ML, and Data Science Coaching.

Let's connect on LinkedIn for professional growth and networking opportunities https://www.linkedin.com/in/dimplesanghvi/ to explore opportunities for professional growth and networking. I often discuss topics such as #ChatGPT, #DataAnalytics, #CoachingBusiness, #StorytellingWithData, and #LeanSixSigmaBlackBelt.

Join my Telegram channel to embark on a journey through Lean Six Sigma and Storytelling. Here,... See full profile

Level: All Levels

Class Ratings

Expectations Met?
    Exceeded!
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Analyze Phase of DMAIC- Data Analytics introduction: Hello friends. Let's get started on this training program, corners data analytics using MiniTab. What are you going to learn in this course? So the skills that you will learn in this course are some basics of statistics. We will be covering descriptive statistics, graphical summary, distributions, histogram, box-plot, bar charts, and pie charts. I'm going to set up a new series on test of hypothesis, which I will be sharing in the link as a link in the last video. But let's first understand all the different types of graphical analysis. Who should attend this class? Anyone who has, who is a student of Lean Six Sigma, who wants to get certified as Green Belt, Black Belt, or who wants to apply statistics and graphical analysis in their place of work. Even though you might be an entrepreneur or you might be a student and you want to understand statistics using MiniTab. I'm going to cover all of it. We're going to learn what mistakes commonly happen when we are analyzing. Because when we do analysis using simple theory based data points, everything appears to be normal. So I'm going to show you some traps in which our analysis will fail and how you should avoid those traps. We will try to, at the end of this program, you, what will you take away from this program? You will understand how to do some basic analysis. You'll understand what are the tools that are required during your measure phase, like capability calculations and so on. We will use during the analyze phase so if possible, to cover test of hypothesis. Otherwise, if it get, the video gets bigger, I will put it as a separate sees. Ivan also cover which graph to use when some common mistakes we have and we perform graphical analysis and creating graphs. And how do I derive insights and conclusion from those graphs? This will really help you in understanding this program really well. Let's see what is a Minitab? Minitab is a statistical software that's available and it has multiple regions. So I go find a new project. My Minitab screen looks something like this. I have a navigator on the left side. I have my output screen on the top, I have my data sheet, which is very much like an Excel sheet, which I can work with. I can keep adding these sheets and have lots of data. I can do a lot of analysis using my options. We're going to cover basic statistics, regression. We will be covering a lot of basic statistics and we'll be covering lots of graphs using different types of data, right? So if you were interested in knowing these things, you should definitely enroll and watch my video. Thank you so much. 2. Project Work: Let us understand what is the project work that we're going to do in this data analytics program using MiniTab. As I told you, we are going to work with MiniTab. And this is the Minitab that I will be using. I will also be sharing with you a datasheet, your project data sheet, where I have multiple examples, where we are doing calculations on capability. We will try to see distributions and you can see that there are various tabs. Example one example two example three, we'll try to do some trend analysis. We will try to see Pareto charts. We have lots of data that has been shared with you, which will give you a hands-on experience on working with data, right? So let's get started. 3. Basics of Statistics: Welcome to our next important topic, Basics of statistics. In this video, you will learn what statistics is, what descriptive statistics is and what inferential statistics is. Let's start with the first question. What is statistics? Statistics deals with the collection, analysis, and presentation of data. For example, if we want to investigate whether gender has an influence on the preferred newspaper, then gender and newspaper are our so called variables that we want to analyze. To analyze whether gender has an influence on the preferred newspaper. We first need to collect data. To do this, we create a questionnaire that asks about gender and preferred newspaper. We will then send out the survey and wait two weeks. Afterwards, we can display the received answers in a table in this table. We have one column for each variable, one for gender and one for newspaper. On the other hand, each row represents the response of one person. For example, the first respondent is male and stated the times of India. The second is female, and stated the Hindu, and so on. Of course, the data does not have to come from a survey. The data can also come from an experiment in which. For example, want to study the effect of two drugs on blood pressure. Let's consider another real life example. Imagine you are a store manager and want to know if a new product display increases sales. You could collect data on sales before. And after the new display is set up, this data will help you analyze the effectiveness of the display, or suppose your school administrator, and want to understand if extra tutoring sessions are helping students improve their math scores. You could collect as scores before? After the tutoring sessions to analyze the impact. Now the first step is done. We have collected data and we can start analyzing the data. But what do we actually want to analyze? We did not survey the entire population but took a sample. Now, the big question is, do we just want to describe the sample data, or do we want to make a statement about the whole population? If our aim is limited to the sample itself. That is, we only want to describe the collected data. We will use descriptive statistics. Descriptive statistics will provide a detailed summary of the sample. For example, if we surveyed 100 people about their preferred newspaper, descriptive statistics would tell us how many people prefer times of India or the Hindu. However, if we want to draw conclusions about the population as a whole. We use inferential statistics. This approach allows us to make inferences about the population based on our sample data. For instance, using inferential statistics, we might estimate the proportion of all adults in a city who prefer a specific newspaper based on a sample of 500 respondents. Inferential statistics can also help us determine if a certain demographic, like gender, significantly influences newspaper preferences. By analyzing our sample data, we can make inferences about the entire population's newspaper preferences. By using both descriptive and inferential statistics, we can gain a deeper understanding of our findings and make informed decisions about marketing strategies or content creation for different newspapers. In the next lesson, we will dive deeper into practical applications of statistics. Stay tuned. 4. Importance of Levels of Measurement or Data Typs: Importance of levels of measurement. Understanding the level of measurement is crucial for several reasons. Appropriate analysis. Different levels of measurement require different statistical techniques. Using the wrong method can lead to incorrect conclusions. Data interpretation. Knowing the level helps incorrectly interpreting results. For example, mean values are meaningful for interval, and ratio data, but not for nominal or ordinal data. Visualization, effective data visualization techniques vary based on the level of measurement. Bar charts are suitable for nominal data, while histograms are better for interval and ratio data. Let's delve deeper into each level of measurement. Nominal level of measurement. Nominal variables categorize data without establishing any meaningful order. For example, asking respondents about their mode of transportation to school, bus, car, bicycle, or walk is nominal. Each category is distinct, but there's no inherent ranking or order among them. Analyzing nominal data involves counting frequencies or using bar charts to visualize distributions. Ordinal level of measurement, ordinal variables introduce a meaningful order or ranking among categories, but the differences between ranks are not consistently measurable. For instance, asking students to rate their satisfaction with their mode of transportation as very satisfied, satisfied, neutral, satisfied, or very satisfied demonstrates ordinal measurement. While we can rank these responses from least to most satisfied, the numerical difference between satisfied and very satisfied isn't quantifiable. Analysis typically involves median calculations and non parametric tests. Interval and ratio levels of measurement, metric variables. Interval and ratio variables are considered metric variables. They share the characteristic that the intervals between values are equally spaced, but ratio variables also have a true zero point, making all arithmetic operations valid. Examples include measuring age, weight, or income. For instance, asking respondents about the number of minutes it takes to get to school measures interval data, where the intervals between responses, EG, 10 minutes, 20 minutes are consistent and meaningful. This allows for statistical measures such as calculating averages and using advanced statistical techniques like regression analysis. Summary. Understanding these levels of measurement is crucial for designing surveys and choosing appropriate statistical analyses. Nominal data informs us about categories without any order. Ordinal data allows for ranking but not precise measurement of differences, and metric data interval and ratio enables precise measurement and supports a wide range of statistical analyses. Whether creating frequency tables, bar charts, or histograms, selecting the right level of measurement ensures accurate interpretation of data and meaningful insights across various fields of study and research. Let's take a closer look at each level of measurement. Nominal level of measurement. Nominal data is the most basic level of measurement. Nominal variables categorize data, but do not allow for meaningful ranking of the categories. Examples include gender, male, female, types of animals, dog, cat, bird, preferred newspapers. In all these cases, you can distinguish between values, but cannot rank the categories meaningfully. For instance, investigating whether gender influences the preferred newspaper involves nominal variables. In a questionnaire, you would list possible answers for both variables. Since there's no inherent order, the arrangement of categories in the questionnaire does not matter. Data collected can be displayed in a table and frequency tables or bar charts can be used to visualize the distributions. Ordinal level of measurement. Ordinal data can be categorized and ranked in a meaningful order, but the differences between ranks are not mathematically equal. Examples include rankings, first, second, third, satisfaction ratings, very unsatisfied, unsatisfied, neutral, satisfied, very satisfied, levels of education, high school, bachelors, masters, in this case, while the order is meaningful. The intervals between ranks are not necessarily equal. For example, if a questionnaire asks, how satisfied are you with your current job with options ranging from very unsatisfied to very satisfied? The response categories are ordered, but the exact difference between each level of satisfaction is not quantifiable. Analysis of ordinal data often involves calculating medians and using non parametric tests. Interval level of measurement. Interval data has equal intervals between values, but lacks a true zero point. Examples include temperature in Celsius or Fahrenheit. Interval data allows for the measurement of differences between values. But because there is no true zero, Ratios are not meaningful. Statistical operations such as calculating averages, and using techniques like regression analysis are possible. Ratio level of measurement. Ratio data has equal intervals between values and includes a true zero point. Examples include age, weight or income, because ratio data includes a true zero. All arithmetic operations are valid. This level allows for the calculation of ratios and averages and enables the use of advanced statistical methods. Oh. What we've learned so far using an example. Imagine you're conducting a survey in a school to understand how pupils get to school. Here are questions you might ask. Each corresponding to a different level of measurement. The first question could be, what mode of transportation do you use to get to school? Options might include bus, car, bicycle, or walk. This is a nominal variable. The answers can be categorized, but there is no meaningful order. This means that bus is not higher than bicycle. Walk is not higher than car and so on. If you want to analyze the results of this question, you can count how many students use each mode of transportation and present it in a bar chart. Next, You might ask, how satisfied are you with your current mode of transportation? Choices might include very unsatisfied, unsatisfied, neutral, satisfied, or very satisfied. This is an ordinal variable. You can rank the responses to see which mode of transportation ranks higher in satisfaction. But the exact difference between satisfied and very satisfied. For example, is not quantifiable. For the final question, how many minutes does it take you to get to school? Here, minutes to get to school is a metric variable. You can calculate the average time it takes to get to school and use all standard statistical measures. We can visualize this data with a histogram showing the distribution of times it takes to get to school and compare the different transportation modes. So Using nominal data, we can categorize and count responses, but cannot infer any order. Ordinal data allows us to rank responses, but not measure precise differences between ranks. Metric data enables us to measure exact differences between data points. As already mentioned, metric levels of measurement can be further subdivided into interval scale and ratio scale. But what is the difference between interval and ratio levels? Let's explore the difference between interval and ratio levels of measurement using an example. Interval versus ratio level of measurement. In a marathon, the time taken by runners to complete the race serves as a practical example. Consider a scenario where the fastest runner finishes in 2 hours and the slowest finishes in 6 hours. Here's how we classify the measurement level based on the information provided. Ratio level of measurement. A ratio level of measurement is characterized by having a true zero point where zero represents the absence of the quantity being measured. In the Marathon example, all runners start at the same 0.0 time when they begin the race. With a true zero point, we can make meaningful comparisons such as stating that the fastest runner took three times less time than the slowest runner, 2 hours vis 6 hours. This level allows for meaningful multiplication and division operations. For instance, if one runner finishes in 4 hours and another in 12 hours, we can accurately say that the first runner was three times faster than the second. Interval level of measurement. An interval level of measurement lacks a true zero point. In the marathon context, if the stopwatch starts late and we only measure the time differences from the fastest runner who started on time, we lose the true zero reference. While intervals between values are still equally spaced and arithmetic operations like addition and subtraction are valid, multiplication and division may not be meaningful. For example, saying a runner finished 4 hours ahead of another is meaningful. But we cannot state that one runner was four times faster than another without knowing the total time for both. In summary, interval level measurement allows for equal intervals between values and supports operations like addition and subtraction, but does not possess a true zero point necessary for meaningful ratios. Now, a little exercise to check whether everything is clear to you. First, we have state of the US, which is a nominal level of measurement. This means the data is used for labeling or naming categories without any quantitative value. In this case, the states are names with no inherent order or ranking. Next, we have product ratings on a scale 1-5. This is an example of ordinal data. Here, the numbers do have an order or rank. Five is better than one, but the intervals between the ratings are not necessarily equal. Moving on to names of departments like the procurement, sales, operations, finance, this is also nominal. The categories here, such as different departments are for categorization and do not imply any order. Next, we have CO two emissions in a year, which is measured on a metric ratio scale. This level allows for the full range of mathematical operations, including meaningful ratios. Zero emissions mean no emissions at all. Then we have telephone numbers. Although telephone numbers are numeric, they are categorized as nominal. They are just identifiers with no numerical value for analysis. Level of comfort is another ordinal example. This might include levels such as low, medium, and high care, which indicate an order, but not the exact difference between these levels. Living space in square meters is measured on a ratio scale. Like CO two emissions, square meters mean there is no living space and comparisons like double or half are meaningful. Lastly, we have job satisfaction on a scale 1-4. This is ordinal data. It ranks satisfaction levels, but the difference between each level isn't quantified. In the next lesson, we will dive deeper into practical applications of design of experiments. Stay tuned. 5. Measures of Center and Measures of Dispersion: Let's examine both methods, starting with descriptive statistics. Why is descriptive statistics important? For instance, if a company wants to understand how its employees commute to work. It can create a survey to gather this information. Once enough data is collected, it can be analyzed using descriptive statistics. So what exactly is descriptive statistics, its purpose is to describe and summarize a dataset in a meaningful way. However, it's crucial to note that descriptive statistics only reflect the collected data and do not make conclusions about a larger population. In other words, knowing how some employees at one company commute doesn't allow us to fer how all workers do. Now, to describe data descriptively, we focus on four key components, measures of central tendency, measures of dispersion, frequency tables, and charts. Let's start with measures of central tendency, which include the mean, median, and more. First, the mean, the arithmetic mean is calculated by adding all observations together and dividing by the number of observations. For example, if we have the test scores of five students, we sum the scores, and divide by five to find that the mean test score is 86.6. Next is the median. When the values in a data set are arranged in ascending order, the median is the middle value. If there's an odd number of data points, it's simply the middle value. If there's an even number, the median is the average of the two middle values. An important aspect of the median is that it is resistant to extreme values or outliers. For example, regardless of how tall, the last person is in a high data set. The median will remain the same. While the mean can change significantly based on that value, the median remains unchanged regardless of the last person's height. Meaning it is not affected by outliers. In contrast, the men can change significantly based on that last person's height, making it sensitive to outliers. Now, let's discuss the mode. The mode is the value or values that occur most frequently in a dataset. For example, if 14 people commute by car, six by bike, five walk, and five take public transport, then car is the mode since it appears most often. Next, we move on to measures of dispersion, which describe how spread out the values in a data set are. Key measures of dispersion include variants. Standard deviation range and intequatle range, starting with standard deviation. It indicates the average distance between each data point and the mean. This tells us how much individual data points deviate from the average. For instance, if the average deviation from the mean is 11.5 centimeters, we can calculate the standard deviation using the formula. Sigma equals the square root of the sum of each value minus the mean. Squared, divided by n, where Sigma is the standard deviation. N is the number of individuals. X sub i is each individual's value, and x bar is the mean. It's important to note that there are two formulas for standard deviation. On divides by n, while the other divides by n minus one. The latter is used when our sample does not cover the entire population, such as in clinical studies. The latter is used when our sample does not cover the entire population, such as in clinical studies. Now, how does standard deviation differ from variance? The standard deviation measures the average distance from the mean. While variance is simply the squared value of the standard deviation. Next, let's discuss range and intequatle range. The range is the difference between the maximum and minimum values in a data set. On the other hand, the inequartile range represents the middle 50% of the data, calculated as the difference between the first quartile, Q one and the third quartile, qu. This means that 25% of the values lie below and 25% above the inte quartile range. Before we proceed to the final points, let's briefly compare these concepts, measures of central tendency and measures of dispersion. Let's consider measuring the blood pressure of patients. Measures of central tendency provide a single value that represents the entire dataset. Helping to identify a central point around which data points tend to cluster. On the other hand, measures of dispersion, such as standard deviation, range and inteQatile range indicate how spread out the data points are. Whether they are closely grouped around the center or widely scattered. In summary, while measures of central tendency highlight the central point of the data set, measures of dispersion describe how the data is distributed around that center. Now, let's move on to tables, focusing on the most important types, frequency tables, and contingency tables. A frequency table shows how often each distinct value appears in a data set. For example, a company surveyed its employees about their commute options, car, bicycle, walk, and public transport. Here are the results from 30 employees showing their responses. We can create a frequency table to summarize this data by listing the four options in the first column, and counting their occurrences from the table. It's clear that the most common mode of transport among employees is by car. With 14 employees choosing this option. The frequency table provides a concise summary of the data. But what if we have two categorical variables instead of one? This is where a contingency table, also known as a cross tabulation comes into play. Imagine the company has two factories, one in Detroit, and another in Cleveland? If we also ask employees about their work location, we can display both variables using a contingency table. This table allows us to analyze and compare the relationship between the two categorical variables. The rows represent the categories of one variable. While the columns represent the categories of the other, each cell in the table shows the number of observations that fit into the corresponding category combination. For example, the first cell indicates how many employees commute by car and work in Detroit was reported six times. Thank you. I will see you in the next lesson of statistics. 6. Minitab: In this class, we're going to learn about hypothesis testing. I'm going to teach you hypothesis testing using MiniTab. I'm also going to teach you hypothesis testing using Microsoft Office. That is using Excel and Microsoft Office for those who are interested in going for MiniTab. Let me show you from where you can download Minitab. Minitab.com under Downloads. Here we come to the download section. You have MiniTab statistical software, and it is available for 30 days for free. I have also downloaded the trial version on my system and Dando analysis and showed you showed it to you. Remember, it is available for 30 days only. You please ensure that you complete the entire training program within the first 30 days. When you feel the value in this, you should definitely go ahead and go by the licensed version of MiniTab, which is available over here. I just have to click on Download and download Woodstock. It starts with a free 30-day trial. And it's good enough time for you to practice all the exercises which are driven. It will ask you for some personal information so that they can be in touch with you and they can help you with some discounts. If there are any. You have a section called as Dr. MiniTab or you have a phone number. If you're calling from UK, it will be easy for you to call over there. But if you're talking from other places, talk to MiniTab is a much easier option. This is a very good statistical tool and they keep upgrading the features regularly. So personally, I feel that this investment will be worth it. But for those who cannot afford to go for the license, they can use Microsoft Office, at least some of the features, not all, but some of the features are available. So initially I will show you the entire exercise of different types of hypothesis using MiniTab. And then we will move into Microsoft Excel, stay connected and keep learning. 7. what is Descriptive Statistics: In today's session, we are going to learn about descriptive statistics. Descriptive statistics means I want to understand measures of center. Like measures of center, mean, median mode. I want to understand measures of spread. That is nothing but range, standard deviation, and variance. Let's take a simple data that I have. I have cycle time in minutes for almost a 100 data points. I'm going to take the cycle time in minutes from my day project data sheet. I'll go to MiniTab and I will paste my data where here I want to do some descriptive statistics. Stats. Click on Basic Statistics and say Display Descriptive statistics. When I do this, it gives me an option in the pop-up window, which is called as, which shows me the available data fields that I have. I have cycle time in minutes. So it is telling me that I want to analyze the variable cycle time in minutes. I'll just click on, Okay, and immediately you will find that in my output window. I can just pull this down. In my output window. It is showing me that it has done some statistical analysis for the variable cycle time in minutes. I have 100 data points over here. Number of missing values are 0. The mean is 10.064. Standard error of mean is 0.103, standard deviation is 1 to minimum value is 7.5. One is nothing but your quartile one is 9.1. Median, that is, your Q2 is 10.35, Q3 is 10.868, and the maximum value is 12.490. If I need more statistical analysis, I can go ahead and repeat this analysis. This time, I'm going to click on Statistics. And I can look at the other data points that I need. Suppose if I need the range, I don't need standard error, I need I need inter-quartile range. I want to identify what is the mood. I want to identify what is the skewness and my data. What is the kurtosis in my data? I can select all of it and say, okay, I will click on, Okay. When I do this, all the other statistical parameters that I have selected will come up in my output window. This is my output window. So it's again tells me that additional data point that I selected. So radius is nothing but your standard deviation squared. It is 0.0541. It is telling me the range that is maximum minus minimum. It is 4.95. Inter-quartile range is 1.707. There is no mode in my data. And number of data points at 0 because there is no more, the data is not skewed. The values very close to 0, it is 0.05, but there is kurtosis. It means my data is not appearing as a non-work go. So good, we like to see how my distribution looks. Let's do that. I click on stats, I click on Basic Statistics, and I will click on graphical summary. I'm selecting cycle time in minutes. And I'm saying I want to see 95% confidence interval. I click on, Okay, let's see the output. The summary of the cycle diamond minutes. It is showing me the mean, standard deviation, variance. All the statistics things are being displayed on the right-hand side. Mean, standard deviation, variance, skewness, kurtosis, number of data points minimum first quartile median, third quartile maximum. These data points which you see as minimum Q1, median, Q3 and maximum will be covered in the boxplot. The boxplot is framed using these data points. And when you look at the Velcro, it says that the bell is not steep curve, it is a little fatter curve, and hence the kurtosis value is a negative value. We will continue our learning more in detail in the next video. Thank you. 8. Descriptive vs Inferential Statistics: Let's examine both methods, starting with descriptive statistics. Why is descriptive statistics important? For instance, if a company wants to understand how its employees commute to work. It can create a survey to gather this information. Once enough data is collected, it can be analyzed using descriptive statistics. So what exactly is descriptive statistics, its purpose is to describe and summarize a dataset in a meaningful way. However, it's crucial to note that descriptive statistics only reflect the collected data and do not make conclusions about a larger population. In other words, knowing how some employees at one company commute doesn't allow us to fer how all workers do. Now, to describe data descriptively, we focus on four key components, measures of central tendency, measures of dispersion, frequency tables, and charts. Let's start with measures of central tendency, which include the mean, median, and more. First, the mean, the arithmetic mean is calculated by adding all observations together and dividing by the number of observations. For example, if we have the test scores of five students, we sum the scores, and divide by five to find that the mean test score is 86.6. Next is the median. When the values in a data set are arranged in ascending order, the median is the middle value. If there's an odd number of data points, it's simply the middle value. If there's an even number, the median is the average of the two middle values. An important aspect of the median is that it is resistant to extreme values or outliers. For example, regardless of how tall, the last person is in a high data set. The median will remain the same. While the mean can change significantly based on that value, the median remains unchanged regardless of the last person's height. Meaning it is not affected by outliers. In contrast, the men can change significantly based on that last person's height, making it sensitive to outliers. Now, let's discuss the mode. The mode is the value or values that occur most frequently in a dataset. For example, if 14 people commute by car, six by bike, five walk, and five take public transport, then car is the mode since it appears most often. Next, we move on to measures of dispersion, which describe how spread out the values in a data set are. Key measures of dispersion include variants. Standard deviation range and intequatle range, starting with standard deviation. It indicates the average distance between each data point and the mean. This tells us how much individual data points deviate from the average. For instance, if the average deviation from the mean is 11.5 centimeters, we can calculate the standard deviation using the formula. Sigma equals the square root of the sum of each value minus the mean. Squared, divided by n, where Sigma is the standard deviation. N is the number of individuals. X sub i is each individual's value, and x bar is the mean. It's important to note that there are two formulas for standard deviation. On divides by n, while the other divides by n minus one. The latter is used when our sample does not cover the entire population, such as in clinical studies. The latter is used when our sample does not cover the entire population, such as in clinical studies. Now, how does standard deviation differ from variance? The standard deviation measures the average distance from the mean. While variance is simply the squared value of the standard deviation. Next, let's discuss range and intequatle range. The range is the difference between the maximum and minimum values in a data set. On the other hand, the inequartile range represents the middle 50% of the data, calculated as the difference between the first quartile, Q one and the third quartile, qu. This means that 25% of the values lie below and 25% above the inte quartile range. Before we proceed to the final points, let's briefly compare these concepts, measures of central tendency and measures of dispersion. Let's consider measuring the blood pressure of patients. Measures of central tendency provide a single value that represents the entire dataset. Helping to identify a central point around which data points tend to cluster. On the other hand, measures of dispersion, such as standard deviation, range and inteQatile range indicate how spread out the data points are. Whether they are closely grouped around the center or widely scattered. In summary, while measures of central tendency highlight the central point of the data set, measures of dispersion describe how the data is distributed around that center. Now, let's move on to tables, focusing on the most important types, frequency tables, and contingency tables. A frequency table shows how often each distinct value appears in a data set. For example, a company surveyed its employees about their commute options, car, bicycle, walk, and public transport. Here are the results from 30 employees showing their responses. We can create a frequency table to summarize this data by listing the four options in the first column, and counting their occurrences from the table. It's clear that the most common mode of transport among employees is by car. With 14 employees choosing this option. The frequency table provides a concise summary of the data. But what if we have two categorical variables instead of one? This is where a contingency table, also known as a cross tabulation comes into play. Imagine the company has two factories, one in Detroit, and another in Cleveland? If we also ask employees about their work location, we can display both variables using a contingency table. This table allows us to analyze and compare the relationship between the two categorical variables. The rows represent the categories of one variable. While the columns represent the categories of the other, each cell in the table shows the number of observations that fit into the corresponding category combination. For example, the first cell indicates how many employees commute by car and work in Detroit was reported six times. Thank you. I will see you in the next lesson of statistics. 9. Concepts of Inferential Statistics Part 2: Let's dive into inferential statistics. We'll start with a brief overview of what it is. Followed by an explanation of the six key components. So what is inferential statistics? It enables us to draw conclusions about a population based on data from a sample. To clarify, the population is the entire group we're interested in. For instance, if we want to study the average height of all adults in the United States, our population includes all adults in the country. The sample on the other hand, is a smaller subset taken from that population. For example, if we select 150 adults from the US, we can use this sample to make inferences about the broader population. Now, here are the six steps involved in this process. Hypothesis. We start with a hypothesis. Which is a statement we aim to test? For example, we might want to investigate whether a drug positively impacts blood pressure in individuals with hypotension. Oh, In this case, our population consist of all individuals with high blood pressure in the US, since it's impractical to gather data from the entire population. We rely on a sample to make inferences about the population using our sample. We employ hypothesis testing. This is a method used to evaluate a claim about a population parameter based on sample data. There are various hypothesis tests available, and by the end of this video. I'll guide you on how to choose the right one. How does hypothesis testing work? We begin with a research hypothesis. Also known as the alternative hypothesis, which is what we are seeking evidence for in our study. Also called an alternative hypothesis. This is what we are trying to find evidence for. In our case, the hypothesis is that the drug affects blood pressure. However, we cannot directly test this with a classical hypothesis test. So we test the opposite hypothesis, that the drug has no effect on blood pressure. Here's the process. One, assume the no hypothesis. We assume the drug has no effect, meaning that people who take the drug and those who don't have the same average blood pressure. T, collect and analyze sample data. We take a random sample. If the drug shows a large effect in the sample, we then determine the likelihood of drawing such a sample or one that deviates even more, if the drug actually has no effect, or one that deviates even more, if the drug actually has no effect, T, evaluate the probability p value. If the probability of observing such a result under the null hypothesis is very low. We consider the possibility that the drug does have an effect. If we have enough evidence, we can reject the null hypothesis. The p value is the probability that measures the strength of the evidence against the null hypothesis. In summary, the null hypothesis states there is no difference in the population, and the hypothesis test calculates how likely it is to observe the sample results if the null hypothesis is true. We want to find evidence for our research hypothesis. The drug affects blood pressure. However, we can't directly test this, so we test the opposite hypothesis, the null hypothesis. The drug has no effect on blood pressure. Here's how it works. Assume the no hypothesis. Assume the drug has no effect. Meaning people who take the drug, and those who don't have the same average blood pressure, collect and analyze data. Take a random sample. If the drug shows a large effect in the sample. We determine how likely it is to get such a result, or a more extreme one. If the drug truly has no effect, calculate the p value. The p value is the probability of observing a sample as extreme as ours. Assuming the null hypothesis is true. Statistical significance. If the p value is less than a set threshold, usually 0.05. The result is statistically significant, meaning it's unlikely to have occurred by chance alone. We then have enough evidence to reject the null hypothesis. A small p value suggests the observed data is inconsistent with the null hypothesis. Leading us to reject it in favor of the alternative hypothesis. A large p value suggests the data is consistent with the null hypothesis. We do not reject it. Important points. A small p value does not prove the alternative hypothesis is true. It just indicates that such a result is unlikely if the null hypothesis is true. Similarly, a large p value does not prove the null hypothesis is true. It suggests the observed data is likely under the null hypothesis. Thank you. I will see you in the next lesson of statistics. 10. Concepts of Hypothesis testing in detail: Welcome back. Let's understand hypothesis in more detail. Hypothesis of We have an entire population that we would love to study. But there would be always constraint of time and resources to study the entire population. Hence, we take a sample from the population using different sampling techniques and pull out a sample. We study the sample and draw some inferences about the population, and that is as inferential statistics. What exactly is hypothesis? A hypothesis is an assumption that can neither be prone nor disapprove. In a research process, the hypothesis is made at the very beginning, and the goal is to either reject or not reject the hypothesis. In order to reject or fail to reject the hypothesis, data example from the experiment a survey is needed, which are then evaluated using hypothesis test. Using hypothesis, usually hypotheses are performed starting at a literal review. Based on the literal review, you can either justify why you formulated the hypothesis in this way. An example of hypothesis could be men earn more than women for the same job in Austria. The hypothesis is an assumption of an expected association. Your target is either to reject or fail to reject the null hypothesis. You can test your hypothesis based on the data. The analysis of the data is done using the hypothesis testing. Man earn more than women for the same job in Austria. You made a survey of of almost 1,000 employees working in Australia, a T test of independent sample. In this test, the hypothesis you need from the survey suitable hypothesis tests such as the T test or the correlation analysis test. We can use online tools like Data tab or Excel tools to solve this. How do I formulate a hypothesis? In order to formulate a hypothesis, a research question must first be defined. A precise formulate hypothesis about the population can then be derived from the research question. Man earn more than women for the same job in Australia. To the subject, what is the question we want to ask and what is the hypothesis? You will then provide the data to the hypothesis test and draw the conclusion. This is a very beautiful visual representation of how a hypothesis test is performed. Hypothesis are not simple statements. They are formulated in such a way that they can be tested with They can be tested with collected data in the course of research process. To test hypothesis, it is necessary to define exactly which variables are involved and how these variables are related. Hypothesis then are assumptions about the cause and effect relationship of the association between the variables. What is a variable in this case? Variable is nothing but a property of an object or an even that can take different values. For example, an eye color is a variable. If the property of the object, I can take different values. If you are researching a social science, your variables can be gender, income, attitudes, environmental protection, et cetera. If you're researching about the medical field, then your variables could be body weight, smoking status, heart rate, et cetera. So what exactly is the null and alternate hypothesis? There are always two hypotheses that are exactly opposite to each other and that claim to be opposite. These opposite hypotheses are called as null and alternate hypothesis and are represented by H naught and H A or H one, H zero and H one. The null hypothesis of H naught assumes that there is no difference between two or more groups with respect to the characteristics that we are trying to study. The null hypothesis are hen. The null hypothesis assumes that there is no difference between two or more groups with respect to the characteristics. Example, the salary of the men and women are not different in Austria. The alternate hypothesis is the hypothesis that we want to prove or we are collecting data to prove it. So alternate hypothesis, on the other hand, assumes that there is a difference between the two or more groups. Example, the salary of the men and women differs in Austria. The hypothesis that you want to test or what you want to dive from the theory usually states the effect. The gender has an effect on salary. This hypothesis is called as the alternate hypothesis. It's a very beautiful statement, right? There is another way of writing it, and that is a gender has an effect on salary, and hypothesis test is called as alternate hypothesis. The null hypothesis usually states that there is no effect. Gender has no effect on salary. In the hypothesis test, only null hypothesis can be tested. The goal is to find out whether null hypothesis is rejected or not. There are different types of hypothesis. What types of hypothesis are available? The most common distinction is between is differences, correlation hypothesis, it can be directional and non directional hypothesis. Differential and correlation hypothesis. Differential hypotheses are used when different groups are to be distinguished and the group of men and the group of women. Correlation hypotheses are used when they want to establish a relationship or a correlation between the variable is to be tested. The relationship between age and height. Difference hypothesis. Difference hypothesis is test where we whether there is a difference between two or more groups. The example of difference hypothesis are the group of men earn more than women. Smokers have higher risk of heart attacks than non smokers. There is a difference between Germany, Austria, and France in terms of hours work per a week. Thus, one variable is always a categorical variable like gender, smoking status or the country. On the other hand, the other variable is an ordinal variable or a variable of salary, percentage risk of heart attack, and hours work per week. Now, let's understand correlation hypothesis a little more in detail. A correlation hypothesis test, relationships between two variables. For example, the height and the body weight. As the height of the person increases, the body weight gets impacted. The correlation hypothesis, for example, is taller a person is, the heavier he is, the more horse power a car has, the higher its fuel consumption. The better the math grade, the higher the future salary. As you can see from the examples, correlation hypothesis often take the form of the more the higher, the lower. Thus, at least two ordinal scale variables are being examined. Directional and non directional hypothesis, hypothesis are divided into directional and non directional. That is either they are one sided or two sided hypothesis. If the hypothesis contains words like better than, worse then, the hypothesis is usually directional. It could be a positive or a negative. In the case of non directional hypothesis, one often finds out the building blocks, such as there is a difference between the formulation, but it's not stated in which direction the difference lies. For the non directional hypothesis, the only thing of interest is whether there is a difference in the value between the variables under consideration. In a directional hypothesis, what is the interest whether one group is higher or lower than the other? You have two sided hypothesis, or you can have one sided hypothesis like left sided or right sided. Non directional hypothesis, a non directional hypothesis test whether there is a difference or a relationship. It does not matter in which direction the relationship exists or the different cos. In the case of a difference hypothesis, it means that there is a difference between two groups, but it does not say whether one group has a higher value. There is a difference between the salary of men and women, but it does not say who earns more. There is a difference in the risk of heart attacks between smokers and non smokers, but it does not say who is at a higher risk. In regards to the correlation hypothesis, it means that a relationship or a correlation between two variables. But it But it is not said whether the relationship is positive or negative. There is a correlation between height and weight and there is a correlation between horse power and fuel consumption in the car. In both cases, it is not said we the correlation is positive or negative. When you talk about a directional hypothesis, we are additionally indicating the direction of the relationship or the difference. In case of the different hypothesis, statement is made, which group is higher or lower value? Men earn more than women. Smokers have a higher risk of heart attacks than non smokers. In case of a correlation hypothesis, the relationship is made as to whether a correlation is positive or negative. The taller a person is, the heavier he is. The more horse power a car has, the higher its fuel economy. One sided Directional alternate hypothesis includes only the values that differ in one direction from the values of the null hypothesis. Now, how do we interpret the p value in a directional hypothesis? Usually, statistical softwares are always help you in calculating the p value. Excel has also become very smart in calculating the p value, and it helps in calculating the non directional test and also helps in giving the p value for this. To obtain the p value for directional hypothesis, it must check whether the effect is on right direction, then the p value is divided by two, and whether the significance level is not sped by two, but only one side. More than this, we have a tutorial on P value. So please go and watch that in the analyzed phase of my course. If you select a directed alternate hypothesis in a software lil data type, for the calculation of hypothesis, the conversion is done automatically and you can only reads. Now, step by step instruction for testing the hypothesis. You should do a literature research, formulate the hypothesis, define the scale level, determine the significance level, determine the hypothesis test, which hypothesis test is suitable for scale levels and hypothesis style? The next tutorial is about hypothesis testing. You will learn about hypothesis testing and find out which one is better and how to read it. 11. Introduction 7Qc Tools: T. Welcome to the new class on seven quality tools. This is one of the most important concepts if you are thinking about doing small continuous improvement in your process or operations or manufacturing setup. Even if you are in the service industry, these tools will help you to keep track of quality. With that, let's get started. So the seven QC tools, what am I going to cover as part of this training program? It is the seven quality control tools. Number one, things catapult, flow chart histogram Pareto analysis, Fishburn diagram also called as Ishikawa diagram Run charts check sheets. We are not only going to cover these tools at a high level. We are going to do some examples, how to draw these things using Microsoft Excel wherever possible. We're also going to give you some sample exercises with data that can help you do these activities very easily. We're going to talk about what is the tool, how to use the tool, when to use the tool, some common mistakes that we should avoid, and a step by step guide to create the output that is required. 12. Checksheet: Let's go to the next quality tool out of the seven QC tools, that is the check sheet. Let's learn more about check sheet. Check sheets are used for systematically recording and compiling the data. From the historical sources or observations as they occur. It can be used to collect data at locations where data is actually generated over time. It can be used to capture both quantitative and qualitative data. So I've shown you a simple check sheet where you have defect types and how many times this particular defect is happening. This can be used to systematically record and compile data from historical sources or observations as they occur. It can be used to collect data at locations where data is generated at real time. This type of data can be quantitative as well as qualitative. Check sheet is one of the basic seven QC. What does the check sheet do? It is used to create easy to comprehend data and that comes with simple efficient process. With every entry, create a clear picture of facts as proposed to opinion of each team member. That is why it's one of the data driven. It standardize the agreement on definitions of each and every condition. How is a check shape used? We agree upon the definition of events or conditions that are being observed. Example. If we seek the root cause of severity one defects, then agreement is to make it as severity one. Decide on who collects the data, decide the person who will be involved in this activity. Note down the sources from where the data is collected. Data should be in the form of sample or the entire population. It can be both qualitative as well as quantitative. Decide on the knowledge level required for the person who is involved in the data collection plan. Decide on the frequency of data collection, whether the data should be needed to be collected, weekly, hourly, daily, or a monthly basis. Decide on the duration of the data collection, that is, how long should the data be collected to make it a meaningful outcome. Construct a checksheet that is simple to use concise, complete, and have consistency in accumulating data throughout the collecting period. Please note that check sheets were created as one of the quality tools when we were in the industrial age. Currently, we are in the information age. We have so many ERP softwares, Machine ese capturing data because of IT, and there are various other computer generated reports which are applicable. Seek to use a checksheet only and only when you are in a completely manual data capture process. It is one of the tools, but the least use tools in the last few months. Let me rephrase, least use tools in the last few years. Unless and until your company is completely not having any systematic approach of capturing the data. It's a very good tool if you are using people who are blue colored employees and you do not have high tech systems to capture the data. So I have attached the template for the check sheet in the project and resource section. You can refer to it. Just give me a second. I will show you the check sheet on the screen. So I can use a check sheet that I have given you as part of my parado template. You can write down the categories over here, telling me that it is defect one defect two. Hight is an issue there of whatever is the name of your defect, please list all the defects here, right? And then you can market that how frequently is this happening? Wherever it is happening, please start writing one. How frequently are you seeing this and when are you seeing it? This in conjunction to I can use later on this data for my Pareto analysis, for which I have created a separate video, you can use that. You don't need a separate check sheet in today's world. You can use the one which I have given over here. Thank you. I will see you in the next class. 13. Boxplot: Today, we are going to learn about boxplot and understand it in detail. We all would have seen boxplot in multiple instances. But let's see what does it interpret. So what exactly is a boxplot? With a boxplot, you can typically graphically display a lot of information about your data. The box indicates the range of the middle 50% of the place where your value lies. Let's understand the box plot, how it is divided. If the beginning of the box is called as Q one, it is the lower end of the box, and it's also called as the first quartile. Q is the upper end of the box or the third quartile. The distance between Q three and Q is called as an inter quartile range, which is the middle 50% of your data. The 25% of the data is below Q one, In the box, you have 50% of the data, and therefore, 25% of the data is above the box. You have a main and the median line inside the box, which again splits the data into 25 and 25%. So let us say when we display the age of the participant, the box plot, one is 31 years. It means that 25% of the participants are younger than 31 years. Q three is 63 years. It means that 25% of the participants are older than 63. 50% of the participants are 31-63-years-old. The mean and the median. The median is at 42, which means half of the participants are aged older than 42 years, and the other half are younger than 42 years. The dash line is also called as the average line or the main value, which represents the average. As the mean is away from the median, it clearly says that the data is. The solid line represents the median and the dotted line represents the average. The point which are further away are called as outliers. The height of the whisker is roughly 1.5 times the interquartal range. The whisker cannot keep ping endlessly. The outlier and the ti shaped whisker. If there is no outlier, the maximum value is here. If there is an outlier, the T shaped whisker is the last point in which 1.5 times the interquaral range and others are called as outliers. How do I create a boxplot? You have Excel sheet to create your boxplot, and you can also do it using online tools. Yeah, so I can just go for charts. With that, I can say I'm taking the metric variable, then you have an option of histogram, and you also have an option of boxplot, which clearly says that the Q one is 29, is 66, median is 42, Man is 46. Maximum is 99, the upper fence is 99. There are no outliers. Let's go and change the data. Let me make this as 126. As soon as I change the value of a person to 126, when you come back, you will find that there is a outlier in the histogram, and it's very evident over here that 126 is an outlier. And here, the upper fence is 92. The Q three is still the same, the Q one is still the same. So the box size does not change and so on. Right? What if the person is one ero? In that case, you will see that it is not part of an outlier, but it is still part of the isc. I can make the graphic small, I can show the zero line. I can show the standard deviation. I can show the points. I can make it as horizontal and vertical. So all these options are possible using an online statistical tool. I can obviously download the Zip file and work with it. Okay. How can I do boxplot using Excel sheet? So I have copied the same data over here. I have different groups, so I have gone ahead and selected my age as data. And now I go to insert, recommended chart, go to all charts, and I have box and whisker chart. And I'm able to see my box and whisker chart. I can remove my grid lines and I can add the data labels, and it clearly shows my pat. Maybe I can just increase it to make it more visible. I can change the color of my graph to be different. Oh and I can pick the My average is over here. My median is 421, three and. Now, the same graph, I can also group it based on roots. I'm taking the group and the age. I click on in, I can click on recommended chart, go to all charts and do box and whisker. This time, I have four boxes for each of the group. I can change the color of my graph. All right. I can include the data labels. When I include it over here and click on the comma sign, you will find that the tei points have been So it's very easy to draw graph using Excel as well as using some online tools. So for the groups, I've taken the group plus the A, and for this, I have taken. So for A, let's say for the group C, if I go ahead and change the value as 100, you will find that there is an outlier over there. The minimum value is ten, let's change the values 25. You will realize that this is how the values are changing. Great. So I will see you in the next class. Thank you. Oh. 14. Understand Box Plot Part 1: In this lesson, we are going to learn more about boxplot. A boxplot is one of the graphical technique which helps us to identify outliers, right? Let us understand how a boxplot gets formed. Let's understand the concept first before we get into the practicals. A boxplot is called as a boxplot because it looks like a box and it has viscous like the cat. The cat has on its face. Now, just like the way the cat cannot have and less viscous, the size of the whisker of the box plot will be decided on certain parameters. You will see some important terminologies when you're forming a boxplot. Number one, what's the minimum value? What's the quartile one? What is the median? What is the core tight? Three, what is the size of the maximum whisker? And what is the maximum value on the data point? Here? The minimum dogs over the minimum point and where the whisker can be extended. Q1 stands for first quarter, which means 25% of the data. Let's assume for ease, we have 100 data points. 25 per cent of the data will be below this one mark. Between Q1 and Q2. Twenty-five percent of your data will be formed, will be present. Q2 is also called as the median or the center of your data. So if I arrange my data in ascending or descending order, the middle data point is called as a median and it is called as Q2. Q3, or otherwise also called as upper quartile, talks about the twenty-five percent of the data after the medium. So technically, by now you have covered seventy-five percent of your data will be below your third quartiles, 25 per cent below Q1, 50% of the data below Q2, Seventy-five percent of the data is below Q3. So technically, out of 100% of the data, 75% of the data is below Q3. It means twenty-five percent of my data points will be above Q3. Now the distance between Q1 and Q3 is called, is called as the box size. And this box size is also called as inter-quartile range. Q3 minus Q1 is called as inter-quartile range. As I told you at the beginning of the class, that the size of the whisker depends upon the interquartile range or IQR. Q3. I can this line form 1.5 times the size of the box. So 1.5 times into IQR plus q3 will be the upper limit for my whisker. On the right side. On the upper side. If I want to draw the whisker on the left side, it is nothing but the same 1.5 times into inter-quartile range. But I subtract this value from Q1 and extended till that value. So it sets up the lower limit. You might have data points which are coming below the minimum point. You might have data points which are coming beyond the maximum size of the risk of these data points are called as outliers. The beauty of boxplot is it will help you to identify if there are any outliers in your dataset. Let's see how can I construct a boxplot? Because physically I don't have to worry about finding out 2525% percent. And really by person, we will go to MiniTab and then do the work. So let's see this datasheet. So in our previous class, we did some descriptive statistics on this. And we found the data points. We found minimum Q1, Q2, Q3, and maximum data point. Let's try to construct a boxplot for the cycle time in minutes. So I will click on graph. I will go to box plot and see a simple boxplot and click on, Okay, I'm going to select cycle time in minutes. And I'm going to say, Okay, let's see the data view. If you look at this boxplot, the below line is called as the one. It is 9.16. The median is the middle line, and it need not be exactly in the center. The top of the box is Q3, which is 10.86 in this data range, and the interquartile range is 1.7. My box can extend for 1.5 times on the elbow and it can go 1.5 times into 1.7 on the balloon. And you are seeing that there are no asterisk marks in this boxplot, very clearly indicating that there are no outliers in my current dataset. Let's pick up some more data set. In our next video to understand how do box plot. 15. Understand Box Plot Part 2: Let us continue our journey on understanding boxplots more in detail. If you go to the sheet in your project file, which is called as a boxplot. I have collected data of cycle time for five different scenarios. As you can see that some places I have more number of data points, like I have almost 401745 data. In some places, I have only 14 data points. So let's try to analyze this in more detail to understand how boxplot works. I have copied this data onto MiniTab, case one, case two, T3 and T4. So first thing I would want to do is do some basic descriptive statistics for all the foreign keys. I'm selecting all of it. And then I'm seeing, when I see my output, I can see that in three of the cases, I have 45 data points. In the fourth case, I have 18 data points. In the fifth case, I have 14 data points. So the number of data points are very, if you look at my minimum value, it is ranging from 1, one, twenty one, twenty two. And the maximum value is somewhere between 4090 them. In one scenario I have developed values from 21 to 40. In one scenario I have values from two to 90, which very clearly shows that the number of data points or do this. But my range of value is white. So if you look at the rate, it's ranging from 18.8 to 99 points. So in case two, I have 1200 as a range, so 99 years. And the same can also be observed as standard deviation. You can see that the skewness of data is different and kurtosis is different. Let's first understand the box plot in detail. And in the next video, when I'm talking about the histogram, we will understand the distribution pattern using the same data set. Let's get started. I click on graph. I can click on boxplot and I click on simple. What I can do is I can take up 11 case at a time to analyze my data. So case one, it shows me a box plot and this boxplot very clearly shows that there are no outlier in my data. And the range is between. When I keep my cursor over here, I have 45 data points. My whisker is ranging from 21.6 to 4.4, and my inter-quartile range is 5.95. My median is 30.3. My first quartile is 26.9. My third quartile is 32.85. Let's redo this thing for case two. When I do my keys too, if you now look, the box is looking very small because here my data points are same. Fortified by Vickery is again ranging from 21.6 to 40 for seem like my previous scenario. But I have outliers over here, which are far beyond. If you remember, the descriptive statistics for kids to my minimum value is one and my maximum value is 100. My median was seemed like my previous scenario. My Q1 is also similar, not same, but similar. And Q3 is also similar. But when you look at the box plot, the box is very small, very clearly indicating that do my inter-quartile range is 6.95. My viscous can only go 1.5 times and any data point beyond that, Misko will be called as an outlier. I can select these outliers, right? And it is very clearly see, k is two, the value is 100 and it is in row number one. Row number 37, I have a value called as 90. In row number 30, I have a value called is 88. And in row number 21 I have a value called as one, which is a minimum size. So I have outliers on both the sides. Let's understand case three. When I look at the chemistry, I put my cursor on the boxplot. I have the same 45 data points. My viscose or from 21.6 to 40 for seem like my case one, case two. But in this scenario, I have lot of outliers. On the lower end. That is, on the bottom of my core, tight, right? It is easy for us to click on each one of them and see how my boxes are. Now the beauty over here is I have only 18 data points, but still I have an outlier. Let's do it for k is five. And understand that as well. I have a smaller box. I have only 14 data points and I have an outlier on the up button, and I have an outlier on the lower end. Here the value is 23. But seeing these plots differently makes it difficult for me to do a comparative. Can I get everything on one screen? So I go to graph, I go to boxplot. I will do simple environment selected. I'm selecting all the cases together and seeing multiple graphs. I'm seeing skin and I'm seeing the axis should be seen. Grid lines should be seen. And I click on, Okay. I'm getting all the five data points, five cases scenario in one graph. This will make it easy for me to do the analysis, that case one. So do individually when I saw the case one, if we're showing us a big swath. But when I'm doing a comparison of one next to each other, I can know that in case two I have outliers on the top and the bottom. In case three, I have outliers on the bottom side. In case four, I have outliers on the top side. In case five, I have outlets on both the sides. The number of data points are different. The bulks will get drawn. The size of the box cannot be determined by the number of data points. I have 45 data points, but my box is very narrow. And I have 14 data points and my box is white. So the size of the box. So if I have 14 data points, it is going to divide my data into four parts. So three data points below Q1, three data points between Q1 and Q2, three data points between Q2 and Q3, and three data points beyond Q3. Whereas when I had 45 data points, it is getting distributed as 11111111. My median would be the middle number. So what is the learning from this exercise is that by looking at the size of the box, you cannot determine the number of data points. But what you can definitely determined is that in mind that dataset, do I have data points which are extremely high or low? So the purpose of drawing a box plot is to see the distribution and identify outliers, if any. I hope the concept is clear. If you have any queries, you are free to put it up in the discussion group. And I'll be happy to answer them. Thank you. 16. Pareto analysis: Hello friends. Let's continue our learning on seven QC tools. The tool that we are going to learn today is Pareto charts are also called as parto analysis. This is based on the famous statistician not statistician. Let me correct myself, economist who went around the world to study the proportion of wealth with respect to the population. When he did this, Mr. Pareto found out the 80 20 principle. Let's dive deep into it. So the Pareto analysis, the principle that helps you focus on the most important matter to obtain the maximum benefit. It describes the phenomena that a small amount of high value contributes more to the total than a high number of low values. The focus is, what are those high value attributes that I need to focus on instead of so many small value items. This in short, is called as identify the vital few instead of the trivial many. What are those red blocks which are only three or four? But the contribution is major. Instead of looking at hundreds of small things where the contribution total contribution is minor. Even if I look at my personal expenditure, O of my total income that I make, majority of my money goes off in paying EMI, paying the rents, and bills. So those are my vital few, instead of trivial many, where I'm trying to look at the bus tickets, the food I'm eating, or the small purchases that I'm making. So if I want to make good savings, I need to focus on seeing how I can repay my EMI quicker, how I can have a rent, which is within my budget. The Pareto analysis is based on the famous 80 20 rule. It states that roughly 80% of the results come from 20% of the effort. Very nicely said, the 80% effort comes from 20% effort. Similarly, 80% of the problems or effects from 20% of causes. We use this for our cause analysis. The exact percentage may vary from situation to situation, whereas we believe it to be 80 20, even if it is a 75 25, we should go ahead and pursue of fixing those vital few. Sometimes we might get it as a 70 30, sometimes we might even get it as 88 12. These are just some of the examples. The point is, which are those major causes, which I can fix with minimum effort to get maximum results. In many cases, few effort are usually responsible for most of the results. A few causes are usually responsible for most of the effort. If I related back to my exam, there are certain chapters in my book which carry more weightage in my final exam. If I'm thorough on those chapters, my probability of getting 60 70% becomes very easy. Instead of trying to read all the 20 chapters in my workbook, I might focus on few chapters to get the results. Sparto analysis is used by decision makers to identify the effort that are most significant in order to decide which to select first, the decision making. It is used for process improvement projects to focus on the causes that contribute most to a particular problem. This will help prioritize the potential causes, factors, and key process inputs of the problem being investigated. It's a continuous improvement toolkit. Pareto analysis is used when prioritizing projects to focus on significant projects that will bring value to the customer and to the business. Rather than doing all the projects that are there in my project list, I would focus on those few projects, two or three major projects, which can give me maximum benefit. You can be careful during the scoping of the project if you're using the parto Aysis or for prioritizing your resources, who is the main person that is required for your project. We can also use parto analysis for visualizing your data to quickly know where the focus should be put. For example, I have a lot of defect data like ten tear off dense catch. I'm doing the analysis and I've got this data. If I put it in the descending order of the defects, I find that tear off is the maximum effort. And followed by pinhole, then then, and so on. The one which are in gray, I'm not going to focus much because they are not contributing majorly. If I fix the tear off, I'm going to get maximum results. If I'm going to fix the first three, I'm going to get major reduction in the defects that are happening in my process. For example, if you collect the data about defect types, operator analysis can reveal which type of defect is most frequent. You can focus on your efforts in solving the cause that has the most effect. The benefit of parto analysis is to help you focus on what really matters. It separates the major causes of the problem from the minor one. It allows to measure the impact of improvement by covering before and after. It allows to reach consensus about what needs to be addressed first. The Pareto principle has been found to be true in many fees, 20% effort to give 80% results. Instead of work or we can also call it as 20% causes giving me 80% effect. So if I'm thinking about cause and effect analysis, it is again 20% causes, 80% effort. O effect, if I'm looking at effort results analysis also, we say put less effort to get maximum results. 20% of the company's client are responsible for 80% of its revenue or 80% of the sale comes from 20% of the clients. So that is the concept of 20% effort versus results of 80%. The Pardo Analysis act office can be thought about as 20% of the workers do 80% of the work. 20% of the time spent on a task leads to 80% of the results. 20% of the population owns 80% of the nation's wealth. Isn't it true, even in our country, our state, our community? We find that there are very few people who are owning the maximum amount of wealth. You may use the 20% of the household tools, 80% of the time. You may wear 20% of your clothes, 80% of the time. So it's time for you to just apply parto analysis in your personal life to clean up your wardrobe if you believe in the concept of minimalism. 20% of the car drivers causes 80% of the accidents. 80% of the customer complaint comes from 20% of the customers. Just a few cause accounts for most of the effect on the fish pole. If I'm converting my parto analysis to a fish pole, you will find that there are few causes which contributes to the major one. By listening to all these examples, you would have understood that Pareto is not restricted to apply only at your office or place of work. You can even apply parto analysis in your personal life. If I take it to Twitter or a social media platform like that, most of the active 20% of the Twitter users are responsible for 80% of the tweets overall. The parto chart is a special type of bar chart that plots the frequency of historical data. So you need to understand this data is as of yesterday or as of today morning or as of last month. So it is a categorical data. The x axis very clearly says that it's a categorical data and the y axis talks about the frequency of occurrence. So Parto analysis cannot be used for continuous data, please note. So if you see, you will have categorical data with frequency plotted in descending order, the major causes which are less effort to get maximum results. The categorical data, it is the lowest level of data that results in classifying people, things or event. I can make it more simple. Anything that's made with words is called as categorical data. Geographical locations, weather, color, device type, blood type, blood, bank account type, like savings or current, FD or home personal loan type of error or defect, type of the dat. Pareto analysis, the vertical axis represents the frequency of the categorical data. The x axis represents the categories of the labels. The horizontal axis represents the categorical data that causes a problem or the issues. The bar is arranged in a descending order from left to right. The most frequent occurring one is on the left side, the least frequent occurring one is on the right side. You do not have to worry if you have microsoft Excel, it will draw it for you. If you are using an older version of Excel, I will share a template in the project and resource section below. If you have too many categories, you may group those small infrequent categories into the category called as others. The last bar is usually a little higher than the previous ones. You may optionally put a cumulative frequency curve above the bar giving it a secondary y axis to represent the cumulative percentage. This simply helps in interpreting the results more easily and to identify the 80 20 connection. The parto analysis focus on efforts on those categories whose vertical bar account is for 80% of the results. You should look for something which are major causes, maximum effect and least effort to get maximum results. If you look at the two parto patterns, A and B, which one is the best illustration of the parto pattern. I would suggest it is the pattern A because pattern B is showing that most of them are almost contributing equally. This is uniform distribution, so I would not go with it. I would go with the one that is category A. And this is wrong. If the resulted charts clearly illustrates a parto pattern. This suggests that only few causes accounts for about 80% of the problem. This means that there is a parto effect, and you can focus your effort on tackling these few causes to get maximum result. If you would have received a pattern like B graph, then the parto analysis will not work, and we will have to use some other QC too. However, if no parado pattern is found, we cannot say that some causes are more important than others. As I just said. Make sure that your parado chart contains enough data points to make it meaningful. In today's world, there is a lot of data that's available, so please ensure that you're capturing as much data as possible. The Pareto analysis on how to construct a parto chart. If with your team, define the problem that you're trying to solve, identify the possible causes using brainstorming or similar techniques. Decide the measurement method to be used for comparison, the frequency, cost, and time, et cetera. How to construct a parto chart, collect the data and require the categorical data to be analyzed. Calculate the frequency of the categorical data. Draw a horizontal line and place the vertical bar to indicate the frequency of category. Draw a vertical line on the left to place the frequency on the left of the line in case you are drawing it on a graph paper. Microsoft Excel can do parado chart automatically. But if you're doing it manually, then sort the categories in the order of frequency of occurrence from the est to the smallest largest coming on the left side. You should calculate your cumulative frequency curve and a cubultive percentage line. If you observe the parade to effect, focus your improvement effort on those few categories whose vertical bar accounts for the most. These causes are likely to have greatest impact on your process output. I've taken a sample pareto to analyze the reason why patient is using a call well in a hospital when admitted. So they need toilet assistant, need food or water, repositioning of their bed, intravenous problems, pain medication, urgent call back to bed, obtaining all the ones which are in gray are are not frequently happening things and are not important. So if we focus on the first three, or the first four. So if I would say that four factors, which contributes to 40% of the effort, and you are going to get 70% of the effect. So I might decide to just work on the first three, that is 30% effort, to still get 68% effort. Anything is fine. Concept being that I need to put less effort to get maximum results. Customer complaints in a factory. A factory team has conducted a parado analysis to address the rising number of complaints from the customer's perspective. In a way, management can understand. It's a type of customer complaint, product complaint, document related complaint, package related complaint or delivery related complaint. We can see people customers are maximum number of times complaining about the type of the product or the defect with the product. Followed by the document related issues. Customer complaint in a factory, the main categories may be too generic and can be divided into subcategories. So if I think about product complaints, it's at a high level, I might take it as sub components of problem A. Is it scratch dent problem, pinhole, pair of HMA or others. You will be able to apply again the parto on the product complaint as well, that if you're going to fix scratch and dent related issues in a product complaint, the majority of the product complaints will come down. Type of document complaints, we can see that missing information is the major contribution followed by invoice error, wrong quantity and others. The parto chart can be further analyzing by using the main categories to be divided into subcategories or subcomponents where the specific problem occurs most often are called the subcategories. Customer complaints in a factory. The results suggest that there are three subcategories that occur most often. Note that it is possible to merge two charts into one. So I have type of product complaints and type of document, and I can go ahead and marge them. Pero Principles is named after the Italian economist Wilfredo Peto. Joseph Juran has applied Peto principles to quality management for business production. In your analysis, consider using contextual data, meta data, and the columns that contain textual data. Databases often contain a lot of categorical data about the environment from which the data is taken. This data can be very useful in later analysis when investigating the who cause concepts and ideas. Pareto principles can help you measure the impact of improvement by comparing the before versus after. If you see the blue work was a major aipulor, after the projects, you find that there is a major improvement in that category. The new parto chart can show that there is a major reduction in the primary cose. Statistically, parado principles can be described by the power lot distribution and many natural phenomena to exhibite the distribution. With that, I come to the end of the concept of parto analysis. In the next video, I'm going to show you how I do Pareto analysis using Microsoft cel. See you in the next class. 17. Concept hypothesis testing and statistical significance: Let's break down the concepts related to hypothesis testing and statistical significance. One, hypothesis testing, when conducting a hypothesis test, we start with a research hypothesis, also called the alternative hypothesis. In your case, the research hypothesis that the drug has an effect on blood pressure. However, we cannot directly test this hypothesis using a classical hypothesis test. Instead, we test the opposite hypothesis that the drug has no effect on blood pressure. We start by assuming that on average, people who take the drug and people who don't take the drug have the same blood pressure in the population. If we observe a large effect of the drug in a sample, we then ask, how likely is it to draw such a sample or one even more extreme if the drug actually has no effect. The probability of getting such a sample, assuming the null hypothesis, no effect is called the P value. The P value indicates the probability of obtaining a sample that deviates as much as our observed sample or even more extreme if the null hypothesis were true. If the p value is very low, typically less than 0.05, we have evidence to reject the null hypothesis in favor of the alternative hypothesis. A small p value suggests that the observed data or sample is inconsistent with the null hypothesis. So Three, statistical significance. When the p value is less than a predetermined threshold, often 0.05. The result is considered statistically significant. This means that the observed result is unlikely to have occurred by chance alone, and we have enough evidence to reject the null hypothesis. The p value threshold is set at 5%, or 0.05, a small p value suggests that the observed data or sample is inconsistent with the null hypothesis. Conversely, a large p value suggests that the observed data is consistent with the null hypothesis, and we do not reject it. Four, errors in hypothesis testing. Remember that a small p value doesn't prove the alternative hypothesis is true. It only suggests that the observed result is unlikely under the null hypothesis. Similarly, a large P value doesn't prove the null hypothesis is true. It only suggests that the observed result is likely under the null hypothesis. Let us now understand the two types of errors. The type one error and the type two error. Type one error occurs when we mistakenly reject a true null hypothesis. In your example, this would mean concluding that the drug works when it actually doesn't. Type one error is when you reject the null hypothesis, when in reality, the null hypothesis is true, but your decision about the null hypothesis is rejected. Type two error occurs when we fail to reject a false null hypothesis. Type two error is when you fail to reject the null hypothesis, when in reality, the null hypothesis is false, but your decision about the null hypothesis is accepted. In your example, this would mean missing the fact that the drug works. Sample taken did not show much difference. Mistakenly thought that the drug is not working. In the next lesson, we will dive deeper into practical applications of design of experiments. Stay tuned. 18. Understand Test of Hypothesis: Hello friends. Let us continue our journey on MiniTab data analysis. Today we are going to learn about hypothesis testing. You might have heard that we do hypothesis testing during the analyze and improve phase of our project. So to understand how hypothesis test works, let us understand a simple case scenario. I will come back to this graph again and explain you that it is. As you know, when we go to the court of law, the justice system can be used to explain the concept of hypothesis testing. The judge always starts with a statement which says, the person is assumed to be innocent until proven guilty. This is nothing but your null hypothesis, the status quo. When they are caught case which goes on. The lawyers tried to produce data and evidences. And unless and until we do not have strong data and strong evidences, the person is In the status of being innocent. So the defendant or the opposition lawyer is always trying to say that this person is guilty and I have data and evidence to prove it. He is trying to work on alternate hypothesis. And the judge says, I'm going with the status quo of null hypothesis by default. Let me explain you in a more easy way. You and I, we're not taken to the court of law because by default, we all are in OSA, that is the status quo. Who are pulled to the court of law. People who are who have a chance of have come, have committed some crime. It could be anything. So same way. What do we try to do hypothesis testing on when I'm doing my analyze phase of the project. So I have multiple causes which might be contributing to my project. Why? We do a root cause analysis and we get to know that, okay? Maybe the shipment got delayed. Maybe the machine is a problem, maybe the measurement system is a problem. Maybe the raw material is of not good-quality. We have multiple reasons which are there. Now I want to prove it using data, and that is the place where I tried to use hypothesis testing. All the processes have variation. We know all the processes follow the bell curve. We are never add the center. There is some bit of variation in every process. Now the data or the sample which you updated, is it a random sample coming from the same Banco? Or is it a sample that's coming from an entirely different bell curve? So hypothesis testing will help you in analyzing the same. Whenever we set up a hypothesis test, we have two types of hypothesis, as I told you, the status quo or the default hypothesis, which is your null hypothesis. By default, we assume that the null hypothesis is true. So to reject the null hypothesis, we need to produce evidences. Alternate hypothesis is the place where there is a difference. And this is the reason why the hypothesis test has actually been initiated, right? We will understand with lots of examples. So stay connected. So when I'm framing up null and alternate hypothesis, Let's say, I am saying my mu are nothing but my average, my population average is equal to some value. Always remember, your alternate hypothesis is mutually exclusive. If mu is equal to some value, the alternate hypothesis would say mu is not equal to that value. By say, mu is less than equal to some value as a null-hypothesis. For example, if I'm selling Domino's Pizza, I see my average delivery time is less than equal to 30 minutes. The customer comes and tells me, know, the average delivery time is more than 30 minutes, that becomes my alternate. Sometimes, if we have the null hypothesis is mu is greater than equal to some value. For example, my average quality is greater than equal to 90%. Then the customer comes back and tells me that know your average quality is less than that percentage. So always remember the null hypothesis and alternate hypotheses are mutually exclusive and complimentary to each other. We will take up many more examples as we go further. 19. Null and alternate Hypothesis concept: Let's dive into inferential statistics. We'll start with a brief overview of what it is. Followed by an explanation of the six key components. So what is inferential statistics? It enables us to draw conclusions about a population based on data from a sample. To clarify, the population is the entire group we're interested in. For instance, if we want to study the average height of all adults in the United States, our population includes all adults in the country. The sample on the other hand, is a smaller subset taken from that population. For example, if we select 150 adults from the US, we can use this sample to make inferences about the broader population. Now, here are the six steps involved in this process. Hypothesis. We start with a hypothesis. Which is a statement we aim to test? For example, we might want to investigate whether a drug positively impacts blood pressure in individuals with hypotension. Oh, In this case, our population consist of all individuals with high blood pressure in the US, since it's impractical to gather data from the entire population. We rely on a sample to make inferences about the population using our sample. We employ hypothesis testing. This is a method used to evaluate a claim about a population parameter based on sample data. There are various hypothesis tests available, and by the end of this video. I'll guide you on how to choose the right one. How does hypothesis testing work? We begin with a research hypothesis. Also known as the alternative hypothesis, which is what we are seeking evidence for in our study. Also called an alternative hypothesis. This is what we are trying to find evidence for. In our case, the hypothesis is that the drug affects blood pressure. However, we cannot directly test this with a classical hypothesis test. So we test the opposite hypothesis, that the drug has no effect on blood pressure. Here's the process. One, assume the no hypothesis. We assume the drug has no effect, meaning that people who take the drug and those who don't have the same average blood pressure. T, collect and analyze sample data. We take a random sample. If the drug shows a large effect in the sample, we then determine the likelihood of drawing such a sample or one that deviates even more, if the drug actually has no effect, or one that deviates even more, if the drug actually has no effect, T, evaluate the probability p value. If the probability of observing such a result under the null hypothesis is very low. We consider the possibility that the drug does have an effect. If we have enough evidence, we can reject the null hypothesis. The p value is the probability that measures the strength of the evidence against the null hypothesis. In summary, the null hypothesis states there is no difference in the population, and the hypothesis test calculates how likely it is to observe the sample results if the null hypothesis is true. We want to find evidence for our research hypothesis. The drug affects blood pressure. However, we can't directly test this, so we test the opposite hypothesis, the null hypothesis. The drug has no effect on blood pressure. Here's how it works. Assume the no hypothesis. Assume the drug has no effect. Meaning people who take the drug, and those who don't have the same average blood pressure, collect and analyze data. Take a random sample. If the drug shows a large effect in the sample. We determine how likely it is to get such a result, or a more extreme one. If the drug truly has no effect, calculate the p value. The p value is the probability of observing a sample as extreme as ours. Assuming the null hypothesis is true. Statistical significance. If the p value is less than a set threshold, usually 0.05. The result is statistically significant, meaning it's unlikely to have occurred by chance alone. We then have enough evidence to reject the null hypothesis. A small p value suggests the observed data is inconsistent with the null hypothesis. Leading us to reject it in favor of the alternative hypothesis. A large p value suggests the data is consistent with the null hypothesis. We do not reject it. Important points. A small p value does not prove the alternative hypothesis is true. It just indicates that such a result is unlikely if the null hypothesis is true. Similarly, a large p value does not prove the null hypothesis is true. It suggests the observed data is likely under the null hypothesis. Thank you. I will see you in the next lesson of statistics. 20. Statistics Understanding P value: What is the p value and how is it interpreted? That's what we will discuss in this video. Let's start with an example. We would like to investigate whether there is a difference in height between the average American man and the average American basketball player. The average man is 1.77 meters tall. So we want to know if the average basketball player is also 1.77 meters tall. Thus, we state the null hypothesis. The average height of an American basketball player is 1.77 meters. We assume that in the population of American basketball players, the average height is 1.77 meters. However, since we cannot survey the entire population, we draw a sample. Of co, this sample will not yield an exact mean of 1.77 meters. That would be very unlikely. Oh. It may be that the sample drawn purely by chance deviates by 3 centimeters by 8 centimeters by 15 centimeters or by any other value. Since we are testing an undirected hypothesis, that is, we only want to know if there is a difference. We do not care in which direction the difference goes. Now we come to the p value. As mentioned, we assume that in the population, there is a mean value of 1.77 meters. If we draw a sample, it will differ from the population by a certain value. The p value tells us how likely it is to draw a sample that deviates from the population by an equal or greater amount than the observed value. Let's take a closer look again. We have a sample that is different from the population. We are now interested in how likely it is to draw a sample that deviates as much as our sample or more from the population. Thus, the p value indicates how likely it is to draw a sample whose mean is in this range. For example, if by chance sample deviates by 3 centimeters from 1.77 meters. The p value tells us how likely it is to draw a sample that deviates 3 centimeters or more from the population. If by chance sample deviates by 9 centimeters from 1.65 meters, the p value tells us how likely it is to draw a sample that deviates 9 centimeters or more from the population. Let's take an example where we get a difference of 9 centimeters and our favorite statistic software. Like Mini tab, calculates the p value of 0.03. That is 3%. This tells us that it is only 3% likely to draw a sample that is equal to or more than 9 centimeters different from the population mean of 1.77 meters. For normally distributed data. This means the probability that the mean lies in this range is 1.5% in one direction and 1.5% in the other direction. Totaling 3%. If this probability is very low. One can of course ask whether the sample comes from a population with a mean of 1.65 meters at all. If this probability is very low. One can of course ask whether the sample comes from a population with a mean of 1.77 meters at all. It is just a hypothesis that the mean value of basketball players is 1.77 meters. And it is precisely this hypothesis that we want to test. Therefore, if we calculate a very small p value, this gives us evidence that the mean of the population is not 1.77 meters at all. Thus, we would reject the null hypothesis, which assumes that the mean is 1.77 meters. Thus, we would reject the null hypothesis, which assumes that the mean is 1.77 meters. But at what point is the p value small enough to reject the null hypothesis? This is determined with the so called significance level, also called Alpha level. There are two important things to notice here. One, the significance level is always determined prior to the study and cannot be changed afterwards in order to finally obtain the desired results. Two, to ensure a certain degree of comparability, the significance level is usually set at 5% or 1%. AP value of less than 1% is considered highly significant. Less than 5% is called significant and greater than 5% is called significant. In summary, the p value gives us an indication of whether or not we reject the null hypothesis. As a reminder, the null hypothesis assumes that there is no difference. While the alternative hypothesis assumes that there is a difference. In general, the null hypothesis is rejected if the p value is smaller than 0.05. It is always only a probability, and we can be wrong with our statement. If the null hypothesis is true in the population, I, the mean is 1.77 meters. But we draw a sample that happens to be quite far away. It might be that the p value is smaller than 0.05. We wrongly reject the null hypothesis. This is called type one error. If in the population, the null hypothesis is false. IE, the mean is not 1.77 meters, but we draw a sample that happens to be very close to 1.77 meters. The p value may be larger than 0.05, and we may not reject the null hypothesis. This is called type two error. Thank you for learning with me. I will see you in the next lesson of statistics. Y. 21. Understand Types of Errors: Let us understand some more examples of null and alternate hypothesis. So suppose if my project is about to shed you, my null hypothesis is a fixed value. So I would say my current mean of my current average time to build to share Julie's 70% are. Current. Average of P to S is 70%. The alternate hypothesis would mean that it is not 70%. Suppose I'm thinking about the moisture content of a project. I'm into a manufacturing setup and I want to measure if the moisture content should be equal to 5%. Or 5% is what is acceptable by my customer, then I can say my moisture content is less than equal to five per cent. Then the alternate hypothesis would claim that the moisture content is greater than five per cent. The case where the mean is greater than, then the null hypothesis. We do not have the interest in that problem. Let's understand it further. The question was, did a recent TED those small business loan approval process reduce the average cycle time for processing the loan? The answer could be no. Means cycle time did not change. Or the manager may see that yes, the mean cycle time is lower than 7.5%. So the status quo is equal to 7.514 minutes. And the alternate says, no, it is less than 7.414 minutes or days, whatever is the main unit of measurement we are measuring, right? So by default, your status quo is go null hypothesis. And the example or the status that you want to prove easier alternate hypothesis. Now, there could be some sort of arrows when we make decisions. So let's go back to our code case. The defendant is in reality not guilty, right? Let me take up my laser beam. By default, the defendant or the reality is the defendant is not guilty. Verdict also comes that the defendant, the person is not guilty. It's a good decision, right? So yes, we have made a very good decision that the person is innocent. In reality, the defendant is guilty. And the verdict also comes that he's guilty. The decision is a good decision. What happens is, in reality, the person is not guaranteed, but the verdict comes that he's guilty and innocent person gets convicted. It's an error. It's a very big error. In Northern person, given a sentence and put in jail, given a penalty, that's an error. The error can even happen on the other side, where in reality the person is guilty, but the verdict comes that he's not guilty. Guilty person is declared as innocent and he's set for it. This is also an arrow, but which is a bigger error. The bigger error you can write down in the comment box, what do you think? Which error is the bigger arrow? Is the error a bigger error or is the error be the bigger arrow? It no sane person getting convicted is a bigger error or is a guilty person moving on the roads free, either bigger arrow? I hope you have already written the comments. So the reality is this becomes my bigger error. And this is called as type one error. Because if an innocent is convicted, we cannot give back the time that he has lost. We cannot get he would go to a lot of emotional trauma. If a guilty is declared as innocent, we can take him to the higher court and Supreme Court and to get him to prove that yes, he's not he's guilty, right. So I can get this decision over here that the person is convict. He should be convicted and he should be declared as guilty and should be punished. So this error is called as type two error. If somebody asked you which error is a bigger error, type one error, it is also called as an alpha error. And this is called as a beta error. Right? Let's continue more in our next class. 22. Understand Types of Errors-part2: Let us understand the types of arrows once again. So as we know that if the person is not guilty or the person is innocent, and the verdict is also saying that the person is not guilty. It's a good decision. If the person is guilty, verdict is he's guilty. The decision is again, a good decision. The convict is not, has to be sentenced or should be punished. The problem will happen when an innocent person is proved as guilty and he suffers. The second type of problem which happens when guilty person, a person with a criminal is declared as innocent. And he said, This is called as a type one error. That is, an innocent person getting convicted or punished is a type one error. It is also called as alpha arrow. A guilty person, criminal set free is called as a type two error or a beta error, which is also an error which we want to avoid. The level of significance is set by the Alpha value. So how confident do you want in making the right decision? So type one error happens when the null is true, but we rejected. Type two error happens when in reality the null is false, but we fail to reject it. Now how does this help us process? So let us just understand this every day for lunch sheet. Right? Let's understand this in more detail. This is the actual scenario. Let's write the actual on the top. And this myths like the judgment. Okay, now, let's think about the process. The process has not changed. Has not changed. No alternate will be process has changed. Now the judgment is noted. And the judgment is the process has improved. Okay. Now I'm going to ask you a very important question. If a process has not changed and the judgment is that there is no change, this is the correct decision. Process has changed and the judgment is also that the process has improved. That's also a correct decision. Now, imagine the process has not changed, but we declared that now I have an improved process and an improved product and I inform the customer, Is it correct? An error. And this is called as a type one error because seem old, but our debt is sold to the customer as new product. Can you understand what will happen to the reputation of the company? The team or product is sold to the customer as new products. New one core product. So what will happen to the reputation of the company? It will go for a toss and hence we say, this is not a good decision. Now understand here also the process has changed. The process has improved, but the judgment comes as not improved. This is also an error. I don't deny it. This is called a type two error or audit is also called as a beta error. Right here. What happens is that we are not communicating to the customer that the improvement has happened, right? So we do not we are keeping the improved items in brood product in the warehouse. Now this is also not correct, but the bigger error is here where actually we have not done an improvement, but I'm informing the customer that you're bad people join. 23. Remember-the-Jingle: When we do test of hypothesis, there are always two hypothesis. One is the default hypothesis, which is the null hypothesis, and second is the alternate hypothesis which you want to prove. And that's the reason you are doing the hypothesis. So when you do the hypothesis, the reason we do is we are never having the access to the whole population. So when we collect the sample, we want to understand, is the sample coming from the bell curve or the distribution from where we are understanding whatever variation you see, is it because of the natural property of the dataset. Sometimes you sample could be at the end corner of the Velcro. And that is a place where we get the confusion that does this data belongs to the original Velcro or does it belong to the second alternate? Welcome. That is there. We will be doing exercises which will be giving you an understanding of this in more easy to do. Hypothesis, you get information like the p-value apart from the test statistics results. You also get the p-value. We always compare the p-value with the null value that we have set. Suppose you want to be 95% confident. Then you set the p-value as 5%. And if you set the confidence level is 90%, then your Alpha value is ten per cent, or your p-value is 0.10. The reason we do a p-value is that if you can see this bell curve, the most likely observation is part of the center of the bell. Very unlikely observation are coming from the tail. This p-value, the green reason, helps you tell whether it belongs to the original Velcro or does it belong to the alternate bulk of that is, you are trying to prove through the alternate hypothesis. Hence, the p-value comes as a help for you to easily remember this. Remember the jingle. Below, null. It means if the p-value is less than the alpha value, I'm going to reject the null hypothesis. P high level flight. If the p-value is more than the alpha value, we fail to reject the null hypothesis, Concluding that we do not have enough statistical evidence that the alternate hypothesis exist. We will be doing a lot of exercise and I will be singing this jingle multiple times so that it's easy for you to remember. Below null, go behind nullcline. Some of the participants with, when I do the workshop get confused, they will say none go means what? The other thing which I tell them to easily remember is f for flight and F for field. So if P is high null, we'll fly. It means you're failing to reject the null hypothesis. Null hypothesis will exist. The alternate hypothesis will get rejected. Remember one more thing which is mostly asked during the interview. The p-value was at 1.230.123. Would you reject the null hypothesis or would you accept the null hypothesis? Or would you accept the alternate hypothesis? Or will you accept the null hypothesis? As a statistician's? We never accept any hypothesis. Either we reject the null hypothesis or we fail to reject the null hypothesis. We always say it from the point of view of null because the default status quo easier null hypothesis. If the P is high, we do not accept the null and alternate hypothesis. Are we do not accept the null hypothesis. We say we fail to reject the null hypothesis. If the p is low, we do not accept alternate, but we say, I reject the null hypothesis, concluding that there is enough statistical evidence that the data is coming from the alternate Bellcore. We will continue with lot of exercises. And this will give you confidence about how to practice and interpret and use inferential statistics in your analysis when you're doing it. 24. Test Selection: One of the most common question which my participants are asked when I'm entering the project is that which hypothesis should I use rent? So this is a simple analysis which will help you understand that. Which tests should I be using? Just like the way when a patient goes to a doctor, the doctor does not prescribe him all the test. He just put his grabs him the appropriate test based on the problem that the patient is fishing. If the patient sees I met with an accident, the doctor would say that I think you should get your X-ray done. He would not be asking him to go for his COVID test or RT-PCR test. If the person is coughing and is suffering from fever, then RT-PCR is suggested. And at that point of time we are not able to satisfy the x-ray. It looks similar way when we do simple hypothesis testing, we're trying to understand or another compare it with the population. We want to understand what test should we be performing? When, if I'm testing for means, that is your average, then you compare the mean of a sample with the expected value. So I'm comparing the sample with my population. Then I go for my one-sample t-test. I have only one sample that I'm comparing. I want to compare if the average performance of the, if the average sales is equal to x amount, which is the expected value. So we were expecting the sales to be, let's say 5 million. My average is coming to say 4.8. I have I met that are not. So then I can go and do a one-sample t-test. Compare the mean of samples with two different proportions. So if I have two independent T's, so let's say I'm conducting a training online. I'm conducting a training offline. It is the Shrina and I have a set of students who are attending my online program. I have a different set of students who are attending my program of mine. I want to compare the effectiveness of training. So I have two samples, and these are two independent samples because the participants are different. Then I go for two-sample t-test. If I want to compare the two samples so people come for my training. I do an assessment before my training program about their understanding of what Lean Six Sigma. And I can take the training program and the same set of participants attend the test after the training program. So the participants or the scene. But the change which has happened is the training which was impacted on them. I have the test results before the training and I have the test results after the training, I want to compare the training is effective. Then I go for two-sample paired t-test. Progressing further. Suppose if I am testing for frequency, I have discrete data and I want to test the frequency because in discrete data I do not have averages. I take frequencies. So when I'm comparing the count of some variable in a sample to the expected distribution, just like the way I had one sample t-test. The equivalent of it for a discrete data would be my chi-square goodness of fit. I, by default expected to be a normal value or a particular value or unexpected value. And I'm comparing that. How far is my data? I go for a chi-square goodness of fit. This test is available on MiniTab in Excel. It is not available. So I will be creating a template and giving it to you, which will make it easy for you to do the chi-squared test. All three different types of chi-square test using the Excel template. If I have to count some of the variables between two samples. So it will be chi-squared homogeneously t-test. I'm checking a simple single sample to see if the discrete variables are independent. I do Chi-Squared independence test. If I have a proportion of data, like good or bad applications, I've accepted versus rejected. And I am saying that okay, 50% of the applications get accepted, or twenty-five percent of the people get placed. I have a proportion which I want to test. If I have only one sample, I go for one proportion test. If I want to compare proportion of commerce graduates versus science graduate or proportion of finance, MBA, people with marketing MBA people, I have two different samples, so I can go for two proportion test. So to summarize the thing, when I'm testing, am I testing for averages? Am I testing for frequencies like discreet data or am I testing for proportions? Depending upon that, you are picking up the appropriate test and working on it. We're going to practice all of it using Men dab and using exit. The dataset is available in the description section. In the project section, I invite you all to practice it and put your projects, your analysis in the project section. If you have any doubts, you can put that in the discussion section and I'll be happy to answer your doubts. Happy learning. 25. Concepts of T Test in detail: What does this video teach you? About the T test? This video covers everything you need to know about the T test. The end of this video, you'll understand what AT test is, when to use it, the different types of t tests, hypotheses, and assumptions involved, how AT test is calculated and how to interpret the results. What is a t test? Let's start with the basics. A t test is a statistical test procedure. That analyzes whether there is a significant difference between the means of two groups. For instance, we might compare the blood pressure of patients who receive drug A versus. Drug B, types of t tests. There are three main types of t tests, the one sample t test, the independent samples t test, or two t test, and the paired samples t test. What is a t test for one sample? We use a one sample t test when we want to compare the mean of a sample with a known reference mean. For example, a chocolate bar manufacturer claims their bars weigh an average of 50 grams. We take a sample. Find its mean weight. Assume the sample weight is 48 grams, and use a one sample t test to see if it significantly differs from the claimed 50 grams. What is a t test for independent samples? The independent samples t test compares the means of two independent groups or samples. For instance, we might compare the effectiveness of two pain colors by randomly assigning 60 people to two groups. On receiving drug A and the other drug B. And then using an independent t test to evaluate any significant differences in pain relief. What is a t test for paired samples? The paired samples t test compares the means of two dependent groups. For example, to assess the effectiveness of a diet, we might weigh 30 people before. After the diet, using a paired samples t test, we determine if there's a significant difference in weight before. After the diet. Understanding the difference between dependent and independent samples is crucial in choosing the right type of t test for your analysis. Dependent samples or paired samples, refer to cases where each observation in one sample is paired with a specific observation. In the other sample, this pairing arises from the nature of the data collection, such as before and after measurements. On the same individuals, matched pairs in an experiment. The paired samples t test is used to assess whether. The mean difference between these paired observations is statistically significant. On the other hand, independent samples are observations, drawn from two separate groups, or populations that are not related or paired in any systematic way. Each observation in one sample is entirely independent of every other observation. In the other sample, the independent samples, T test evaluates whether the means of these two independent groups differ significantly from each other. Choosing between these types of t tests depends on how the data were collected and the relationship between the samples being compared. Using the correct t test ensures that your statistical analysis accurately reflects the nature of your research question and the structure of your data. Here's an interesting note. The paired samples t test is very similar to the one sample t test. We can also think of the paired samples t test as having one sample that was measured at two different times. We then calculate the difference between the paired values, giving us a value for one sample. The difference is one minus five plus two minus one minus three, and so on and so forth. Now, we want to test whether the mean value of the difference just calculated deviates from a reference value. In this case, zero, this is exactly what the one sample t test does. What are the assumptions? For a t test, of course, we first need a suitable sample in the one sample t test, we need a sample and the reference value in the independent t test. We need two independent samples, and in the case of a paired t test, a paired sample, the variable for which we want to test whether there is a difference between the means must be metric. Examples of metric variables are age, body weight, and income. For example, a person's level of education is not a metric variable. In addition, the metric variable must be normally distributed in all three test variants to learn how to test if your data is normally distributed. In case of an independent t test, the variances in the two groups must be approximately equal. You can check whether the variances are equal using L evens test. What are the hypotheses of the t test? Let's start with the one sample t test in the one sample t test. The null hypothesis is the sample mean is equal to the given reference value. So there's no difference, and the alternative hypothesis is the sample mean is not equal to the given reference value. What about the independent samples t test? In the independent t test, the null hypothesis is the mean values in both groups are the same. So there is no difference between the two groups, and the alternative hypothesis is the mean values in both groups are not equal. So there is a difference between the two groups. And finally, the paired samples t test in a pair t test, the null hypothesis is the mean of the difference between the pairs is zero, and the alternative hypothesis is the mean of the difference between the pairs is not zero. Now we know what the hypotheses are. Before we look at how the t test is calculated. Let us look at an example of why we actually need a t test. Let's say there is a difference in the length of study for a bachelor's degree between men. And women in Germany. Our population is therefore made up of all graduates of a bachelor who have studied in Germany. However, as we cannot survey all bachelor graduates, we draw a sample that is as representative as possible. We now use the test to test the null hypothesis that there is no difference in the population. If there is no difference in the population, if there is no difference in the population, we will certainly still see a difference in study duration in the sample. It would be very unlikely that we drew a sample where the difference would be exactly zero. In simple terms, we now want to know at what difference measured in a sample. We can say that the duration of study of men and women is significantly different. And this is exactly what the t test answers. But how do we calculate a t test? To do this? We first calculate the t value to calculate the t value. We need two values. First, we need the difference between the means, and then we need the standard deviation from the mean. This is also known as the standard error. In the one sample t test, we calculate the difference between the sample mean and the known reference mean. S is the standard deviation of the collected data, and n is the number of cases. S divided by the square root of n is then the standard deviation from the mean. Which is the standard error? In the dependent samples t test, we simply calculate the difference between the two sample means. To calculate the standard error, we need the standard deviation and the number of cases from the first and second sample, depending on whether we can assume equal or unequal variance for our data. There are different formulas for the standard error. In a paired sample t test, we only need to calculate the difference between the paired values and calculate the mean from that. The standard error is then the same as for a one sample t test. What have we learned so far about the t value? No matter which t test, we calculate. The t value will be greater if we have a greater difference between the means, and the t value will be smaller if the difference between the means is smaller. Further, the t value becomes smaller when we have a larger dispersion of the mean, so the more scattered the data, the less meaningful are given mean differences. Now we want to use the t test to see if we can reject the null hypothesis or not. To do this, we can now use the t value in two ways. Either we read the critical t value from a table, or we simply calculate the p value from the t value. We'll go through both in a moment. But what is the p value? A t test always tests the null hypothesis that there is no difference. First, we assume that there is no difference in the population. When we draw a sample, this sample deviates from the null hypothesis by a certain amount. The p value tells us how likely it is that we would draw a sample that deviates from the population by the same amount or more than a sample we drew. Thus, the more the sample deviates from the null hypothesis, the smaller the p value becomes, if this probability is very very small, we can of course, ask whether the null hypothesis holds for the population. Perhaps there is a difference, but at what point can we reject the null hypothesis? This border is called the significance level, which is usually set at 5%. If there is only a 5% chance that we draw such a sample. Or one that is more different. Then we have enough evidence to assume that we reject the null hypothesis. In simple terms, we assume that there is a difference, that the alternative hypothesis is true. Now that we know what the p value is, we can finally look at how the t value is used to determine whether or not the null hypothesis is rejected. Let's start with the path through the critical t value, which you can read from a table. To do this. We first need a table of critical t values, which we can find on Data tab under tutorials and T distribution. Let's start with the two tail case. We'll briefly look at the one tail case at the end of this video. Here below, we see the table. First, we need to decide what level of significance we want to use. Let's choose a significance level of 0.05 of 5%. Then we look in this column at 120.05, which is 0.95. Now we need the degrees of freedom in the one sample t test and the paired samples t test. The degrees of freedom are simply the number of cases minus one. If we have a sample of ten people, there are nine degrees of freedom. In the independent samples t test, we add the number of people from both samples and calculate that minus two because we have two samples. Note that the degrees of freedom can be determined in a different way depending on whether we assume equal or equal variance. So if we have a 5% significance level, and nine degrees of freedom, we get a critical t value of 2.262. Now, on the one hand, we've calculated a t value with the t test and we have the critical t value. If our calculated t value is greater than the critical t value. We reject the null hypothesis. For example, suppose we calculate a t value of 2.5. This value is greater than 2.262, and therefore, the two means are so different that we can reject the null hypothesis. On the other hand, we can also calculate the p value for the t value we've calculated. If we enter 2.5 for the t value, and nine for the degrees of freedom, we get a p value of 0.034. The p value is less than 0.05, and we therefore reject the null hypothesis as a control, if we copy the t value of 2.262 here, We get exactly a p value of 0.05, which is exactly the limit. If you want to calculate AT test with Data tab, you just need to copy your own data into this table. Click on hypothesis test and then select the variables of interest. For example, if you want to test whether gender has an effect on income, you simply click on the two variables and automatically get AT test, calculated for independent samples. Here below. You can read the p value. If you're still unsore about the interpretation of the results, you can simply click on interpretation inwards. A two tail t test for independent samples, equal variances assumed showed that the difference between female and male with respect to the dependent variable salary was not statistically significant. Thus, the null hypothesis is retained. The final question now is, what is the difference between directed hypothesis and undirected hypothesis? In the undirected case, the alternative hypothesis is that there is a difference. For example, there is a difference between the salary of men and women in Germany. We don't care who earns more. We just want to know if there is a difference or not. In a directed hypothesis. We are also interested in the direction of the difference. For example, the alternative hypothesis might be that men earn more than women or women earn more than men. If we look at the t distribution graphically, we can see that in the two sided case, we have a range on the left and a range on the right. We want to reject the null hypothesis if we are either here or there with a 5% significance level. Both ranges have a probability of 2.5%. Together just 5%, if we do a one tail T test, the null hypothesis is rejected only if we are in this range or depending on the direction which we want to test in that range with a 5% significance level, A 5% fall within this range. Thank you for learning with me. I will see you in the next lesson of statistics. 26. Understand 1 sample t test: Let us understand which hypothesis tests should I use? In Minitab, you have an assistant which can help you do that decision. So if you go to assistant hypothesis testing, it will help you identify based on the number of samples that you have. To suppose if you have one sample, you might be doing one-sample t-test, one sample standard deviation, one sample percentage defective, chi-squared goodness of fit. If you have two samples, then you have two sample t-test for different samples. T-test if the before and after items are the same. Sample standard deviation to sample percentage of defective chi-square test of association. If you have more than two samples, then we have one way ANOVA standard deviation test, Chi-square percent is defective and chi-squared test of association. We will be practicing all of it with lots of examples. So let's come to the first example. We have the ADHD of calls in minutes. We have taken a sample of 33 data points. Average is seven, minimum value is four minutes, maximum value is ten minutes. The reason we have to do a hypothesis testing is the manager of the processes that his team is able to close the resolution or on the call in seven minutes. And the process average is also seven minutes, minimum is four minute. But the customer sees that the agents keep them on hold and it takes more than seven minutes on the call. So now I want to statistically validate whether it's correct or not. Whenever we are setting up hypothesis testing, we have to follow the five step six step approach. Step number one, define the alternate hypothesis. Define the null hypothesis, which is nothing but your status quo. What is the level of significance or your Alpha value? If nothing is specified, be sent Alpha value as five per cent. We first set the alternate hypothesis. So in our case, what is the customer saying? The customer sees that the average handle time is more than seven minutes. The status quo or the SLA agreed is the ADHD should be less than equal to seven minutes. As I told you, the null and the alternate hypothesis will be mutually exclusive and complimentary to each other. Now, identify the test to be performed. How many samples do I have? I have only one sample of the HD of the contact center. So I am going to pick up one sample t-test. Okay? Now I need to do the test statistics and identify the p-value. If you remember the previous example lesson, we said if p-value is less than the alpha value, we reject the null hypothesis. If p-value is greater than five per cent or Alpha value, we fail to reject the null hypothesis. Let us do this understanding. So if you remember, we have our project data. In the project data, we have the test of hypothesis. Over here. I have given you the AHG of coal in minutes. So I have copied this data onto MiniTab. So let's do it in two ways. First time and show it to you using assistant. Second, I will show it to you using stats. So if I go to assistant hypothesis testing, what is the objective I want to achieve? It's a one-sample t-test. I have one sampling. Is it about mean? Is it about standard deviation? Is it apart, defective or discrete numbers? We're talking about the average 100 times. So I'm going to take one sample t-test. For data in columns. I have selected this. What is my target value? My target value is seven. The alternate hypothesis is the mean age of the call in minutes is greater than seven. This is what the customer is complaining. Alpha value is 0.05 by default, I click on, Okay. Let's see the output. To see the output you're going to click on View and output only. You will see that. If you see the p-value, p-value is 0.278. You remember below non-goal be high nullcline is this value of 0.278 greater than the alpha value of 0.05? Yes, it is. Hence, I can conclude that the mean is d of coal is not significantly greater than the target. Whatever you are seeing as greater than target, it is only by chance. So there is not enough evidence to conclude that the mean is greater than seven with five per cent level of significance. And it also shows me how the pattern is. There is no unusual data points because the sample size is at least 20. Normality is not an issue. The test is accurate. And it'd be good to conclude that the average handle time is not significantly greater than seven minutes. I can go ahead and reject the claim given by the customer. The few calls that we see as high-quality, high-value goals. This could be only by chance. The same test. I can also do it by clicking on test stat, basic statistics. And I'll save one sample t-test, one or more samples, each in one column. I will flick your select ADHD. I want to perform hypothesis testing. Hypothesized mean is seven. I go to Option and I say, what is the alternate hypothesis I want to define. I want to define that the actual mean is greater than the hypothesized mean. Click on Okay. If I need graph, I can put up these graphs. Click on Okay, and click on Okay. I get this output. So the descriptive statistics, this is the mean, this is the standard deviation and so on. Null hypothesis is mu is equal to seven. Alternate hypothesis is mu is greater than seven. P-value is 0.278. Concluding that null flight, we fail to reject the null hypothesis, concluding that the average 100 time is around seven minutes. Let's continue. We received our output. We saw all of this, and we have concluded that the average handle time is not significantly greater than seven minutes. 27. Understand 2 sample t test example 1: Let's do one more example of two teams, two samples. So in this example, two teams whose performance need to be measured. The manager of DMB claimed that his team is better performing team than DNA. The manager of a team advocates that this claim is invalid. Let's go to our dataset. So if you go to the project file, you will have something called as team a and team B. So let me just copy that data. Okay. Let me go here and place the radar on the right side. Why can also do I can take a new sheet and paste the data. Right? So let's come to as hypothesis testing, two-sample t-test. Let me delete this value. And TB, the team a is different from the VM. I can also say based on the hypothesis that is team be claimed that his team is better than a. So I can say it is less than TV. And I click on Okay. Again, in this example, I get an output which says that the team is not significantly less than TB. Do you have the values of 27.727.3? There is no statistical difference between both the tips, right? So both the examples which we got were like that. So let's go and see one more example. I have taken cycle time of process one and cycle time of process B. So let's just copy this data. This is another data set. And I go, What's my alternate hypothesis? Both the beams are different. What is the null hypothesis? Both the teams are same. Because these two teams are different. I will go ahead and do my two-sample t-test. The data of each team is separate. And I'm seeing is different from TB alpha value is 5%, and then I click on, Okay. Now, if you see the output this time, it says that yes, the cycle time of a is significantly different from the cycle time of dB. Here, this 26.8, twenty-seven point six. But if I look at the distribution, the distribution that this red is not overlapping with this red. So there is a difference in the cycle time of the two teams. If I have to do the same thing using stats, basic statistics, two-sample t-test. Like your time of being e at the time of TB options, are there different? I can have my graphs. I don't want an individual graph. I will only take the boxplot and say, okay, mu1 is population mean of cycle time of processes, cycle time of process B. Now if you'll see there is a standard deviation that is a difference. The p-value is 0, telling that, yes, there is a significant difference between both the teams. Be low, none cool. So here we are rejecting the null hypothesis, telling that there is a significant difference between E and D. Right? I have seen the same thing with the distribution goes on. So there is a larger distribution or here and there's a smaller distribution. I can do my graphical analysis that I did learn on your right and then see how the team is performing. So this is the summary of DNA. Mean is 26, standard deviation is 1.5. And if I scroll down, I get for team B and it is coming in this way. Now I want to overlap these graphs so I can click on graph and a histogram. And I'll say a bit fit and silky. And I will select these two graphs on separate panel of the same graph, same vitamin C max. Click on, Okay. Click on Okay. Can you see that the bell curve of both of them are different? Let's do an overlapping graph histogram. And in multiple ground overlay on this graph. Can you see that the blue and the red, there is a difference? And hence, yes, the kurtosis is different, the skew is different, and that's what is the conclusion in my two-sample t-test, which says that the distribution there is a significant difference. There is a statistically significant difference between sacred time of being EN fighter, dying off. The second thing, we will learn about bed t-test in our next example. 28. Understand 2 sample t test example 2: Let's come to our example. Two. There are two centers whose performance needs to be measured. The manager of sensory claimed that his team is a better performing team than the center B. The magnitude of the center be advocates that the claim is invalid. Again, I will follow my five-step process. What is the alternate hypothesis? Is better than B. Let's make it more easy. It is not equal to T, is not equal to TB, or center is not equal to center. What does the non-hypothesis center a is equal to center V, level of significance, five per cent. How many samples do I have? I have two samples, center editor and center B data. Because I have two samples, I need to go for two-sample t-test. Let's go to our Excel sheet. I have the data for Centauri and center B. I'm going to copy it in Minitab. I'm placing my data here. Let's do the two-sample t-test. So I go to Stat, Basic Statistics and say two-sample t-test. Both the samples are in one column. Each sample has its own column, so I'm going to select this sample. One is sensory sample. Do you center B? Option is hybrid. That is no different. So the difference between a and B is 0. And I go ahead and do it. I can have my individual box plot and say OK, and say Okay, let's see the output. So the sensory data is yours and TBI data is here. And if you see the p-value, the p-value is high. Again, I got an example which says that be high null fly, meaning there is no difference between center and center B. If you see the individual value, but you see the same thing. Let's see the boxplot. The boxplot says that the mean is not significantly different because it would have taken a sample. That's the reason it is, and you are seeing a value of 0, which is an outlier. So we should be considering that. The same thing. Let me do it using hypothesis testing. Two-sample t-test, sample mean. Sample is different. The mean of center is different from the mean of center B and C. Okay. So does the mean difference, the mean of Santa Fe is not significantly different from the mean off center. Right? If you see this distribution, you can find that the red part is completely overlapping with each other, telling that there is no enough evidence to conclude that there is a difference. There is a difference when you see the mean, 6.86.5. But that could be because of a chance. And there is a standard deviation also. Hence, these show it using the red bars, telling that there is not a significant difference between sensory and center week. We will continue learning about other examples in the coming video. 29. Understand Paired t test: Let us understand one more example. This is an example of paired t-test. If you look at this case study, the psychologists wanted to determine whether a particular running program has an effect on their resting heart rate. The heart rate of 15 randomly selected people were measured. The people were then put on a running program and measured again after one year. So are the participants saying before versus after? Yes. And that is the reason it is not two-sample t-test, but it is a paired t-test test, the before and after measurement of each person or in bands of observation. So if I go back to my dataset, I have something called as before and after, there's a different stage, I'm not taking the difference value. I've taken the data for the 15 people and put up in mini tab. Right? Now, I want to do because it's the same person before and after I, we want to understand the different hypothesis testing. I'm going to take paired t-test. The first thing was, what's the alternate hypothesis? Before and after is different. If you remember, the program of before and after, they want to determine if they have an effect on the run. The measurement one is before, measurement tool is up. Mean of before is different from the mean of after. So that's my alternate hypothesis. So what's my null hypothesis mean of before is there is no change. The alternate sees the before is different from after. Alpha value is 0.05. Let's click on Okay. Let's see the output. So does the mean differ? What is a p-value of 0.007? The mean of before is significantly different from the mean of after. If you look at the mean value, it was 74.572.3. But there is a difference. So if you see the difference is more than 0. And if I look at these values of before versus after the blue dot is after the black dot is before. Most of the participants, their heart rate had reduced after the running program. Few of them were an exceptions, but that could be an exception. There is no unusual paired differences because our sample size is at least 20. Normality is not an issue. The sample is sufficient to detect the difference in the mean. So I can see that, yes, there is a difference between both of them. Wonderful. So again, quick revision. Hello, null goal as the p-value is less than the significance level, we conclude that there is a significant difference between both the readings. If I have to do the scene, I click on Stat, Basic Statistics. Bad detest, each sample in one rule. Before, after option is they are different. Let me take only the boxplot and histogram of I don't want to pick the histogram. I'll only take the boxplot. Null hypothesis. The difference is 0. Alternate hypothesis is difference is non-zero, p-values low, concluding that I reject the null hypothesis. And there is a difference by adopting the program. So if you see the null value, the red dot is much away from the mean of the confidence interval of the box toward concluding that there is a difference between by undergoing the program by this heart specialist, right? So in the next program, we will learn, take up more examples. 30. Understand One Sample Z test: The quick recap of the different types of tests that we learned is that if I'm looking at how different is my group and between the population, I go for a one-sample t-test. When I have two different groups of samples, then I go for two-sample t-test. If these samples are independent. If I will go for a paired t-test. Paired t-test. If the group the same set of people, but it is or different point of time. Like we saw the example of the heartbeat. So the people were measured on their heartbeat. The report through a running program and post the running program. How was that hot resting heartbeat, right? So those are the things that we sorted. Now let's continue with more examples. So we add on use case number five, fat percentage analysis. The scientists for a company that manufactured process who want to S's the percentage of fat in the company's water source. The advertisement posted date is 15% and the scientists measure that the percentage of fat is 20 random samples. The previous measurement of the population standard deviation is 2.6. Now this is the population standard deviation. The standard deviation of the sample is 2.2. When I know the population parameter, I can go ahead and use one sample z-test because the number of samples I have is one. And I want, I have the known standard deviation of the population. Now, again, I'm going to apply the same thing defined the alternate hypothesis, right? So what am I going to say? What's the alternate hypothesis? Fat percentage is not equal to 603050. What are the null-hypothesis fat percentage is equal to 15%. Level of significance five per cent. Because I know it's a one-sample test and I have the population standard deviation. I'm going to use one sample z-test. Let's do the analysis. I have opened the project file and I have the sample IDs and cause a fat percentage data over here. Let me copy this data into Minitab. But copied the percentage of fat with the scientists have done. Because we know that population standard deviation, I can go ahead and use one-sample z-test. My data is present in a column. It's the fact presented. The known standard deviation was 2.6. I want to perform hypothesis testing. Hypothesized mean, it's 15%. So my null hypothesis is the fat percentage is equal to 15. My hypothesis is fat was a big a is not equal to 15. I can pick a graph of boxplot and histogram and say, Okay, I will show you the output. So the null hypothesis is fat percentage is equal to 15. Alternate hypothesis is fat percentage is not equal to 15. Alpha value is 0.05. My p-value is 0.012, as my p-value is less than the alpha value, P low, none cool. So I reject the null hypothesis, concluding that the fat percentage is not equal to 50. If you see over here, the fat percentage is more than 50. I can redo the same test. This time. I can go ahead and check. Is my fat percentage greater than the hypothesized mean. Let's do it. And still I get my p-value more confidently, 0.006 very far from my Alpha value. Concluding that yes, the Alpha, the null value is hypothesized, mean is 15. But the sample says that there is a high probability that your fat percentage in the source is more than 50. What is the advice we will give to the company? We will advise the company that you cannot sell the naming that the container is 15% because our factor is more than 15%. So to be safe, you can change the label of the product to saying that the fat percentage is 18, right? Because we have five per cent is going through 20. So a consumer will be happy to receive a product which is containing less fat. Then to receive a product which is containing more fat because we are all health-conscious, right? So let's continue in the next class. 31. Understand One Sample proportion test-1p-test: We will continue on our hypothesis testing. Sometimes we might have a proportion of the action, right? We do not have averages or standard deviation or variance to measure though, that we are doing. Let's take this example six, the marketing analyst wants to determine whether the male, the advertisement for the new product resulted in a response rate different from the national average. Normally whenever you put an advertisement in the paper, they say there are the advertising company usually see is that we will be able to impact 6% outcome or 10% outcome or some number outcome right here. What is, it's the same type of scenario. Here. They took a random sample of 1000 households who have received advertisement. And out of these 10 thousand households, sample 87 of them made purchases after receiving this aggrandizement. So this company, which is an advertising company, is claiming that I have made a better impact than the other advertising's. The analyst has to perform the one proportion z-test to determine whether the proportion of households that made a purchase was different from the national average of 6.5 because this is 8.7. In this case. What is your alternate hypothesis? Alternate hypothesis is the advertisement is different than the response to the advertisement is different from the national average. Here we will say there is no difference. They both are sin, alpha value is five per cent. And we're going to take up one proportion, z-test, event proportion test. I'm supposed to take you to the minute. So let's go to MiniTab. I can go ahead and these dads, basic statistics, one proportion. I do not have data in my column, but I have summarized, right? So let me close this, cancel, let me close this. So I have taken one sample proportion test. I have summarize data. How many events have been are we absorbing? We are observing 87 events to happen. The sample is of thousand. I need to perform hypothesis test and the hypothesized proportion, 6.5, 0.0656%.5, right? So it is 0.065. This proportion is not equal to hypothesize proportion. I say, Okay, I see, Okay. Now the null hypothesis is the proportion is equal to 6.5 per cent. Alternate hypothesis is the proportionate impact is not equal to 5.56 per cent. P-value is 0.008. What does it mean? Yes, be low, none cool. So we reject the null hypothesis, concluding that the effect of the advertisement, He's not 6.6.5 per cent, but it is more because if you see the ninety-five percent confidence interval, it says 0.7% to 10%, right? You have got a proportion of 88.7%. And the 95% confidence interval of proportion is much ahead of 6.5, it starts from 7. So we can conclude that there is a significant impact of the advertisement and we can go over this advertising company. Let's continue in our next lesson. 32. Understand Two Sample proportion test-2p-test: Let's do this exercise one more time using Assistant. So we have the numbered 80 beef products by supplier E that we have checked. 725 are defective or non-defective. So how many is that effective? So if I do a subtraction, it would be 777802 minus 725 is 77712 products of sampling the supplier B were selected by 73. Perfect. So how much is defective? One, 39. So let's try to do our two proportion test using Minitab assistant as this then hypothesis testing, sample pieces, stool, sample percentage defective supplier E, 0 to 7771 to 139. The person is defective of supplier E is less than the percentage defective of supplier B. I will go ahead and click on Okay. And I get that. Yes, that percentage defective or supplier is significantly less than the percentage defective of supplier B. And if I scroll down, Yes. So it says the difference, this supplier a is reading readiness. That from the test you can conclude that the percentage depictive of supplier a is less than Supplier B at 5% level of significance. When you are seeing this percentage. Also, you can clearly see that we will continue with the next hypothesis testing in the next week. Do 33. Two Sample proportion test-2p-test-Example: Now let us understand the next example. This is an example where an operation managers samples a product manufactured using raw material of two suppliers, determine whether one of the supplies raw material is more likely to produce a better quality product. So 802 products were sampled from the supplier E 725 or perfect, that is non-defective. 712 products were sampled from Supplier B, 573 or buffet. That is, it's not defective. So we want to perform because what is their personal data non-defective percentage? Yes, I have got two proportions, supply array and Supplier B. Let's go to main. I can go to Stat, Basic Statistics two proportion test. I have my summarize data, the evens by the first ease, 725 or both act out of 802. So let's take 725025723712572371. The option with them seeing is there is a difference and let's find it out. So the BVA, the null hypothesis, is there is no difference between the proportion. Alternate hypothesis is there is a difference between both the proportions. When I was looking at the p-value, the p-value comes out to be Z, to be low null. It is concluding that I have to reject the null hypothesis. There is a difference in the performance of the two suppliers. Now, if I think about because I'm talking about perfect or non-defective, currently, sample one has 90% perfect and sample two has 80% perfect. So concluding that supplier E is a better supplier than Supplier B. Right? So, thank you so much. We will continue in the next lesson. 34. Using Excel = one Sample t-Test: Many times we understand test of hypothesis, but there is a challenge that we have. The challenge is that I do not have MiniTab me. Can I not do test of hypothesis with an easy way rather than going through a manual calculation using statistical calculator. Do not worry that is possible. I'm going to show you how I can get to do test of hypothesis using Microsoft Excel. Go to File. Go to Options. When you go to Options, go to Add-ins. When you click on Add-ins. Let me click here. You have an option which is called as Excel add-in in the Manage option. So select Excel add-in and click on Go. Click on Analysis ToolPak and ensure that this tick mark is on. Once you have that, you will find that in your Data tab. You have data analysis available. If let me click on it for you to understand what's possible. In data analysis. I have an OR correlation, covariance, descriptive statistics, histogram, t-Test, z-tests, random number generation, sampling regression and all those things. So it is becoming very easy for you to do hypothesis testing. At least the continuous data hypothesis tested easily through Microsoft Excel as well. I'm going to take you through step-by-step exercise for now. Let's go back to the presentation. Let's take the first problem. That is, I have the descriptive statistics for the HD of the call, the manager of the processes that his team is working to close the resolution on the call in seven minutes. But the customer sees that he's kept on hold for a long time, and hence he is spending more than seven minutes. If I look at the descriptive statistics, it is telling me ten minutes, median is seven, average is 7.1. Now I would want to do this analysis using Microsoft exit. So let's get started. I have this use case in the project data which I have uploaded, click on ASD, of course, it takes you to this place. Now, I will first teach you how to do descriptive statistics using Microsoft Excel. I'm going to click on data analysis under the Data tab. I'm going to look for descriptive statistics. Click on, okay. My input range is from here to the bottom. I have selected. My data is grouped by columns. The label is present in the first row. And I want my output to go to a new workbook. I want summary statistics and I want confidence level of me. I click on OK. Excel is doing some calculation and getting it ready for it. Yes. Here is my output. I click on former over here to see what's the output. So you can see you are mean, median mode, standard deviation, kurtosis, skewness, range, minimum, maximum, sum, count, confidence level. All these things are easily calculated by a click of a button. I do not have to write so many formulas. Now, let us go back to our dataset. I want to do the hypothesis testing. What is my null hypothesis? When the null hypothesis is that the ADHD is equal to seven minutes. Alternate hypothesis. The ADHD is not seven minutes. There is a different alpha value I'm setting up as 5%. And with that, I'm going to conduct the tests that I'm going to connect is a one-sample t-test. When you are doing one-sample t-test using Microsoft Excel, you will have to follow a small trick. The trick is, I'm going to insert a column over here. And this, I'm going to call it as dummy. Because Microsoft Excel comes with an option of two-sample t-test. I have HD of the call in minutes and dummy where I have written down to zeros, zeros. However, the average median, everything for 0 is always 0. Click on data analysis. I will go down and I will say two sample t-test assuming equal variance. I'm going to select this. I'm going to click on, Okay. My input range, one is this line. My input range through this dummy. My hypothesized mean difference is seven minutes. Label is present in both the Alpha value set as five per cent. And I'm telling that my output needs to be in a new workbook. I click on Okay, it is doing the calculation and getting me the output. You can see that the numbers have conveyed as a practice, I just click on the karma in the Format section so that the numbers are visible. I'm changing the view because dummy does not have any data. I am free to go ahead and delete this column. Now let's understand what do we always look for? We look for this value, the p value. Do you remember the formula? Let me get my formulas over here. Yes. What is the conclusion? The conclusion is P high. I fail to reject the null hypothesis. Concluding the ADHD of the call is seven months. I'm rejecting the alternate hypothesis because my p-value is beyond 0.05. I'll be taking up more examples in the following lessons. So I'm looking forward for you to continue this series. If you have any questions, I would request you to drop your questions in the discussion section below, and I will be happy to answer them. Thank you. 35. Correlation analysis: Welcome to the next lesson of our analyzed phase in the DMac life cycle of a Lean Six Sigma project. Sometimes we get into a situation where we would want to do a correlation analysis. And hence, I thought today I should be diving you deep into what is correlation What is the difference between correlation and casualty? How do I interpret correlation when I look at the scatter plot? What significance level can I set up when I'm doing my hypothesis testing? Pearson's correlation, Spearman correlation, point b serial correlation, and how to do these calculations online using some of the available tools? So let's get started. So what exactly is correlation analysis? Correlation analysis is a statistical technique that gives you information about the relationship between the variables. Correlation analysis can be calculated to investigate the relationship of variables, how strong the correlation is determined by the correlation coefficient, which is represented by the number letter r, which varies from minus one to plus one. Correlation analysis can thus be used to make statements about the strength and the direction of the correlation. Example, you want to find out whether there is a correlation between the age at which a child speaks his first sentence and later success at school. Then you can use correlation analysis. Now, whenever we work with correlation, there is a challenge. Sometimes we get confused with things that are a problem. Like, if the correlation analysis shows that two characteristics are related to one another, it can substantially be checked whether one variable can be used to predict the other variables. If the correlation mentioned the example is confirmed, for example, it can be checked whether the school success can be predicted by the age at which the child speaks its first sentence, it means that there is a linear regression equation. I have a separate video on explaining what is a linear regation. But beware, correlation need not have a causal relationship. It means any correlation that can be discovered should therefore be investigated by the subject matter expert more closely, but never interpreted immediately in terms of content, even if it is very obvious. Let's see some of the examples of correlation and causality. If the correlation between the sales figure and the price is analyzed, there is a strong correlation identified. It would be logical to assume that the sales figure are influenced by the price and not the wise person. The price does not happen the other way around. This assumption can, however, by no means be proven on the basis of a correlation analysis. Furthermore, it can happen that the correlation between the variable x and y is generated by the variable. Hence, we will be covering that in partial correlation in more detail. However, depending upon which variable can be used, you may be able to speak a causal relationship right from the start. Let's look at an example if there is a correlation between the H and the salary. It is clear that age influences salary, not the other way around. Salary does not influence the age. So just because my age is increasing, or just because I have a higher salary does not mean that I will be old. Otherwise, everyone would want to earn as little salary as possible. That's just love. Interpret the correlation. With the help of correlation analysis, two statements can be made. One about the direction of the correlation, and one about the strength. Of the linear relationship of the two metrics or the ordinarily scale variables. The direction indicates whether the correlation is positive or negative. Whether the strength dictates whether the correlation between the variable is strong or weak. So when I say there is a positive correlation exists between we are trying to say that the larger values of the variable x are accompanied by the larger values of variable y and not the other way round. Height and shoe size, for example, are correlated positively. The correlation cofient lies 0-1. That is, it's a positive value. Negative correlation on the other hand exists if a larger value of variable x is accompanied by the smaller value of variable y and the other way round. The product price and the sales quantity usually have a negative correlation. The more expensive a product is the smaller the sales quantity. In this case, the correlation coefficient will be between minus one and zero, assuming it's a negative value. So it results in a negative one. How do I determine the strength of the correlation? With regards to the strength of the correlation coefficient r, the following table can act like a guide. If your value is between 0.0 to 0.1, then we can clearly say there is no correlation. If the value is between 0.1 to 0.3, we say there's a little or a minor correlation or a correlation. If the value is between 0.32 0.5, medium correlation, if the value is between 0.5 0.7, we say there is a high correlation or a strong correlation, and if the value is between 0.7 to one, we say it's a very high correlation. At the end of this module, I'll show you how to calculate the correlation cation directly on an online too. So let's go further. When you do it online, you will get one of the tools that we use to analyze the correlation is a scatter plot because both the x and the y are variable data type or metric data type as you call it. Just as important as considering the correlation coefficient is a graph in graphical way, we can use a scatterplot. So as the age, the x axis will always have the input variable, and the y axis will have the output variable because y is equal to function of x. And I can see that as my age is increasing, my salaries increase. The scatterplot gives you a rough estimate whether the corre whether there is a correlation, and whether there's a linear or a non linear correlation and whether there are any outliers. When we do correlation, we might also want to do our hypothesis testing, test the correlation for significance. If there is a correlation in the sample, it is still necessary to test whether there is enough evidence that the correlation also exists in the population. Thus, the question arises when the correlation copion is considered statistically significant. The significance of correlation esient can be tested using the t test. As a rule, it is tested whether the correlation coeent is significantly different from zero. That is, a linear dependence is tested. In this case, the null hypothesis is that there is no correlation between the variables under study. In contrast, the alternate hypothesis assume that there is a correlation. As with any other hypothesis testing, the significance level is first set at 5%. The Alpha value is set at 5%. It means I should have 95% confidence in the analysis that I'm doing. If the calculated p value is below 5%, the null hypothesis is rejected and the alternate hypothesis applies. If the p value is below 5%, it assumes that there is a relationship between the x and the. The t test formula that we use for hypothesis testing is r into under root of n minus two divided by under root of one minus r square. Where n is the sample size, r r is the determined correlation of the sample, and the corresponding p value can be easily calculated in the correlation calculator. Directional and non directional hypothesis. With correlation analysis can be tested for directional or non directional correlation hypothesis. What do we mean by non directional correlation hypothesis? You are only interested to know whether there is a relationship or a correlation between two variables. For example, whether there is a correlation between age and salary, but you are not interested in the direction of the relations. When you are doing a directional correlation hypothesis, you are also interested in the direction of the correlation. Whether there is a positive or a negative correlation between the variables. Your alternate hypothesis is then example. Age is positively influenced on salary. What you have to pay attention to is in the case of a directional hypothesis, you will go with the bottom of the example. So you will go telling that, is there a positive influence or not? So normally, we say there is no correlation and there is a correlation. But here we'll say there is no correlation, and the alternate hypothesis will say that there is a positive influence on the salad. So now let's go to the next part. That is Pearson's correlation analysis. With the Pearson's correlation analysis, you get a statement about the linear correlation between the metric scale variables. The respective covariance is used for the calculation. The covariance gives a positive value if there is a positive correlation between the variables and a negative value if there is a negative correlation between the variables. The covariance is calculated as COV or covariance of X is calculated using the formula given on the screen. Do not worry. We don't have to calculate it manually. Then we have systems and tools which can do that analysis for us. However, the covariance is not standardized and can assume values between plus and minus infinity. This makes it difficult to compare the strength of the relationship between the variates. For this reason, the correlation coefficient is also a product movement correlation. And this is calculated in a different way. The correlation coeent is obtained by normalizing the covariance. For this normalization, the variance of the two variable is calculated as given by. The Pearson's correlation coeent can now take values of minus one to plus one and can be interpreted as follows. The value of minus one means that there is an entirely positive linear relationship, and the more the minus one indicate that there's an entirely negative relationship exist. The more and the less. With the value of zero, there is no linear relationship. The variable does not correlate with each. Correlation of plus one will look something like this, which is only possible in theory. Correlation of 0.7 plus will look something like this, where it's going in a positive side, and most of the dots are closer to the axis to the regression light. A correlation of plus three will be scattered, but it's going in a positive direction. When you do a correlation you have a correlation of -0.7, they are all scattered moving downward. So as the value of x increases, the value of y is reducing, and most of the dots are scattered around the regression ide. We get the correlation value of zero in multiple ways, either the dots are completely scattered, or you might get some perfect lines like this or like this, which again, would not be, which means that you need to take some other analysis for interpreting the variables. Now, finally, the strength of the relationship can be interpreted and this can be illustrated by the following tale. The strength of the correlation. If it is 0-0 0.1, there is no correlation. If it is 0.1 to 0.3, there is a little correlation 0.3 to 0.5 medium correlation, 0.52 0.7, very high sorry, high correlation, and 0.7 to one is a very high correlation. To check in advance whether a linear relationship exists, scatter plots should be considered. This way, the respective relationship between the variables can also be checked visually. The Pearson's correlation is only useful and purposeful if demor relationships are present. Pearson's correlation has certain ems, which you should be keeping in mind. For PSM, whenever you're using this, the variables must be normally distributed, and there must be a linear relationship between the variables. The normal distribution can be tested either analytically or graphically using the QQ plot, which I will teach you how to do. Whether the variables have a linear correlation, it is best checked with the scatter plot. If the conditions are not met, then Spearman's correlation can be used. So I hope you are clear till here, and let's continue our learning. Let's continue. What do we do when my data is not normal and I want to establish a correlation analysis. In this case, we use Spearman's rank correlation. Spearman's rank correlation analysis is used to calculate the relationship between two variables that have an ordinal level of measure. When you have variable data, or I can say continuous data, we are using normal correlation analysis like Pearson's correction analysis. But if my data is ordinal or non parametric, then I can go ahead with Spearman's correlation analysis. This procedure is therefore used when the prerequisite of the correlation analysis, that is the parametic procedures are not met or when there is no metric data or continuous variable, and the data is not normal. In this context, we offer refer it as Spearman's correlation or Spearman's row. Spearman's rank correlation is meant. The question can then be treated as is Spearman's rank correlation similar to those of Percy's correlation coefficient? Examples. Is there a correlation between two variables or features? For example, is there a correlation between age and the religiousness in the France population? The calculation of the rank correlation is based on the ranking system of the data series. This means that the rank measure variables are not used in the calculation, but are transformed into ranks. The test is then performed using the ranks. For the rank correlation coefficient, p, the value between minus one and one are positive. If there is a value less than zero, p is less than zero, there is a negative linear relationship. If the value is greater than zero, then there's a positive linear relationship. If the value is zero or close to zero like 0.1 to -0.1, we can say that there is no relationship between the variables. As with the spareans correlation coefent, the strength of the correlation can be classified as follows. So if it is 0-0 0.1, there is no correlation. If it is 0.12 0.3, there is a little correlation. If there is 0.3 to 0.5, there is a medium rretation. There is 0.5 0.7 high correlation and 0.7 to one, very high correlation. If there are negative values, we will say minor negative correlation, high negative correlation, and so on and so forth. There is another type of correlation called this point bi serial correlation. The point bi serial correlation is used when one of the variable is dichotomous. Example, did you study or not study? The other is a metric variable like salary. In this case, we use a point by serial correlation. The correlation of a point by serial correlation is the same as the calculated Pearson's correlation. To calculate it, one of the two expressions of the dichotomous value is coded as zero. The other is coded as one. Calculated correlation analysis, we will show you using Excel or the other tools that are available for free. I will show you the calculation after some time, but let's first study the case. A student wants to know if there's a correlation between height and the weight of the participants in the statistic course. For this purpose, the student drew a sample, which is distributed below. So I have the heights of the people, I have the weights of the people. To analyze the linear relationship by means of correlation analysis, you can calculate the correlation using Excel or the other available tools online. First copy the table into the statistic calculator. Then click on correlation and select it. And finally, you will be able to get the following inserts. So let's do it online. So I have come to data tab.net. It is an online statistical calculator. The data over here has 100% data security because the calculations are made on your browser, and the data is inserted and stored on your browser cookies. The data is 100%, and that is the reason the calculation works very fast. The data therefore does not need a large server, and hence you. So I have the body weight, I have the weight, and I have the age. So I want to understand. So if I go down, I have cortation. I want to understand if there is a relationship between body way height and body weight. What type of correlation I want? Let's go with Pearsons first. There is a correlation. There's a positive correlation. Level of significance is set. 5% We can test for assumptions, and it is immediately doing the analysis. It is doing the QQ plot for me. It is drawing the histogram, and it is showing the results, right? So we can say that yes, more or less the data is normally distributed. I can copy this by clicking on Download PNG, and the file will get copied. And you'll be able to see it in that way. So now let me close this tumba, so it has tested for the assumptions. The summary in verses, the result of the Pearson's correlation showed that there is a very high positive correlation between body weight, height and weight. The results showed that the relationship between body weight, height and weight are statistically significant with a positive r value. R is 0.86, and the p value is 0.01. 001. So when you look at the strength of the correlation, if the value is greater than 0.7 and one, we say it's a very high correlation and it's a positive decor. When I go for hypothesis testing, there is no or a negative correlation between body height and weight. There is a positive correlation between body height and weight. How many cases we have ten cases. The r value is 0.86, and the p value is 0.001, which is less than 0.5. Hence, we reject the hypothesis telling that there is no correlation, and the alternate hypothesis applies that there is a positive correlation between the body height and weight. The advantage of being on data draft is that you have AI interpretation. This table summarize the results of the analysis of body height and weight, showing the correlation coefficient r and the P va. The correlation coefficient value indicates the strength and the direction of the relationship between the variable of height and weight, and the coefficient value is 0.86, which is suggesting that there is a very high positive correlation. This means that generally, as the body height increases, the weight also tends to increase and vice versa. The P value. The p value here assumes that the available data provides sufficient evidence to reject the null hypothesis. In this case, the one sided hypothesis tested, and the null hypothesis stats that there is no or negative correlation between the height and the weight in the population. In most cases, the p value is less than 0.05, we consider that there is a statistical significance. In our case, the p value is 0.001, which is obviously less than 0.5. The null hypothesis is rejected, and the result of the Pearson's correlation shows that there is a statistical significance of positive correlation between the body height and weight. So the result of the Pearson's correlation shows that there is a very positive correlation between height and weight, and this is stored by statistically significant positive correlation of r value as 0.86 and P value is 0.05. Now, there is a scatter plot which is automatically getting done. I can click over here and get my regression line. I can change my axle if I want to not start from zero, Do I want a zero line? Then the zero is included, but I don't want it. I can change it. How do I want my image, the extra large PDM and so? I can click on Download TNG to download this image. Now, as I told you, we can also do the co variance calculation. So when I'm looking at body height and body weight, the co variance is 1.29, right? So it means that there is a relationship. So that is how you are doing the calculation. Now, for point by serial calculator, we might have a different type of data where we want to analyze, does the change in salary have something to do with the gender. Then in this case, I would be selecting the metric value as salary and the nominal variable as gender, and then I will be doing my calculation. It would set the male as zero and female as one. Box plot, which tells that yes, the males tend to have a higher salary when compared to female. So when a student wants to know if there's a correlation between heightened s, we have done that analysis. The hypothesis, if you can go for a normal hypothesis, there is no correlation between the body height and weight. There is an association between height and weight, but I had taken a directional hypothesis in my test. The P value is this, and we saw how we can generate the output. First, you will get the null and alternate hypothesis. The null hypothesis states that there is no correlation between height and weight, and then we have the alternate hypothesis which stalls the opposite. If you click on submarine birds, you'll get the interpretation, which we just saw. We can go ahead and actually we tried out the directional or one sided correlation hypothesis. And in Excel and there are other tools which can help you calculate. So we just did the testing, telling that there is no or negative correlation between body gen, and there is a positive correlation between body heighten. And when we saw, we got that, yes, there is a positive, very strong positive correlation, and hence the p value was less than 0.01. In this case, you must first check whether the correlation is in all the directions of the alternate hypothesis, that is the height and weight are positively correlated, and in this case, the p value is divided by two. Hence, only one sided distribution is considered. However, this tool takes care of these two steps and the summary in verse is given as we saw. We state that there is a positive correlation between the height and weight of the data set on sample. Hence, we can say that there is a significance positively correlated, and we can see that there is a very positively correlated between variables of height and pt. Thus, there is very high positive correlation between the sample height and pt. With that, we will close our correlation analysis and I will see you in the next class. 36. Pearsons Correlation analysis concept: L et's continue our correlation journey. I'm going to cover about Pearson's correlation today. Pearson's correlation analysis is an examination of the relationship between two variables. For example, it is a correlation between a person's age and salary. Both of them are continuous variables, and hence the diagram will be scattered. So as the age of the person increases, does the salary increase? Now, you need to remember y is a function of x, so your y axis will have the outcome, and the x axis will have the independent variable. More specifically, we can use the Pearson's correlation coefficient to measure the linear relationship between two variables. If the relationship is not linear, then this correlation equation will not be of any hell. I think you would have observed that I have changed my AR for this recording. If you liked it, just put a thumbs up in the comment section. L et's continue, the strength and the direction of correlation. With the correlation analysis, we can determine how strong the relationship is and in which direction the correlation goes. We can read the strength and the direction of correlation in the Pearson's correlation coefficient letter r, whose value varies from minus one to plus one. The strength of the correlation, the strength of the correlation, it can be read on the table. The r value lies between zero to minus one indicates that there is no correlation. If the amount of the value of r lies between 0.7 to one, it is a very highly correlated, very strong correlation. Now, if the values are positive, it is positively correlated, and if the values are negative, it is negatively correlated. So let's say the r value comes out as -0.66. Then we can say it is highly negatively correlated. So this I have taken up from the book of statistics. Let's contain it. What do you mean by the direction of correlation? A positive correlation is a correlation exist when large values of one variable are associated with large values of other variable or when a small change in one variable is associated with a small change in the other variable. So if it's a positive correlation, if there is a bigger value on x axis, it corresponds to a bigger value on y axis. And a smaller value on x axis correlates to a smaller value on y axis, as you can see in these two images. A positive correlation results examples of height and shoe size. This results in a positive correlation. So as the person's height increases, the shoe size is also increasing. The result is a positively correlation coefent, and r is greater than zero. Now, did you see there's a mistake in this graph? The mistake is the shoe size is the outcome, and height is the independent variable, but we have wantonly mapped it wrong to avoid it. So let me put my comments over here. What is wrong in the pow graph? The question is, does the show size increase effect or result in increase of the height of the person or does the increase in height of the person, serves increase in the shoe size. Please write in the ten section below. Yes. Remember, y is a function of x. And here, y is height of the person and x is my mistake. X is the height of the person and y is the so size. I hope now it's clear what we are trying to say. So y is a function of x. Let me make the letter a small y because that's the project y. X is the height of the person. So here, the mistake is that we have shown it in the wrong way. The negative correlation is when a large value on one variable is associated with a small value on the other variable and vice versa. So if the y axis is big, the x axis value is small. And if the x axis value is big, the y axis value is small. This is what is called a negative correlation. The dots are flowing. Unlike the previous one where the dots were flowing upwards. Now, the negative correlation is found between product size and sales value. This results in a negative correlation cation. What happens when the price increases, the sales volume decreases. And if the price is reduced, people tend to buy more volume. Resulting in more sales. Let me write it do increases. Very good. So the result is a negative correlation, the coefion value of r is less than zero. The stronger the correlation is, the value goes closer to minus one. And here the graph is correct. As the price is increasing, the volumes are decreasing. Now, how do we calculate Pearson's correlation cient? That's a very important thing, right? The Pearson's correlation ceient is calculated using the following equation. Here, r is the Pearson's correlation coient. X i is the individual value of one variable. Example, it could be the age of the person. X bar is the average age of the sample data set. Y one is the individual value of the other variable or the outcome variable, and Y bar is nothing but the average salary of the sample dataset. So here, x bar and y bar are the mean value of two variables respectively. This is whole divided by under root of x one minus x bar square, y one minus y bar whole square. So when I'm squaring it and doing an under root, it will be taken care of. So x one is the individual values, and y one is the individual values of the outcome variable. R is the Pearson's correlation and the mean value. In this equation, we can see that the respective mean values of the first subtract from the other variable. In our example, we calculated that the main value of age and salary. We then subtract the main value of each age and salary against the mean. We then multiply both the values. We then sum up the individual results of the multiplication. The expiration of the denominator ensures that the correlation coefficient always ranges between minus one and plus one. Remember, you don't have to manually calculate any of it. Currently, we have these features available on Excel and multiple online website. If you want multiple two positive values, we get a positive value. And if we multiply two negative values, also we get a positive value minus into minus e plus. So all values that lie in that range are a positive influence on the correlation coeion. As the age is increasing, the salary is increasing, as the age is decreasing, the salaries decreasing. If we multiply positive value with a negative value, we get a negative value that is minus to plus is minus. All the time, there is a range of negative influence on the correlation coeion. So the things which are highlighted in the purple box, if the data is falling over there, then it will result in a negative correlation. Therefore, if our value is predominantly two green areas of the previous two figures. We get a positive correlation coeent, and therefore, positive correlation. If our scores are predominantly in the red area of the figures, we get negative correlation coeent and has thus negative correlation. If the points are distributed over all the four areas, positive terms and negative terms, they cancel each other out, and we might end up with very small or no correlation at all. So this is a very important part, which you need to understand. Right? If the points are distributed overall, then we result in no correlation at all. Now, how testing correlation and coefficient are significant? In general, the correlation coeficient is calculated using a data from a sample. In most cases, however, we want to test the hypothesis about the population. Because we cannot study the population, we take a sampling, and we take a sample and by studying the sample, we want to draw inference about the population. In this case, the correlation analysis, we then want to know if there is a correlation in the population. For this, we test whether the correlation coeficient in the sample is statistically significant and different from zero. Now, how do we do hypothesis testing? For Pearson's correlation? The null hypothesis and the alternate hypothesis for Pearson correlations are th. The Null hypothesis says there is no there is no correlation and hence the R value is not significantly different from zero. There is no relationship. The alternate hypothesis says that there is a significant difference, or there is a linear correlation from the data. Attention. We always test whether the null hypothesis is rejected or not rejected. This is very, very important. We never accept or we never work on the like myself. The thing is, we always work to prove or to reject the null hypothesis. We never try to prove the alternate, though our research starts because there is an alternate. In our example, when the salary and the age of the person, we could thus say the question. Is there a correlation between age and salary for the German population? To find out, we draw a sample and test whether the correlation coefficient is significantly different from zero in this sample. The null hypothesis is then there is no correlation between salary and age in the German population. The alternate hypothesis is there is a correlation between the salary and age in the German population. Significance and the test. When the Pearson's correlation ceficient test is significantly different from the zero based sample survey, we test it using the t test formula. Here, r is the correlation coefient, and n is the sample size. Again, I would say it's good to know the formula but not get lost at it. Right? A P value can be calculated from the test statistics t, and the p value is smaller than the specified significance level, which is usually 5%, then the null hypothesis rejected, otherwise not. So we want to ensure that the p value is if it is greater than 0.05, we fail to reject the null hypothesis. If the p value is greater than 0.05, then we fail to reject the null hypothesis. Now, what are some assumptions that are there in Pearson's correlation? What about the assumptions of Pearson's correlation? Here we have to distinguish whether we want to calculate the Pearson's correlation coient, or we want to test a hypothesis. To calculate the Pearson's correlation coeion, only two metric variables are present. Metric variables, for example, can be person's weight, salary, electric consumptions, et cetera. In short, continuous variable. The Pearson's correlation coient then tells us how large the linear relationship is, and is there a non linear relationship? We cannot read from the Pearson's correlation coiion. So this is a linear correlation, and if your data is carried out or showing up like this, then we tend to go ahead. So then in this case, there is no correlation. However, if we want to test whether the Pearson's correlation coefficient is significantly different from zero in the sample, we want to test the hypothesis that the two variables are normally distributed also. Because you cannot test the Pearson's correlation for non normal data. In this if the calculated test statistics t and the p value cannot be interpreted reliably. If the assumption is not made, Pearson's rank correlation will be used. It means that for non normal data, I'm going to use Pearson's rank correlation. How do I calculate Pearson's correlation online using Excel and other tools? I will be showing it to you shortly. 37. Point Biserial correlation: Let us now learn about point bi serial correlation. I'll be covering the theory and the example and how we can practically do this with an online calculator. Stay connected. What exactly is point bi serial correlation? Have you heard about it earlier or your face has turned something like this? We mostly hear about linear regression, logistic regression. When we learn about correlation, we think about simple correlation, positive correlation, negative correlation. And whenever we are doing correlation, we are only thinking about variables, continuous variables on both x axis and y axis. So let's understand what is point by serial correlation. It's a special case of a Pearson's correlation, and it examines the relationship between a dicotonmous variable and a metric variable. Okay. The rule for correlation is both your variables should be continuous or metric. But using point by serial correlation, I can even check for a dichotymous variable variables, which can be yes or no. Let's understand the example of dicotonous variable. A dicotymous variable is a variable with two values, gender, like male and female, and smoking status like smoker, non smoker. Metric variables, on the other hand, are weight of the person, the salary of the person, the electricity consumption, and so on. So if we have dichotonmous variable and a metric variable, we want to know if there is a relationship. We can use point by serial correlation. So let's understand the definition of it. Point by serial correlation is a special type of correlation, and it examines the relationship between dichotyous variable and a metric variable. Dichotonomous variables are variables with two values, and metric variables are continuous variables with infinite values, like height weight, salary, power consumption, et cetera. How exactly is the point by serial correlation calculated? It uses the concept of Pearson's correlation, but in the Pearson's correlation, we also have a variable which is nominal in nature. So for example, let's say you are interested in investigating the relationship between the number of hours studied in one test and the results, that is the person passed or failed. So here I can see how many hours did the person spend in studying and did it result in a pass or a fail? We have collected data for the sample of 20 students. 12 students have passed, eight students have failed. We have recorded the number of hours for each of the students who have studied in the test, and we assigned a score of one to the student who passed the test and zero to the student who failed the test. Now, we can either calculate the Pearson's correlation of the time and the test results or we can use the equation for the point by CDN correlation. Now we can either calculate the Pearson's correlation of time and test results with the equation. Now, here, x y is the mean value of the people who failed, and X one is the mean value of the people who have passed. N stands for the total number of observations. N one stands for the number of people who have passed, n two stands for the number of people who have failed. Just like the Pearson's correlation cofent, r, point by serial correlation is rp B also varies between minus one and plus one. With the help of cefent, we can determine two things. That how strong the relationship is. Is it a positive correlation? Is it a weak positive correlation, and in which direction the correlation goes. Is it a positive correlation or is it a negative correlation? The strength of the correlation can be read in the table. If the value is between 0.0 and less than 0.1, there is no correlation. If the value is between 0.1 to less than 0.3, there is low correlation. The value is between 0.3 and 0.5, there is a medium correlation 0.52 0.7 high correlation 0.7 to one, very high correlation. If the value is between zero and minus one, we call it as a negative correlation. If the coefent is between minus one and less than zero, it's a negative correlation, hence a negative relationship exists between the variable. If the value is between zero to plus one, it's a positive correlation. Thus, a positive relationship exists between the variable, and if the result is close to zero, we say there is no correlation. The correlation coefient is usually calculated with the data taken from the sample. However, we often want to test hypothesis about the population. We want to test a hypothesis about the population because we cannot study the population, we are using a sampling tech. We calculate the correlation cefent of the sample data. Now we can test if the correlation coefent is significantly different from zero. The null hypothesis says that the correlation coefent does not differ significantly from zero. There is no relationship. Alternate hypothesis says that the correlation cohesion differs significantly from zero. There is a relationship. So when we compute the point by serial correlation, we get the same p value as we compute the t test for independent sample for the same data. So whether we test the correlation hypothesis with point by serial correlation or a difference hypothesis of the t test, we get the same p value. What about the assumptions that we have to consider whenever we do a point by serial correlation? Here, we must distinguish whether it is just want to calculate the correlation coeent, or do we also want to test the hypothesis. To calculate the correlation coeent, only one metric variable and one dichotomous variable, it must be present. However, if you want to test whether the correlation coefent is significantly different from zero, one metric variable must also be normally distributed. If this is not given, the calculated test statistics or the p value cannot be interpreted reliably. We can use online calculators like Data tab, which can help you do the analysis and which I will cover now. We are on data tap. I have populated some data in terms of number of our study test results, and I've converted zero and one as pass and fail as zero and one. I can import my data using this button and I can clear the table using this. You have settings to decide what type of settings you want to use for visuals. Now let's go down. I'm in correlation, and I have options. Here, my nominal variable is test results. My metric variable is our strded. I want to calculate Pearson's pans and convolu. For now, I'll just keep it as Pearsons. My nominal variable is test results, as soon as I selected nominal variable as test results, It was able to identify this as a point pi serial correlation. The hypothesis says there is no correlation between our studied and test results. The alternate hypothesis says there is an association between the number of hours studied and the test results. The point serial correlation fail is taking the value of zero, Ps is taking the value of one. The point by serial correlation value r is 0.31 degrees of freedom r 18 t is 0.14 p value is 1.79. I have the boxplot over here telling that my boxplot for the past students is like this. 50% of the participants are studying between 8.5 to 19.25 hours who have resulted in a pass. People who got failed are studying 7-13 hours, right? I can even download this by clicking on the download PNG button. And you will see that I'm able to Now, how does the calculation works for the point b serial correlation? If you calculate the point by serial correlation, choose a metric variable and a nominal variable with two values. Before I go there, let me just do summary in words. The point b serial correlation was run to determine the relationship between our studies and test results. There is a positive correlation between our study and test result, which was not significant, statistically significant because p value is greater than 0.05. If I had more data like this, where I'm using multiple values to determine male and female zero and one, and then it has calculated. So it says, is there a correlation between the salary and the gender? And we can very clearly see that yes, male have a higher salary significantly when compared to female. But if you see the p value, it's very close to 0.05, but it is 0.07. So we fail to reject the null hypothesis, telling that maybe it is because of the sample ding error. O 38. Logistic Regression: Welcome to the next lesson on logistic regression. Let's understand the theory example and how we do the interpretation. When do we use logistic regulation? Let's take for an example. Wherever we have to check whether is it an old person who will who will suffer with cancer, or is it a male or a female who who's getting more of a disease? Is it a smoker who is causing the disease? When I want to check for multiple variables, which can infect and tell me whether the disease is possible, what is the probability of having a disease? So let's dive deeper. What exactly is regression? A regression analysis is a method of modeling relationship between variables. It makes it possible to infer or predict a variable, whether the customer is happy or sad, based on one or more of other variables. So I'm trying to check if this is possible, based on the person's qualification, the time it takes or the age. What is the factor that's affecting it? The variable the variable we want to infer or predict is called as a dependent variable or the criterion, and the variables that we use for prediction are called as independent variables or predictors. What is the difference between linear regression and logistic regulation? In a linear regulation, the dependent variable is a metric variable. Example, the salary, electricity, consumption, et cetera. It means it's a continuous variable. In a logistic regression, the dependent variable is a dichotonmous variable. What is a dichotonymous variable? It means that variable has only two values. For example, whether a person will buy or does not buy a particular product, or whether a disease is present or not. How can logistic regulation be used? With the help of logistic regulation, we can determine what has an influence on whether a certain disease is present or not. We could study the influence of age, gender, smoking status on that particular disease? In this case, zero stands for no diseased and one stands for diseased. The probability of occurrence of a disease or a characteristics is one means the characteristics is present is estimated. Our data site met look something like this where my independent variables could be a gender smoking status, and my dependent variable could be a variable made up of zeros and ones. We could now investigate what influence the independent variable and have the disease have the effect on the disease. If there is an influence, we can predict how likely a person is to have a certain disease. Now, of course, the question arise. Why do we need logistic regulation in this case? Why does the linear recreation not work? So let's do a quick recap of what happened in the linear regression. Let's do a quick recap of what is linear regulation. In the linear regression, this is our regression equation. Y is go to b1x1 plus b2x2 plus b3x3, and so on and so forth. B and xn plus c. We have the dependent variable y, and we have independent variables like x one, x 2x3tx nine. And we have the regression coeion, b one, b2bt Bn. Now, however, when you look at this variable, the dependent variable is made with zero or one. And hence, your output will look something like this. You have a lot of dots on the zero line and a lot of dots on the one line, but you don't have any data in between. No matter how much value you have, the independent variable can contribute to make the variable as 0-1. The results are always zero or one. In a regression equation, we have to simply put a straight line through the points and we see that there is a lot of error. We can now see that in case of a linear regression, values between plus and minus infinity can occur. And hence, this formula does not work. What's the solution? However, the goal of the logistic regression is to estimate the probability of occurrence. The value range of prediction should therefore be 0-1. And hence, we want a line which fits on this line and not a diagonal like this. So we need a function that only takes values between that results in a value zero and one. That is exactly where the logistic function does. No matter where you are on the x axis, you will be your y axis will either result in zero or one. Between the minus and the plus infinity, the only results are 0-1. And that's exactly what we want. The equation of the logistic recoration will look something like this. The logistic function is now used in the logistic recreation. So let's break down the linear recreation formula one more time. One plus y isqu to b1x1 plus b2x2 plus t b x, and so on. This equation will now be inserted in the function. When you do that, it is e to the power of minus your largest linear recreation equation, 1/1 plus e to the power of minus equation. Thus, the probability of the dependent variable is one given by this. What does this look like in our example? What is the probability of a certain disease? P is disa. What is the probability that the person is diseased is equal to 1/1 plus e bar minus B one into H, B two into gender, P three into smoker plus A? It's a function of a, gender and smoking status. For Z, the equation of the linear equation is now simply inserted. And when you do that, we find that the probability of a dependent variable is one given that example. In our example, the probability of getting a certain disease based on the parameter of a gender and smoking status. What does this look like in our example? E to the power of minus B one, B two, B three, are all the coefficient of determination so that the model best fits on the given data. To solve this problem, we call it as a maximum lighthod method. For this purpose, there are good numerical methods to solve the problem efficiently. But how do you interpret the results of a logistic regulation? Let's take a look at the fixitios number. He gender smoking status and disease. 22 female non smoker and is diseased, 25 female smoker is diseased, 18 male smoker is no diseased, so on and so forth. When we put this on an online statistical calculator and we go to regression, and then select what are my dependent variables and what are my independent variables? What is a more of prediction diseased or not diseased, and so on. And when we click on it, it will perform the recreation equation for us. So we want to calculate logistic recreation, so we will have to click on the recreation tab. Then we copy our data there and the variables are shown down here. Depending upon how your dependent variables are used, online statistical calculators like Data tab will calculate either the logistic recreation or linear recreation under the tab recreation. We choose diseased as dependent variable A gender smoking status as independent variable. Now, the calculator will do the logistic regression equation for us. Now, go through all the table slowly and understand, and let's start from the top. If you do not know how to interpret the results, there's a pattern call a summary in verse. You can copy it in word, you can copy the results in Excel, and you can copy the classification table also. So let's start. The first thing that is displayed in the result table is the results, where we say that the total number of cases are 36 people have been examined. 26 have been correctly estimated and that's 72.22 percentage in percentage time. With the help of the calculation, regression model, 26 out of 36% have correctly been assigned. That is 72%. Now let's go to the classification table below. You have an option to export it to word and excel. Here you can see how often the categories not diseased and disease are observed and how often they are predicted. So the observed values are 11, five, five, 15, and the predicted categories are like this. So we can say that they have made a correct prediction means. In reality, the person is not diseased, and the model has also predicted it has not diseased. In reality, the person is deceased, and the model has predict diseased. Both are positive. True positive and true negative. But we have a concept called a false negative and a false positive. In reality, the person is not diseased, but the model is telling it is diseased. So this is a false positive case, which is okay because you can definitely go for the second opinion and the person is careful. The concern is for the false negative. In reality, the person is diseased, but my model is not able to predict it. Hence, these five patients will miss on the treatment if they don't go for the current diagnosis. In total not disease observation are 16 11 plus 516. Out of these 16, the recreation model correctly scored 11 as not diseased and incorrectly stored five as disease. Out of 20 diseased individuals, 15 were correctly scored as disease, Pi were incorrectly scored as To be noted, for deciding whether a person is diseased or not, a threshold of 50% is used. If the probability is greater than 50%, we are marking it as diseased. As the probability is less than 50%, we mark it as not desased. So if the regression model estimates greater than 50%, the person is assigned desased, otherwise, not desaed. Let's come to the chi square test. We have a detailed video on chi square. The chi square value is 8.79 degrees of freedom three, and the p value is 0.32. If P low null go. We will go into the hypothesis testing. Here we can read whether the model is on a whole is significant or not. The answer is, yes. Now let's see. There are two models to be compared. In one model, all the independent variables are used. In the other model, few of the independent variables are used. With the help of chi square test, we compare how good the prediction is when the dependent variables are used and how good it is when the dependent variables are not used. And the chi squared t test tells us if there is a significant difference between the two results. The null hypothesis is both the models are the same. The p value is less than 0.05. This means that the null hypothesis is rejected. So when the null hypothesis is rejected, we assume that there is a significant difference between the models. Thus, the model as a whole is significant. Next comes the model summary. In this table, you will see one hand with minus two log likelihood value, and on the other hand, you have different coefficient of determination r square value. The model summary looks like this. You can easily export it to word and cel. Minus two log likelihood is 40.67, Cosell r square value is 0.22. And the other values are also displayed. The R square is used to find out how well the recreation model explains the dependent variable. In the linear recreation, the R square indicates the portion of the variation that can be explained by the independent variables. The more variance can be explained, the better the regulation model. R square is used to find out how well the regulation model explains the dependent variable. In a linear regulation, the R square indicates the portion of variance that can be explained by the independent variables. The more variance can be explained and the better the regulation model. However, in the case of logistic regulation, the the meaning is different. There are different ways of calculating r square. Unfortunately, there is no agreement yet on which is the best way to do it. The R square according to coin cell is 0.22 Nagker ki is 0.29 and so on. And now comes the most important table, table with the model coeent. The most important parameter of the coient is B, p value odds ratio. The coeent B values are here, the p values are here, and the odds ratio is here. We can see that gender p value is greater than 0.05. It means that gender is not a contributing factor for the disease. In the first column, we can read the coefficient values as 0.040 0.871 0.4 -2.73, and then we can insert those values instead of B one, b2bk. When we insert the cypion, we get an equation like this, 1/1 plus erase 20.04 into H, 0.87 into gender plus 1.34 into smoker minus the constant of 2.73, and then we go ahead and calculate. With this, we can now calculate the probability that a person is deceased. We want to know how likelihood that a person with the age of 55-years-old, female and smoker can be deceased. We replace the value of age with 55, gender as zero because it's not a male and one as a smoker and then calculate the value. When we do this calculation, the probability value is 0.69. It means there is a 69% likelihood that a 55-years-old female smoker is disease. Based on this prediction, it would now be decided whether or not to extensively investigate. The example is purely imaginary. In reality, there could be certain many other factors and different independent variables like the weight of the person age of the person and many more other things to determine whether the person is diseased or not. But now let's come back to the table. In the column, we can read coefficient of significant difference from zero. The null hypothesis is coefficient is zero in the population. The following null hypothesis is testing. The coefficient is zero in the population. As the variable is smaller than 0.05, the predicted coefficient is significant influence. In our example, we see none of the coefficient has a significant impact as all the p values are greater than 0.05. Now let's go to understand the odds ratio. The odds ratio is 1.042 0.39 83.81. For example, the odds ratio is 1.04, means that for one unit increase in the variable age, the increase of probability that a person can fall sick is by 1.04. And we can see that for smoker, the odds ratio is very high. With that, we come to the end of logistic recreation. We will see you in the practical session. Stay on. Thank you. 39. Logistic Regression practice: We will use an online calculator to do regression analysis, especially the logistic regression analysis in this video. I have uploaded a separate video on how you can do this analysis using Excel. So let's continue with online statistical calculator. I can import my data by clicking on the import button and drop Excels files, SV file or Data tab file. I can click on Browse and get my data inside. Right? So I have already loaded my data, which you can see on the screen. I have whether a person is deceased or not, age, gender smoking status. We can see that the data type has been automatically been identified by the statistical calculator. It says age is a metric variable, gender is nominal, and smoking status is also normal. Disease is nominal. Now, what I do is I click on regression, scroll down. So I have a good amount of cases. Let me just scroll down. When I click on regression, I can do simple linear regression, multi linear regression, and logistic regulation. What are my dependent variables? Age is my dependent variable. Gender is a dependent variable. Smoking status is a dependent variable. What do I want to predict? I want to predict whether the person is diseased or not. Am I selecting the right thing? No. I want to check, what is the dependent variable? What is my y? My y is whether the person is deceased or not. And my independent variables are a gender and smoking status. So for reference of gender, I'm taking male as one. For reference of smoking status, I'm taking smokers as one, and the model is predicted whether the person is diseased or not. Now I can click on summary in words, and it does a proper analysis and shows it to me. Right? It clearly shows that a logistic regreation analysis has been performed for examining the influence of age, gender, female and smoker status as non smokers as variables, disease is predicted for the value decease, a logistic analysis model has shown that the chi square for the three is 8.79 p value is 0.32, and the number of observations is 36. The coefficient of the variable p is 0.04, which is a positive. This means that when the increase of age is associated with increase of the probability of the dependent variable disease. However, the p value is 0.092, indicating that the influence is not statistically significant. The odds ratio is 1.04, indicating that for one unit increase of the variable eight, the increase of the odds that the dependent variable is deceased increases by 1.04. The coefficient of variable gender female, B value is 0.87 negative. Because this variable is negative, it means that the value of the variable gender female, the probability of the dependent variable becoming disease decreases. However, the p value of 2.0 0.28 indicates that the influence is not statistically significant. The odds ratio is 0.42, meaning that the variable gender female, the probability of the dependent variable disease increases by 0.42 times. The coeficent of the variable smoker status, p value is -1.32, which is negative, which means that if the value of the variable of the smoking status is non smoker, the probability that the dependent variable is deceased decreases. However, the p value is 0.089, indicating that the influence is not statistically significant. The odds ratio is 0.26 means that the variable is a smoker status, non smoker probability that the dependent variable deceased increases by 0.26 times. Now, let me pick up the reference as non smoker and the category as this and no disease. Now, let's come to the summary. We find that there is a slight change in the analysis. All of them have now become negative. Right? The odds ratio has changed, telling that for one unit increase in age, 0.96 indicates that the person will be not deceased because now we are targeting not deceased, right? So you should be careful of what you are taking as a reference. What do you believe in your hypothesis, are male people more likely to be diseased. So when you take the gender as male, the b value is -0.87. Now here my target is not diseased. So it seems that the probability that the male person being not diseased decreases by 0.97. But if I'm looking at diseased, you will find that this is now a positive value. Smoker is also a positive value. So we should know what is the target variable we want to study. Now let's come down. Let's see the results, and I even have an AI interpretation to help me. The table summarizes the overall performance of the binary logistic regression model. Here the interpretation is, total number of cases are 36, which is the total number of observations The table summarizes the overall performance of the binary logistic model. Here, the interpretation is the total number of cases of 36. This is a total number of observations or instance the model has tested on. In this context, the number of individuals are items in which the model attempted to predict the outcome, whether the person is deed or not deed. Correct assignment is 26 out of 36 cases, the model predicted the outcome of 26 of them. This correct prediction included both true positives correctly identifying the person is diseased and true negative correctly identifying cases without diseased. In percentage 72.22%. This is the accuracy of the model indicating that the number of assignments is 26 divided by the total number of cases 36. I multiply it with ten to get the percentage. It tells us how the model makes the right prediction. Now, let's understand the classification table. Is where we are trying to classify. I can take help of AI interpretation to understand it. The table summarizes the goodness of fit measure from the logistic regression analysis. Here, the true positive true negatives are 11 cases where we have correctly predicted that they are not diseased. False positive are five cases where we have made a type one error. False negatives are five cases where we incorrectly predicted they are not diseased as type two error. True psitives are correctly predicted as diseased. Correctness of prediction. Correct prediction for not diseased is 68.75%. The total not diseased cases were correctly identified. Correct predictions of disease, sensitivity or we call, 75% of the actual disease cases were correctly identified. Total accuracy is 72.22% all protection whether disease or not diseased, we correctly identified. Now, let's understand the chi square test. The beauty with this statistical calculator is that it gives you an AI interpretation. I don't have to go to changeP to it. The table shows the results of the chi square test associated with the binary logistic regression model. The test is often used to assess the overall significance of the model. Here, the interpretation of each component. I squared is the statistics where the answer is 8.79 in our case. This measures the difference between the observed and expected frequency of the outcome. The higher the chi square value indicates greater discrepancy between the expected and the observed value, suggesting that the models predictors have a significant relationship. Degrees of freedom, here, we have three degrees of freedom representing the number of predictors in the simple logistic regression. P value is the probability of observing the chi squared test statistics as extremely as one observed under the null hypothesis. The null hypothesis is that there is no relationship between observed and expected frequency of the outcome predicted by the volume, P value is 0.032, suggesting that there is 3.22% probability that the observed chi square statistics is extreme. And the null hypothesis where true. The p value is 0.32 below indicating that it's less than 0.05 threshold, indicating that there is a statistical significance result. Now, let's do a model summary. So here it says the minus two log likelihood is 40.67. It measures the models fitness. The lower value better the model fits the data. In our case, the value is 40.67, that it is relatively a saturated model, a model with a perfect fit. This number alone does not tell us much. Hence, we need to compare it with different other numbers. Cocin cell R square value is 0.22. This is a pseudo R square measure that indicates the amount of variation in the predicted variable explained by the model. It ranges 0-1. The value of 0.22 indicates that the 22% variance is explained by the model. However, it's worth noting that this measure never reaches one even for a perfect model. Let's go to Nagar K R square value. It is 0.29. Again, we try to adjust the r square to reach to one. But remember, there is a 29% of the variation is explained by this model. It means that you need to include more variables to understand the model better. When we are looking at this, we are getting the model difference. The component is question represents the various size, standard error, z value, p value, expected ratio and 95% confidence. Let's do the interpretation. The model predicts The basic outcome as -2.73 where the predictor are zero, the odds ratio is 0.7. Suggesting lower odds of outcome when the predictor is at the reference value. With every unit increase of the age, the probability that the person is deceased increases by 0.04. That is 4% increase in the odds. If the gender is male, there is a 0.87% increase, and so. Et's do the prediction. If the person's age is 45 and the person is male and the probability that the person is a smoker, what is the probability that the person will be diseased? There is 0.81. Is it more than 0.45? 50%? Yes. There is a probability that the person is diseased. But if the person is a female, then the probability decreases. Moreover, if the person is a non smoker, then there is a very less probability that the person is diseased. Now we have gone to the next example where we are trying to check if the person will purchase a product or not. And The variables are gender, age, and the time they spent online. So I'm going to click on recreation equation. What is the dependent variable, gender, age, and the time spent online and purchasing behavior is my dependent variable. There are three types of predictions they are happening, not two like last time. We have buy now, buy later and don't buy anything. Reference category for female gender, I'm taking it as female, and let's go to the summary. So the logistic regression analysis performed here that the influence of the gender male, age, and time spent online on the variable purchasing behavior for the value of by now. The logistic regression analysis shows that the model has, on the whole was significant. Number of observations are 24. The coefent that the variable gender is male is 1.53, which is positive. This means that the value of the variable gender ma, the probability that the person will buy increases. The p value is 0.201, indicating that the influence is not statistically significant. Odds ratio is 4.63, meaning that the gender is male, the probability that the dependent variable by now increases by 4.63 times. The cofient of variable ag is p equal to -0.11, which is negative. This means that an increase in age is associated with decrease in the probability that the dependent variable is by now. However, the p value is 0.07 indicating that the influence is not statistically significant. Odds ratio is 0.9, indicating with every unit increase in the age, the person by now only increases by 0.9 times. The coeent of the variable time spent on the online shop is b -0.02, which is negative. It means that the more a time spent on the online, there is less probability that they will buy now. P value is 0.56 indicating it's not statistically significant, and the time spent online increases the odds by 0.98 times. 24 cases 17 correctly predicted in percentage 70. Let's do the analysis. So um total number of cases 24, correct assignment 17 percentages 70. Now, let's go to the classification table. We can understand that what's the type one error and type two error? True negatives 13 cases were correctly predicted that they are not going to buy False positives are three cases, which was incorrectly predicted as they are pin now, but in reality, they did not buy. And false cases are that four of them actually bought, but our model said that they did not buy. Four cases were correctly predicted as Pi now. Correctness of by now is 82%, correctness of by now is 50% total accuracy is 70%. If you look at the chi square equation, we are getting the p value of 0.42. Here, the probability of a chi square test is extremely important as one of the observed value of the null hypothesis. The null hypothesis is that there is no relationship between the observed and the expected frequency and the output predicted from the model. P value of 0.42 is become below this convention 0.5, statistically significant. If I go with the model somebody, we can see that the r squared values are very w. And I have the p value So now let's do a prediction. If the person is a male and is 45-years-old and the time spent is 2 seconds. What is the probability that a person will buy? There is no much probability. But if the person is 20-years-old, then the probability increases. So we can understand that the new generation people are willing to buy more than the senior people. If we have an 80-year-old person, then the probability is absolutely equal to 0.01. So I hope you learn how to do logistic regression in this video. Thank you. Oh. 40. ROC Curve: D. Let's understand the ROC curve. We just completed learning about logistic regression. One of the ways to validate the accuracy of the model is using the ROC curve. Let's understand the theory with examples. So ROC stands for receiver operating characteristics. It's a graphical way of representing the performance of a binary classification model, also called as a logistic regression model, and also for other classification threshold. Let's understand with an example. Let's assume that we are performing a screening test on patients to identify whether the patient is healthy or diseased. For this classification to be done, the pharmacist is performing some tests on the blood and then deciding who of them will be diseased and who are healthy. When they got the sample of ten data, they have decided that they will put a threshold, and anybody below that threshold will be called as healthy and anybody above the threshold will be called as diseased. Now, how do we decide what should be the threshold? Based on which you can predict that the future is the patient is deseased? So let's say we have got a sample of ten people with their blood levels. We see that most of the people who are diseased have a higher blood level. And most of the people who are healthy have lower blood levels. So we decide that let's put a threshold at 45. So when we put a threshold at 45, we are saying that anybody who is below 45, we will classify them as healthy. Anybody who is above 45, we will classify them as disease. Now we can see that there are certain issues over here, and let's understand those issues in detail. So in this case, out of six people who have been classified as disease, two of them, four are correctly classified as disease, but two of them are incorrectly classified as disease, but in reality, they are healthy. So we have classified four out of six as disease, and this is called as two positive rate. It is also called as sensitivity. On the other hand, of the four healthy individuals, we misclassified one person as diseased. A disease person as healthy, and we have correctly classified three healthy people as healthy. Now, when we mis classified one out of four as healthy, this is called as false positive rate, and it is represented by FPR or it is one minus specificity. The threshold of 45, we get true positive rate as 4/5, that is 80% and false positive rate as 2/5 as 40%. So what exactly is TPR or two positive rate? True positive rate is nothing but true positives divided by true positive plus false negative. Two positives are the persons who are correctly classified as disease. We have correctly classified four of them as disease. False negatives are the persons who are incorrectly classified as healthy. So we did a mistake with one person. So Total is 4/1. So true positives is nothing but four of them have been correctly classified as diseased. But the problem was that out of the four who were correctly classified, one of the diseased person we missed. The reason we need to know the TPR is that what percentage of people will go without being treated? The specificity is very important to understand that there is 20% of the population which might not be treated well, or we are correctly classifying 80% of the population that we have tested. Let's understand FPR, that is false positively. False positives are the people who are healthy individuals, misclassified as diseased, and two negatives are healthy individuals. Individuals were correctly classified as healthy. So two of them have been incorrectly classified as DCs. So we start treatment for them, divided by total number that is five who were actually healthy. So total number of healthy people divided by how many people were false positive. So 40% of the people have been 0.4 is the rate of FPR. So how do we calculate TPR and FPR for each threshold? Should I put the threshold as 38? Should I put the threshold at 65, and so on. So in this case, we calculate the TPR and FPR for each of the thresholds. If I put this as zero, then my true positive rate is increasing, but my false positive rate is almost zero. So this is precisely the two values that are getting plotted on the ROC curve. The true positive rate is plotted on the y axis and the false positive rate is plotted on the x axis. We want to decide that if you go at 0.240 0.2, our false positive rate is here, but the true positive is increasing, and similarly at 0.4 0.6 0.8 and one. Now, let's draw the complete ROC curve for our example. If we choose the threshold value to be very small, that is push all the way to the left, we correctly classify all the five diseased individuals. But we misclassify all the five healthy individuals as well. Hence, the true positive rate is five out of five that is one. In the same way, however, we misclassified five healthy individuals as diseased. So the false positive rate is five out of five, that is again one. For that reason, the first data point is at one dot one. So as we push the threshold, we will still correctly classify if I'm at 0.2. I'm still correctly classifying all the five individuals as diseased, but I'm classifying four of the healthy individuals also as diseased. So now I come to the next data point. So if I take 0.8 as the threshold, my true positive rate is five out of five, so I've correctly classified all the people who are deceased as deceased. But out of five healthy individuals, we have now misclassified only four out of five. And hence, I am at 0.8 in terms of the false positive rate. For the next roshold, where we have the positive rate of 0.1, we are at 0.3, and we see that we have correctly classified all the five people as diseased, but my healthy individuals are less. So that will be my third data point. Five diseased people are correctly classified. False positive rate is three of them have been misclassified as disease out of five, that is 0.6. At the next threshold, the diseased person is misclassified as healthy for the first time. This is the threshold. This is the place where the diseased person is getting misclassified as healthy. And hence we see a dip in the true positive rate from 12.8. The true positive rate is four out of five that is 0.8, and the false positive rate is three out of five that is 0.6. We can now do that for all the other thresholds, and accordingly, we draft our ROC curve. At this point, for example, 80% of the das individuals were correctly classified as disease, 20% of the healthy individuals were incorrectly classified as disease. Using the ROC curve, we can compare different classification methods. Classification models are better is better the higher the curve. Therefore, the larger the area under the curve, the better the classification model is. Using the ROC curve, we can compare different classification methods, and it is precisely the area that is reflected by the AUC area under curve value. The area under curve is used during linear regression model valuation. The AUC value varies 0-1. The larger the value, the better the model. What about the ROC curve and the logistic regression? For example, we could build a new classification model using the logistic regression. Here, we could use the additional values like blood value, age, and gender of each of the person and try to predict whether the person is healthy or diseased. About the ROC curve and logistic regression, let's continue. In a logistic regression, the estimated value is then how likely it is that a particular person is deceased. Very often, 50% of them simply take as the threshold to classify whether a person is deceased or not But of course, this does not be what we are thinking of. So you can't be taking the threshold as 50% always. Therefore, even with the logistic regulation, we construct the ROC curve for different threshold values and see that at what level, we have the maximum area. So how can I get the ROC curve online? Yes. So now let's understand how I can do this ROC calculation using the data. So I've populated some data values for more than 40 almost 40 people, different blood levels and whether the person is diseased or not. So I can either go for my liberation model, and I say that I want to state the variable as diseased. Variable state is yes or no, and I want the test variable as blood value. So immediately we get the ROC, and the ROC is showing that at what levels specificity and sensitivity. Sensitivity is nothing but my true positive rate. How many of them diseased people have I classified correctly? Specificity on the other hand, is how many of them or how many of the healthy people have been misclassified as diseased. And we want that there is. Diseased people are 19, not diseased are 22, and positive is greater than equal to one, the sensitivity is one and it shows me the entire data. We can loe some sample data. And do. I can also find this under my correlation model. So I'll go to regulation, and I'm saying my dependent variable is deceased and blood value is my independent variable. The summary in words, if the logistic regulation analysis has been performed to examine whether the blood value of a variable deseas to predict the value as yes. Logistic recreation analysis shows the chi square value is 5.23, P value is 0.02. It means blood able to predict there is no influence of blood level on disease. We reject the null hypothesis because the p values lo. The coient of the blood value B is 0.03, which is positive. It means that the increase in the blood value is associated with the increase in the probability of the dependent variable as yes. The p value of 0.32 indicates that the influence is statistically significant. The odd ratio is 1.03, indicating that one unit increase in the blood value will increase the odds of the dependent variable as yes by 0.13 times. So when we build the logistic regression, we can see that we have just read the summary that the p value is 0.03 telling that there's a significance of the blood value to the diseased. The table summarizes that out of 41 cases which have been investigated are observed for building the model, in this context, the number of individuals who were either predicted as diseased or healthy. 28 of them out of 41 were correctly classified, diseased individual classified as diseased, and healthy individuals classified as healthy. The percentage is 68.29. It indicates the total number of people who have been correctly classified by 28, which is divided by 41, and then it is multiplied by 100 to get a percentage. If I tell you how often the model makes the right prediction, whether the prediction is presence or absence of S. So we can see that out of this is called as a classification table. People who are actually not diseased and correctly predicted as not diseased, people who are diseased and predicted as not diseased. This eight is my concern. Why? Because these are the people who will not go for their treatment. And five of them have been classified as diseased, when in reality, they were not suffering. So we will then be building the ROC model, and the ROC currently the AOC, A under the curve is 0.699. Higher the curve, better the model. Out of 41 cases, the correct assignment has happened for 28 cases, and the incorrect assignment has happened for 13 cases. So 68% of the people were correctly classified. Now, let's do an AI interpretation. The AI interpretation very clearly says that the model fit of two log likelihood. The lower the value, the better the model. Here, the value is 51.39 indicating that the model is relatively saturated, a model with a perfect fit. The number alone does not tell much. We need to compare it with other models. Now, let's do the interpretation of the model. The table shows that we have done a binary logistic recursion analysis, which look at how predictors influence the likelihood of a particular outcome. Components, Cefion B. This represents the effect of each predictor. A positive coeent increases the likely odds or the log odds of the outcome, and the negative coeion decreases it. Standard error. This measures the standard deviation of the estimated coeion, relatively how precisely the model estimates the coesion value. The z value. This is the z score calculated as a coefent divided by the standard error, it is used to test the null hypothesis that the coefent is zero. P value indicates the probability of observing the data or something more extreme. If the null hypothesis is true, the lower P and word value suggests, the p value indicates the probability of observing the data or something more extreme. If the null hypothesis is true, the lower p value suggests that the null hypothesis of no effect is less likely. Interpretation. The model predicts the log odds of the baseline as -1.31, for all the predictors are zero. The odd ratio is 0.27, suggesting that the lower odds of the outcome when all the predictors are of the reference value. Blood value that increases by three. Now, let's do the prediction. If my blood value is 85, then there is a 75% probability that I am suffering. I will also get to see the ROC curve. The ROC, the area under the curve is 0.699. She Shh 41. Understand the Non Normal Data: Our normal or not. Let us try to understand how do we work when my data is not normal? Or even before getting there, let me introduce you to this gentleman. Any guesses? Who is the gentleman? You can type in the chat window if you know. And even if you do not know, that's perfectly fine. There are no penalty points for wrong guesses. Yes. Some of you have guessed it right? He's the famous person behind our normal distribution. Mr. Carl cos. He's the great mathematician. And he was the person who came up with the concept of the Gaussian distribution or the normal distribution. So here is the brain behind concept of normal distribution and all the parametric tests that we are taking. If my data is not normal, then it can be skewed. It could be negatively skewed or it could be positively skewed. If I say negatively skewed, it is technically having a tail on the left side. Positively skewed means tail on the right side. It means my data is not behaving in a normal way. My data can be not normal because it is following a uniform distribution or a flat distribution like this. Then also it's not following the normal distribution. My data can have multiple peaks, something like this, which represents that there are multiple data groups in my dataset. And it's not normal behavior. Because my data has all these things. I need to treat this data differently when I'm doing my hypothesis testing. And why is this data not normal? It could be because of the presence of some outliers. It could be because of the skewness of my data, or it could be because of the kurtosis that's present in the data. So the reason for your data not behaving in a normal way could be one of these. Let us summarize, what did we learn? My data is not normal if the distribution has a skewness, has unimodal, it's not unimodal, but in fact this bimodal or multimodal distribution. It is a heavy tail distribution containing outliers. Or it could be a flat distribution like a uniform distribution. These are some basic reasons why my data is not behaving in a normal way. Odd, it is not a normal distribution, then there are multiple distributions. There are other distributions as well, which talks about the exponential distribution, which models the time between the event. The log-normal distribution. Which says that if I apply the logarithm on the data, then my data will follow a normal distribution. Poisson distribution, binomial distribution, multinomial distribution. Let us understand some examples, real-life scenarios where the non-normal distributions can be applied. If you look at this, whenever I am trying to predict something over a fixed time interval. Then I use Poisson distribution for my analysis and hypothesis. Some examples of Poisson distribution or the number of customer service called received in the call center. The number of patients that present a hospital emergency room on a given day, the number of request for a particular item in an online store in a given day. The number of packages delivered by the delivery company in a given day, the number of defective items produced by a manufacturing company in a given day. If you observe there is a common behavior here. Whenever we are trying to understand something on a particular time period, it could be a given day, it could be a given month, given B. Then we prefer to do our analysis using Poisson distribution. Some examples of log-normal distribution. The size of the file downloads from the internet, the size of the particles in a sediment sample, the height of the tree, the size of the financial returns, the size of the insurance game. If you see these examples, like if I take the example of financial returns from their investment, you might see that out of my portfolio of investments, some investment gave me a very good return of 100%, 100%, 150 per cent, 80 per cent. And you will also see that I have made investments in some part in my portfolio because it resulted in a zero return or a negative return because I'm in loss. But overall my portfolio is giving me a return of 12 to 15% or 15 to 20 per cent. You are trying to say that your distribution is technically not a normal distribution. You have very low returns and very high returns. But if you apply the logarithm on your data, then it behaves like a normal distribution that overall your portfolio will result in a return of some X percentage. Similar applies even in the insurance claim. Let us try to understand the application of exponential distribution. The time between arrivals of customers in queue, the time between failure in a machine, your factory, the time between purchases in the retail store, The time between phone calls and the contact center, the time between page views on the website. Now if you see between the Poisson distribution and the exponential distribution, there is one common element. What is the common element? We're trying to study with reference to time. Whenever you're doing a normal distribution, It's not with reference to time. Right? So these are some applications. But the difference between a poison and an exponential is in a Poisson distribution. It is on a particular day, on a given day, on a given week are given month. Here we are trying to understand the time between the two evens. What is a time gap between the two events? Then the exponential distribution can help you out. We can, let's understand the application of some uniform distribution, like the heights of the student in the class. Needs of packets in a delivery truck. Some packages are very big, some packages are small. If you put it in a distribution, you will also find that it's a flat distribution or a uniform distribution because for each category of packages, you'll have approximately the same number of, similar number of packages. Goods that you're delivering. The distribution of test scores for a multiple choice exam. The distribution of waiting time at a traffic light, the distribution of the arrival time of a customer at a retail store. So if you see all these examples following uniform distribution, it's not a bell curve. Because you have continuously people who are arriving at the retail store. It's not that there is a sudden peak. And the real-world scenarios of heavy-tailed distribution, it means the distribution where the outliers are present, the signs of the financial loss and an insurance industry or other signs of financial loss. In a few ask a trader, they would see that extremely high and an extremely low number. The size of the extreme rainfall. So we do not have extreme rainfalls every year. So we wouldn't be able to say that whatever has happened, it's because of an outlier. And heavy-tailed distribution are usually impacted because of the presence of outliers. So if your data is having outliers, then you can also see that the distribution for load is a heavy-tailed distribution. And we will understand in the next session, what type of non-parametric tests should I be performing? Depending upon the type of the non-normal data that we are starting. The size of the power consumption, the size of the economic fluctuation of the stock market crash. These are all examples of your heavy-tailed distribution. Examples of bimodal data. Here you need to understand bimodal means there are two outcomes that we're trying to study. The distribution of exam scores of students who studied and those who did not. Distribution of ages of the individual in a population who from two distinct age groups, height of two different species, salary distribution of employees from two different departments. Godspeed on a highway with two groups of slow and fast drivers. So here you can see that I am having two groups of data which are different. And I'm trying to understand the behavior are I will go ahead and do my investigation as part of my hypothesis or the resource that I'm trying to do. If I have more than two groups, two different, more than two different groups, like three different groups for different groups, then it becomes a multimodal distribution. Right? So I think by now you would have gotten an idea of what are the different distributions which are not normal distributions. So how do I determine if my data is not normally? The first dot become, it comes to our mind is a normality test. But even before doing a normality test, you can use simple graphical methods to find out if your data is normal or not. You can use histogram. And here the histogram is clearly showing multiple moves. So I can clearly see that this is not a normal distribution. If I tried to put a fit line, then also I can see that there is skewness in my data. I can also use box plot to determine if my data is not normal. So here you can see that I have a heavy tail on the left side telling that my data is skewed. I can also have outliers which a boxplot can easily highlight. So I can hide, identify the heavy-tailed distribution using the boxplot. Also. I can use simple descriptive statistics where I can see the numbers of mean median mode. And when I see that these numbers are not overlapping or not close to each other, that also simply indicates that my data is not normal. I can look at the kurtosis and skewness of my data distribution and then come to a conclusion if my data is behaving normal or not. So I have shown you other ways of identifying whether your data is following and not non-normal distribution or if your data is following a normal distribution. Now I would say one more thing. Do not kill yourself if your mean was 23.78 and median is 24, and the mode would be like 24.2 or 24. So if there is a slight deflation, we still consider it to be normal. Right? Skewness close to zero is an indication that my data is normal. But if my skewness is beyond minus two or plus two, it is definitely our non-normality proof. Ketosis is also one more way of identifying if my data is following normal distribution. Most of the time we prefer the kurtosis number to be in 0-3. But if you're ketosis is negative, it means that it's a flat curve. Audits follow a uniform distribution. Audit could be a heavy-tailed distribution of high kurtosis could also be an indication that your data is too perfect. And maybe you need to investigate if there are, they have not manipulated your data before handing it over. Other favorite ADText or Anderson-Darling test, where we try to understand if my data is normal or not. So the basic null hypothesis whenever I'm doing NAT test, is that my data follows a normal distribution. So this is the only test where I want my p-value to be greater than 0.05 that I get, I fail to reject the null hypothesis, concluding that my data is normal, and I fall back on my favorite parametric test, which makes it easy for me to do the analysis. But what if during the ADA test, your data and your data analysis shows that the p-value is significant, that it is less than 0.05, maybe it's 0.02. Then it concludes, my data is not normal distribution. And I need to investigate what type of non-normality it has. Accordingly, I will have to put up the test and then take it further. We will continue our session in the next Venice day. I hope you liked it. If you have any questions, please feel free to comment in the WhatsApp or in the Telegram channel or in the comments section over here. Any topic which you would like to learn as part of the y's Wednesday's session. I would be happy to look into that. If you can put those comments in the chat box or in the WhatsApp group or the telegram. I really love teaching you and I thank you for being wonderful. Students. Take care. 42. Kruskal Wallis test 3 or more groups nonnormal data: This tutorial is about the crus walus test. If you want to know what the crus c, walus test is and how it can be calculated and interpreted. You are at the right place at the end of this video. I will show you how you can easily calculate the walus test online. And we get started right now. The crus Walus test is a hypothesis test that is used when you want to test whether there is a difference between several independent groups. Now, you may wonder a little bit and say, Hey, if there are several independent groups, I use an analysis of variance. That's right. But if your data are not normally distributed, and the assumptions for the analysis of variance are not met. The wus test is used. The Wace test is the non parametric counterpart of the single factor analysis of variance. I will now show you what that means. There is an important difference between the two tests. The analysis of variance tests, if there is a difference in means. So when we have our groups, we calculate the mean of the groups, and check if all the means are equal. When we look at the crus C wals test, on the other hand, we don't check if the means are equal. We check if the rank sums of all the groups are equal. What does that mean? Now, what is a rank? And what is a rank sum in the classical als test? We do not use the actual measured values, but we sort all people by size, and then the person with the smallest value gets the new value or rank one. The person with the second smallest value gets rank two. The person with the third smallest value gets rank three, and so on and so fourth until each person has been assigned a rank. Now we have assigned a rank to each person, and then we can simply add up the ranks from the first group. Add up the ranks from the second group and add up the ranks from the third group. In this case, we get a rank sum of 54 for the first group. 70 for the second group, and 47 for the third group. The big advantage is that if we do not look at the main difference but at the rank sum, the data does not have to be normally distributed when using the cross was test. Our data does not have to satisfy any distributional form, and therefore, we also don't need it to be normally distributed. Examples for the rusk wallace test for the rusk walus test. Of course, the same examples can be used as for the single factor analysis of variance, but with the addition that the data need not be normally distributed. Medical example. For a pharmaceutical company, you want to test whether a drug XY has an influence on body weight. For this purpose, the drug is administered to 20 test persons. T test persons receive a placebo and 20 test persons receive no drug or placebo. Objective, Determine if drug XY has a statistically significant effect on body weight compared to placebo and control groups. Social science example. Do three age groups differ? In terms of daily television consumption, research question and hypothesis. The research question for the ruskal was test maybe. Is there a difference in the central tendency of several independent samples? This question results in the null and alternative hypothesis. No hypothesis. The independent samples all have the same central tendency, and therefore come from the same population. Alternative hypothesis, at least one of the independent samples does not have the same central tendency as the other samples and therefore originates from a different population. Before we discuss how the crus cull, walus test is calculated, and don't worry. It's really not complicated. We first take a look at assumptions. Assumptions. When do we use the crus c? Walus test? We use the crus Walus test if we have a nominal or ordinal variable with more than two values. And a metric variable, a nominal or ordinal variable with more than two values is, for example, the variable, preferred newspaper, with the values, Washington Post, New York Times, USA today. It could also be frequency of television viewing with daily several times a week. Really never a metric variable is, for example, salary, well, being, or weight of people. What are the assumptions now? Only several independent random samples with at least ordinarily scaled characteristics must be available. The variables do not have to satisfy a distribution curve. So the null hypothesis is the independent samples, all have the same central tendency. And therefore come from the same population or in other words. There's no difference in the rank sums, and the alternative hypothesis could be at least one of the independent samples does not have the same central tendency as the other samples, and therefore comes from a different population. Or to say it in other words again. At least one group differs in rank sums. So the next question is, how do we calculate a rusk. Wallace test. It's not difficult. Let's say you have measured the reaction time of three groups. Group A group in group C, and now you want to know if there's a difference between the groups in terms of reaction time. Let's say you've written down the measured reaction time in a table. Let's just assume that the data is not normally distributed, and therefore, you have to use the crus k was test. Then our null hypothesis is that there is no difference between the groups, and we're going to test that right now. First, we assign a rank to each person. This is the smallest value. So this person gets rank one. This is the second smallest value. So this person gets rank two, and we do this now for all people. If the groups have no influence on reaction time, the ranks should actually be distributed purely randomly. In the second step, we now calculate the rank sum and the mean rank sum for the first group, the rank sum is two plus four plus seven plus nine, which is equal to 22, and we have four people in the group. The mean rank sum is 22/4, which equals 5.5. Now we do the same for the second group. Here we get a rank sum of 27 and the mean rank sum of 6.75, and for the third group, we get a rank sum of 29, and the mean rank sum of 7.25. Now we can calculate the expected value of the rank sums. The expected value, if there is no difference in the groups would be that each group would have a rank sum of 6.5. We've now almost got everything we need. We interview 12 people. The number of cases is 12. The expected value of the ranks is 6.5. We've also calculated the mean rank sums of the individual groups. The degrees of pre Domina case are two, and these are simply given by the number of groups minus one, which makes three minus one. Lastly, we need the variance. The variance of ranks is given by n squared -1/12. N is again a number of people, so 12. We get a variance of 11.92. Now we've got everything we need with these values. We can now calculate our test value g. The test statistic corresponds to the g square value and is given by this formula n times the sum of r bar minus e r squared all divided by Sigma squared. In our case, the number of cases is 12. We always have four people per group. So we can pull out the E 5.5 is the mean rank of group A, 6.75 is the mean rank of group B, and 7.25 is the mean rank of group C. This gives us a rounded value of 0.5, as we just said. As we just said, this value corresponds to the square value. Now we can easily read the critical, square value in the table of critical, square values. You find this table on the Internet also. We have two degrees of freedom. And if we assume that we have a significance level of 0.05, we get a critical, square value of 5.991. Of course, our value is smaller than the critical g square value, and so based on our example data, the null hypothesis is retained, and now I will show you how you can easily calculate the Cresco Wallace test online with Data tab. Online calculation. In order to do this, you simply visit data tab.net, and then you click on the statistics calculator and insert your own data into this table. Further, you click on this tab, and under this tab, you will find many hypothesis tests, and when you select the variables you want to test, the tool will suggest the appropriate test. After you've copied your data into the table, you will see the reaction time and group right here at the bottom. Now we simply click on reaction time and group, and it automatically calculates an analysis of variance for us. But we don't want an analysis of variance. We want the non parametric test. We just click here. Now, the calculator automatically calculates the ruskal Wallace test. We also get a e square value of 0.5, the degrees of freedom are two, and the calculated p value is, and here below, you can read the interpretation. Ruskal Walus has shown that there is no significant difference between the categories. Based on the p value, therefore, with the data used, we fail to reject the null hypothesis. Just try it out yourself. It's very easy. Stay connected, keep learning, keep growing, see you in the next lesson. 43. Design of Experiments: Hi, and welcome. In this video. We'll delve into the fascinating world of design of experiments. Commonly referred to as DOE, we discuss what design of experiments or DOE is, the process steps of DOE project. How DOE can help you reduce the number of experiments. How to estimate the number of experiments needed. And we go through the most common types of designs. So what exactly is design of experiments at its core, design of experiments, DOE is a structured method used to plan, carry out, and interpret experiments. The main purpose of DOE is to find out how different input variables, called factors, affect an output variable, called the response variable. Here's a more straightforward explanation. Systematic approach. DOE is organized and methodical. It follows a step by step process to ensure that experiments are conducted in a logical and efficient way. Input variables, factors. These are the elements that you change in an experiment to see how they affect the outcome. For example, if you are baking a cake, factors could include the amount of sugar, the baking time, or the oven temperature. Output variable, response variable. This is what you measure in the experiment to see the effect of the changes you made to the factors. In the cake example, the response variable could be the taste or texture of the cake. The goal of DOE is to understand the relationship between these factors and the response variable. Helping you determine which factors have the most significant impact and how they interact with each other. Imagine you're riding a bicycle. The smooth rotation of the wheels depends on the condition of the bearings. If the bearings are well lubricated, there's minimal frictional torque, making pedaling effortless. However, if the lubrication is inadequate or the temperature is too high, more effort is required to maintain speed due to increased friction. In such cases, DOE allows us to systematically investigate factors like types of lubrication, such as oil or grease, and varying temperatures low, medium, High to precisely quantify their impact on frictional talk. But why is this important? Design of experiments enables us to design efficient test plans that uncover these insights effectively. By carefully manipulating factors and their levels, DOE helps us pinpoint which variables significantly influence the outcome. Be it in mechanical systems like bearings or in more complex scenarios involving human responses to medications. The applications of DOE are vast and diverse, whether optimizing manufacturing processes, improving product designs, or refining medical treatments, DOE serves as a powerful tool to identify critical factors and determine optimal conditions for achieving desired outcomes. It empowers researchers and engineers to make informed decisions based on empirical data rather than relying on guesswork. In our upcoming segments, we'll explore the essential steps of ADOE project from designing experiments to analyzing results. As we proceed further in the course, we uncover the intricacies of design of experiments and discover how this methodological approach can revolutionize your approach to experimentation and research. Stay tuned for more insights and practical tips. 44. The areas of application for a DOE: Now, let us understand what are the areas of application for DOE. The applications of DOE are wide ranging and varied, whether it's for optimizing manufacturing processes, improving product designs, or refining medical treatments. DOE is a powerful tool for identifying key factors and determining the best conditions to achieve desired results. It helps researchers and engineers make informed decisions based on real data instead of guesswork. Steps of DOE project, let's take a look at the process of A DOE project, planning, screening, optimization, and verification. In the first step, planning. The things are important. First, gain a clear understanding of the problem and the system. Second, determine one or more response variables. Third, identify factors that can significantly influence the response variable. The task of determining potential factors that influence the response variable can be very complex and time consuming. For example, a fishbone diagram can be created in a team. Now comes the second step. Screening, if there are many factors that could have an influence. Usually more than four to six factors. Screening experiments should be carried out to reduce the number of factors. Why is this important? The number of factors to be investigated has a major influence on the number of experiments required. Note, in the design of experiments, the individual experiments are also simply called runs in the full factorial design, which we will discuss in more detail in a moment. The number of experiments or runs is n equal to two to the power of k, where n is the number of runs and k is the number of factors. Here is a small overview if we have three factors. For example, we have to make at least eight runs with seven factors. It is already at least 128 runs, with ten factors. It is already at least 1024 runs. Please note that this table applies to AD OE, where each factor only has two levels, otherwise. There will be even more runs, depending on how complex an individual experiment is. It may therefore be worthwhile to select so called screening designs for four or more factors. Later, we will discuss the fractional factorial design and the placid Berman design. Which can be used for screening experiments. Once the significant factors have been identified using screening designs, and hopefully, the number of factors has been reduced. Further experiments can now be conducted. The data obtained can then be used to create a regression model, which helps to determine the input variables in such a way that the response variable is optimized. After optimization comes the final step verification. This involves checking once again whether the calculated optimal input variables really have the desired influence on the response variable. Depending on whether we are in the screening step or the optimization step. There are different types of designs. Thank you for your attention. In the next lesson, we will dive deeper into practical applications of design of experiments and how to interpret the results effectively. Stay tuned. 45. Types of Designs in a DOE: Types of designs in DOE experiments. When we are in either the screening step or the optimization step. We use different types of design methods. The most well known ones are full factorial design, fractional factorial design, Placet Berman design, Box Benkin design, central composite design. Let's start by looking at the full factorial design and the fractional factorial design. We also need to answer why we put in all this effort. Why do we use design of experiments, DOE, and why do we need statistics? The reason is that experiments take time and cost money. Therefore, we need to keep the number of runs, individual experiments as low as possible. However, if we do too few runs, we might miss important differences and not get accurate results. For example, let's say we want to find out which factors affect the frictional talk of a bearing? We need to carefully design our experiments to identify these factors efficiently without doing unnecessary runs. How are the number of experiments in DOE estimated? Let's take a look at an example. We want to investigate which factors influence the frictional tock of a bearing. Let's start with one factor, lubrication. We want to know whether lubrication affects the frictional torque if a bearing is oiled or greased? To find out, we take a random sample of ten bearings? We oil half of the bearings and grease the other half. Now we can measure the frictional tok of the five oiled bearings and the five greased bearings. But why use ten bearings, in most cases, each run costs a lot of money. Perhaps we can manage with fewer runs. How many experiments do we need to find out if the lubricant has an effect on the frictional tok? Let's just start with the ten bearings. We can now calculate the mean value of the frictional torque of the oiled and greased bearings. Then we can calculate the difference between the two mean values. In this sample, we can see a difference between oiled and greased bearings. However, we also notice that the frictional torque in the oiled and greased bearings is highly variable. If we take another random sample of ten bearings, the difference might be greater, or it might be in the opposite direction. In other words, the frictional talk of the bearings varies widely. The wider the spread, the more difficult it is to identify a specific difference or effect. Fortunately, we can reduce the variability of the mean value by increasing the sample size. The larger the sample size, the more precise the estimation of the mean. Therefore, the smaller the effect and the wider the spread of the response variable, the larger the sample size needs to be. But how much larger, how can you estimate the number of runs needed? You can use this formula as an approximation to estimate the number of runs needed, n equals Sigma divided by Delta. A squared here, n is the number of runs. Sigma is the standard deviation. Delta is the effect to be determined. For example, if we have a standard deviation of three newton millimeters and a relevant difference of five newton millimeters. We need 22 runs. If the standard deviation is two newton millimeters. We only need ten runs if the standard deviation is one newton millimeter. We need four runs. So we would use two runs with greased bearings and two runs with oiled bearings. But how can DOE help you reduce the number of runs? We will see it in detail in the next lesson. Thank you for your attention. In the next lesson, we will dive deeper into practical applications of design of experiments and how to interpret the results effectively. Stay tuned. 46. How to reduce the number of runs: But how can DOE help you reduce the number of runs? Let's assume that the calculation of the number of runs results in 16 experiments. Eight runs with oiled bearings and eight runs with greased bearings. But what if we have a second factor? Let's say in addition to lubrication, we have temperature with low and high levels. Then we need another eight runs to take these factors into account. So we need 16 runs to check if the lubricant has an effect. And 16 runs to check if the temperature has an effect. This gives us a total of 24 runs. Now the question arises, is it possible to achieve this with fewer runs, and that brings us to the full factorial design. The question is, why should we limit ourselves to testing one factor at a time? Instead, we could devise a design that incorporates all potential combinations, such as grease and high temperature. Of course, we still need 16 runs per factor. We get this by making four runs with each of the four combinations. Then we have eight runs with oil and eight with grease, and on the other side, eight with low temperature and eight with high temperature. We now have a total of 16 runs before we had 24 runs. We now need fewer experiments and get even more information. Why more information? We now also know whether there is an interaction between temperature and lubrication. For example, oiled bearings may show a variation in frictional torque at different temperatures, which is not observed with greased bearings. This information would have been lost previously. Now, when we have three factors instead of two, the savings are even higher. If we test one of the three factors at a time, we need 32 runs. If we now run two experiments for each combination in a full factorial design, we still only need 16 runs. However, for each factor, we still have eight runs per factor level. For example, for the lubrication factor, we have eight runs with oil and eight runs with grease. Of course, we can also create full factorial designs with more than two levels. For example, the temperature factor could have three levels, low, medium, and high. However, as mentioned at the beginning, even with a full factorial design with two levels in each factor, the number of runs required increases very quickly as the number of factors increases. Let us, therefore, now take a look at the fractional factorial design. The fractional factorial design is used for screening designs. That is, if you have more than approximately four to six factors, Of course, reducing the number of runs means reducing information. In fractional factorial designs, the resolution is reduced. What is the resolution? The resolution is a measure of how well DOE can distinguish between different effects. More precisely, the resolution indicates how much the main effects and interaction effects are confounded in a design. But what are mean effects and interaction effects? What does confounded mean? In design of experiments, the term effect refers to the impact that a certain factor or a combination of factors has on the response variable of an experiment. Essentially, they measure how much the response variable changes when you change the factors. A main effect is the influence of a single factor on the response variable. For example, what influence does the lubrication of a bearing have on the frictional tok? Interaction effects occur when the effect of one factor on the response variable depends on the level of another factor. For example, the effect of the lubricant on the frictional talk could depend on the temperature. But what does that mean? Thank you for your attention. In the next lesson, we will dive deeper into practical applications of design of experiments. Stay tuned. 47. Type of Effects: But what are main effects and interaction effects, and what does confounded mean? In design of experiments. The term effect refers to the impact that a certain factor or a combination of factors has on the response variable of an experiment. Essentially, they measure how much the response variable changes when you change the factors? A main effect is the influence of a single factor on the response variable. For example, what influence does the lubrication of a bearing have on the frictional torque? Interaction effects occur when the effect of one factor on the response variable depends on the level of another factor. For example, the effect of the lubricant on the frictional tok could depend on the temperature. But what does that mean? Let's say we have an average frictional torque value of 102 newton millimeters for the bearings with oil and an average value of 108 newton millimeters for the bearings with grease. Then we have a main effect of lubrication of six newton millimeters. But now we can break this down into high and low temperatures. At high temperature, we could get 98 for oil and 102 for grease. The difference between oil and grease is only four newton millimeters. At low temperature, we could get 104 and 112. A difference of eight, so the lubrication factor is influenced by the temperature, and we have an interaction between lubrication and temperature. The interaction leads to a difference of two new 10 millimeters to the original result. We therefore have an interaction effect of two newton millimeters. Full factorial designs take all interactions into account. In our bearing friction example, in addition to the lubricantent temperature factors, we also looked at the interaction between lubricant and temperature. However, as the number of factors increases, numerous interactions rapidly emerge. For example, if we have five factors, A, B, C D and E, we get the interaction between two factors. Between three factors, between four factors and between all five factors. Now, of course. The question is, do we really need all the interactions, or can we reduce the resolution? This is exactly what the fractional factorial design does in a fractional factorial design. Interactions can be confounded with other interactions or with main effects of factors. What does confounded mean? It means that the effects of different factors or the effect of the interaction of factors cannot be separated from each other. The extent to which the number of runs can be reduced at the expense of resolution is shown in this table. The resolution is usually indicated by Roman numerals. Example three, four, five, and so on. Here on the diagonal. We see the full factorial designs. We'll go through what resolutions three, four, and five mean in a moment. For example, if we have six factors, we need at least 64 runs for a full factorial design. If we choose a fractional factorial design with a resolution of six. We need 32 runs with a resolution of four. We need 16 runs, and with a resolution of three. We need only eight runs. But what does that mean? How does it work? The full factorial design is always used as the starting point. Let's take a look at the example with eight runs. In the next lesson, we will dive deeper into practical applications of design of experiments. Stay tuned. 48. Fractional Factorial design: Let's break down the key points about fractional factorial designs in simple terms. What are fractional factorial designs? Fractional factorial designs are an efficient way to test multiple factors simultaneously. They significantly reduce the number of experimental runs needed. Why use fractional factorial designs? Using fractional factorial designs saves both time and resources compared to full factorial designs. Additionally, they allow for the testing of interactions between factors, providing valuable insights with fewer experiments. One, Resolution in fractional factorial designs. Definition, resolution refers to how much information is captured in an experimental design. In simpler terms, it tells us how many factors like A, B, C, we can test together and how well we can separate their effects from each other. H igher resolution, example, three or three. This means we can test more factors together, but it also means that the effects of these factors might get mixed up with interactions. These factors interact with each other. For example, with resolution three, the effects of main factors could be mixed up with interactions involving two other factors. Lower resolution, example. I V or four, here, we can't test as many factors together, but it's clearer to see the main effects of each factor because they are less mixed up with interactions. For instance, at resolution four, the effects of main factors are confounded with interactions involving three factors. Two, confounding effects, definition. When we say effects are confounded, it means we can't tell exactly which factor is causing a certain change in the results. This happens because different combinations of factors might have similar effects on the outcome. Example, imagine testing factors, A, B, and C, if we add a fourth factor, D, the results might show changes that we can't attribute solely to D. The effect of D might be mixed up with how A, B, and C interact with each other. Three, impact of resolution on experiment design. Explanation. Choosing a resolution affects how efficient our experiment is and how clear our results are. Higher resolution, lets us test more factors together, but requires more tests to be confident in our results. Lower resolution requires fewer tests, but can make it harder to entangle the effects of different factors. Four, practical examples, Illustration, to understand better, think of testing different recipes for baking a cake. If you change one ingredient, like sugar, the taste might change. But if you change both sugar and flour, it's harder to say which change caused, which result. The design helps us balance testing many factors and understanding their separate impacts. By understanding these points, researchers can design experiments that give clear answers about how factors affect outcomes, even when testing several factors at once. We'll go through what resolutions three, four, and five mean in a moment. For example, if we have six factors, we need at least 64 runs for a full factorial design. If we choose a fractional factorial design with a resolution of six, we need 32 runs. With a resolution of four, we need 16 runs, and with a resolution of three, we need only eight runs. But what does that mean and how does it work? The full factorial design is always used as the starting point. Let's take a look at an example with eight runs. Suppose we have the factors A, B, and C with a full factorial design, we can test whether factor A, B or C has an effect. We can also test whether interactions between two factors have an effect and whether interactions among all three factors have an effect. If we now want to test not just three factors with eight runs, but an additional fourth factor, S factor D, we must sacrifice some information from one of the interactions. For example, the interaction of A, B, and if we want to test a fifth factor with eight trials, let's say factor A, we would need to sacrifice another interaction. For example, the interaction between B and C, however, we are not actually dropping the information. We are mixing the new factor with the interaction. This means we've confounded the factor with the interaction. What does that mean? It means we can't determine if an observed effect is due to factor D or the interaction of A, B, and C. Similarly, we can't tell if an effect is due to factor A or the interaction of B and C of cose. It's much less problematic to mix one factor with an interaction of three factors than with an interaction of two factors. Similarly, we can't distinguish if an effect results from factor A or from the interaction of B and C. Now, we have a good transition to the resolution. What do the resolutions three, four, and five mean? At resolution three, main effects can be confounded with interactions of two factors. For example, factor D could be confounded with the interaction of factors A and B Experiments with resolution three so therefore be considered critical. They can only be used if the interaction of two factors is significantly smaller than the effects of the main factors. Otherwise, the interaction of two factors can significantly distort the result of one factor. Experiments at resolution four are much less critical. Here, only the main effects are confounded with the interactions of three factors, and the more factors involved in an interaction. The smaller the effect is likely to be. Furthermore, in resolution four, interactions of two factors are confounded with interactions of other two factors. O Experiments at resolution five are not considered critical. Main effects are only confounded with interactions of four factors. In the same way, two factor interactions are only confounded with interactions of three factors. But how do you confound a factor and an interaction? Let's take a look at this example. Here, we have the full factorial design of the three factors, A, B, and C. These eight runs are carried out in total. We still only consider factors with two levels, minus one stands for one level and one stands for the other. For our frictional talk example, the test plan would look like this for the factor temperature, minus one is the low temperature, and one is the high temperature. If we now run the experiments, we obtain a value for the response variable for each run. If factor A is one or minus one, this has a certain effect on the target value. The same applies if factor B is one or minus one. The interaction effect tells us whether there is an additional effect. I factors A and B are simultaneously, one or minus one, or if both go in exactly the opposite direction. On one side, we have the pairings with the same sign, and on the other side, the pairings with an unequal sign. We can check whether there is a difference in the response variable, between the values in the green group and the values in the red group. If there is a difference, then there is an interaction between A and B. However, if we know in advance, that there is only a very small or no interaction, we can use these combinations. To test a fourth factor, D to do this, we simply multiply. A and B. We always have a one, if the factors, A and B have the same sign and minus one if they have a different sign. Of course, a problem may arise. When analyzing the results. If there is a difference between the green and the red values. In the response variable, we cannot determine whether this effect comes from the interaction between A and B or from factor D if we are a. Show that there can be no interaction between A and B. This is not a problem. Then we can be sure that the difference is due to factor D similarly. We can take the interaction of A and C and also measure factor A and the interaction of A, B and C to measure factor F therefore. In this case, we measure six factors with only eight runs, but we can no longer distinguish factor D from the interaction of A and B factor A from the interaction of A and C or factor F from the interaction of A, B, and C in the next lesson, we will take a detailed view at the other types of designs available in DOE. In the next lesson, we will dive deeper into practical applications of design of experiments. Stay tuned. 49. Plackett Burman Central Composite design: Welcome today. We are diving into different types of design of experiments. Or DOE, let's start with the Placet Berman Design. What is a Placet Berman design? Placet, Berman designs are typically used with two levels, and of resolution three. The main advantage of these designs is that the interaction between two factors is distributed among several other factors. For example, the interaction between factors A and B is confounded with all other factors except A and B themselves. This makes Plackett Burman designs ideal when dealing with many factors, and when only the main effects are of interest. However, these designs should be used with caution, if you assume that two factor interactions can be neglected. Though this requirement is less strict than in classical fractional factorial designs of resolution three. Moving on, what is a box Benkin design? The box, Benkin design, along with the central composite design is used to analyze and optimize a few factors in detail. And to identify non linear dependencies for detecting non linear relationships. At least three levels per factor are necessary with a full factorial design using three levels. The number of trials can increase rapidly. For instance, with two factors at three levels each, you need nine runs and with three factors at three levels each, it increases to 27 runs. Box, Benkan designs address this by creating a full factorial design with two levels. And including center points, such as three times for two factors, or with three factors, which reduces the number of runs 27-15. Although this reduces the number of runs, it may identify fewer non linear relationships. Next, let's discuss the central composite design. This design typically involves three types of test points, two, level fol factorial points which form the corners of a cube or hyper cube in multi dimensional spaces. Central points located in the center of the space defined by the factorial points. Axial points which lie on the axes of the factor space outside the queue. These last two types of points help estimate non linear effects in your model. In the next lesson, we will dive deeper into practical applications of design of experiments. Stay tuned. 50. Conclusion: I would like to thank you very much for completing the program. It shows that you are highly committed on your journey for learning. You want to upskill yourself and I trust you have learned a lot. I hope all your concepts are also clear. I want to ensure that I tell you what are the other programs that I do want skillshare. So on Skillshare, I have many other programs which are already there and many will come up in the future weeks and future months. What the programs are like storytelling with data, how I can use the analytics, data visualization, predictive analytics without coding, and many more. Apart from this, I also work as a corporate trainer. I ensure that all my programs are highly interactive and keeps all the participants very much engaged. I designed the books which are customized for my workshop, which also ensures that all the concepts are clearly understood by the participants. My games are designed in such a way that the concepts get loans in a while they play. There are a lot of games which are designed for my programs. And if you are interested, you are free to contact me. I have also done more than 2 thousand hours of training in the past two years during the pandemic. These are just a few of the workshops. So if your organization wants to take up any corporate training program which is offline or online. Or if you feel that personally you want to upskill your learning, you're free to contact me on my e-mail ID. Stay connected with me on LinkedIn if you liked my training, please ensure that you write a review on LinkedIn. Also, I also run a Telegram channel where I put lot of questions where people can learn the concepts and they will, they might just take few seconds for them to do it. Apart from that, please ensure that you write to leave a review on Skillshare, that how was your training experience? Please do not forget to complete your project. I love people when they are committed and you have proved that you are one of them. Please stay connected. Stay safe, and God bless you.