Transcripts
1. Data Analytics intro: Hello friends. Let's get started on
this training program, corners data analytics
using MiniTab. What are you going to
learn in this course? So the skills that
you will learn in this course are some
basics of statistics. We will be covering
descriptive statistics, graphical summary, distributions, histogram, box-plot, bar charts,
and pie charts. I'm going to set up a new
series on test of hypothesis, which I will be sharing in the link as a link
in the last video. But let's first understand all the different types
of graphical analysis. Who should attend this class? Anyone who has, who is a
student of Lean Six Sigma, who wants to get
certified as Green Belt, Black Belt, or who
wants to apply statistics and graphical
analysis in their place of work. Even though you might
be an entrepreneur or you might be a
student and you want to understand
statistics using MiniTab. I'm going to cover all of it. We're going to learn what mistakes commonly happen
when we are analyzing. Because when we do analysis using simple theory
based data points, everything appears to be normal. So I'm going to show
you some traps in which our analysis will fail and how you should
avoid those traps. We will try to, at the end of this program, you, what will you take away
from this program? You will understand how to
do some basic analysis. You'll understand what
are the tools that are required during
your measure phase, like capability
calculations and so on. We will use during the
analyze phase so if possible, to cover test of hypothesis. Otherwise, if it get, the video gets bigger, I will put it as
a separate sees. Ivan also cover which graph to use when some common mistakes we have and we perform graphical analysis
and creating graphs. And how do I derive insights and conclusion
from those graphs? This will really help you in understanding this
program really well. Let's see what is a Minitab? Minitab is a statistical
software that's available and it has
multiple regions. So I go find a new project. My Minitab screen looks
something like this. I have a navigator
on the left side. I have my output
screen on the top, I have my data sheet, which is very much
like an Excel sheet, which I can work with. I can keep adding these
sheets and have lots of data. I can do a lot of analysis
using my options. We're going to cover basic
statistics, regression. We will be covering a lot of basic statistics and we'll be covering lots of graphs using different types of data, right? So if you were interested
in knowing these things, you should definitely
enroll and watch my video. Thank you so much.
2. Project Work: Let us understand what is the project work
that we're going to do in this data analytics
program using MiniTab. As I told you, we are going
to work with MiniTab. And this is the Minitab
that I will be using. I will also be sharing
with you a datasheet, your project data sheet, where I have multiple examples, where we are doing
calculations on capability. We will try to see
distributions and you can see that there
are various tabs. Example one example
two example three, we'll try to do some
trend analysis. We will try to see
Pareto charts. We have lots of data that
has been shared with you, which will give you a
hands-on experience on working with data, right? So let's get started.
3. Minitab: In this class, we're going to learn about hypothesis testing. I'm going to teach you hypothesis
testing using MiniTab. I'm also going to teach you hypothesis testing
using Microsoft Office. That is using Excel and Microsoft Office for those who are interested in
going for MiniTab. Let me show you from where
you can download Minitab. Minitab.com under Downloads. Here we come to the
download section. You have MiniTab
statistical software, and it is available
for 30 days for free. I have also downloaded the
trial version on my system and Dando analysis and
showed you showed it to you. Remember, it is available
for 30 days only. You please ensure
that you complete the entire training program
within the first 30 days. When you feel the value in this, you should definitely
go ahead and go by the licensed
version of MiniTab, which is available over here. I just have to click on Download
and download Woodstock. It starts with a
free 30-day trial. And it's good
enough time for you to practice all the
exercises which are driven. It will ask you
for some personal information so that they can be in touch with you and they can help you
with some discounts. If there are any. You have a section called as Dr. MiniTab or you have
a phone number. If you're calling from UK, it will be easy for you
to call over there. But if you're talking
from other places, talk to MiniTab is a
much easier option. This is a very good
statistical tool and they keep upgrading the
features regularly. So personally, I feel that this investment
will be worth it. But for those who cannot
afford to go for the license, they can use Microsoft Office, at least some of the features, not all, but some of the
features are available. So initially I will show you the entire exercise of different types of
hypothesis using MiniTab. And then we will move
into Microsoft Excel, stay connected and
keep learning.
4. what is Descriptive Statistics: In today's session, we are going to learn about
descriptive statistics. Descriptive statistics
means I want to understand measures of center. Like measures of center,
mean, median mode. I want to understand
measures of spread. That is nothing but range, standard deviation,
and variance. Let's take a simple
data that I have. I have cycle time in minutes for almost a 100 data points. I'm going to take
the cycle time in minutes from my day
project data sheet. I'll go to MiniTab and I
will paste my data where here I want to do some
descriptive statistics. Stats. Click on Basic Statistics and say Display
Descriptive statistics. When I do this, it gives me an option in the pop-up window, which is called as, which shows me the available
data fields that I have. I have cycle time in minutes. So it is telling
me that I want to analyze the variable
cycle time in minutes. I'll just click on, Okay, and immediately you will find
that in my output window. I can just pull this down. In my output window. It is showing me
that it has done some statistical analysis for the variable cycle
time in minutes. I have 100 data
points over here. Number of missing values are 0. The mean is 10.064. Standard error of mean is 0.103, standard deviation is 1
to minimum value is 7.5. One is nothing but your
quartile one is 9.1. Median, that is,
your Q2 is 10.35, Q3 is 10.868, and the
maximum value is 12.490. If I need more
statistical analysis, I can go ahead and
repeat this analysis. This time, I'm going to
click on Statistics. And I can look at the other
data points that I need. Suppose if I need the range, I don't need standard error, I need I need
inter-quartile range. I want to identify
what is the mood. I want to identify what is
the skewness and my data. What is the kurtosis in my data? I can select all of it and say, okay, I will click on, Okay. When I do this, all the other
statistical parameters that I have selected will come
up in my output window. This is my output window. So it's again tells me that additional data point
that I selected. So radius is nothing but your
standard deviation squared. It is 0.0541. It is telling me the range
that is maximum minus minimum. It is 4.95. Inter-quartile range is 1.707. There is no mode in my data. And number of data points at
0 because there is no more, the data is not skewed. The values very close to 0, it is 0.05, but
there is kurtosis. It means my data is not
appearing as a non-work go. So good, we like to see
how my distribution looks. Let's do that. I click on stats, I click on Basic Statistics, and I will click on
graphical summary. I'm selecting cycle
time in minutes. And I'm saying I want to see
95% confidence interval. I click on, Okay,
let's see the output. The summary of the
cycle diamond minutes. It is showing me the mean,
standard deviation, variance. All the statistics things are being displayed on
the right-hand side. Mean, standard deviation,
variance, skewness, kurtosis, number of data points
minimum first quartile median, third quartile maximum. These data points which you
see as minimum Q1, median, Q3 and maximum will be
covered in the boxplot. The boxplot is framed
using these data points. And when you look at the Velcro, it says that the bell
is not steep curve, it is a little fatter curve, and hence the kurtosis
value is a negative value. We will continue our learning more in detail in
the next video. Thank you.
5. Understand Box Plot Part 1: In this lesson, we are going
to learn more about boxplot. A boxplot is one of the graphical technique which helps us to identify
outliers, right? Let us understand how
a boxplot gets formed. Let's understand
the concept first before we get into
the practicals. A boxplot is called as a
boxplot because it looks like a box and it has
viscous like the cat. The cat has on its face. Now, just like the way the cat cannot have and less viscous, the size of the whisker of the box plot will be decided
on certain parameters. You will see some
important terminologies when you're forming a boxplot. Number one, what's
the minimum value? What's the quartile one? What is the median? What is the core tight? Three, what is the size
of the maximum whisker? And what is the maximum
value on the data point? Here? The minimum dogs over the minimum point and where
the whisker can be extended. Q1 stands for first quarter, which means 25% of the data. Let's assume for ease, we have 100 data points. 25 per cent of the data will
be below this one mark. Between Q1 and Q2. Twenty-five percent
of your data will be formed, will be present. Q2 is also called as the median or the
center of your data. So if I arrange my data in
ascending or descending order, the middle data
point is called as a median and it is called as Q2. Q3, or otherwise also
called as upper quartile, talks about the
twenty-five percent of the data after the medium. So technically, by
now you have covered seventy-five percent
of your data will be below your
third quartiles, 25 per cent below Q1, 50% of the data below Q2, Seventy-five percent of
the data is below Q3. So technically, out
of 100% of the data, 75% of the data is below Q3. It means twenty-five percent of my data points will be above Q3. Now the distance between
Q1 and Q3 is called, is called as the box size. And this box size is also
called as inter-quartile range. Q3 minus Q1 is called as
inter-quartile range. As I told you at the
beginning of the class, that the size of
the whisker depends upon the interquartile
range or IQR. Q3. I can this line form 1.5
times the size of the box. So 1.5 times into IQR plus q3 will be the upper
limit for my whisker. On the right side.
On the upper side. If I want to draw the
whisker on the left side, it is nothing but the same 1.5 times into
inter-quartile range. But I subtract this value from Q1 and extended till that value. So it sets up the lower limit. You might have data
points which are coming below the minimum point. You might have data
points which are coming beyond the
maximum size of the risk of these data points
are called as outliers. The beauty of boxplot
is it will help you to identify if there are any
outliers in your dataset. Let's see how can I
construct a boxplot? Because physically I
don't have to worry about finding out 2525% percent. And really by person, we will go to MiniTab and then do the work. So let's see this datasheet. So in our previous class, we did some descriptive
statistics on this. And we found the data points. We found minimum Q1, Q2, Q3, and maximum data point. Let's try to construct a boxplot for the
cycle time in minutes. So I will click on graph. I will go to box plot and see a simple boxplot
and click on, Okay, I'm going to select
cycle time in minutes. And I'm going to say, Okay, let's see the data view. If you look at this boxplot, the below line is called
as the one. It is 9.16. The median is the middle line, and it need not be
exactly in the center. The top of the box is Q3, which is 10.86 in
this data range, and the interquartile
range is 1.7. My box can extend
for 1.5 times on the elbow and it can go 1.5 times into 1.7
on the balloon. And you are seeing
that there are no asterisk marks
in this boxplot, very clearly indicating
that there are no outliers in my
current dataset. Let's pick up some
more data set. In our next video to
understand how do box plot.
6. Understand Box Plot Part 2: Let us continue our journey on understanding boxplots
more in detail. If you go to the sheet
in your project file, which is called as a boxplot. I have collected data of cycle time for five
different scenarios. As you can see that some places I have more
number of data points, like I have almost 401745 data. In some places, I have
only 14 data points. So let's try to analyze this in more detail to understand
how boxplot works. I have copied this
data onto MiniTab, case one, case two, T3 and T4. So first thing I would
want to do is do some basic descriptive
statistics for all the foreign keys. I'm selecting all of it. And then I'm seeing,
when I see my output, I can see that in
three of the cases, I have 45 data points. In the fourth case, I have 18 data points. In the fifth case, I
have 14 data points. So the number of data
points are very, if you look at my minimum value, it is ranging from 1, one, twenty one, twenty two. And the maximum value is
somewhere between 4090 them. In one scenario I have
developed values from 21 to 40. In one scenario I have
values from two to 90, which very clearly shows that the number of data
points or do this. But my range of value is white. So if you look at the rate, it's ranging from
18.8 to 99 points. So in case two, I have 1200 as a
range, so 99 years. And the same can also be
observed as standard deviation. You can see that the
skewness of data is different and kurtosis
is different. Let's first understand
the box plot in detail. And in the next video, when I'm talking
about the histogram, we will understand the
distribution pattern using the same data set. Let's get started.
I click on graph. I can click on boxplot
and I click on simple. What I can do is I can take up 11 case at a time
to analyze my data. So case one, it shows
me a box plot and this boxplot very clearly shows that there are no
outlier in my data. And the range is between. When I keep my cursor over here, I have 45 data points. My whisker is ranging
from 21.6 to 4.4, and my inter-quartile
range is 5.95. My median is 30.3. My first quartile is 26.9. My third quartile is 32.85. Let's redo this
thing for case two. When I do my keys too, if you now look, the box is looking very small because here my data
points are same. Fortified by Vickery
is again ranging from 21.6 to 40 for seem like
my previous scenario. But I have outliers over here, which are far beyond. If you remember, the
descriptive statistics for kids to my minimum value is one
and my maximum value is 100. My median was seemed like
my previous scenario. My Q1 is also similar, not same, but similar. And Q3 is also similar. But when you look
at the box plot, the box is very small, very clearly indicating that do my inter-quartile
range is 6.95. My viscous can only go 1.5 times and any data
point beyond that, Misko will be called
as an outlier. I can select these
outliers, right? And it is very clearly see, k is two, the value is 100
and it is in row number one. Row number 37, I have
a value called as 90. In row number 30, I have
a value called is 88. And in row number 21 I have
a value called as one, which is a minimum size. So I have outliers
on both the sides. Let's understand case three. When I look at the chemistry, I put my cursor on the boxplot. I have the same 45 data points. My viscose or from 21.6 to 40 for seem like my
case one, case two. But in this scenario, I have lot of outliers. On the lower end. That is, on the bottom of
my core, tight, right? It is easy for us to click on each one of them and
see how my boxes are. Now the beauty over here is
I have only 18 data points, but still I have an outlier. Let's do it for k is five. And understand that as well. I have a smaller box. I have only 14 data points and I have an outlier
on the up button, and I have an outlier
on the lower end. Here the value is 23. But seeing these
plots differently makes it difficult for
me to do a comparative. Can I get everything
on one screen? So I go to graph,
I go to boxplot. I will do simple
environment selected. I'm selecting all the cases together and seeing
multiple graphs. I'm seeing skin and I'm seeing
the axis should be seen. Grid lines should be seen. And I click on, Okay. I'm getting all the
five data points, five cases scenario
in one graph. This will make it easy for me to do the analysis, that case one. So do individually when
I saw the case one, if we're showing us a big swath. But when I'm doing a comparison of one next to each other, I can know that in case two I have outliers on the
top and the bottom. In case three, I have
outliers on the bottom side. In case four, I have
outliers on the top side. In case five, I have
outlets on both the sides. The number of data
points are different. The bulks will get drawn. The size of the box cannot be determined by the
number of data points. I have 45 data points, but my box is very narrow. And I have 14 data points
and my box is white. So the size of the box. So if I have 14 data points, it is going to divide my
data into four parts. So three data points below Q1, three data points
between Q1 and Q2, three data points
between Q2 and Q3, and three data points beyond Q3. Whereas when I had
45 data points, it is getting
distributed as 11111111. My median would be
the middle number. So what is the learning from this exercise is that by
looking at the size of the box, you cannot determine the
number of data points. But what you can definitely determined is that in
mind that dataset, do I have data points which
are extremely high or low? So the purpose of drawing
a box plot is to see the distribution and
identify outliers, if any. I hope the concept is clear. If you have any queries, you are free to put it up
in the discussion group. And I'll be happy
to answer them. Thank you.
7. Understand TestofHypothesis: Hello friends. Let us continue our journey
on MiniTab data analysis. Today we are going to learn
about hypothesis testing. You might have heard that we do hypothesis testing
during the analyze and improve phase
of our project. So to understand how
hypothesis test works, let us understand a
simple case scenario. I will come back to this graph again and explain
you that it is. As you know, when we go
to the court of law, the justice system can be used to explain the concept
of hypothesis testing. The judge always starts with
a statement which says, the person is assumed to be
innocent until proven guilty. This is nothing but your null
hypothesis, the status quo. When they are caught
case which goes on. The lawyers tried to
produce data and evidences. And unless and until we do not have strong data
and strong evidences, the person is In the
status of being innocent. So the defendant or the
opposition lawyer is always trying to say that
this person is guilty and I have data and
evidence to prove it. He is trying to work on
alternate hypothesis. And the judge says, I'm going with the status quo of null hypothesis by default. Let me explain you
in a more easy way. You and I, we're not taken to the court of law
because by default, we all are in OSA, that is the status quo. Who are pulled to
the court of law. People who are who have
a chance of have come, have committed some crime. It could be anything.
So same way. What do we try to do
hypothesis testing on when I'm doing my analyze
phase of the project. So I have multiple causes which might be contributing
to my project. Why? We do a root cause analysis and we get to know that, okay? Maybe the shipment got delayed. Maybe the machine is a problem, maybe the measurement
system is a problem. Maybe the raw material
is of not good-quality. We have multiple reasons
which are there. Now I want to prove
it using data, and that is the place where I tried to use hypothesis testing. All the processes
have variation. We know all the processes
follow the bell curve. We are never add the center. There is some bit of
variation in every process. Now the data or the
sample which you updated, is it a random sample
coming from the same Banco? Or is it a sample that's coming from an entirely
different bell curve? So hypothesis testing will help you in analyzing the same. Whenever we set up
a hypothesis test, we have two types of hypothesis, as I told you, the status quo
or the default hypothesis, which is your null hypothesis. By default, we assume that
the null hypothesis is true. So to reject the
null hypothesis, we need to produce evidences. Alternate hypothesis
is the place where there is a difference. And this is the reason why the hypothesis test has
actually been initiated, right? We will understand
with lots of examples. So stay connected. So when I'm framing up null
and alternate hypothesis, Let's say, I am saying my mu
are nothing but my average, my population average
is equal to some value. Always remember, your alternate hypothesis
is mutually exclusive. If mu is equal to some value, the alternate hypothesis would say mu is not equal
to that value. By say, mu is less than equal to some value
as a null-hypothesis. For example, if I'm
selling Domino's Pizza, I see my average delivery time is less than equal
to 30 minutes. The customer comes
and tells me, know, the average delivery time
is more than 30 minutes, that becomes my alternate. Sometimes, if we have the null hypothesis is mu is greater than
equal to some value. For example, my average quality is greater than equal to 90%. Then the customer comes
back and tells me that know your average quality is
less than that percentage. So always remember the
null hypothesis and alternate hypotheses
are mutually exclusive and complimentary
to each other. We will take up many more
examples as we go further.
8. Understand Types of Errors: Let us understand
some more examples of null and alternate
hypothesis. So suppose if my project
is about to shed you, my null hypothesis
is a fixed value. So I would say my
current mean of my current average
time to build to share Julie's 70% are. Current. Average of P to S is 70%. The alternate hypothesis would
mean that it is not 70%. Suppose I'm thinking about the moisture content
of a project. I'm into a manufacturing
setup and I want to measure if the moisture content
should be equal to 5%. Or 5% is what is
acceptable by my customer, then I can say my
moisture content is less than equal
to five per cent. Then the alternate
hypothesis would claim that the moisture content is
greater than five per cent. The case where the
mean is greater than, then the null hypothesis. We do not have the
interest in that problem. Let's understand it further. The question was,
did a recent TED those small business loan
approval process reduce the average cycle time
for processing the loan? The answer could be no. Means cycle time did not change. Or the manager may see that yes, the mean cycle time
is lower than 7.5%. So the status quo is
equal to 7.514 minutes. And the alternate says, no, it is less than 7.414
minutes or days, whatever is the main unit of measurement we are
measuring, right? So by default, your status
quo is go null hypothesis. And the example or
the status that you want to prove easier
alternate hypothesis. Now, there could be some sort of arrows when we make decisions. So let's go back
to our code case. The defendant is in
reality not guilty, right? Let me take up my laser beam. By default, the defendant or the reality is the
defendant is not guilty. Verdict also comes
that the defendant, the person is not guilty. It's a good decision, right? So yes, we have made a very good decision that
the person is innocent. In reality, the
defendant is guilty. And the verdict also
comes that he's guilty. The decision is a good decision. What happens is, in reality, the person is not guaranteed, but the verdict comes that he's guilty and innocent
person gets convicted. It's an error. It's a very big error. In Northern person, given a
sentence and put in jail, given a penalty,
that's an error. The error can even happen
on the other side, where in reality the
person is guilty, but the verdict comes
that he's not guilty. Guilty person is declared as innocent and he's set for it. This is also an arrow, but which is a bigger error. The bigger error you can write down in the comment
box, what do you think? Which error is the bigger arrow? Is the error a bigger error or is the error be
the bigger arrow? It no sane person getting
convicted is a bigger error or is a guilty person moving on the roads free,
either bigger arrow? I hope you have already
written the comments. So the reality is this
becomes my bigger error. And this is called
as type one error. Because if an innocent
is convicted, we cannot give back the
time that he has lost. We cannot get he would go to
a lot of emotional trauma. If a guilty is
declared as innocent, we can take him to
the higher court and Supreme Court and to get
him to prove that yes, he's not he's guilty, right. So I can get this decision over here that the person is convict. He should be convicted
and he should be declared as guilty and
should be punished. So this error is called
as type two error. If somebody asked you which
error is a bigger error, type one error, it is also
called as an alpha error. And this is called
as a beta error. Right? Let's continue
more in our next class.
9. Understand Types of Errors-part2: Let us understand the types
of arrows once again. So as we know that if the person is not guilty or the
person is innocent, and the verdict is also saying that the
person is not guilty. It's a good decision. If the person is guilty, verdict is he's guilty. The decision is again,
a good decision. The convict is not, has to be sentenced or
should be punished. The problem will happen when an innocent person is proved
as guilty and he suffers. The second type of problem which happens when guilty person, a person with a criminal
is declared as innocent. And he said, This is called
as a type one error. That is, an innocent
person getting convicted or punished
is a type one error. It is also called
as alpha arrow. A guilty person, criminal set free is called as a type
two error or a beta error, which is also an error
which we want to avoid. The level of significance
is set by the Alpha value. So how confident do you want in making the
right decision? So type one error happens when the null is true,
but we rejected. Type two error happens when
in reality the null is false, but we fail to reject it. Now how does this
help us process? So let us just understand this
every day for lunch sheet. Right? Let's understand
this in more detail. This is the actual scenario. Let's write the
actual on the top. And this myths
like the judgment. Okay, now, let's think
about the process. The process has not changed. Has not changed. No alternate will be process has changed. Now the judgment is noted. And the judgment is the
process has improved. Okay. Now I'm going to ask you a
very important question. If a process has not changed and the judgment is that
there is no change, this is the correct decision. Process has changed and the judgment is also that
the process has improved. That's also a correct decision. Now, imagine the process
has not changed, but we declared that now I
have an improved process and an improved product and I inform the customer, Is it correct? An error. And this is called as a type
one error because seem old, but our debt is sold to the
customer as new product. Can you understand
what will happen to the reputation of the company? The team or product is sold to the customer as new products. New one core product. So what will happen to the
reputation of the company? It will go for a toss
and hence we say, this is not a good decision. Now understand here also
the process has changed. The process has improved, but the judgment comes
as not improved. This is also an error. I don't deny it. This is called a
type two error or audit is also called
as a beta error. Right here. What happens is that we are not communicating
to the customer that the improvement
has happened, right? So we do not we are keeping the improved items in brood product
in the warehouse. Now this is also not correct, but the bigger error is here where actually we have
not done an improvement, but I'm informing the customer that you're bad people join.
10. Remember-the-Jingle: When we do test of hypothesis, there are always two hypothesis. One is the default hypothesis, which is the null hypothesis, and second is the
alternate hypothesis which you want to prove. And that's the reason you
are doing the hypothesis. So when you do the hypothesis, the reason we do is we are never having the access
to the whole population. So when we collect the sample, we want to understand, is the sample coming
from the bell curve or the distribution
from where we are understanding whatever
variation you see, is it because of the natural
property of the dataset. Sometimes you sample could be at the end corner of the Velcro. And that is a place where we get the confusion that
does this data belongs to the original Velcro or does it belong to the
second alternate? Welcome. That is there. We will be doing exercises
which will be giving you an understanding of this
in more easy to do. Hypothesis, you get
information like the p-value apart from the
test statistics results. You also get the p-value. We always compare the p-value with the null value
that we have set. Suppose you want to
be 95% confident. Then you set the p-value as 5%. And if you set the
confidence level is 90%, then your Alpha value
is ten per cent, or your p-value is 0.10. The reason we do a p-value is that if you can
see this bell curve, the most likely observation is part of the
center of the bell. Very unlikely observation
are coming from the tail. This p-value, the green reason, helps you tell
whether it belongs to the original Velcro or does it belong to the
alternate bulk of that is, you are trying to prove through
the alternate hypothesis. Hence, the p-value comes as a help for you to
easily remember this. Remember the jingle. Below, null. It means if the p-value is
less than the alpha value, I'm going to reject
the null hypothesis. P high level flight. If the p-value is more
than the alpha value, we fail to reject
the null hypothesis, Concluding that we do not have enough statistical evidence that the alternate hypothesis exist. We will be doing a lot of
exercise and I will be singing this jingle multiple times so that it's easy for
you to remember. Below null, go behind nullcline. Some of the participants with, when I do the workshop
get confused, they will say none
go means what? The other thing which
I tell them to easily remember is f for
flight and F for field. So if P is high null, we'll fly. It means you're failing to
reject the null hypothesis. Null hypothesis will exist. The alternate hypothesis
will get rejected. Remember one more thing which is mostly asked
during the interview. The p-value was at 1.230.123. Would you reject
the null hypothesis or would you accept
the null hypothesis? Or would you accept the
alternate hypothesis? Or will you accept
the null hypothesis? As a statistician's? We never accept any hypothesis. Either we reject
the null hypothesis or we fail to reject
the null hypothesis. We always say it from
the point of view of null because the default status quo easier
null hypothesis. If the P is high, we do not accept the null
and alternate hypothesis. Are we do not accept
the null hypothesis. We say we fail to reject
the null hypothesis. If the p is low, we do not accept alternate, but we say, I reject
the null hypothesis, concluding that there is enough statistical evidence that the data is coming from
the alternate Bellcore. We will continue with
lot of exercises. And this will give
you confidence about how to practice and interpret and use
inferential statistics in your analysis when
you're doing it.
11. Test Selection: One of the most common question which my participants
are asked when I'm entering the project is that which hypothesis
should I use rent? So this is a simple analysis which will help you
understand that. Which tests should I be using? Just like the way when a
patient goes to a doctor, the doctor does not
prescribe him all the test. He just put his grabs him the appropriate test based on the problem that the
patient is fishing. If the patient sees I
met with an accident, the doctor would say that I think you should get
your X-ray done. He would not be
asking him to go for his COVID test or RT-PCR test. If the person is coughing
and is suffering from fever, then RT-PCR is suggested. And at that point of time we are not able to satisfy the x-ray. It looks similar way when we do simple hypothesis testing, we're trying to understand or another compare it
with the population. We want to understand what
test should we be performing? When, if I'm testing for means, that is your average, then you compare the mean of a sample with the
expected value. So I'm comparing the
sample with my population. Then I go for my
one-sample t-test. I have only one sample
that I'm comparing. I want to compare if the
average performance of the, if the average sales
is equal to x amount, which is the expected value. So we were expecting
the sales to be, let's say 5 million. My average is coming to say 4.8. I have I met that are not. So then I can go and do
a one-sample t-test. Compare the mean of samples with two different proportions. So if I have two
independent T's, so let's say I'm conducting
a training online. I'm conducting a
training offline. It is the Shrina and I have a set of students who are
attending my online program. I have a different
set of students who are attending
my program of mine. I want to compare the
effectiveness of training. So I have two samples, and these are two
independent samples because the participants
are different. Then I go for two-sample t-test. If I want to compare
the two samples so people come for my training. I do an assessment before my training program about their understanding of
what Lean Six Sigma. And I can take the
training program and the same set of participants attend the test after
the training program. So the participants
or the scene. But the change
which has happened is the training which
was impacted on them. I have the test results before the training and I have the test results after the training, I want to compare the
training is effective. Then I go for two-sample
paired t-test. Progressing further. Suppose if I am
testing for frequency, I have discrete data
and I want to test the frequency because in discrete data I do
not have averages. I take frequencies. So when I'm comparing
the count of some variable in a sample to
the expected distribution, just like the way I
had one sample t-test. The equivalent of it for a discrete data would be my
chi-square goodness of fit. I, by default expected to be a normal value or a particular
value or unexpected value. And I'm comparing that. How far is my data? I go for a chi-square
goodness of fit. This test is available
on MiniTab in Excel. It is not available. So I will be creating a
template and giving it to you, which will make it easy for you to do the chi-squared test. All three different types of chi-square test using
the Excel template. If I have to count some of the variables
between two samples. So it will be chi-squared
homogeneously t-test. I'm checking a
simple single sample to see if the discrete
variables are independent. I do Chi-Squared
independence test. If I have a proportion of data, like good or bad applications, I've accepted versus rejected. And I am saying that okay, 50% of the applications
get accepted, or twenty-five percent of
the people get placed. I have a proportion
which I want to test. If I have only one sample, I go for one proportion test. If I want to compare
proportion of commerce graduates
versus science graduate or proportion of finance, MBA, people with
marketing MBA people, I have two different samples, so I can go for two
proportion test. So to summarize the thing, when I'm testing, am I
testing for averages? Am I testing for
frequencies like discreet data or am I
testing for proportions? Depending upon that,
you are picking up the appropriate test
and working on it. We're going to
practice all of it using Men dab and using exit. The dataset is available in
the description section. In the project section, I invite you all to practice
it and put your projects, your analysis in the
project section. If you have any doubts, you can put that in the discussion section and I'll be happy to answer your doubts. Happy learning.
12. Understand 1 sample t test: Let us understand which
hypothesis tests should I use? In Minitab, you have an assistant which can
help you do that decision. So if you go to assistant
hypothesis testing, it will help you identify based on the number of
samples that you have. To suppose if you
have one sample, you might be doing
one-sample t-test, one sample standard deviation, one sample percentage defective, chi-squared goodness of fit. If you have two samples, then you have two sample
t-test for different samples. T-test if the before and
after items are the same. Sample standard
deviation to sample percentage of defective
chi-square test of association. If you have more
than two samples, then we have one way ANOVA
standard deviation test, Chi-square percent
is defective and chi-squared test of association. We will be practicing all of
it with lots of examples. So let's come to
the first example. We have the ADHD of
calls in minutes. We have taken a sample
of 33 data points. Average is seven, minimum
value is four minutes, maximum value is ten minutes. The reason we have to do a hypothesis testing is the
manager of the processes that his team is able to close the resolution or on the
call in seven minutes. And the process average
is also seven minutes, minimum is four minute. But the customer sees
that the agents keep them on hold and it takes more than
seven minutes on the call. So now I want to statistically validate whether
it's correct or not. Whenever we are setting
up hypothesis testing, we have to follow the five
step six step approach. Step number one, define
the alternate hypothesis. Define the null hypothesis, which is nothing but
your status quo. What is the level of significance
or your Alpha value? If nothing is specified, be sent Alpha value
as five per cent. We first set the
alternate hypothesis. So in our case, what is the customer saying? The customer sees that the average handle time is
more than seven minutes. The status quo or
the SLA agreed is the ADHD should be less than
equal to seven minutes. As I told you, the null and the alternate hypothesis will be mutually exclusive and
complimentary to each other. Now, identify the
test to be performed. How many samples do I have? I have only one sample of the
HD of the contact center. So I am going to pick
up one sample t-test. Okay? Now I need to do
the test statistics and identify the p-value. If you remember the
previous example lesson, we said if p-value is less
than the alpha value, we reject the null hypothesis. If p-value is greater than
five per cent or Alpha value, we fail to reject
the null hypothesis. Let us do this understanding. So if you remember, we have our project data. In the project data, we have the test of hypothesis. Over here. I have given you the
AHG of coal in minutes. So I have copied this
data onto MiniTab. So let's do it in two ways. First time and show it
to you using assistant. Second, I will show it
to you using stats. So if I go to assistant
hypothesis testing, what is the objective
I want to achieve? It's a one-sample t-test.
I have one sampling. Is it about mean? Is it about standard deviation? Is it apart, defective
or discrete numbers? We're talking about
the average 100 times. So I'm going to take
one sample t-test. For data in columns. I have selected this. What is my target value? My target value is seven. The alternate hypothesis is the mean age of the call in minutes is
greater than seven. This is what the
customer is complaining. Alpha value is 0.05 by
default, I click on, Okay. Let's see the output. To see the output
you're going to click on View and output only. You will see that. If you see the p-value,
p-value is 0.278. You remember below non-goal
be high nullcline is this value of 0.278 greater
than the alpha value of 0.05? Yes, it is. Hence, I can conclude
that the mean is d of coal is not significantly
greater than the target. Whatever you are seeing
as greater than target, it is only by chance. So there is not enough evidence
to conclude that the mean is greater than seven with five per cent level
of significance. And it also shows me
how the pattern is. There is no unusual data points because the sample
size is at least 20. Normality is not an issue. The test is accurate. And it'd be good
to conclude that the average handle time is not significantly greater
than seven minutes. I can go ahead and reject the claim given by the customer. The few calls that we see as high-quality,
high-value goals. This could be only by chance. The same test. I can also do it by clicking on test stat, basic statistics. And I'll save one sample t-test, one or more samples,
each in one column. I will flick your select ADHD. I want to perform
hypothesis testing. Hypothesized mean is seven. I go to Option and I say, what is the alternate
hypothesis I want to define. I want to define that the actual mean is greater
than the hypothesized mean. Click on Okay. If I need graph, I can put up these graphs. Click on Okay, and
click on Okay. I get this output. So the descriptive statistics, this is the mean, this is the standard
deviation and so on. Null hypothesis is mu
is equal to seven. Alternate hypothesis is
mu is greater than seven. P-value is 0.278. Concluding that null flight, we fail to reject
the null hypothesis, concluding that the
average 100 time is around seven minutes.
Let's continue. We received our output. We saw all of this, and we have concluded that
the average handle time is not significantly
greater than seven minutes.
13. Understand 2 sample t test example 1: Let's do one more example
of two teams, two samples. So in this example, two teams whose performance
need to be measured. The manager of DMB claimed that his team is better
performing team than DNA. The manager of a team advocates that this
claim is invalid. Let's go to our dataset. So if you go to
the project file, you will have something
called as team a and team B. So let me just copy that data. Okay. Let me go here and place the
radar on the right side. Why can also do I can take a new sheet and paste the data. Right? So let's come to as hypothesis testing,
two-sample t-test. Let me delete this value. And TB, the team a is
different from the VM. I can also say based
on the hypothesis that is team be claimed that
his team is better than a. So I can say it is less than
TV. And I click on Okay. Again, in this example, I get an output which says that the team is not
significantly less than TB. Do you have the
values of 27.727.3? There is no
statistical difference between both the tips, right? So both the examples which
we got were like that. So let's go and see
one more example. I have taken cycle time of process one and cycle
time of process B. So let's just copy this data. This is another data set. And I go, What's my
alternate hypothesis? Both the beams are different. What is the null hypothesis? Both the teams are same. Because these two
teams are different. I will go ahead and do
my two-sample t-test. The data of each
team is separate. And I'm seeing is different
from TB alpha value is 5%, and then I click on, Okay. Now, if you see the
output this time, it says that yes, the cycle time of a is significantly different
from the cycle time of dB. Here, this 26.8,
twenty-seven point six. But if I look at
the distribution, the distribution that this red is not overlapping
with this red. So there is a difference in the cycle time of the two teams. If I have to do the
same thing using stats, basic statistics,
two-sample t-test. Like your time of
being e at the time of TB options, are there different? I can have my graphs. I don't want an
individual graph. I will only take the
boxplot and say, okay, mu1 is population mean of
cycle time of processes, cycle time of process B. Now if you'll see there is a standard deviation
that is a difference. The p-value is 0,
telling that, yes, there is a significant difference
between both the teams. Be low, none cool. So here we are rejecting
the null hypothesis, telling that there is a significant difference
between E and D. Right? I have seen the same thing
with the distribution goes on. So there is a
larger distribution or here and there's a
smaller distribution. I can do my graphical
analysis that I did learn on your right and then see how
the team is performing. So this is the summary of DNA. Mean is 26, standard
deviation is 1.5. And if I scroll down, I get for team B and it
is coming in this way. Now I want to overlap
these graphs so I can click on graph
and a histogram. And I'll say a bit
fit and silky. And I will select these two graphs on separate
panel of the same graph, same vitamin C max. Click on, Okay. Click on Okay. Can you see that the bell curve of both of them are different? Let's do an overlapping
graph histogram. And in multiple ground
overlay on this graph. Can you see that the blue and the red, there
is a difference? And hence, yes, the
kurtosis is different, the skew is different, and that's what
is the conclusion in my two-sample t-test, which says that the distribution there is a significant
difference. There is a statistically
significant difference between sacred time of being
EN fighter, dying off. The second thing, we will learn about bed t-test
in our next example.
14. Understand 2 sample t test example 2: Let's come to our example. Two. There are two centers whose performance
needs to be measured. The manager of
sensory claimed that his team is a better performing
team than the center B. The magnitude of the center be advocates that the
claim is invalid. Again, I will follow
my five-step process. What is the alternate
hypothesis? Is better than B. Let's make it more easy. It is not equal to T, is not equal to TB, or center is not
equal to center. What does the
non-hypothesis center a is equal to center V, level of significance,
five per cent. How many samples do I have? I have two samples, center editor and center B data. Because I have two samples, I need to go for
two-sample t-test. Let's go to our Excel sheet. I have the data for
Centauri and center B. I'm going to
copy it in Minitab. I'm placing my data here. Let's do the two-sample t-test. So I go to Stat, Basic Statistics and
say two-sample t-test. Both the samples
are in one column. Each sample has its own column, so I'm going to
select this sample. One is sensory sample. Do you center B? Option is hybrid. That is no different. So the difference
between a and B is 0. And I go ahead and do it. I can have my individual
box plot and say OK, and say Okay, let's
see the output. So the sensory data is
yours and TBI data is here. And if you see the p-value, the p-value is high. Again, I got an example which
says that be high null fly, meaning there is no difference between center and center B. If you see the individual value, but you see the same thing. Let's see the boxplot. The boxplot says
that the mean is not significantly
different because it would have taken a sample. That's the reason it is, and you are seeing a value of 0, which is an outlier. So we should be
considering that. The same thing. Let me do it using
hypothesis testing. Two-sample t-test, sample mean. Sample is different. The mean of center
is different from the mean of center B and C. Okay. So does the mean difference, the mean of Santa Fe is not significantly different
from the mean off center. Right? If you see this distribution, you can find that the red part is completely overlapping
with each other, telling that there is
no enough evidence to conclude that there
is a difference. There is a difference when
you see the mean, 6.86.5. But that could be
because of a chance. And there is a standard
deviation also. Hence, these show it
using the red bars, telling that there is not a significant difference between
sensory and center week. We will continue learning about other examples in
the coming video.
15. Understand Paired t test: Let us understand
one more example. This is an example
of paired t-test. If you look at this case study, the psychologists wanted
to determine whether a particular running program has an effect on their
resting heart rate. The heart rate of 15 randomly selected
people were measured. The people were then put on a running program and measured
again after one year. So are the participants
saying before versus after? Yes. And that is the reason it
is not two-sample t-test, but it is a paired t-test test, the before and after
measurement of each person or in
bands of observation. So if I go back to my dataset, I have something called
as before and after, there's a different stage, I'm not taking the
difference value. I've taken the data for the 15 people and
put up in mini tab. Right? Now, I want to do because it's the same person
before and after I, we want to understand the
different hypothesis testing. I'm going to take paired t-test. The first thing was, what's the alternate hypothesis? Before and after is different. If you remember, the program
of before and after, they want to determine if they
have an effect on the run. The measurement one is before, measurement tool is up. Mean of before is different
from the mean of after. So that's my
alternate hypothesis. So what's my null
hypothesis mean of before is there is no change. The alternate sees the before
is different from after. Alpha value is 0.05. Let's click on Okay. Let's see the output. So does the mean differ? What is a p-value of 0.007? The mean of before is significantly different
from the mean of after. If you look at the mean
value, it was 74.572.3. But there is a difference. So if you see the
difference is more than 0. And if I look at these
values of before versus after the blue dot is after
the black dot is before. Most of the participants, their heart rate had reduced
after the running program. Few of them were an exceptions, but that could be an exception. There is no unusual
paired differences because our sample
size is at least 20. Normality is not an issue. The sample is sufficient to detect the difference
in the mean. So I can see that, yes, there is a difference
between both of them. Wonderful. So again, quick revision. Hello, null goal as the p-value is less than
the significance level, we conclude that there is a significant difference
between both the readings. If I have to do the scene, I click on Stat,
Basic Statistics. Bad detest, each
sample in one rule. Before, after option
is they are different. Let me take only the
boxplot and histogram of I don't want to
pick the histogram. I'll only take the boxplot. Null hypothesis. The difference is 0. Alternate hypothesis is
difference is non-zero, p-values low, concluding that I reject
the null hypothesis. And there is a difference
by adopting the program. So if you see the null value, the red dot is much away from the mean of the confidence
interval of the box toward concluding that there
is a difference between by undergoing the program by this heart specialist, right? So in the next program, we will learn, take
up more examples.
16. Understand One Sample Z test: The quick recap of
the different types of tests that we
learned is that if I'm looking at how different is my group and between
the population, I go for a one-sample t-test. When I have two different
groups of samples, then I go for two-sample t-test. If these samples
are independent. If I will go for
a paired t-test. Paired t-test. If the group the
same set of people, but it is or different
point of time. Like we saw the example
of the heartbeat. So the people were measured
on their heartbeat. The report through
a running program and post the running program. How was that hot resting
heartbeat, right? So those are the
things that we sorted. Now let's continue
with more examples. So we add on use case number five, fat percentage analysis. The scientists for a company that manufactured process who want to S's the percentage of fat in the company's
water source. The advertisement posted date
is 15% and the scientists measure that the percentage
of fat is 20 random samples. The previous measurement of the population standard
deviation is 2.6. Now this is the population
standard deviation. The standard deviation
of the sample is 2.2. When I know the
population parameter, I can go ahead and
use one sample z-test because the number
of samples I have is one. And I want, I have the known standard deviation
of the population. Now, again, I'm going to apply the same thing defined the
alternate hypothesis, right? So what am I going to say? What's the alternate hypothesis? Fat percentage is
not equal to 603050. What are the null-hypothesis
fat percentage is equal to 15%. Level of significance
five per cent. Because I know it's
a one-sample test and I have the population
standard deviation. I'm going to use
one sample z-test. Let's do the analysis. I have opened the
project file and I have the sample IDs and cause a fat
percentage data over here. Let me copy this
data into Minitab. But copied the percentage of fat with the
scientists have done. Because we know that
population standard deviation, I can go ahead and use
one-sample z-test. My data is present in a column. It's the fact presented. The known standard
deviation was 2.6. I want to perform
hypothesis testing. Hypothesized mean, it's 15%. So my null hypothesis is the fat percentage
is equal to 15. My hypothesis is fat was a
big a is not equal to 15. I can pick a graph of boxplot
and histogram and say, Okay, I will show
you the output. So the null hypothesis is fat
percentage is equal to 15. Alternate hypothesis
is fat percentage is not equal to 15. Alpha value is 0.05. My p-value is 0.012, as my p-value is less
than the alpha value, P low, none cool. So I reject the null hypothesis, concluding that the fat
percentage is not equal to 50. If you see over here, the fat percentage
is more than 50. I can redo the same
test. This time. I can go ahead and check. Is my fat percentage greater
than the hypothesized mean. Let's do it. And still I get my
p-value more confidently, 0.006 very far from
my Alpha value. Concluding that yes, the Alpha, the null value is
hypothesized, mean is 15. But the sample says that there
is a high probability that your fat percentage in the
source is more than 50. What is the advice we
will give to the company? We will advise the company
that you cannot sell the naming that the container is 15% because our factor
is more than 15%. So to be safe, you can change the
label of the product to saying that the fat
percentage is 18, right? Because we have five per
cent is going through 20. So a consumer will be happy to receive a product which
is containing less fat. Then to receive a product
which is containing more fat because we are all
health-conscious, right? So let's continue
in the next class.
17. Understand One Sample proportion test-1p-test: We will continue on our
hypothesis testing. Sometimes we might have a proportion of
the action, right? We do not have averages or standard deviation
or variance to measure though,
that we are doing. Let's take this example six, the marketing analyst wants to determine whether the male, the advertisement for the
new product resulted in a response rate different
from the national average. Normally whenever you put an
advertisement in the paper, they say there are the advertising company usually see is that we will be able to impact 6% outcome
or 10% outcome or some number outcome right here. What is, it's the same
type of scenario. Here. They took a
random sample of 1000 households who have
received advertisement. And out of these 10
thousand households, sample 87 of them made purchases after receiving
this aggrandizement. So this company, which is
an advertising company, is claiming that I have made a better impact than the
other advertising's. The analyst has to perform the one proportion z-test to determine whether
the proportion of households that made a
purchase was different from the national average
of 6.5 because this is 8.7. In this case. What is your
alternate hypothesis? Alternate hypothesis is the
advertisement is different than the response to the advertisement is different
from the national average. Here we will say there
is no difference. They both are sin, alpha value is five per cent. And we're going to take
up one proportion, z-test, event proportion test. I'm supposed to take
you to the minute. So let's go to MiniTab. I can go ahead and these dads, basic statistics,
one proportion. I do not have data in my column, but I have summarized, right? So let me close this, cancel, let me close this. So I have taken one
sample proportion test. I have summarize data. How many events have
been are we absorbing? We are observing 87
events to happen. The sample is of thousand. I need to perform
hypothesis test and the hypothesized proportion, 6.5, 0.0656%.5, right? So it is 0.065. This proportion is not equal
to hypothesize proportion. I say, Okay, I see, Okay. Now the null hypothesis is the proportion is
equal to 6.5 per cent. Alternate hypothesis is
the proportionate impact is not equal to 5.56 per cent. P-value is 0.008. What does it mean? Yes, be low, none cool. So we reject the
null hypothesis, concluding that the effect
of the advertisement, He's not 6.6.5 per cent, but it is more
because if you see the ninety-five percent
confidence interval, it says 0.7% to 10%, right? You have got a
proportion of 88.7%. And the 95% confidence
interval of proportion is much ahead of 6.5,
it starts from 7. So we can conclude that there is a significant impact of the advertisement and we can go over this advertising company. Let's continue in
our next lesson.
18. Understand Two Sample proportion test-2p-test: Let's do this exercise one
more time using Assistant. So we have the numbered
80 beef products by supplier E that
we have checked. 725 are defective
or non-defective. So how many is that effective? So if I do a subtraction, it would be 777802 minus 725 is 77712 products of sampling the supplier B were
selected by 73. Perfect. So how much is
defective? One, 39. So let's try to do our
two proportion test using Minitab assistant as this
then hypothesis testing, sample pieces, stool, sample percentage defective supplier E, 0 to 7771 to 139. The person is defective of supplier E is less
than the percentage defective of supplier B. I will go ahead
and click on Okay. And I get that. Yes, that percentage
defective or supplier is significantly less
than the percentage defective of supplier B. And if I scroll down, Yes. So it says the difference, this supplier a is
reading readiness. That from the test you can conclude that the percentage
depictive of supplier a is less than Supplier B at
5% level of significance. When you are seeing
this percentage. Also, you can
clearly see that we will continue with the
next hypothesis testing in the next week. Do
19. Two Sample proportion test-2p-test-Example: Now let us understand
the next example. This is an example where
an operation managers samples a product
manufactured using raw material of two suppliers, determine whether one of
the supplies raw material is more likely to produce
a better quality product. So 802 products were
sampled from the supplier E 725 or perfect, that
is non-defective. 712 products were sampled from
Supplier B, 573 or buffet. That is, it's not defective. So we want to perform
because what is their personal data
non-defective percentage? Yes, I have got two proportions, supply array and Supplier B. Let's go to main. I can go to Stat, Basic Statistics two
proportion test. I have my summarize data, the evens by the first ease, 725 or both act out of 802. So let's take
725025723712572371. The option with them
seeing is there is a difference and
let's find it out. So the BVA, the null hypothesis, is there is no difference
between the proportion. Alternate hypothesis is there is a difference between
both the proportions. When I was looking
at the p-value, the p-value comes out to be Z, to be low null. It is concluding that I have to reject the
null hypothesis. There is a difference in the performance of
the two suppliers. Now, if I think about
because I'm talking about perfect or
non-defective, currently, sample one has 90% perfect and sample two has 80% perfect. So concluding that supplier E is a better supplier
than Supplier B. Right? So, thank you so much. We will continue in
the next lesson.
20. Using Excel = one Sample t-Test: Many times we understand
test of hypothesis, but there is a
challenge that we have. The challenge is that I
do not have MiniTab me. Can I not do test of
hypothesis with an easy way rather than going through a manual calculation using
statistical calculator. Do not worry that is possible. I'm going to show you
how I can get to do test of hypothesis using
Microsoft Excel. Go to File. Go to Options. When you go to Options,
go to Add-ins. When you click on Add-ins. Let me click here. You have an option
which is called as Excel add-in in
the Manage option. So select Excel add-in
and click on Go. Click on Analysis ToolPak and ensure that this
tick mark is on. Once you have that, you will find that
in your Data tab. You have data
analysis available. If let me click on it for you to understand
what's possible. In data analysis. I have an OR correlation, covariance, descriptive
statistics, histogram, t-Test, z-tests,
random number generation, sampling regression
and all those things. So it is becoming very easy for you to do hypothesis testing. At least the continuous
data hypothesis tested easily through
Microsoft Excel as well. I'm going to take you through step-by-step exercise for now. Let's go back to
the presentation. Let's take the first problem. That is, I have the descriptive statistics
for the HD of the call, the manager of the
processes that his team is working to close the resolution on the call in seven minutes. But the customer
sees that he's kept on hold for a long time, and hence he is spending
more than seven minutes. If I look at the
descriptive statistics, it is telling me ten minutes, median is seven, average is 7.1. Now I would want to do this analysis using
Microsoft exit. So let's get started. I have this use case in the project data
which I have uploaded, click on ASD, of course, it takes you to this place. Now, I will first
teach you how to do descriptive statistics
using Microsoft Excel. I'm going to click on data
analysis under the Data tab. I'm going to look for
descriptive statistics. Click on, okay. My input range is from
here to the bottom. I have selected. My data is grouped by columns. The label is present
in the first row. And I want my output to
go to a new workbook. I want summary
statistics and I want confidence level of
me. I click on OK. Excel is doing some calculation and getting it ready for it. Yes. Here is my output. I click on former over here
to see what's the output. So you can see you are mean, median mode, standard
deviation, kurtosis, skewness, range,
minimum, maximum, sum, count, confidence level. All these things are easily calculated by a
click of a button. I do not have to write
so many formulas. Now, let us go back
to our dataset. I want to do the
hypothesis testing. What is my null hypothesis? When the null hypothesis is that the ADHD is equal
to seven minutes. Alternate hypothesis. The ADHD is not seven minutes. There is a different alpha
value I'm setting up as 5%. And with that, I'm going to
conduct the tests that I'm going to connect is
a one-sample t-test. When you are doing
one-sample t-test using Microsoft Excel, you will have to
follow a small trick. The trick is, I'm going to
insert a column over here. And this, I'm going
to call it as dummy. Because Microsoft Excel comes with an option of
two-sample t-test. I have HD of the call in minutes and dummy where I have
written down to zeros, zeros. However, the average median, everything for 0 is always 0. Click on data analysis. I will go down and I will say two sample t-test
assuming equal variance. I'm going to select this. I'm going to click on, Okay. My input range,
one is this line. My input range
through this dummy. My hypothesized mean
difference is seven minutes. Label is present in both the Alpha value
set as five per cent. And I'm telling that
my output needs to be in a new workbook. I click on Okay, it is doing the calculation
and getting me the output. You can see that the numbers
have conveyed as a practice, I just click on the karma in the Format section so that
the numbers are visible. I'm changing the view because dummy does
not have any data. I am free to go ahead
and delete this column. Now let's understand what
do we always look for? We look for this
value, the p value. Do you remember the formula? Let me get my
formulas over here. Yes. What is the conclusion? The conclusion is P high. I fail to reject the
null hypothesis. Concluding the ADHD of
the call is seven months. I'm rejecting the
alternate hypothesis because my p-value
is beyond 0.05. I'll be taking up more examples
in the following lessons. So I'm looking forward for
you to continue this series. If you have any questions, I would request you to drop your questions in the
discussion section below, and I will be happy
to answer them. Thank you.
21. Understand the Non Normal Data: Our normal or not. Let us try to
understand how do we work when my data is not normal? Or even before getting there, let me introduce you to this
gentleman. Any guesses? Who is the gentleman? You can type in the chat
window if you know. And even if you do not know,
that's perfectly fine. There are no penalty
points for wrong guesses. Yes. Some of you have
guessed it right? He's the famous person behind
our normal distribution. Mr. Carl cos. He's the great mathematician. And he was the person
who came up with the concept of the
Gaussian distribution or the normal distribution. So here is the brain
behind concept of normal distribution and all the parametric tests
that we are taking. If my data is not normal, then it can be skewed. It could be negatively skewed or it could be
positively skewed. If I say negatively skewed, it is technically having
a tail on the left side. Positively skewed means
tail on the right side. It means my data is not
behaving in a normal way. My data can be not
normal because it is following a uniform distribution or a flat distribution
like this. Then also it's not following
the normal distribution. My data can have multiple peaks, something like this,
which represents that there are multiple data
groups in my dataset. And it's not normal behavior. Because my data has
all these things. I need to treat this data differently when I'm doing
my hypothesis testing. And why is this data not normal? It could be because of the
presence of some outliers. It could be because of
the skewness of my data, or it could be because of the kurtosis that's
present in the data. So the reason for your data not behaving in a normal way
could be one of these. Let us summarize,
what did we learn? My data is not normal if the
distribution has a skewness, has unimodal, it's not unimodal, but in fact this bimodal or
multimodal distribution. It is a heavy tail distribution
containing outliers. Or it could be a
flat distribution like a uniform distribution. These are some basic reasons why my data is not behaving
in a normal way. Odd, it is not a
normal distribution, then there are multiple
distributions. There are other
distributions as well, which talks about the
exponential distribution, which models the time
between the event. The log-normal distribution. Which says that if I apply
the logarithm on the data, then my data will follow
a normal distribution. Poisson distribution, binomial distribution,
multinomial distribution. Let us understand some examples, real-life scenarios where the non-normal distributions
can be applied. If you look at this, whenever I am trying to predict something over a
fixed time interval. Then I use Poisson distribution for my analysis and hypothesis. Some examples of Poisson
distribution or the number of customer service called
received in the call center. The number of
patients that present a hospital emergency
room on a given day, the number of request for a particular item in an
online store in a given day. The number of packages delivered by the delivery company
in a given day, the number of defective items produced by a manufacturing
company in a given day. If you observe there is
a common behavior here. Whenever we are
trying to understand something on a
particular time period, it could be a given day, it could be a given
month, given B. Then we prefer to do our analysis using
Poisson distribution. Some examples of
log-normal distribution. The size of the file
downloads from the internet, the size of the particles
in a sediment sample, the height of the tree, the size of the
financial returns, the size of the insurance game. If you see these examples, like if I take the example of financial returns from
their investment, you might see that out of my
portfolio of investments, some investment gave me a
very good return of 100%, 100%, 150 per cent, 80 per cent. And you will also
see that I have made investments in some part in my portfolio because
it resulted in a zero return or a negative
return because I'm in loss. But overall my
portfolio is giving me a return of 12 to 15%
or 15 to 20 per cent. You are trying to say that your distribution is technically not a normal distribution. You have very low returns
and very high returns. But if you apply the
logarithm on your data, then it behaves like a normal distribution that
overall your portfolio will result in a return of
some X percentage. Similar applies even in
the insurance claim. Let us try to understand the application of
exponential distribution. The time between arrivals
of customers in queue, the time between failure in
a machine, your factory, the time between purchases
in the retail store, The time between phone calls
and the contact center, the time between page
views on the website. Now if you see between the Poisson distribution and the exponential distribution, there is one common element. What is the common element? We're trying to study
with reference to time. Whenever you're doing
a normal distribution, It's not with reference to time. Right? So these are some applications. But the difference
between a poison and an exponential is in a
Poisson distribution. It is on a particular day, on a given day, on a given week are given month. Here we are trying to understand the time between the two evens. What is a time gap
between the two events? Then the exponential
distribution can help you out. We can, let's understand the application of some
uniform distribution, like the heights of the
student in the class. Needs of packets in
a delivery truck. Some packages are very big, some packages are small. If you put it in a distribution, you will also find that
it's a flat distribution or a uniform distribution because for each category of packages, you'll have approximately
the same number of, similar number of packages. Goods that you're delivering. The distribution of test scores for a multiple choice exam. The distribution of waiting
time at a traffic light, the distribution of
the arrival time of a customer at a retail store. So if you see all these examples following uniform distribution, it's not a bell curve. Because you have
continuously people who are arriving at
the retail store. It's not that there
is a sudden peak. And the real-world scenarios of
heavy-tailed distribution, it means the distribution where
the outliers are present, the signs of the
financial loss and an insurance industry or other
signs of financial loss. In a few ask a trader, they would see that extremely high and an extremely
low number. The size of the
extreme rainfall. So we do not have extreme
rainfalls every year. So we wouldn't be able to say
that whatever has happened, it's because of an outlier. And heavy-tailed
distribution are usually impacted because of
the presence of outliers. So if your data is
having outliers, then you can also see
that the distribution for load is a heavy-tailed
distribution. And we will understand
in the next session, what type of non-parametric
tests should I be performing? Depending upon the type of the non-normal data
that we are starting. The size of the
power consumption, the size of the
economic fluctuation of the stock market crash. These are all examples of your
heavy-tailed distribution. Examples of bimodal data. Here you need to understand bimodal means there are two outcomes that
we're trying to study. The distribution
of exam scores of students who studied
and those who did not. Distribution of ages
of the individual in a population who from
two distinct age groups, height of two different species, salary distribution of employees from two different departments. Godspeed on a highway with two groups of slow
and fast drivers. So here you can see
that I am having two groups of data
which are different. And I'm trying to understand the behavior are I will go ahead and do my investigation
as part of my hypothesis or the resource
that I'm trying to do. If I have more than two
groups, two different, more than two different groups, like three different groups
for different groups, then it becomes a
multimodal distribution. Right? So I think by now you
would have gotten an idea of what are the different
distributions which are not normal distributions. So how do I determine if
my data is not normally? The first dot become, it comes to our mind
is a normality test. But even before doing
a normality test, you can use simple
graphical methods to find out if your
data is normal or not. You can use histogram. And here the histogram is
clearly showing multiple moves. So I can clearly see that this is not a normal
distribution. If I tried to put a fit line, then also I can see that
there is skewness in my data. I can also use box plot to determine if my
data is not normal. So here you can see that
I have a heavy tail on the left side telling
that my data is skewed. I can also have outliers which a boxplot can easily highlight. So I can hide, identify the heavy-tailed distribution
using the boxplot. Also. I can use simple
descriptive statistics where I can see the numbers
of mean median mode. And when I see that
these numbers are not overlapping or not
close to each other, that also simply indicates
that my data is not normal. I can look at the kurtosis and skewness of my data distribution and then come to a conclusion if my data is behaving
normal or not. So I have shown you other ways of identifying
whether your data is following and not non-normal distribution or if your data is following a
normal distribution. Now I would say one more thing. Do not kill yourself
if your mean was 23.78 and median is 24, and the mode would
be like 24.2 or 24. So if there is a
slight deflation, we still consider
it to be normal. Right? Skewness close to zero is an indication that
my data is normal. But if my skewness is beyond
minus two or plus two, it is definitely our
non-normality proof. Ketosis is also one more way of identifying if my data is
following normal distribution. Most of the time we prefer the kurtosis number
to be in 0-3. But if you're
ketosis is negative, it means that it's a flat curve. Audits follow a
uniform distribution. Audit could be a
heavy-tailed distribution of high kurtosis could also be an indication that your
data is too perfect. And maybe you need to
investigate if there are, they have not manipulated your data before
handing it over. Other favorite ADText or
Anderson-Darling test, where we try to understand
if my data is normal or not. So the basic null hypothesis
whenever I'm doing NAT test, is that my data follows
a normal distribution. So this is the only
test where I want my p-value to be greater
than 0.05 that I get, I fail to reject the
null hypothesis, concluding that my
data is normal, and I fall back on my
favorite parametric test, which makes it easy for
me to do the analysis. But what if during the ADA test, your data and your data analysis shows that the p-value
is significant, that it is less than
0.05, maybe it's 0.02. Then it concludes, my data
is not normal distribution. And I need to investigate what type of
non-normality it has. Accordingly, I will
have to put up the test and then
take it further. We will continue our session
in the next Venice day. I hope you liked it. If you have any questions, please feel free to
comment in the WhatsApp or in the Telegram channel or in the comments
section over here. Any topic which
you would like to learn as part of the y's
Wednesday's session. I would be happy
to look into that. If you can put those comments in the chat box or in the WhatsApp
group or the telegram. I really love teaching you and I thank you for being wonderful. Students. Take care.
22. Conclusion: I would like to thank you very much for
completing the program. It shows that you are highly committed on your
journey for learning. You want to upskill yourself and I trust you
have learned a lot. I hope all your concepts
are also clear. I want to ensure that I tell you what are the other programs
that I do want skillshare. So on Skillshare, I have many other programs
which are already there and many will come up in the future weeks
and future months. What the programs are like
storytelling with data, how I can use the analytics, data visualization, predictive analytics without
coding, and many more. Apart from this, I also work
as a corporate trainer. I ensure that all
my programs are highly interactive and keeps all the participants
very much engaged. I designed the books which are customized for my workshop, which also ensures
that all the concepts are clearly understood
by the participants. My games are designed
in such a way that the concepts get loans
in a while they play. There are a lot of games which are designed for my programs. And if you are interested, you are free to contact me. I have also done more
than 2 thousand hours of training in the past two
years during the pandemic. These are just a few
of the workshops. So if your organization
wants to take up any corporate training
program which is offline or online. Or if you feel that personally you want to upskill
your learning, you're free to contact
me on my e-mail ID. Stay connected with me on LinkedIn if you
liked my training, please ensure that you
write a review on LinkedIn. Also, I also run a
Telegram channel where I put lot of
questions where people can learn the
concepts and they will, they might just take few
seconds for them to do it. Apart from that,
please ensure that you write to leave a
review on Skillshare, that how was your
training experience? Please do not forget to
complete your project. I love people when they are committed and you have proved
that you are one of them. Please stay connected. Stay safe, and God bless you.