Transcripts
1. Introduction : This class shows how statistics disease to hide
truth in order to manipulate our
decisions and how we can detect these tricks
to make better decisions. Hello. I'm a cognitive
science expert with over 25 years experience in US, India, Middle East,
and South Africa. We live in an information age. We need to use data to
make every decision. But for marketers, media politicians, to all
businesspeople, everyone hides real truth behind carefully constructed
statistics so that the data they
present syrups, their own purpose or agenda. What can we do? Well, it's pretty simple. We need to learn the
statistical tricks people use to fool others so that we can catch them in the act and make
better decisions. In this class, you
will learn about the ten most common
statistical tricks is by the marketers, media, spin doctors,
politicians, business and other people to
hide data in plain sight. So that at the end
of this class, you'll be able to make far better data-driven
decisions in both personal and business
situations with confidence. Who can take this class
12, pretty much everybody, from beginners to professionals, the business leaders
to students. Everybody will be
benefited from this class. At the end of this
class, of course, there is a project
which involves investigating thin practical
situations when people generally make wrong
decisions due to the statistical tricks
discussed in the class. The objective of the
students is to first detect the applicable trick
and then figured out the correct way
to make decisions. I guess that's pretty much it. Looking forward to seeing
you in the class and all, yes, don't forget to put your comments and
feedback at the end.
2. Sampling Bias: Sampling bias, as statistics can be
sometimes a bit heavy. So let's start with a
lighter mood with a cartoon. So the guy has created a
bird feed, which is vehicle. And he says, after sampling every bird that frequency the sidewalk outside
this building, we have concluded that what birds really loved his vehicles. That means in order to
decide what birds like, all he has done is
looked at Bard. The sidewalks were eating bagels because bagels are getting
thrown out of his buildings. That is a funny way to look at. It was sampling can lead
to biased conclusion. Statistics is all about
collecting a small number of candidates from a
large population is small number of
candidate is called sample. And deciding about
the large population based on checking
with the sample. The sample is representative
of the population, are fun inclusions are right. If the sample is biased like it is in the picture that are more green dots then red ones in the sample compared
to the population. Then our conclusion based on
sample is also not valid. Lot of time in real life surveys that conclusions are not valid
because samples are biased. For example, I really, college graduates are always
show higher salary than actual salary those
graduates get because people retire
salary on the response. Sometimes before fall. Statisticians predict based
on surveys, someone will win, someone else wins
with a big landslide, like it happened
in 1932 election. Because the sampling was biased. In 2012 election polls, it was conducted by landline, which predicted
Obama would lose. But Obama won because
many officers, supporters were actually using cell phone and therefore they
will be denoted by that. Fosters a survey. It is saved that 80% of British people like
electric vehicle. The first question
you should ask is 80% of which piece,
British people. Because if the sample is biased and becomes from a section
of the British people, then it is representative
of that section rather than dead
British population. Now, let's take a look at
how sampling bias happens. Let's start with the case
of Ivy League students. Average Saturdays. Statement published by a journal says average salary
of Ivy League MBA's after five years of
graduation is $300 thousand. The question is, how accurate
is a statement like that? Now the statement can be
very wrong and misleading. Why? Because they don't mention
what the sample size was. They don't mention that if they had a contact of all the
graduates to select from. They also don't
mention if the list of candidates who answered where
it's selected randomly, or they just took the answers of people
who care to answer. It also does not talk
about how many of them actually answering
those questions. If they're asked
this question to 50 people and five answered. If they average this five and say it is gender
thousand dollars, that's a heavily biased case because that five women
not be representative. This next question is, did only people with
hi sanity answered, which is most likely that case. Was it a representative samples? As you can see, they've got
so many questions and most of the surveys carefully avoid answering any one
of this question. So you cannot trust any
salary survey conclusion. So people can manipulate
sample or just be sloppy, are not methodically correct
while collecting sample. And as a result, unfortunately, meaning of this hotbed
results are not trustworthy. 60% of urban Indians preferred engineering
education, which 60%, 80 percent of Indians
preferred a certain indeed 13, 80% of which indian. The result of a study to be warp anything sample must be
representative and random. Meaning, while
collecting the sample, all the candidates should have equal chance
of being selected. Many a time. That is not the case. There is a constant
battle to reduce sampling bias by making
the sampling, random. Sampling bias is always there. If we see statements like
this that 60% of urban Indian preferred to
engineering education one should as 60%
of which people. It wouldn't be 60% of
people they care to ask. Not 60% of on our
bundle, Indians. When random sampling is not possible or time-consuming
or expensive, meaning random sampling, meaning every candidate will have the same chance of being selected if that
is not possible. The next best thing is
stratified random sampling, where you divide
the population in several groups based on the
prevalence of that group. So if a population of 50% white, 30% black, 20% Asian, you collect a certain number of samples from Asian community, a certain number of samples
from black community, and certain number
of samples from white community in proportion
to their Parson digit. And that may involve violation. And that is a stratified. The groups can be
based on income, education, ethnicity,
or anything else. But if your knowledge of
the proportion is wrong, the sampling, we'd
also be biased. For example, if a
graduate classes 30% of women is 70% men, if your sample sizes
calculated to be ten, you can randomly
select three from the female students and
seven from the students. That wouldn't be a case of
stratified random sampling. That would be somewhat usable. Remember, own polls and started the results
are somewhat biased. The bias comes from
two different sources. First, it is the steps
we're doing the interview. They age, education, ethnicity, and experience of the
interviewing staff influenced heavily
the survey result. In a case, it was found
that based on whether white or black staff doing the interview of
black communities, the results were
pretty different. The second source of bias
are the people who have a higher tendency of being selected as samples
are candidates. Generally people
with more money, better education,
beta appearance, more information and alertness. More conventional behavior
and acceptable habits gets more selected as candidates
for interview or sample. And as a result,
de-bias the hurdle. **** and survey results.
3. Sample Size: Sample size. In many cases, people do not use a large enough sample
size to make a conclusion. For example, I hear while advertisement claims
studies show that here I'll X reduces here
fault for 80% of users. But how credible are
these statements? Some advertisers
provide a fine print that mentioned a
small sample size, but some don't even do that. Unfortunately, for
commercial advertisement, there is no regulation
that states that certain sample
size and certainly methodologies need to be used. Like it is the case for
pharmaceutical products. And as a result, people do claim all sorts of stuff based on all
sorts of sampling. Those results are mostly
based on small sample size. And hence bogus can be
due to fewer chance. Or the images show a few
cases of their sampled to show good result so that they can make a claim
in an advertisement. Yet they are shown all the time. If you toss a coin ten
times and get eight heads, you can't claim that study shows chance of getting
heads in coin toss is 80%. You need to have large
number of coin task to establish what is the chances of getting heads are
and which will be 50%. Similarly, in order to
establish any claim, you need to have a large
enough sample size. The sample size would
be dependent on how variable outcomes are and how confident you need
to be in your claim. Let's say the chance
of severe cases from a vital fever is
just 10% among kids. You have selected
1 thousand kids with the vital fever desk. The effectiveness of a medicine. Is this a large
enough sample size? You have selected one
hundred, ten hundred. You were happy that the
sample size is 100000. But in reality, 90% of
those kids will get better. Even without the medicine. You are actually
testing the medicine on just 10% or 100 kids who needs this medicine because they would be the severe
cases inductance. That is actually a smaller
sample size than you thought. The actual sample size
is digital domain by the confidence interval and standard deviation on
a case to case basis. Among population. Few factors are important in estimating the
sample size needed. Their confidence level,
confidence interval, Z-score, which depends on
confidence level, standard deviation, and sample size is
calculated based on that. And I will not get
into a lot of details, but it is important
for us to be aware. There are certain ways to
calculate the sample size. Confidence level is the measure of you will repeat the
experiment a 100 times, how many times you will
get similar result. It's generally used as a ninety-five percent
are most cases. Confidence interval
is the degree of ADR. It's generally used
as plus minus 5%. Z-score is a measure of how many standard
deviation area from the mean based on the confidence
level you've selected, there are a z-score for the standard deviation is
the degree of variability. If you don't know
anything, you can use 0.5. Putting all these, there is a standard calculation which is square root of z score multiplied
by standard deviation, multiplied by one minus
standard deviation divided by confidence
interval squared. With 95% confidence
level and 5% interval, you get a stamp sample
size of gendered in 85. So 4.5 standard deviations. So a sample size of 400 to 500 generally is a good enough
sample size for most cases. Determining sample size
is a critical step of any statistical survey
or statistical studies. Because if this has done wrong, they enter study
or surveys wrong. There were some colossal
cases of ps. In 1950s. Polio vaccine study in 450
children were vaccinated and 680 kept without
vaccine as a part of steady during a polio outbreak, none of the kids
contracted polio because the rate of contraction
was just less than 1%. Actually the needed sample size between 38 per 1040
thousand kids. So they could expect to
have 300 to 400 cases of polio would be the
effective sample size to test the vaccine. There were way, way wrong. This selected four hundred
and fifty and six hundred, around 1100 kids where they're needed 30 thousand
to 40 thousand kids. So estimation of sample size
and using the sample size of right value is of extreme importance for
all statistical study. But unfortunately, in most of claim that we see
for advertisers, for different
products or even for politicians, media and whatnot. All of these guys, almost all of these
guys don't inner, use an adequate enough
sample size and the result. None of those
claims are managed.
4. Small Numbers: Small numbers. Let's start with agrees. One town has two hospitals, one small and one large. One of the two
hospitals in a month, 60% of the babies born whereby. Whereas typically it is 50%. Which one is it? The small or the large hospital? Answer is the small hospital. Because just like a
small sample size, extreme or unusual cases. In this case, 60%. Boys are unusual, 50%, The usual unusual cases are
seen in small sample sizes, are, in this case,
small hospital. While investigating cases of kidney cancers in over
3 thousand US counties, it was found that
highest incidence game from small counties. Interestingly, it
was also found that the lowest incidence also
gained from small counties. That means small and
less populated counties had both the highest and
the lowest incidence. That's the case
of small numbers. And why it happens. Let's look at a
really small county with just stay in adults. As it is just in adults, it's highly likely that
nano to have cancer. It will show 0. But if just one person gets
canceled by any chance, it will also show 10%
incidence, which is very high. That means if account is small, chances are that it will
be spared of many cases. And just if it has a few, it will show a high
cases of incidence. So small counties will show
both high and low incidence. Small districts
with low population will always show
extreme results. Where in larger districts, entire population
cases we live in out to show more
average results. One supermarket chain found out that smaller stores showed both lowest and highest footfall per unit area and
sales per unit area. Smaller sample size
is more likely to show extreme results
due to randomness. And larger sample better
represent a population. That's why we hear
something called law of large number
in statistics. In order to make any conclusion, you can depend only
on large numbers and when large numbers of
dead things to average out. Now how in case of small numbers things
get extreme results? Let's take another example. Let's say as students
who are still as 100 room with five
students in each. Out of 500 total students, 25 got a plus in their exam. That means many
of the rooms have 0% students with a
plus because there is only 25 a plus students
and 100 rooms. But if a room has just
won a plus student, it will show 20% cases of a plus one student divided
by five, that is 20. Room has to students
by any chance it will show 40% cases of a plus. You can see due to the fact
that the rooms are small, any figure will get exaggerated. That is how small numbers
existed at results. And that is our law
of large number needs larger number to
make a conclusion.
5. Average Problem: Average problem. One of the most common
ways statistics is, is, is as somebody
of lots of data. Average is probably the most commonly used
summary statistics. We like to summarize a lot
of data into one figure. And that's the
popularity of average. We hear statements like economy will create 20 thousand
jobs per month. That's an average figure. Country we'll get 20 inch of rainfall and other average here, about average salary increase of 10% per year and so forth. But all these average
figures can be really misleading and they may hide more information
than the show. For example, some months
may add 50 thousand jobs, whereas some other months may
add Justin thousand jobs. So somebody of 20 thousand
doesn't mean much. Similarly, some parts of the
country may get flooded. Where does there could be
drought in some other parts. So average rainfall of 20 inch in a country
does not mean anything. Similarly, some people
salary may double, where there's some may not
see it increase at all. So an average salary
increase of 10% is absolutely useless
information as far as understanding the real increase in people's salaries concern. It's said that don't cross a river if it's on
average four feet deep. Because the river
can be 20 feet deep, add some faces than just one or two feet deep at
some other places. So an average for free depth, meaning if you're trying to cross the river thinking it'll be for print on an
average, a drunk. One of the dramatic cases of the problem of using average was what scientists found out during the earlier years of
tracking global warming. At that point, scientists were looking mainly at
average increase in temperature and trying
to correlate that with the filtering
or polar ice camps, they were surprised
to find that even if the average temperature
remain the same, more than more polar ice caps, we're getting melted
year-over-year. Later on, they found out even if the average remains the same, years with the high peaks in temperature caused
more eyes demand. Average was not much of use. Similarly, just try
to look at two cases where one person drinks one
beer every day for 30 days. So that is average
of one beer a day. Whereas another person
does not drink any beer for 29 days and on 30th date, he drinks 30 years. In both cases, the average
is one beer per day, but the impact of drinking
beer will be much more severe, would be fatal in the second case of
drinking 30 beers a day. Similarly, average can
be really misleading. Another famous case of
problem with averages where in 1997 flood
of Grand Forks, the city had a 51 feet levee. Higher than 49 feet
forecasted river water level, which was an average. But at some places
during a flood, the water reached 53 feet even if the average
remain for dinner. But in those cases or displaces, the water breached the levee, causing blood to the city. Using averages are depending
on averages because flood in the city averages can be
misleading in many other ways. Let's take the case of
average life expectancy. For example, in 19 hundreds, average life expectancy
at birth was 30 years. But this 32 years doesn't
really mean anything. It doesn't mean that
everybody died at 32. Many diet at bart, many diet below age of five
die during waging war. But those who survived all those lived to be sixties
and seventies, just like today, the
average came out to be 32, but this figure does not
really mean anything. That's why there is
no average company. There is no average stock
market fluctuation. There is no average war, no average and epidemic, nor average marketing campaign, nor average book sale. In most cases, distribution
can be very skewed. An average monarch
mean anything. For example, the left side, you see it is negative skew of the distribution
where evidence is moved to the left
and the right side, you can see a positive skew where the average is moved
to the light because it's. In detail towards right, only in the case of the central normal distribution
averages in the center. And so it is
somewhat meaningful, but in all real cases, distribution of values
are always skewed. And as a result, average does not convey
any much useful meaning. In the context of
statistical summary, three types of
averages are used. They're called mean,
median, and mode. Mean is the simple
arithmetic average. Let's take the example shown
in the table to your left. Let's say in a room, there are 20 people. Their salaries are
shown in the columns. Their salaries varies
between $20 thousand a year and $3 million a year. So the mean salary is that
it mitigates average, which comes out to
be 248,750 Donner. Now, median is the
salary in the middle when they're all arranged in ascending or
descending order. In this case, it is
coming out to be $56,500. Mode is the salary figure
that occurs the most in this, if you see 60 thousand hawkers twice versus everything
else operating once. So $60 thousand is the mode. As you can see, when few outliers can change the mean or median
is a better summary. In this case, among 20 people, only two people have
very high salary, $1 million in $3 million. The rest of them are
between 20000.95000. Like this to high values
or skewing the means. See if somebody says the
average salary of people in the room is $248 thousand, that does not reflect the true nature of salaries
that people in the room. So compared to that median, which is $56 thousand, is a much better representation. In all cases where a few outliers can
heavily skewed averages. Median is a better estimation
of average statistics. Mode in this case also
in somewhat useful, but in many cases something may not happen more than once. So more men are able to be used. Median is a better estimate
in lot of cases that are heavily influenced by
high or low outliers. But in all practical cases where the distribution is heavily
skewed to one side or other, non of the averages mean, median or mode can be used. Let's take the
example of cancer. A lot of time dr will
tell a cancer base and that he has just do a monster leave or
18 months to live. The patient should
not lose heart. Because that 12 months or 18 months could be a
mean and median value. And that can be
heavily skewed because there could be lot of people dying immediately
or after filming. It's my diff the patient
survives the first few months. He or she can expect to
live a much longer life. Because even the mode can
be highly move towards the right side or towards the higher number
of months or years. Look at the case
of distribution C, where both mean and
median is towards left, but the actual number
of people living much higher than mean and
median is towards right. So one can even leave
6070 years even after detected with cancer
and that has happened. Let us take another example. Let's say 80% of the companies, It's have $90
million of sales and just 20% of the companies
had $60 million cells. If you calculate the average, you will see the average
comes out to be $84 million. But in this case, 80% of the companies
that are above average, which is kind of
difficult to comprehend. It's counter-intuitive. But a lot of cases, averages provide a counter, counter-intuitive and sometimes intentionally bar
flexing result. And people who end up making wrong decisions by
using amperage. So always look at the distribution and make
a decision based on that. Never decide based on
any average figures. Average calculations
sometimes lead to very confusing and
misleading information. For example, average
China does not come from average families
because calculation of average child and calculation of average family
are not the same. Illustrated this,
let's say there is a village with
a 100 families. A number of kids
per family is as Insert a table to your next. 100 families are divided into five groups of 20
families, each. First group as 0 kids. Second group has one kid. Hard group has two kids for
the group has three kids. And fifth group has formed kids. If the question is, what's the number of kids
for an average family? You have to calculate
the total number of kids by multiplying left
column to right column, that will lead to 200 kids and you divide it 200 families. It will give the average
number of kids or family, which is 200 divided
by a 100 too. But if there is a question, average kid lives with
families with how many kids? The answer is not tooth. That means, although the
average family has two kids, but average kids don't
live in a toolkit family. It is confusing. But let's take a
look at kids. Kids. No kids in the first group
because there is 0 kids. Now in the second group, that is total 20 kids. And they lived in a
family of one kid. In the third group
that are 40 kids. And then they live
in a family of kids. The fourth group
that are 60 kids, 20 multiplied by three. And all of these 60 kids live
in a family of three kids. In the final group
that are 80 kids, because 20 family multiplied
by four kids AT kids. And all of these 80 kids
live in a poor kid family. So to calculate
the average number of kids that an average
kid lives with, you have to multiply
20 with one. Because 20 kids live
in a one kid families, 40 kids live in a
toolkit family, 60 kids live in a tricky family, and 80 kids live in a
poor kid pocket family. And you have to divide
it by total Kids, which is 200. And
he would get three. That means average kid lives
in a triggered family, whereas average families
adjust toolkits. Similarly, average
employees don't work in an average company. It will be skewed
towards larger company. For example, here you
see the average kids are skewed towards larger
families with more kids. Similarly, average
citizen doesn't live in average country. Average citizen will be skewed towards that
high-population countries like India, China, US. That's how averages can
be really confusing. We'll look at another
example to show how average student does not
go to average college. Because average
student and average colleges are not calculated
in the same way. Cts for colleges with
students as far as the table. College one is small, has 1 thousand students. College two has 3
thousand students. Cartilage tree has 10
thousand students, and college for has
30 thousand students. There are total 44
thousand students. The first question is, what's the average number
of students per college? For college total 44
thousand students. So average number
of students per college is 44 thousand
divided by four, that is 11 thousand. The second question is, average students goals to
college with how many students? This calculation is very
different because in college, 11000 student goes to college with another
1 thousand students. In college to 3
thousand students goes to college with 3 thousand
students in college, 310 thousand students go to
college within 1000 students. In college for 30
thousand students Goes to College of 30
thousand students. So the average student
goes to college web 10 thousand squared, 30 thousand squared,
3 thousand squared, 1 thousand squared divided
by 44 thousand students. So it comes to 22,954. So even though the
average college has just 11 thousand student, but average student
goes to college of 23 thousand students. Similarly, average population by country is 39 million because there are around 7.8 Boolean people and around 200 pantries. You divide that 7.8
billion with 200, you come to around 39
million people per country. But average citizen
lives in a country which is lips with many more people, almost like ten
times more people. That's how average country and average citizen are
not the same thing. Average student and
average colleges are not the same thing. Often these figures are intentionally used by
different vehicles. And they depict
that average figure that serves their partners.
6. Simpson's Paradox: Simpson's paradox. Between 196319, eighties, the average verbal
and math SAT score of all US students came down by
5040 points respectively. These are big drops. This led to a
nationwide outrage. And presidential
panel was formed. They investigated
the results and published an infamous report
called Nation at Risk. These lead to subsequent
drastic measures, a complete overhaul
of education system, fighting lot of
teachers and so forth. However, later investigation found out that interestingly, when all the
students were broken down into individual
income groups, all the groups saw an increase in SAT
score between 19631980, although group of all
students put together. So in degrees. This is an example of
simple Simpson's paradox. And this is seen in large
of different cases. To illustrate the point, to show how it happens, let's say in the year 19631980, all the students
could be divided into three main groups or income groups for
middle-class and rich. In 1963, overwhelmingly 70% of the students came from
rich families and just 20, 10% came from middle-class
and poor families. And their average
scores in Math, respectively with 554 reach 500 for middle-class
and 404 board, the overall average was 525. This mix completely changed
in 1980 because between 6380, the total number of students
increased many folds. And all they increase game from poor and
middle-class students. Now they're constituted
40% each of all the students that reduce the proportion of rich
students to just 20%. Now, all these three groups, matt SAT score was higher. For students score was 420, which was hired by 20. Middle-class students
score was 520, which was higher by 20 points. Enrich students score was 560, which was also hired
by ten points. However, the overall
average in 1980, it was just for 88 because
the total mix is now different and more foreign student and less
proportional reach students. The overall average
came down from 525 in 1967 in three to 488 in 1980. However, all the
individual groups saw an increase in average. Met is a Discord. That's what is called
Simpson's paradox. Simpson's paradox is named after a mathematician
Edward Simpson. He found this out in 1951. It says sometimes a
result or the trained seen when data is segregated
into different groups. But when all these
groups are put together, that result are trained, these are reversed.
It disappears. Let's look at the two
chart to your left. In the above chart, you see the whole data
together in gray dots. Here. If you look at it, you
would see that it's an increasing trend
from left to right. When you divide these gray dots into four different
colors of green, yellow, blue, and red. You see for the green, it is a decreasing trend. For the yellow, it's a decreasing trend
from left to right. From the blue, it's again
a decreasing trend. For the red, it is
slightly increasing trend but not as dramatic
as the gray dots. Basically, all the
individual groups had showing different
trains compared to all the dots when they're put together in a grid format, meaning all dots
that went together. That is what is called
Simpson's paradox. Simpson's paradox is present in many areas and it
confused decision-making. Let's take a look at
few other examples. Again, this is a
real life example. In any university PhD program. It was found. That 50% male
applicants were getting accepted or just tidy 8% women applicants
were getting accepted. That lead to lot of
debate and discussion about having a bias
against women. But further investigation
showed a different trend where all the PhD programs
where divided into natural science PhDs and
social science PhDs. He was found for both
these different types of PhD programs. More women were getting
accepted than men, a parson day job applicant. But women predominantly applied for social science PhD program, which had compared to natural
science PhD program at a much lower acceptance rate for both male and
female student. And that kind of bias, that aggregate level acceptance for women, how that happens? Let's take a look at the tip. So 80% of the male students were applying for Natural Science, 20% for social science, whereas for women, 30% was
applying for natural science, and overwhelmingly
70% were applying for Social Science PhD program. Now for men, the acceptance
level for natural science, we're 60% and for
women it was 80%. Women had higher
acceptance level for natural science PhD programs. Similarly for social
science PhD program, men had 10% acceptance level and women had 20% acceptance
level double. But as more and more women applied for social
science PhD program, which had a much
lower acceptance competitor natural
science PhD program. The overall acceptance for man
was 50% and women was 38%. Again, when looked
at individually, it's a one trend brand. When looked at aggregate, it's a different trend and
it seems and spending. So every time you see a data trend while looking
at lots of data together, just remember that
if there is a way to divide all the data
into meaningful groups, individual groups may show a completely different drained, an opposite trend compared to the whole gridded data that
you have to be careful. It's seen in business. Research. All the time. Company sold split
asean window AC. There check the feedback of 1000 customers of each product. So 1 thousand customers who purchase split AC, the checked. And as it's shown in
the table to your left, 800 and say the light split AC, that is 80% people
liked split ac. Similarly, 1 thousand customers who purchased Windows ISE, they checked and
seventy-five percent of people said they
liked window AC. So they concluded that
split MAC is liked more. The thing reversed when the customers were divided
into men and women. So out of 1 thousand
split AC, 900, where main customers, 100
where women customer. Out of 900, male customers, 750 said they liked split AC, so they do the 3%. And among 100 women customers, 50 Said the light split
is it that is 50%. But for window AC, out of total 1800
million customers, out of eight hundred and
seven hundred and fifty said they liked window
AC, so that is 88%. Two 100 was women customers
who bought window and C and 150 light window
and see that it's 75%. As you can see, that 83% of men liked split AAC compared to 88% of men who liked
window and see, when you look at just
the male customers, more customers liked. Window is similarly when you are looking at just the
female customers, 50% like split SE and seventy-five percent
light window SE. So again, more customers, more female customers, Mike, window AC, but more
female customers like window I see in both male customers
like window and see. But when you weren't
putting both male and female customers together, at the aggregate level, you are seeing a
different trend of more people liking,
liking split. So that's another example
of citizens, Benedict. Now we will look at a couple of case studies in
businesses to see how Simpson's paradox can be identified as a
potential problem and what to do about it. In a market research, data showed that between year two thousand and
two thousand and fifteen, the average spending
per household on certain product
category decrease from $120 in 2 thousand
to $95 in 2015. Now the question is, is
this a downward trend? Or should you discontinue that product line
based on the street? The answer is,
just by looking at this overall trend
of all the data, you cannot make that decision. The representation
of this problem is Simpson's paradox problem
that we have found out. And to evaluate whether some
group can be discontinued, you have to make a decision
based on potential subgroups. So you have to
investigate by dividing the total growth into some meaningful subgroup
which may behave differently. And check if the same train these being seen
for each subgroup. Individually, the
subgroup could be based on income groups, high-income, middle-income,
low-income families. Is it the same trend
in all these groups? It could be based on male
and female customers or location wise or anything else that could be
meaningful for their data. All these individual
groups and also showing a decrease in spending, then you can make a decision. But chances are that there
could be some other factors contributing to
this overall group versus individual groups
showing a different trend. If that is the case, then you should continuing with, you should be continuing
with this product line. Let's take a look at
another case study, which is again a common problem. How sale data between
20142019 showed that the price bar squared meter of real estate came
down by 10% in a CTE. Is this the downward
real estate trend? Again, this is a Simpson's
paradox problems because how real estate market can be divided and segmented
and grouped in, in waste. To decide whether it's
a downward trend, you have to look at some groups. You have to investigate by
dividing the entire group into different subgroups that could
be based on locations to see where there you are seeing the same trend in all locations, by the types of houses, by the sizes of houses. Chances are, you would see that individual groups are
showing different trend. Maybe some are showing
increase in price. But whereas some other factor, maybe overwhelmingly one type of housing is being sold in
a very large quantity. And as a result, that is, I'm changing the trend
for the whole group. Whereas individual groups could be showing a different trend, like the way we solve for poor middle income and reach students
skills in SAT score. Only after investigating
the trend in the subgroup, you can conclude
whether this trend of decreasing real
estate prices managed. Similarly, there are
many cases every time you see a group
of data points, try to divide it into
different groups and see if those groups are
showing the same trend. Because you have to watch out for the problem of
Simpson's paradox.
7. Cause Effect Correlation: Cause, effect and correlation. When two events happening
together and at positive or negative
change in one event is associated with a similar
change in the other event. They are said to be
positively correlated. When this relationship
is opposite, they are said to be
negatively correlated. There are many cases where things are positively
correlated. For example, there is
a 0.3 correlation. A correlation coefficient
is calculated and certain way we won't
get into those. But there is a certain level
of positive correlation between college GPA
and subsequent income. There is a higher level
of correlation between IQ and average job performance. There's a much
higher level of 0.7 correlation between
height and weight. Taller people are heavier. Even higher correlation between math SAT score of two
consecutive years. So these are in cases where things are positively
correlated. Now, dual common problems and fallacies in
decision-making. Related to correlation r, one is called post-hoc fallacy. When stemmed event
happened after one event, they're said to be
formulated and it is said to be one event is causing another Richmond
and did a case. Another is *** hoc, ergo propter hoc, which is when two events
happening together. One must be because of another event which
may not be the guest, because correlation does
not mean causation, two events are correlated
doesn't mean one is causing the false causality or thinking that one event is
causing another falsely. His mother of all
statistical logistics when two events happen
at the same time. Therefore related, but correlation does
not mean causation. If two events happened at
the same time does not mean one is causing another. Every morning. Rooster, crows. And sunrises does not mean
this rooster causes sunrise, just happens at the same time. Roosters anticipates in Sunrise and it grows and
sunrise elements, they are correlated, but one
is not causing the other. There's something called
spurious correlation because by chance two
events could be correlated, but there is no
relationship whatsoever. Let's take a few examples. If you plot iPhone sales in the US and dip caused by
folds down the stairs, you would see a
very high level of positive correlation
because both of these increased
over the years. But there is
absolutely, as we know, no for it no actual causality between no relationship between. Similarly, if you plot bar capita consumption of
how fructose corn syrup, which has dropped and spending in spectacular spectator sports you would see one is increasing, whereas other is decreasing. Just two events, there's
a spurious correlation. Similarly, if you look
at the visitors that Universal Studio of Orlando and sales of new
cars, you'd see both. They're coming down, but again, they're just becoming down. And any relations
between them and spurious association can be completely coincidental
or spirit. It's been a lot of
time and this is extremely exampled that if those two events are
somewhat related, one can falsely conclude, or one can manipulate to show others that they
are related hands, one is causing the other. You have to watch out for that. Two-by-two table are fourfold. Table is a statistical table that can help making decisions. It's a robust way to see how strong the correlation
and causation. So if something is causing
something, It's for example, often you will see
people telling you that 20 people with
flu took medicine and got well after
five days compared to just ten people who got well
without taking medicines. So therefore the
medicine is infected. But in order to check whether medicine is effective or not, you have to complete
the full form table. One side, you have to look at people who took Madison versus
who didn't take medicine. And then the other side, you have to look at people who
got better after five days versus didn't get
better after five days. If total 30 people to
medicine got better. And Dan didn't get better. And if total 15 people
took medicine and ten God bitter and didn't get better than you cannot
conclude anything. Because the relation between
201010 to five is the same. But if we had our perjury
People who didn't take Madison and then got better and 20 didn't get
better than you can say, okay, medicine is helping. So similarly, you can
create different for, for table to look for additional information that you can plug them in in
order to make decisions. Sometimes the causality
could be in reverse. If dad gives a report that
says that companies with more women in their
board or more profitable or larger than size. It may mean that companies with more women are
more profitable, or it may also mean that more profitable companies
are hiring more women, which is what is
not really clear. Similarly, top colleges claim that their students
perform better. Now whether the students perform better because
of the colleges or major students are going to the top colleges
is not very clear. These are the causality
could be reversed. In two back-to-back events, the farmer is often seen
to cause the letter, but which may not be the case, which are often not decades. But this fallacy is
called post-hoc fallacy. It's like correlation
related fallacy. This joke, org men says, his doctor asked him to quit smoking and other old man
said, Oh don't do that. A couple of my friends just
with smoking and they die. This is a funny way of
putting the dying has nothing to do with
quitting smoking, but they happened after
the quiz smoking. A lot of real cases, one event happens completely
unphosphorylated, not driven by anything. But people guessed that that was because of some other
event that happens with the famous case of our
balanced hologram bracelet became a global ridge
between 200712, millions of dollars were
spent on it as people, including sportsmen,
toward the product was enhancing their
athletic ability. Later on, randomized
double-blind trials found out that there was no
relationship whatsoever between wearing those bracelets and athletic abilities in
all athletes were training, people were trying to lose
weight and get better, more athletic, and there
were getting better. That had nothing to do with
wearing this wrestling. But it happened because people thought the award those breast
lead and that God better. So it was Paul's
causality, correlation. Another case of two
events happening at the same time being part
as one causing another. In 1990 to 30 American teenagers who frequently played a certain video game
committed suicide. This created a nationwide alarm, thinking that does video games
were causing the suicide. But further
investigation found that teen suicide rate
at that time was 12 per 100 thousand as trim million played
the game regularly. So 360 people where statistically prone
to committing suicide among people
who played the game. Tertius re-size was
not a big number. That kind of leads
somebody to think that the suicide was
due to the gate. Again, a false causality because both these events
happen at the same time. All kinds of companies
claim all kinds of correlation between their
broad accident improvement, but that those are not decades. The company says it's
shampoo leads to take head. However, reality could be that people would
pick her hair using the shampoo into first suggests showing the
reverse causality. Diet product claims how
it's used Lauren the wet, but people using the
product would be doing ten other things like
exercising, eating less. Those are causing the weight
loss and not bad product. Beer company claim having
beer improve her disease. But there could be
no relationship whatsoever because more
and more healthy people could be drinking beer, so they are already
having bitter hard. But those causality that
they are claiming because it could be just a
spurious insanity. Another important
factor related to correlation is that
affinity is wrongly assumed that the
correlation pharmacists beyond the data with which
it has been established. For example, with
increasing rain, corn grows taller, but beyond a certain point,
it becomes shorter. Again. Save the data is up to a
certain inches of rain, then that should end. But it, because if it floods, everything may get destroyed. Similarly, education is a positive correlation
with earning, but up to a certain point, maybe up to master's
degrees because PhDs, unless this correlation
is not 5-bit Judy. Similarly, people's
happiness miss him to increase with the
increase in money. But up to a certain point. And after that it may not change much or it may
have been decreased. That is the case then
the richest people who would have been the happiest
people in this world. But that's not the
case at the same time, if it's somebody's report, the mean, I'd be really happy. So initially with
increasing money, happiness made crazy indeed, people found that out. But after a certain level that changed in 19,
pretty significant. Again, the relationship
or correlation or causality may exist up to a certain point which
are checked beyond that, it shouldn't be increased
to five maturity.
8. Statisculation: Status correlation. Statistical figures are
used by journalists, advertisers, so-called
experts, salesmen and others. Often message form and
manipulate people. This process is called
status correlation. Here's an exempt. You would see advertisers
in stores from insane 50%, 20 sale on clouds. Many may think it's a
total 70% discount. But in reality it's first
50% debt on the balance, 50% and other 20%
of the total sales is 50% plus 20% of
50%, that is 10%. So total 60% believe
they say total 60% sale. That will create less impact. That's why they say 50% plus 2%. He will see this all the time. These are two
manipulate people's. Similarly journalists
will report that in a day is shut down in
a city or in a state, such a huge economic loss. Always these figures are, figures are calculated based on all possible economic
activities that can potentially happen in a
day, which never happens. More importantly, when
one day shut down, most of those
economic activities will be just transferred
to the next day. They actually impact
is much less. But journalists
would like to get your interest in their news. That's why they give
such a high finger so that more people read news, news as big business, they have no intention of
giving you information. People often add
up our Cindy age, do highly existed, it changes. For example, they say cost of
material has gone up by 5%, never by 7%, utility by
13%, transportation by 15%. So the total cost
is 40% in bits. Bizarre, you can
just think about it. You would say that maybe a
lot of people don't do that, but people do do this. This isn't a case of
status correlation. If you break up the cost, you will see that. For example, let's say the past 60% of the
cost was material, 30% labor and five-person
utility, and 5% transportation. Those 5713, 15% increase in cost basically increases
the overall costs by 6.5% a paltry amount. But very few people
will see through it. And we'll see that the total impact in increase
in cost explain 5%. Companies, especially
consulting companies or services companies, often see that they have more than 200 years of combined experience
of their management. There is strangely
paper with deniers, each experience become 200
years of combined experience. That means nothing. It is just status correlation
showing an artificially, completely arbitrary
figure to give an impression that a company
has a lot of experience. If a company cuts
employee salary by 25% 1 year and increase
it by 25% next year. It can claim that
it has given people back there and the salary, the student down by 6.25%. Again, that's another way
to stay discriminated. The net profit of a
company is increased by from 2% last year, 4% this year. One can claim that the profit has increased
by two percentage points, but that would not sound good. The better way to claim is
that the broth rate has increased by 102% percent to 4%. That will give much
higher impact. But people do, do them. Another way to status
to let disease a different sort of
calculation to come to. A finger. For example, household
income can be calculated by dividing
the total income by number of household. That is one way to
calculate or by multiplying the average per capita income by average number
of people in a household. The second figure will
always give high advantage. Some people do use it. So if you want to show
low household income, use the first calculation
if you wanted to. So the high-end
household income is the second calculation.
What are legitimate? In 2021? Median household income
in US $168 thousand. In the same year, median
per capita income was $37 thousand. An average household
size was 2.6. If you multiply 37 by 2.6, you'll get $96 thousand, which is much higher
than the 68 thousand. If average used in
a medium income. Average household income
will go up even higher. How you calculate will
determine the household income. People who have
different on the agenda. We'll do different calculations. Let's take this example of how consumer price index can be calculated differently to
give different figures, milk and bread costs,
1.20.5 last year. This year both cause $1.10. Different calculation can show different price
index and depends on your agenda and you will
calculate this way. When the last year is
used as a base year, the milk has become 50%
of last year's cost. And brand does become
100% of last year's cost. So the average is a 125%, that is twenty-five
percent increase. Again, if you use the current
year as the base year, which is also measured in it. The middle-class year was
200% and bread was 50%. So last year's average
price was a 125%. This year is the base year. This is a 100%. So this one will show that last year's prize was twenty-five percent higher. This one wouldn't show
the prices come down. If you use geometric mean, then it will show no change. Based on how you calculate price index May 1 show an
increase in price. Now that may show a
decrease in price. And some calculation may
show no change in birds, although the prices are exactly the same in
all three cases. Percentile figure is
another interesting things when examining percentile
figure, for example, ranks of students in a class of 399 percentile is the
top three students. That did this. 1% of 300 students, 98% dynes is the next
three students and so on. There is a great difference
between the 99th percentile, the 90th percentile, as you can see from this normal
distribution, which is generally the case
for a large in the figure. But as the ranks get
closure in the middle, there is hardly any
difference between 14, 60%. Sentence. It says you can
see in the extreme sides, the percentiles are stretched. There is a big
difference between 9919, where it is, the difference
between 4060 is much smaller. The supplier said
that he has increased price by 20% because
this cost has increased by 20% in the process in nicely
increased his absolute dollar. These are all different
ways of seeing how different in ways
to statistically. Subpar sentence. May 1 say, well, Google change your person
dying for 40 to 60, whereas they will
change from 1999, whereas the 90 to
99 is much bigger. Similarly, by
proportionately increasing price based on cost, one can nicely increase
their actual dollars. These are all different
ways to Stanford.
9. Half Relevant Data: Have relevant data. When people can't prove
something directly, they use half relevant data
as a proxy to prove it. Watch out. It's difficult to prove the
effectiveness of toothpaste. Companies show how
their products destroy germs in
laboratory conditions. These two conditions
are not the same. What works in the lab may
not work inside our mouth. Also, often it's not clear what sort of bacteria they
are killing in the lab. Are these the same type of bacteria that is
there in our mouth? You'd be surprised. It's also not clear what type
of dosage the using land. Is it a full team to just
scale a few bacteria? None of these information
are very clear. But even then they're claiming
their advertisement to advance are very effective
in killing bacteria in LRMR. That's the use of
half relevant data. One refrigerator company may
advertise that it keeps, produce 50% more fresh. You say it all the time, but
more fresh compared to what? Another refrigerator. Storing food in a coolant right? Shade, or just living food under summer
heat and humidity. You'll be surprised what this
50% is calculated based on. Labor union says 85% of employees are unhappy
with the management. They're using irrelevant data. To make a strong
case delivery Union often collect all
possible complex, including very trivial ones
such as for networking, lights needs to be replaced. So and so and add them up to say eighty-five percent
of employees are complaining against
management, which is not true. 100% of a country's citizen complain about something
over a period of time. But it is unjustifiable
to conclude that a 100 person citizens are
unhappy with their government. Again, it'll be using
relevant details. Sometimes have relevant data
becomes extremely prominent. Spanish flu is an example. In 1918, H1N1 outbreak
is called Spanish Flu. Interestingly, the flu
neither originated in Spain nor was more
prevalent in Spain. Only because it was covered
more by Spanish newspapers. Because they were not directly
engaged in World War One. And all other
countries in Europe learn waging war
inborn born one. Because of the fact he was
in Spanish newspapers. The flu is called Spanish. Governments in all countries are notorious users of
half relevant detail. Here's an example. During 1898, Spanish American war, the death rate in Navy
was 94 thousand centers. The death rate in New
York at that point was 16 thousand people. Navy recruiters
use this figure to compare and claim it was
suffered to be in the Navy. Dan to be out of 82 misled
people to join the Navy. Obviously, the death rate of old and sick folks in New York was compared with young and strong people
were dying anymore. These are not comparable, these are not relevant at all. Often have relevant data
influenced decision-making. For example, in 1950
to the year is seen more as the year of worst
polio epidemic in us. Investigation found that
duty more awareness, more cases where a
diagnosed and reported and more people came forward due to the federal financial aid. Moreover, the year had more children susceptible
for that age group. All these figures contributed to a higher number of cases, but the dead figure, which is a better measure
where not out of ordinary. 1952, was in famous as the year of polio
epidemic in the US. But these were all based
on half random and data. In reality, it was just like any other districts
that records and reports more crime are
seen as more crime prone only because they
are reporting more compared to some other districts
that are not reporting. Politicians in all countries of the most prolific users
of half relevant data. Rummage through all data to find statistics that indirectly
suits their inanities. Before coming to power, some districts said low-income
and some had high-income. After Bauer, some districts also add high-income and
some had low-income. Politicians will pick a district with low-income before power, and the district
with high-income after power to show that their governance
increased income. Whereas opposition
will do the opposite. They will in a district
with high-income, low-income after to show the government caused
income to drop. That is how people use half relevant data to
make all the points. But it is not always possible to look
through all the data. Just be aware though, that all data that are
presented in media entities, men, everywhere where people are trying to influence
our decision. Chances are they're using half
relevant data to do that. We're all familiar with the
before and after pictures. Before and after
pictures show how some pills held
somebody lose weight. But advertisers don't mention all the other strength the person is doing
to lose the weight. It's not just the
pills that is causing this loss of weight and that is used of
half relevant data. Therefore, losing weight is not completely irrelevant
to their using a product. It could be even losing
weight without using this product if they
are doing exercising, eating less, and having
a healthy lifestyle. My dad sow half relevant
data is used in amputees meant enhancement to make a point to influence
our decisions.
10. Check The Scale: Check the scale. Always carefully check the
scale when looking at data presented in a chart or some
kind of a line diagram. One of the most common ways
that it used we existed it or diminish the change of values is by
manipulating the scale. Just stay in percent
increase in salary can be exacerbated In dramatically
by showing it in Lake. And it's the lower chart showing as a stagnation like
it is in the upper charge. Sam data presented in different
scale looks different. And just by looking
at this line chart, May 1 come to very
different conclusions without getting into how
much change is happening. So many advertisers do
dissolve the times, is playing with the scale
to accentuate results. Because they know
half the people would be in a hardy is this scale. And the other half would be so influenced by the dramatic
visual representation. Because visual
representation is so powerful that they will
ignore this scale. The purpose of the
advertisers to accentuate the difference
would be served. The bars are broken in a bar chart to exhibited
change both up and down. Let's take this example. In 1995, the road
accidents where 7,800. In 2005, the road accidents
where 7,728 excess. But when plotted in a broken bar chart that
is broken at 7680 level, this small change of 80 nor
accidents gets accentuated. So people looking at it and say, Oh my God, 2005 heads
so many lower acids. But it is just being
that's how broken charts on scales are used
to manipulate decisions.
11. Comparison By Picture: Comparison by pictures. Comparing by picture. Lot of time, people
intentionally and unintentionally exaggerate
or diminished data. Grime records in
the US in 199098 where compared by picture and
reduction looked dramatic. This is because even if the
height of the criminals in this picture are proportional
to the volume or not. So in reality, that
rhyme came down between 15 million
into 9 million. But it looks lot lower
because volume of the smaller criminal
is much smaller. It looks like the crime
reduction was fairly. Similarly, India's
production of mango. When compared with China. By the height of the mango, It looks much higher
than it really. China was 4.5 million tons. India is eighteen million ten, but the volume of the bigger mango is much
higher than four tenths, is more like 1015 tons. That's our one can
give an impression of one figure is much higher
by showing two pictures. Similarly, India's motor
vehicle sales increase from 3.2 million in 2014 to
4.4 million in 2019, just thirty-seven
percent increase. But when this is
shown by two cars, the second being 37%
tolerant than the first one. Just by looking at this, people would get an
impression that the car sales increase much
higher in 2018. This trick is commonly
used by advertisers and propagandists whose objective
is to mislead viewers. Because once this impression
is people's mind, the two impressions of
mangoes in the previous slide are two pictures of criminals
in the first slide. And here are two
pictures of cars. Than one kind of decides that they increase is much higher even if it is very small.