How to Detect Statistical Deception for Better Decision Making | Saptarshi Bhattacharyya | Skillshare

Playback Speed


1.0x


  • 0.5x
  • 0.75x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 1.75x
  • 2x

How to Detect Statistical Deception for Better Decision Making

teacher avatar Saptarshi Bhattacharyya, Make Better Decisions

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

    • 1.

      Introduction

      2:01

    • 2.

      Sampling Bias

      8:36

    • 3.

      Sample Size

      6:20

    • 4.

      Small Numbers

      4:13

    • 5.

      Average Problem

      16:35

    • 6.

      Simpson's Paradox

      14:28

    • 7.

      Cause Effect Correlation

      11:31

    • 8.

      Statisculation

      8:59

    • 9.

      Half Relevant Data

      7:30

    • 10.

      Check The Scale

      2:22

    • 11.

      Comparison By Picture

      2:32

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

54

Students

1

Projects

About This Class

We live in age of information overload. We need to constantly use data to check evidence and verify the validity of many claims made in both our personal and professional lives.

However, in most cases we don’t have time to check the raw data based on which all claims are getting made, neither do we have access to them. Unfortunately, everyone takes advantage of this opacity and use various statistical tricks to bend facts, so that their own objectives and agenda are served.

From the news articles, to advertising claims, from business communications to political campaigns, all the time the data and information is processed and presented in ways that are crafty and deceitful.  

This class teaches the top 10 most commonly used statistical tricks used to hide facts while presenting data. At the end of the class the participants will be able to detect and defeat the tricks, objectively scrutinize evidence for any claim made, and make more optimum personal and business decisions.

Meet Your Teacher

Teacher Profile Image

Saptarshi Bhattacharyya

Make Better Decisions

Teacher
Level: All Levels

Class Ratings

Expectations Met?
    Exceeded!
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Introduction : This class shows how statistics disease to hide truth in order to manipulate our decisions and how we can detect these tricks to make better decisions. Hello. I'm a cognitive science expert with over 25 years experience in US, India, Middle East, and South Africa. We live in an information age. We need to use data to make every decision. But for marketers, media politicians, to all businesspeople, everyone hides real truth behind carefully constructed statistics so that the data they present syrups, their own purpose or agenda. What can we do? Well, it's pretty simple. We need to learn the statistical tricks people use to fool others so that we can catch them in the act and make better decisions. In this class, you will learn about the ten most common statistical tricks is by the marketers, media, spin doctors, politicians, business and other people to hide data in plain sight. So that at the end of this class, you'll be able to make far better data-driven decisions in both personal and business situations with confidence. Who can take this class 12, pretty much everybody, from beginners to professionals, the business leaders to students. Everybody will be benefited from this class. At the end of this class, of course, there is a project which involves investigating thin practical situations when people generally make wrong decisions due to the statistical tricks discussed in the class. The objective of the students is to first detect the applicable trick and then figured out the correct way to make decisions. I guess that's pretty much it. Looking forward to seeing you in the class and all, yes, don't forget to put your comments and feedback at the end. 2. Sampling Bias: Sampling bias, as statistics can be sometimes a bit heavy. So let's start with a lighter mood with a cartoon. So the guy has created a bird feed, which is vehicle. And he says, after sampling every bird that frequency the sidewalk outside this building, we have concluded that what birds really loved his vehicles. That means in order to decide what birds like, all he has done is looked at Bard. The sidewalks were eating bagels because bagels are getting thrown out of his buildings. That is a funny way to look at. It was sampling can lead to biased conclusion. Statistics is all about collecting a small number of candidates from a large population is small number of candidate is called sample. And deciding about the large population based on checking with the sample. The sample is representative of the population, are fun inclusions are right. If the sample is biased like it is in the picture that are more green dots then red ones in the sample compared to the population. Then our conclusion based on sample is also not valid. Lot of time in real life surveys that conclusions are not valid because samples are biased. For example, I really, college graduates are always show higher salary than actual salary those graduates get because people retire salary on the response. Sometimes before fall. Statisticians predict based on surveys, someone will win, someone else wins with a big landslide, like it happened in 1932 election. Because the sampling was biased. In 2012 election polls, it was conducted by landline, which predicted Obama would lose. But Obama won because many officers, supporters were actually using cell phone and therefore they will be denoted by that. Fosters a survey. It is saved that 80% of British people like electric vehicle. The first question you should ask is 80% of which piece, British people. Because if the sample is biased and becomes from a section of the British people, then it is representative of that section rather than dead British population. Now, let's take a look at how sampling bias happens. Let's start with the case of Ivy League students. Average Saturdays. Statement published by a journal says average salary of Ivy League MBA's after five years of graduation is $300 thousand. The question is, how accurate is a statement like that? Now the statement can be very wrong and misleading. Why? Because they don't mention what the sample size was. They don't mention that if they had a contact of all the graduates to select from. They also don't mention if the list of candidates who answered where it's selected randomly, or they just took the answers of people who care to answer. It also does not talk about how many of them actually answering those questions. If they're asked this question to 50 people and five answered. If they average this five and say it is gender thousand dollars, that's a heavily biased case because that five women not be representative. This next question is, did only people with hi sanity answered, which is most likely that case. Was it a representative samples? As you can see, they've got so many questions and most of the surveys carefully avoid answering any one of this question. So you cannot trust any salary survey conclusion. So people can manipulate sample or just be sloppy, are not methodically correct while collecting sample. And as a result, unfortunately, meaning of this hotbed results are not trustworthy. 60% of urban Indians preferred engineering education, which 60%, 80 percent of Indians preferred a certain indeed 13, 80% of which indian. The result of a study to be warp anything sample must be representative and random. Meaning, while collecting the sample, all the candidates should have equal chance of being selected. Many a time. That is not the case. There is a constant battle to reduce sampling bias by making the sampling, random. Sampling bias is always there. If we see statements like this that 60% of urban Indian preferred to engineering education one should as 60% of which people. It wouldn't be 60% of people they care to ask. Not 60% of on our bundle, Indians. When random sampling is not possible or time-consuming or expensive, meaning random sampling, meaning every candidate will have the same chance of being selected if that is not possible. The next best thing is stratified random sampling, where you divide the population in several groups based on the prevalence of that group. So if a population of 50% white, 30% black, 20% Asian, you collect a certain number of samples from Asian community, a certain number of samples from black community, and certain number of samples from white community in proportion to their Parson digit. And that may involve violation. And that is a stratified. The groups can be based on income, education, ethnicity, or anything else. But if your knowledge of the proportion is wrong, the sampling, we'd also be biased. For example, if a graduate classes 30% of women is 70% men, if your sample sizes calculated to be ten, you can randomly select three from the female students and seven from the students. That wouldn't be a case of stratified random sampling. That would be somewhat usable. Remember, own polls and started the results are somewhat biased. The bias comes from two different sources. First, it is the steps we're doing the interview. They age, education, ethnicity, and experience of the interviewing staff influenced heavily the survey result. In a case, it was found that based on whether white or black staff doing the interview of black communities, the results were pretty different. The second source of bias are the people who have a higher tendency of being selected as samples are candidates. Generally people with more money, better education, beta appearance, more information and alertness. More conventional behavior and acceptable habits gets more selected as candidates for interview or sample. And as a result, de-bias the hurdle. **** and survey results. 3. Sample Size: Sample size. In many cases, people do not use a large enough sample size to make a conclusion. For example, I hear while advertisement claims studies show that here I'll X reduces here fault for 80% of users. But how credible are these statements? Some advertisers provide a fine print that mentioned a small sample size, but some don't even do that. Unfortunately, for commercial advertisement, there is no regulation that states that certain sample size and certainly methodologies need to be used. Like it is the case for pharmaceutical products. And as a result, people do claim all sorts of stuff based on all sorts of sampling. Those results are mostly based on small sample size. And hence bogus can be due to fewer chance. Or the images show a few cases of their sampled to show good result so that they can make a claim in an advertisement. Yet they are shown all the time. If you toss a coin ten times and get eight heads, you can't claim that study shows chance of getting heads in coin toss is 80%. You need to have large number of coin task to establish what is the chances of getting heads are and which will be 50%. Similarly, in order to establish any claim, you need to have a large enough sample size. The sample size would be dependent on how variable outcomes are and how confident you need to be in your claim. Let's say the chance of severe cases from a vital fever is just 10% among kids. You have selected 1 thousand kids with the vital fever desk. The effectiveness of a medicine. Is this a large enough sample size? You have selected one hundred, ten hundred. You were happy that the sample size is 100000. But in reality, 90% of those kids will get better. Even without the medicine. You are actually testing the medicine on just 10% or 100 kids who needs this medicine because they would be the severe cases inductance. That is actually a smaller sample size than you thought. The actual sample size is digital domain by the confidence interval and standard deviation on a case to case basis. Among population. Few factors are important in estimating the sample size needed. Their confidence level, confidence interval, Z-score, which depends on confidence level, standard deviation, and sample size is calculated based on that. And I will not get into a lot of details, but it is important for us to be aware. There are certain ways to calculate the sample size. Confidence level is the measure of you will repeat the experiment a 100 times, how many times you will get similar result. It's generally used as a ninety-five percent are most cases. Confidence interval is the degree of ADR. It's generally used as plus minus 5%. Z-score is a measure of how many standard deviation area from the mean based on the confidence level you've selected, there are a z-score for the standard deviation is the degree of variability. If you don't know anything, you can use 0.5. Putting all these, there is a standard calculation which is square root of z score multiplied by standard deviation, multiplied by one minus standard deviation divided by confidence interval squared. With 95% confidence level and 5% interval, you get a stamp sample size of gendered in 85. So 4.5 standard deviations. So a sample size of 400 to 500 generally is a good enough sample size for most cases. Determining sample size is a critical step of any statistical survey or statistical studies. Because if this has done wrong, they enter study or surveys wrong. There were some colossal cases of ps. In 1950s. Polio vaccine study in 450 children were vaccinated and 680 kept without vaccine as a part of steady during a polio outbreak, none of the kids contracted polio because the rate of contraction was just less than 1%. Actually the needed sample size between 38 per 1040 thousand kids. So they could expect to have 300 to 400 cases of polio would be the effective sample size to test the vaccine. There were way, way wrong. This selected four hundred and fifty and six hundred, around 1100 kids where they're needed 30 thousand to 40 thousand kids. So estimation of sample size and using the sample size of right value is of extreme importance for all statistical study. But unfortunately, in most of claim that we see for advertisers, for different products or even for politicians, media and whatnot. All of these guys, almost all of these guys don't inner, use an adequate enough sample size and the result. None of those claims are managed. 4. Small Numbers: Small numbers. Let's start with agrees. One town has two hospitals, one small and one large. One of the two hospitals in a month, 60% of the babies born whereby. Whereas typically it is 50%. Which one is it? The small or the large hospital? Answer is the small hospital. Because just like a small sample size, extreme or unusual cases. In this case, 60%. Boys are unusual, 50%, The usual unusual cases are seen in small sample sizes, are, in this case, small hospital. While investigating cases of kidney cancers in over 3 thousand US counties, it was found that highest incidence game from small counties. Interestingly, it was also found that the lowest incidence also gained from small counties. That means small and less populated counties had both the highest and the lowest incidence. That's the case of small numbers. And why it happens. Let's look at a really small county with just stay in adults. As it is just in adults, it's highly likely that nano to have cancer. It will show 0. But if just one person gets canceled by any chance, it will also show 10% incidence, which is very high. That means if account is small, chances are that it will be spared of many cases. And just if it has a few, it will show a high cases of incidence. So small counties will show both high and low incidence. Small districts with low population will always show extreme results. Where in larger districts, entire population cases we live in out to show more average results. One supermarket chain found out that smaller stores showed both lowest and highest footfall per unit area and sales per unit area. Smaller sample size is more likely to show extreme results due to randomness. And larger sample better represent a population. That's why we hear something called law of large number in statistics. In order to make any conclusion, you can depend only on large numbers and when large numbers of dead things to average out. Now how in case of small numbers things get extreme results? Let's take another example. Let's say as students who are still as 100 room with five students in each. Out of 500 total students, 25 got a plus in their exam. That means many of the rooms have 0% students with a plus because there is only 25 a plus students and 100 rooms. But if a room has just won a plus student, it will show 20% cases of a plus one student divided by five, that is 20. Room has to students by any chance it will show 40% cases of a plus. You can see due to the fact that the rooms are small, any figure will get exaggerated. That is how small numbers existed at results. And that is our law of large number needs larger number to make a conclusion. 5. Average Problem: Average problem. One of the most common ways statistics is, is, is as somebody of lots of data. Average is probably the most commonly used summary statistics. We like to summarize a lot of data into one figure. And that's the popularity of average. We hear statements like economy will create 20 thousand jobs per month. That's an average figure. Country we'll get 20 inch of rainfall and other average here, about average salary increase of 10% per year and so forth. But all these average figures can be really misleading and they may hide more information than the show. For example, some months may add 50 thousand jobs, whereas some other months may add Justin thousand jobs. So somebody of 20 thousand doesn't mean much. Similarly, some parts of the country may get flooded. Where does there could be drought in some other parts. So average rainfall of 20 inch in a country does not mean anything. Similarly, some people salary may double, where there's some may not see it increase at all. So an average salary increase of 10% is absolutely useless information as far as understanding the real increase in people's salaries concern. It's said that don't cross a river if it's on average four feet deep. Because the river can be 20 feet deep, add some faces than just one or two feet deep at some other places. So an average for free depth, meaning if you're trying to cross the river thinking it'll be for print on an average, a drunk. One of the dramatic cases of the problem of using average was what scientists found out during the earlier years of tracking global warming. At that point, scientists were looking mainly at average increase in temperature and trying to correlate that with the filtering or polar ice camps, they were surprised to find that even if the average temperature remain the same, more than more polar ice caps, we're getting melted year-over-year. Later on, they found out even if the average remains the same, years with the high peaks in temperature caused more eyes demand. Average was not much of use. Similarly, just try to look at two cases where one person drinks one beer every day for 30 days. So that is average of one beer a day. Whereas another person does not drink any beer for 29 days and on 30th date, he drinks 30 years. In both cases, the average is one beer per day, but the impact of drinking beer will be much more severe, would be fatal in the second case of drinking 30 beers a day. Similarly, average can be really misleading. Another famous case of problem with averages where in 1997 flood of Grand Forks, the city had a 51 feet levee. Higher than 49 feet forecasted river water level, which was an average. But at some places during a flood, the water reached 53 feet even if the average remain for dinner. But in those cases or displaces, the water breached the levee, causing blood to the city. Using averages are depending on averages because flood in the city averages can be misleading in many other ways. Let's take the case of average life expectancy. For example, in 19 hundreds, average life expectancy at birth was 30 years. But this 32 years doesn't really mean anything. It doesn't mean that everybody died at 32. Many diet at bart, many diet below age of five die during waging war. But those who survived all those lived to be sixties and seventies, just like today, the average came out to be 32, but this figure does not really mean anything. That's why there is no average company. There is no average stock market fluctuation. There is no average war, no average and epidemic, nor average marketing campaign, nor average book sale. In most cases, distribution can be very skewed. An average monarch mean anything. For example, the left side, you see it is negative skew of the distribution where evidence is moved to the left and the right side, you can see a positive skew where the average is moved to the light because it's. In detail towards right, only in the case of the central normal distribution averages in the center. And so it is somewhat meaningful, but in all real cases, distribution of values are always skewed. And as a result, average does not convey any much useful meaning. In the context of statistical summary, three types of averages are used. They're called mean, median, and mode. Mean is the simple arithmetic average. Let's take the example shown in the table to your left. Let's say in a room, there are 20 people. Their salaries are shown in the columns. Their salaries varies between $20 thousand a year and $3 million a year. So the mean salary is that it mitigates average, which comes out to be 248,750 Donner. Now, median is the salary in the middle when they're all arranged in ascending or descending order. In this case, it is coming out to be $56,500. Mode is the salary figure that occurs the most in this, if you see 60 thousand hawkers twice versus everything else operating once. So $60 thousand is the mode. As you can see, when few outliers can change the mean or median is a better summary. In this case, among 20 people, only two people have very high salary, $1 million in $3 million. The rest of them are between 20000.95000. Like this to high values or skewing the means. See if somebody says the average salary of people in the room is $248 thousand, that does not reflect the true nature of salaries that people in the room. So compared to that median, which is $56 thousand, is a much better representation. In all cases where a few outliers can heavily skewed averages. Median is a better estimation of average statistics. Mode in this case also in somewhat useful, but in many cases something may not happen more than once. So more men are able to be used. Median is a better estimate in lot of cases that are heavily influenced by high or low outliers. But in all practical cases where the distribution is heavily skewed to one side or other, non of the averages mean, median or mode can be used. Let's take the example of cancer. A lot of time dr will tell a cancer base and that he has just do a monster leave or 18 months to live. The patient should not lose heart. Because that 12 months or 18 months could be a mean and median value. And that can be heavily skewed because there could be lot of people dying immediately or after filming. It's my diff the patient survives the first few months. He or she can expect to live a much longer life. Because even the mode can be highly move towards the right side or towards the higher number of months or years. Look at the case of distribution C, where both mean and median is towards left, but the actual number of people living much higher than mean and median is towards right. So one can even leave 6070 years even after detected with cancer and that has happened. Let us take another example. Let's say 80% of the companies, It's have $90 million of sales and just 20% of the companies had $60 million cells. If you calculate the average, you will see the average comes out to be $84 million. But in this case, 80% of the companies that are above average, which is kind of difficult to comprehend. It's counter-intuitive. But a lot of cases, averages provide a counter, counter-intuitive and sometimes intentionally bar flexing result. And people who end up making wrong decisions by using amperage. So always look at the distribution and make a decision based on that. Never decide based on any average figures. Average calculations sometimes lead to very confusing and misleading information. For example, average China does not come from average families because calculation of average child and calculation of average family are not the same. Illustrated this, let's say there is a village with a 100 families. A number of kids per family is as Insert a table to your next. 100 families are divided into five groups of 20 families, each. First group as 0 kids. Second group has one kid. Hard group has two kids for the group has three kids. And fifth group has formed kids. If the question is, what's the number of kids for an average family? You have to calculate the total number of kids by multiplying left column to right column, that will lead to 200 kids and you divide it 200 families. It will give the average number of kids or family, which is 200 divided by a 100 too. But if there is a question, average kid lives with families with how many kids? The answer is not tooth. That means, although the average family has two kids, but average kids don't live in a toolkit family. It is confusing. But let's take a look at kids. Kids. No kids in the first group because there is 0 kids. Now in the second group, that is total 20 kids. And they lived in a family of one kid. In the third group that are 40 kids. And then they live in a family of kids. The fourth group that are 60 kids, 20 multiplied by three. And all of these 60 kids live in a family of three kids. In the final group that are 80 kids, because 20 family multiplied by four kids AT kids. And all of these 80 kids live in a poor kid family. So to calculate the average number of kids that an average kid lives with, you have to multiply 20 with one. Because 20 kids live in a one kid families, 40 kids live in a toolkit family, 60 kids live in a tricky family, and 80 kids live in a poor kid pocket family. And you have to divide it by total Kids, which is 200. And he would get three. That means average kid lives in a triggered family, whereas average families adjust toolkits. Similarly, average employees don't work in an average company. It will be skewed towards larger company. For example, here you see the average kids are skewed towards larger families with more kids. Similarly, average citizen doesn't live in average country. Average citizen will be skewed towards that high-population countries like India, China, US. That's how averages can be really confusing. We'll look at another example to show how average student does not go to average college. Because average student and average colleges are not calculated in the same way. Cts for colleges with students as far as the table. College one is small, has 1 thousand students. College two has 3 thousand students. Cartilage tree has 10 thousand students, and college for has 30 thousand students. There are total 44 thousand students. The first question is, what's the average number of students per college? For college total 44 thousand students. So average number of students per college is 44 thousand divided by four, that is 11 thousand. The second question is, average students goals to college with how many students? This calculation is very different because in college, 11000 student goes to college with another 1 thousand students. In college to 3 thousand students goes to college with 3 thousand students in college, 310 thousand students go to college within 1000 students. In college for 30 thousand students Goes to College of 30 thousand students. So the average student goes to college web 10 thousand squared, 30 thousand squared, 3 thousand squared, 1 thousand squared divided by 44 thousand students. So it comes to 22,954. So even though the average college has just 11 thousand student, but average student goes to college of 23 thousand students. Similarly, average population by country is 39 million because there are around 7.8 Boolean people and around 200 pantries. You divide that 7.8 billion with 200, you come to around 39 million people per country. But average citizen lives in a country which is lips with many more people, almost like ten times more people. That's how average country and average citizen are not the same thing. Average student and average colleges are not the same thing. Often these figures are intentionally used by different vehicles. And they depict that average figure that serves their partners. 6. Simpson's Paradox: Simpson's paradox. Between 196319, eighties, the average verbal and math SAT score of all US students came down by 5040 points respectively. These are big drops. This led to a nationwide outrage. And presidential panel was formed. They investigated the results and published an infamous report called Nation at Risk. These lead to subsequent drastic measures, a complete overhaul of education system, fighting lot of teachers and so forth. However, later investigation found out that interestingly, when all the students were broken down into individual income groups, all the groups saw an increase in SAT score between 19631980, although group of all students put together. So in degrees. This is an example of simple Simpson's paradox. And this is seen in large of different cases. To illustrate the point, to show how it happens, let's say in the year 19631980, all the students could be divided into three main groups or income groups for middle-class and rich. In 1963, overwhelmingly 70% of the students came from rich families and just 20, 10% came from middle-class and poor families. And their average scores in Math, respectively with 554 reach 500 for middle-class and 404 board, the overall average was 525. This mix completely changed in 1980 because between 6380, the total number of students increased many folds. And all they increase game from poor and middle-class students. Now they're constituted 40% each of all the students that reduce the proportion of rich students to just 20%. Now, all these three groups, matt SAT score was higher. For students score was 420, which was hired by 20. Middle-class students score was 520, which was higher by 20 points. Enrich students score was 560, which was also hired by ten points. However, the overall average in 1980, it was just for 88 because the total mix is now different and more foreign student and less proportional reach students. The overall average came down from 525 in 1967 in three to 488 in 1980. However, all the individual groups saw an increase in average. Met is a Discord. That's what is called Simpson's paradox. Simpson's paradox is named after a mathematician Edward Simpson. He found this out in 1951. It says sometimes a result or the trained seen when data is segregated into different groups. But when all these groups are put together, that result are trained, these are reversed. It disappears. Let's look at the two chart to your left. In the above chart, you see the whole data together in gray dots. Here. If you look at it, you would see that it's an increasing trend from left to right. When you divide these gray dots into four different colors of green, yellow, blue, and red. You see for the green, it is a decreasing trend. For the yellow, it's a decreasing trend from left to right. From the blue, it's again a decreasing trend. For the red, it is slightly increasing trend but not as dramatic as the gray dots. Basically, all the individual groups had showing different trains compared to all the dots when they're put together in a grid format, meaning all dots that went together. That is what is called Simpson's paradox. Simpson's paradox is present in many areas and it confused decision-making. Let's take a look at few other examples. Again, this is a real life example. In any university PhD program. It was found. That 50% male applicants were getting accepted or just tidy 8% women applicants were getting accepted. That lead to lot of debate and discussion about having a bias against women. But further investigation showed a different trend where all the PhD programs where divided into natural science PhDs and social science PhDs. He was found for both these different types of PhD programs. More women were getting accepted than men, a parson day job applicant. But women predominantly applied for social science PhD program, which had compared to natural science PhD program at a much lower acceptance rate for both male and female student. And that kind of bias, that aggregate level acceptance for women, how that happens? Let's take a look at the tip. So 80% of the male students were applying for Natural Science, 20% for social science, whereas for women, 30% was applying for natural science, and overwhelmingly 70% were applying for Social Science PhD program. Now for men, the acceptance level for natural science, we're 60% and for women it was 80%. Women had higher acceptance level for natural science PhD programs. Similarly for social science PhD program, men had 10% acceptance level and women had 20% acceptance level double. But as more and more women applied for social science PhD program, which had a much lower acceptance competitor natural science PhD program. The overall acceptance for man was 50% and women was 38%. Again, when looked at individually, it's a one trend brand. When looked at aggregate, it's a different trend and it seems and spending. So every time you see a data trend while looking at lots of data together, just remember that if there is a way to divide all the data into meaningful groups, individual groups may show a completely different drained, an opposite trend compared to the whole gridded data that you have to be careful. It's seen in business. Research. All the time. Company sold split asean window AC. There check the feedback of 1000 customers of each product. So 1 thousand customers who purchase split AC, the checked. And as it's shown in the table to your left, 800 and say the light split AC, that is 80% people liked split ac. Similarly, 1 thousand customers who purchased Windows ISE, they checked and seventy-five percent of people said they liked window AC. So they concluded that split MAC is liked more. The thing reversed when the customers were divided into men and women. So out of 1 thousand split AC, 900, where main customers, 100 where women customer. Out of 900, male customers, 750 said they liked split AC, so they do the 3%. And among 100 women customers, 50 Said the light split is it that is 50%. But for window AC, out of total 1800 million customers, out of eight hundred and seven hundred and fifty said they liked window AC, so that is 88%. Two 100 was women customers who bought window and C and 150 light window and see that it's 75%. As you can see, that 83% of men liked split AAC compared to 88% of men who liked window and see, when you look at just the male customers, more customers liked. Window is similarly when you are looking at just the female customers, 50% like split SE and seventy-five percent light window SE. So again, more customers, more female customers, Mike, window AC, but more female customers like window I see in both male customers like window and see. But when you weren't putting both male and female customers together, at the aggregate level, you are seeing a different trend of more people liking, liking split. So that's another example of citizens, Benedict. Now we will look at a couple of case studies in businesses to see how Simpson's paradox can be identified as a potential problem and what to do about it. In a market research, data showed that between year two thousand and two thousand and fifteen, the average spending per household on certain product category decrease from $120 in 2 thousand to $95 in 2015. Now the question is, is this a downward trend? Or should you discontinue that product line based on the street? The answer is, just by looking at this overall trend of all the data, you cannot make that decision. The representation of this problem is Simpson's paradox problem that we have found out. And to evaluate whether some group can be discontinued, you have to make a decision based on potential subgroups. So you have to investigate by dividing the total growth into some meaningful subgroup which may behave differently. And check if the same train these being seen for each subgroup. Individually, the subgroup could be based on income groups, high-income, middle-income, low-income families. Is it the same trend in all these groups? It could be based on male and female customers or location wise or anything else that could be meaningful for their data. All these individual groups and also showing a decrease in spending, then you can make a decision. But chances are that there could be some other factors contributing to this overall group versus individual groups showing a different trend. If that is the case, then you should continuing with, you should be continuing with this product line. Let's take a look at another case study, which is again a common problem. How sale data between 20142019 showed that the price bar squared meter of real estate came down by 10% in a CTE. Is this the downward real estate trend? Again, this is a Simpson's paradox problems because how real estate market can be divided and segmented and grouped in, in waste. To decide whether it's a downward trend, you have to look at some groups. You have to investigate by dividing the entire group into different subgroups that could be based on locations to see where there you are seeing the same trend in all locations, by the types of houses, by the sizes of houses. Chances are, you would see that individual groups are showing different trend. Maybe some are showing increase in price. But whereas some other factor, maybe overwhelmingly one type of housing is being sold in a very large quantity. And as a result, that is, I'm changing the trend for the whole group. Whereas individual groups could be showing a different trend, like the way we solve for poor middle income and reach students skills in SAT score. Only after investigating the trend in the subgroup, you can conclude whether this trend of decreasing real estate prices managed. Similarly, there are many cases every time you see a group of data points, try to divide it into different groups and see if those groups are showing the same trend. Because you have to watch out for the problem of Simpson's paradox. 7. Cause Effect Correlation: Cause, effect and correlation. When two events happening together and at positive or negative change in one event is associated with a similar change in the other event. They are said to be positively correlated. When this relationship is opposite, they are said to be negatively correlated. There are many cases where things are positively correlated. For example, there is a 0.3 correlation. A correlation coefficient is calculated and certain way we won't get into those. But there is a certain level of positive correlation between college GPA and subsequent income. There is a higher level of correlation between IQ and average job performance. There's a much higher level of 0.7 correlation between height and weight. Taller people are heavier. Even higher correlation between math SAT score of two consecutive years. So these are in cases where things are positively correlated. Now, dual common problems and fallacies in decision-making. Related to correlation r, one is called post-hoc fallacy. When stemmed event happened after one event, they're said to be formulated and it is said to be one event is causing another Richmond and did a case. Another is *** hoc, ergo propter hoc, which is when two events happening together. One must be because of another event which may not be the guest, because correlation does not mean causation, two events are correlated doesn't mean one is causing the false causality or thinking that one event is causing another falsely. His mother of all statistical logistics when two events happen at the same time. Therefore related, but correlation does not mean causation. If two events happened at the same time does not mean one is causing another. Every morning. Rooster, crows. And sunrises does not mean this rooster causes sunrise, just happens at the same time. Roosters anticipates in Sunrise and it grows and sunrise elements, they are correlated, but one is not causing the other. There's something called spurious correlation because by chance two events could be correlated, but there is no relationship whatsoever. Let's take a few examples. If you plot iPhone sales in the US and dip caused by folds down the stairs, you would see a very high level of positive correlation because both of these increased over the years. But there is absolutely, as we know, no for it no actual causality between no relationship between. Similarly, if you plot bar capita consumption of how fructose corn syrup, which has dropped and spending in spectacular spectator sports you would see one is increasing, whereas other is decreasing. Just two events, there's a spurious correlation. Similarly, if you look at the visitors that Universal Studio of Orlando and sales of new cars, you'd see both. They're coming down, but again, they're just becoming down. And any relations between them and spurious association can be completely coincidental or spirit. It's been a lot of time and this is extremely exampled that if those two events are somewhat related, one can falsely conclude, or one can manipulate to show others that they are related hands, one is causing the other. You have to watch out for that. Two-by-two table are fourfold. Table is a statistical table that can help making decisions. It's a robust way to see how strong the correlation and causation. So if something is causing something, It's for example, often you will see people telling you that 20 people with flu took medicine and got well after five days compared to just ten people who got well without taking medicines. So therefore the medicine is infected. But in order to check whether medicine is effective or not, you have to complete the full form table. One side, you have to look at people who took Madison versus who didn't take medicine. And then the other side, you have to look at people who got better after five days versus didn't get better after five days. If total 30 people to medicine got better. And Dan didn't get better. And if total 15 people took medicine and ten God bitter and didn't get better than you cannot conclude anything. Because the relation between 201010 to five is the same. But if we had our perjury People who didn't take Madison and then got better and 20 didn't get better than you can say, okay, medicine is helping. So similarly, you can create different for, for table to look for additional information that you can plug them in in order to make decisions. Sometimes the causality could be in reverse. If dad gives a report that says that companies with more women in their board or more profitable or larger than size. It may mean that companies with more women are more profitable, or it may also mean that more profitable companies are hiring more women, which is what is not really clear. Similarly, top colleges claim that their students perform better. Now whether the students perform better because of the colleges or major students are going to the top colleges is not very clear. These are the causality could be reversed. In two back-to-back events, the farmer is often seen to cause the letter, but which may not be the case, which are often not decades. But this fallacy is called post-hoc fallacy. It's like correlation related fallacy. This joke, org men says, his doctor asked him to quit smoking and other old man said, Oh don't do that. A couple of my friends just with smoking and they die. This is a funny way of putting the dying has nothing to do with quitting smoking, but they happened after the quiz smoking. A lot of real cases, one event happens completely unphosphorylated, not driven by anything. But people guessed that that was because of some other event that happens with the famous case of our balanced hologram bracelet became a global ridge between 200712, millions of dollars were spent on it as people, including sportsmen, toward the product was enhancing their athletic ability. Later on, randomized double-blind trials found out that there was no relationship whatsoever between wearing those bracelets and athletic abilities in all athletes were training, people were trying to lose weight and get better, more athletic, and there were getting better. That had nothing to do with wearing this wrestling. But it happened because people thought the award those breast lead and that God better. So it was Paul's causality, correlation. Another case of two events happening at the same time being part as one causing another. In 1990 to 30 American teenagers who frequently played a certain video game committed suicide. This created a nationwide alarm, thinking that does video games were causing the suicide. But further investigation found that teen suicide rate at that time was 12 per 100 thousand as trim million played the game regularly. So 360 people where statistically prone to committing suicide among people who played the game. Tertius re-size was not a big number. That kind of leads somebody to think that the suicide was due to the gate. Again, a false causality because both these events happen at the same time. All kinds of companies claim all kinds of correlation between their broad accident improvement, but that those are not decades. The company says it's shampoo leads to take head. However, reality could be that people would pick her hair using the shampoo into first suggests showing the reverse causality. Diet product claims how it's used Lauren the wet, but people using the product would be doing ten other things like exercising, eating less. Those are causing the weight loss and not bad product. Beer company claim having beer improve her disease. But there could be no relationship whatsoever because more and more healthy people could be drinking beer, so they are already having bitter hard. But those causality that they are claiming because it could be just a spurious insanity. Another important factor related to correlation is that affinity is wrongly assumed that the correlation pharmacists beyond the data with which it has been established. For example, with increasing rain, corn grows taller, but beyond a certain point, it becomes shorter. Again. Save the data is up to a certain inches of rain, then that should end. But it, because if it floods, everything may get destroyed. Similarly, education is a positive correlation with earning, but up to a certain point, maybe up to master's degrees because PhDs, unless this correlation is not 5-bit Judy. Similarly, people's happiness miss him to increase with the increase in money. But up to a certain point. And after that it may not change much or it may have been decreased. That is the case then the richest people who would have been the happiest people in this world. But that's not the case at the same time, if it's somebody's report, the mean, I'd be really happy. So initially with increasing money, happiness made crazy indeed, people found that out. But after a certain level that changed in 19, pretty significant. Again, the relationship or correlation or causality may exist up to a certain point which are checked beyond that, it shouldn't be increased to five maturity. 8. Statisculation: Status correlation. Statistical figures are used by journalists, advertisers, so-called experts, salesmen and others. Often message form and manipulate people. This process is called status correlation. Here's an exempt. You would see advertisers in stores from insane 50%, 20 sale on clouds. Many may think it's a total 70% discount. But in reality it's first 50% debt on the balance, 50% and other 20% of the total sales is 50% plus 20% of 50%, that is 10%. So total 60% believe they say total 60% sale. That will create less impact. That's why they say 50% plus 2%. He will see this all the time. These are two manipulate people's. Similarly journalists will report that in a day is shut down in a city or in a state, such a huge economic loss. Always these figures are, figures are calculated based on all possible economic activities that can potentially happen in a day, which never happens. More importantly, when one day shut down, most of those economic activities will be just transferred to the next day. They actually impact is much less. But journalists would like to get your interest in their news. That's why they give such a high finger so that more people read news, news as big business, they have no intention of giving you information. People often add up our Cindy age, do highly existed, it changes. For example, they say cost of material has gone up by 5%, never by 7%, utility by 13%, transportation by 15%. So the total cost is 40% in bits. Bizarre, you can just think about it. You would say that maybe a lot of people don't do that, but people do do this. This isn't a case of status correlation. If you break up the cost, you will see that. For example, let's say the past 60% of the cost was material, 30% labor and five-person utility, and 5% transportation. Those 5713, 15% increase in cost basically increases the overall costs by 6.5% a paltry amount. But very few people will see through it. And we'll see that the total impact in increase in cost explain 5%. Companies, especially consulting companies or services companies, often see that they have more than 200 years of combined experience of their management. There is strangely paper with deniers, each experience become 200 years of combined experience. That means nothing. It is just status correlation showing an artificially, completely arbitrary figure to give an impression that a company has a lot of experience. If a company cuts employee salary by 25% 1 year and increase it by 25% next year. It can claim that it has given people back there and the salary, the student down by 6.25%. Again, that's another way to stay discriminated. The net profit of a company is increased by from 2% last year, 4% this year. One can claim that the profit has increased by two percentage points, but that would not sound good. The better way to claim is that the broth rate has increased by 102% percent to 4%. That will give much higher impact. But people do, do them. Another way to status to let disease a different sort of calculation to come to. A finger. For example, household income can be calculated by dividing the total income by number of household. That is one way to calculate or by multiplying the average per capita income by average number of people in a household. The second figure will always give high advantage. Some people do use it. So if you want to show low household income, use the first calculation if you wanted to. So the high-end household income is the second calculation. What are legitimate? In 2021? Median household income in US $168 thousand. In the same year, median per capita income was $37 thousand. An average household size was 2.6. If you multiply 37 by 2.6, you'll get $96 thousand, which is much higher than the 68 thousand. If average used in a medium income. Average household income will go up even higher. How you calculate will determine the household income. People who have different on the agenda. We'll do different calculations. Let's take this example of how consumer price index can be calculated differently to give different figures, milk and bread costs, 1.20.5 last year. This year both cause $1.10. Different calculation can show different price index and depends on your agenda and you will calculate this way. When the last year is used as a base year, the milk has become 50% of last year's cost. And brand does become 100% of last year's cost. So the average is a 125%, that is twenty-five percent increase. Again, if you use the current year as the base year, which is also measured in it. The middle-class year was 200% and bread was 50%. So last year's average price was a 125%. This year is the base year. This is a 100%. So this one will show that last year's prize was twenty-five percent higher. This one wouldn't show the prices come down. If you use geometric mean, then it will show no change. Based on how you calculate price index May 1 show an increase in price. Now that may show a decrease in price. And some calculation may show no change in birds, although the prices are exactly the same in all three cases. Percentile figure is another interesting things when examining percentile figure, for example, ranks of students in a class of 399 percentile is the top three students. That did this. 1% of 300 students, 98% dynes is the next three students and so on. There is a great difference between the 99th percentile, the 90th percentile, as you can see from this normal distribution, which is generally the case for a large in the figure. But as the ranks get closure in the middle, there is hardly any difference between 14, 60%. Sentence. It says you can see in the extreme sides, the percentiles are stretched. There is a big difference between 9919, where it is, the difference between 4060 is much smaller. The supplier said that he has increased price by 20% because this cost has increased by 20% in the process in nicely increased his absolute dollar. These are all different ways of seeing how different in ways to statistically. Subpar sentence. May 1 say, well, Google change your person dying for 40 to 60, whereas they will change from 1999, whereas the 90 to 99 is much bigger. Similarly, by proportionately increasing price based on cost, one can nicely increase their actual dollars. These are all different ways to Stanford. 9. Half Relevant Data: Have relevant data. When people can't prove something directly, they use half relevant data as a proxy to prove it. Watch out. It's difficult to prove the effectiveness of toothpaste. Companies show how their products destroy germs in laboratory conditions. These two conditions are not the same. What works in the lab may not work inside our mouth. Also, often it's not clear what sort of bacteria they are killing in the lab. Are these the same type of bacteria that is there in our mouth? You'd be surprised. It's also not clear what type of dosage the using land. Is it a full team to just scale a few bacteria? None of these information are very clear. But even then they're claiming their advertisement to advance are very effective in killing bacteria in LRMR. That's the use of half relevant data. One refrigerator company may advertise that it keeps, produce 50% more fresh. You say it all the time, but more fresh compared to what? Another refrigerator. Storing food in a coolant right? Shade, or just living food under summer heat and humidity. You'll be surprised what this 50% is calculated based on. Labor union says 85% of employees are unhappy with the management. They're using irrelevant data. To make a strong case delivery Union often collect all possible complex, including very trivial ones such as for networking, lights needs to be replaced. So and so and add them up to say eighty-five percent of employees are complaining against management, which is not true. 100% of a country's citizen complain about something over a period of time. But it is unjustifiable to conclude that a 100 person citizens are unhappy with their government. Again, it'll be using relevant details. Sometimes have relevant data becomes extremely prominent. Spanish flu is an example. In 1918, H1N1 outbreak is called Spanish Flu. Interestingly, the flu neither originated in Spain nor was more prevalent in Spain. Only because it was covered more by Spanish newspapers. Because they were not directly engaged in World War One. And all other countries in Europe learn waging war inborn born one. Because of the fact he was in Spanish newspapers. The flu is called Spanish. Governments in all countries are notorious users of half relevant detail. Here's an example. During 1898, Spanish American war, the death rate in Navy was 94 thousand centers. The death rate in New York at that point was 16 thousand people. Navy recruiters use this figure to compare and claim it was suffered to be in the Navy. Dan to be out of 82 misled people to join the Navy. Obviously, the death rate of old and sick folks in New York was compared with young and strong people were dying anymore. These are not comparable, these are not relevant at all. Often have relevant data influenced decision-making. For example, in 1950 to the year is seen more as the year of worst polio epidemic in us. Investigation found that duty more awareness, more cases where a diagnosed and reported and more people came forward due to the federal financial aid. Moreover, the year had more children susceptible for that age group. All these figures contributed to a higher number of cases, but the dead figure, which is a better measure where not out of ordinary. 1952, was in famous as the year of polio epidemic in the US. But these were all based on half random and data. In reality, it was just like any other districts that records and reports more crime are seen as more crime prone only because they are reporting more compared to some other districts that are not reporting. Politicians in all countries of the most prolific users of half relevant data. Rummage through all data to find statistics that indirectly suits their inanities. Before coming to power, some districts said low-income and some had high-income. After Bauer, some districts also add high-income and some had low-income. Politicians will pick a district with low-income before power, and the district with high-income after power to show that their governance increased income. Whereas opposition will do the opposite. They will in a district with high-income, low-income after to show the government caused income to drop. That is how people use half relevant data to make all the points. But it is not always possible to look through all the data. Just be aware though, that all data that are presented in media entities, men, everywhere where people are trying to influence our decision. Chances are they're using half relevant data to do that. We're all familiar with the before and after pictures. Before and after pictures show how some pills held somebody lose weight. But advertisers don't mention all the other strength the person is doing to lose the weight. It's not just the pills that is causing this loss of weight and that is used of half relevant data. Therefore, losing weight is not completely irrelevant to their using a product. It could be even losing weight without using this product if they are doing exercising, eating less, and having a healthy lifestyle. My dad sow half relevant data is used in amputees meant enhancement to make a point to influence our decisions. 10. Check The Scale: Check the scale. Always carefully check the scale when looking at data presented in a chart or some kind of a line diagram. One of the most common ways that it used we existed it or diminish the change of values is by manipulating the scale. Just stay in percent increase in salary can be exacerbated In dramatically by showing it in Lake. And it's the lower chart showing as a stagnation like it is in the upper charge. Sam data presented in different scale looks different. And just by looking at this line chart, May 1 come to very different conclusions without getting into how much change is happening. So many advertisers do dissolve the times, is playing with the scale to accentuate results. Because they know half the people would be in a hardy is this scale. And the other half would be so influenced by the dramatic visual representation. Because visual representation is so powerful that they will ignore this scale. The purpose of the advertisers to accentuate the difference would be served. The bars are broken in a bar chart to exhibited change both up and down. Let's take this example. In 1995, the road accidents where 7,800. In 2005, the road accidents where 7,728 excess. But when plotted in a broken bar chart that is broken at 7680 level, this small change of 80 nor accidents gets accentuated. So people looking at it and say, Oh my God, 2005 heads so many lower acids. But it is just being that's how broken charts on scales are used to manipulate decisions. 11. Comparison By Picture: Comparison by pictures. Comparing by picture. Lot of time, people intentionally and unintentionally exaggerate or diminished data. Grime records in the US in 199098 where compared by picture and reduction looked dramatic. This is because even if the height of the criminals in this picture are proportional to the volume or not. So in reality, that rhyme came down between 15 million into 9 million. But it looks lot lower because volume of the smaller criminal is much smaller. It looks like the crime reduction was fairly. Similarly, India's production of mango. When compared with China. By the height of the mango, It looks much higher than it really. China was 4.5 million tons. India is eighteen million ten, but the volume of the bigger mango is much higher than four tenths, is more like 1015 tons. That's our one can give an impression of one figure is much higher by showing two pictures. Similarly, India's motor vehicle sales increase from 3.2 million in 2014 to 4.4 million in 2019, just thirty-seven percent increase. But when this is shown by two cars, the second being 37% tolerant than the first one. Just by looking at this, people would get an impression that the car sales increase much higher in 2018. This trick is commonly used by advertisers and propagandists whose objective is to mislead viewers. Because once this impression is people's mind, the two impressions of mangoes in the previous slide are two pictures of criminals in the first slide. And here are two pictures of cars. Than one kind of decides that they increase is much higher even if it is very small.