Transcripts
1. Introduction: Hi, My name is Gavin Lou. And imagine director Bold Insight, a global UX consultancy based in Chicago. I've been doing your ex for about 25 years now. In the last 10 years, I've been teaching graduate level classes at various universities. What I'd like to do in this series is to bring the science back into UX. In this class. I want it answered the question. How many participants are enough? It's a sample size question. It's really important because participants cost time and money and a really the key piece to any research project that you do. So what I'd like to do is to give you the perspectives from what people are saying in terms of rules of thumb, all the way down to the science and rational behind it. So you can justify the sample size that you choose, whether it's 57 21 or a lot more. So throughout this series, we're going have a number of topics that I have during my classes. I hope you enjoy
2. Let's Define Research: So the answer. How many participants are enough? Let's first start with defining user experience research because I think that's really important for us to understand the type of studies that require different types of outputs but also different types of participants and different numbers of participants. So the first thing I want to talk about is that market research and user experience research are very complimentary. So when I think about how many participants is enough, it really depends on type of study. So what I want to do is make sure all of us are on the same page as to the type of research that's being performed. And let's start the highest level. The difference between user experience, research and market research. They're very complimentary, but they are different in market research. You're trying to ask questions such as What features do people want? What did they inspire for whether their needs and often how much will they pay? Now? Those questions are very important in a building product, but what you take from there is you actually have to build something, and that's where you X comes in. That's where you X research comes in, where you're taking research from a prototype, a design, and you're informing that design you're making it better. So in some ways I like to think of it as market research is very attitudinal in nature. It talks about what are the elements in the business case, but then you X research takes it to actually build. So in some sense it's ensuring the brand delivery. It's ensuring the product is good and therefore it's behaviors that matter. It's not what people aspire to, what it's what they actually do with it, but they actually touch it. So it's not about the aesthetics. It's not about opinions or acceptance. When we're talking about you X, it's about behavior, and those sample sizes are different. Then those that are aspirational and fundamentally when we think about you exits really two things. It's measurement. How do we capture that experience and how do we change it? Because the information that we gather will actually go into informing the design, shaping the line into the design that you really want, and ultimately that is the behavior that you want. So what, we're looking for us. How many participants do you need? So in this class would like you to do is as an exercise. Consider two things. Consider device a and device. Be imagine. Device A is your existing product. It's a next gen product. Where's device be? Might be a competitive part, or maybe an older version off that product line. But essentially, the question is, is device a better than device speak? And how many participants should you test? That will give you sufficient justification in the difference that you could then actually act on it and change design, Make a business decision? That's what we were looking for when we think about what's right. Sample size, okay, and how do we capture experiences? What I'd like to do when I say justify is I'd like to bring from science back in the user experience. What that means is, I'd like to tell you what I really wish I have learned in graduate school. 10 years after graduate school, I start to see all of these pieces come into play in terms of sample size, justifying it. How many is enough? I'd like to bring that into this course so you can kind of get a sense for what I really wish someone had talked me in graduate school
3. Why is the Number of Participants Important? : So in this next section of what I want to do is talk about former testing for user experience research. So that's the typical thing. You get people here when they think of a usability test. But before I get to that, I want everyone to think about why participants, and knowing how many are enough is important. It's a really important question for us to think about because, to be honest, participants are expensive. They cost money to recruit or they take your time to recruit. When they come in, you have to pay them compensation. It takes time to actually collect their data. Imagine if you're running six participants versus 100 participants, first of 1000 that takes time, and it also takes time to analyse and synthesise all of this data. So when we think about sample size, it's often thought of as a cost. But it's also time, time to analyze and a large time commitment. So feeling really comfortable, feeling very confident and justifying the right number participants that sufficient to get you the answer and you can believe it is the key because it's an important question. So our goal here is to learn. What is the minimum number of participants that will make us feel confident that what we found we can act on. Okay, So what you're gonna learn in this section is how many? Well, it depends. And what I'm going to do it is tell you it depends on the question of trying to answer the type of data you need and the type of research. So what? I'd like to introduce its a decision tree. And in the next couple sections and the section, you will go through elements of the decision tree. That should give you a pretty good idea off how I break down the question into the type of study which ultimately will give you the number of participants. So the first part of this is entry is thinking about how many participants your first question is, it formed it or that's on it. And what I like to think about formative is as I mentioned, it's a usability test, and the picture I'm showing you here is a picture of somebody who, you can tell, is doing a recipe in the kitchen. They're building a dish. Okay, that's formative. They're trying to refine the recipe itself. Now that is different than Summit, where in summited it's more like, Hey, it's not testing the recipe or actually having all the guests over for dinner now is where the rubber meets road. It's time to actually perform. Whereas formative we can learn from the summit. It is the real deal. So in this example that I'm showing you think of formative is very useful for early stage. It's useful for quality. It tends to be a diagnostic. What you're seeing in front of you is a wire friend. It doesn't have all the bells and whistles. It's not all the graphic treatment because we're testing the foundation because we could make it better and improve upon it. That's formative. We can inform the design. So a typical question, somebody might ask. We want to improve the website where it is open so that users can more easily find what they're looking for. That's a classic form of study, now that some of study slightly different, it's a little bit later in development cycle. It could be verification. You could be benchmarking and other products. It's a little bit more quantity and a question like I showed you in the picture here is it may be a drug label, and we're trying to understand that that drug label we want to be sure that it will not increase the number of dispensing errors lead to that type of confusion. Slot in different question a little bit mawr later in the life cycle, and that's a summit of study. And those are the kind of areas I need you to think about as we move into how many participants which test.
4. Formative User Testing: in this section, I'd like to talk about form of usability testing usability testing for user experience. Research tends to be the very classic bread and butter type of research have used to inform ALS I to make designs better. It's a very much important to the user centered design methodology. It's something that most of us learn how to do when they first start the last career. In doing so, when we think about forward of testing, we don't run hundreds, even 20 or 30 or 50 participants. We tend to run a very small number, the rational around at his concept of this Magic five and I'll get into it. And what I want to do is talk about what participle participants you need, how many and give you a little bit more science behind what goes on. So in General Florida of usability testing requires fewer participants in something, and the reason is because you want to find in some ways the minimum number of participants to find the most or maximize the number of problems. So, ideally, you only want to really look for the severe problems you fix it and then tested again and continue this illiterate design process to get to the truth, to improve your design. And by doing that, it's often times better to use a smaller sample size, which is where we get in this concept of the magic fire. It's been said you only need five participants to run these building test and in some ways started with Jacob using its. He called it the magic file. And the rationale here is that you perfect number is five participants. You don't need any more. That's what Jacob says. On the other hand, you've got others, whether it's Jared Spool, Malek, other researchers who look into this question off is five enough, and this is where I want to leave you. I want you to think about all what went into this so that when you make your decision, you can justify it and you could live with the results. You can create your study, design it, run it, analyze it and make things work. Knowing whether you chose five for seven or 15 or 21 I want you to be able to justify why you chose you chose the magic fire. You could see a graph right here on the X axis. It shows the number of usability problems down. So really, you want to find a lot of to how many people do you want to actually test? And the logic here is that if you test nobody, you're going to get zero insects. If you test one participant, you'll notice on the line. It gets about maybe 25 30% of usability problems out there. So there he is with one participants to find 1/3 off the problems. But usually 1/3 is not enough. You want to do a little bit more if you test a second purposes. This second part isn't will find some of the same problems the first did and from new ones . When you go to 1/3 participant, they'll continue to add to the number of problems and will continue to find the problems the others did. So with three participants, you get really close that 70 60% range of hay of the number of problems out there that exists in that device that interface three people confined about 60% of the beginners. As you get mawr and Mawr participants, you find less and less novel problems do problems. So that's what they call diminishing returns. The concept. Years after the fifth participant, you're starting to waste time and money because you're getting the same findings over and over and again, and those who sat in the usability test. I urge people to do that. If you're sitting in a in a test and maybe you've gone through five or six participants in a single day, you will see this concept of diminishing returns I've had proved People who are vice president sit in the back of the room, and they look at me into hago. It's Day one. On day 16 out of seven people miserably failed that cast, and they all said roughly the same thing. What's your bet? That tomorrow, if we were to do the exact same task on seven new participants were going to see six of them. That's the concept of diminishing returns. You're gonna keep adding more participants, but you're not getting more insight. We're still seeing that same problem, and that's really the logic that was used by Jacob. And if you take this to the extreme using his logic with 15 participants, you're gonna find 100% of the key big hitter usability problems, and that's why you don't necessarily need 1 15 now. As I mentioned, there are schools of thought and other people that have tried to investigate whether 5 15 or enough and they come to conclusion number. It's not enough. You need much more than five to find what you're looking for. So what do you do now? You've got one school thought on five, another school of thought that says much more than to break down the science of this. I'm mentioning diminishing returns. The biggest bang for your buck is in right has 55 results. Make a change, test another five and keep doing that. That's the best r a y. You're gonna get face on user centered design process. But let's break down how Jacob Nielson actually came up with this breath. So I'm showing you the graph again. Believe it or not, it's not based on real data. It's not based on actual participants. What it is is a statistical approximation. It's a binomial probability, basically asked two questions. What is the likelihood that you're going to find a usability problem? It's only based on how many problems you actually want to find. Do you want to find a lot of them? Say 99%? What do you want to find most of them on? Let's say 85% that's the first place. The second question is, How hard are the problems to find? You can imagine there are those problems that smack you right in the face. Those are the big problems, and that's what Jacob meals and typically thinks about. When he wrote his book, I'm usability engineering. He was thinking about the big problems, the showstoppers. So if you wanted to find most of the show stoppers, he thought five participants wasn't so. What I want to do is actually give you the sci fi I am by no. So what you see here and I give all the credit in the world, Just Sora. He runs the cycle measuring usability dot com. I urge you to go to it. He has great articles and calculators, and I pulled one from inside. It's a sample size calculator for the Severin problems in a usability test or user interface, so basically there are only a couple values. You put it, you put in how many problems do you want to find? And in this example, 85. So let's find most of the problems then. The next question is, How hard is it for the problem to be found? And that's a percentage. And in this case, I put in 0.3 or roughly. You know, three out of 10 people are going to find this problem, so I'm not setting it to where it's Oh my gosh, it's really hard to find only one out of every 100 people. It's one out of every three people are gonna encounter this problem, and I want to find most of them. When you put that calculation, those two variables into Sorrows calculator for sample size. You see 5.3 people are needed to discover most 85% of the problem, knowing that the problem happened one out of every 33 That's the magic number five. It's that simple. So if you were to break down and say, Well, what if we wanted to find a different set of problems? I wanted Onley find things that happened really often, so I would change it to in the first example I want to find all the problems that happened in one out of every two. So you can imagine you're taking a test. You might not find the problem, but you find another problem in another task. Well, if there is another person taking this day, uncover the problem. So it happens more frequently. If you want to find all the problems that happened with just two people, what have two people are going to see it? You need 6.6 or seven people. You see the numbers to change or what? If you wanted to say no? You know what? This is gonna be released 40 million people. I need to find lots of the little problems as well as a big ones. So let's make it harder to find a problem. So I want to find all the problems. 99%. But I want to find nothing with Zach in the face. But it might take one out of every five. So the problem that 20% of your users are going to have now that might sound like it's a hard problem. Imagine 40 million people are going to use it one out of five off those 40 million are going to have this problem. It's a pretty big problem. It doesn't stack in the face. In that case, you need 21 participants. So basically you can construct your own grab. You could say, You know what? I want to find these made problems, but I'm only looking for to show stoppers. I might be five or seven, but if I really want to dig a little bit deeper, find problems that you know what Four people ran right through it. But usually, like the fifth person has some issues. Now we need to get more participants. That's where you can build your confidence toe. Understand how many participants I should have, whether it's five or you know what, I feel more comfortable if it's 21 because I'm looking for in my problem here, all the problems and it's 20% of time you're gonna find. But then there are other issues. So what, you come up with your number, you justify 57 Whatever number you justify. There also other reasons to have more people in your son. One example is you have different users. You have to user groups what you want to do is, you might consider doubling your sample. So if you decided on seven now, you might run 14. Or maybe you want to do a hybrid where, you know there's a lot of overlap. Your testing, a patient in the nurse. They are different, but there's a little bit of overlap on their tasks. Maybe you test four and four and I am eight. So instead of just getting five, you get it. Okay, so that's one consideration another one might be, as I can show you here that the kids health site you've got a parent section, a kids section and a teen section. Now we have three user groups. Now you may want to consider if I started with five my target or seven and multiply it by three and now retreating 15 participants, five in each room. That's how you build and justify the number of participants you need. The other reason why you might want more participants is complexity. In the example. Here you're looking cat modeling. You might have a lot of tasks, or maybe really complex, and one user in your session can't do all 20 tasks. So maybe I'm going to have sample where half of my group does about 10 tasks and another half does the other 10 past. In this case, if I picked seven participants, I need 14 so I could get coverage across all the tasks in the sample sizes I decided to justify. So those were the ways you you set yourself on a number, you justify get your right confidence, and then you consider, Do I have enough participants? Because if user groups were complexity, But let's also be honest. It's a snow. I have had situations where I look at somebody, and chief marketing officer says they come from a market research background, that user experience research background and says, Hey, I'm not gonna believe it if I only have five participants even though if you have, he's saying I need 50 participants. And believe me, if I strapped that person down and have him watched 50 participants by participant 25 26 27 they will feel diminishing returns. But ultimately that person isn't going to sit, and you have to convince that stakeholder to make a change. In that case, it is a sculpture, so maybe five aren't enough, even though you could justify it and you choose 20 third because someone will act on that many participants. It provides you comfort and confidence that the people you happen to find about out liars that's the smell. Check. So as you think about a form of test, this is really about the question off. I'm doing usability test. You know, the magic number five. Me. I need a little bit more for various reasons, whether it's a type of difficulty or I just need more to justify the spell check. As do you think about those types of things, you're going to walk through it in the old justified. So what I've just described if I take you back to the initial course exercise is ah, formative study is the usability. It's what were you only evaluate one thing? Vice A. It's not the vice and versus of ice T only device A and to the question, how many participants? So I need to test if I'm just testing device A. I hope through this exercise you should be able to justify your recommendation. Going to the decision tree it's going to be based on is a formative, okay, it's for them. What kind of problems. I want to see how many people, what's the complexity, how many user groups and ultimately, what's my smell? Check and that's how you define forward of sample size based on.
5. Summative Reseach: So in the next couple sections, I want to talk about higher sample size Mawr summited research. Like I mentioned before. It's the time when it's not formative. We're not preparing the recipe or actually ready for a formal dinner. We're gonna launch this product. So before I get into, I want to give everyone an example off this concept off some butter or shall we say precision. So in the course exercise again, you think of device a versus of ice be. We're still talking about one device, and we're trying to understand how many participants you need. So when I think about this, we have to ask ourselves how many participants I need when I know the sample size is gonna be a little bit higher. And the first question I'm gonna ask you is if you're planning a dinner party, you're planning the lots of product. Do you really want to know the answer definitively, or do you just want direction? I like to ask that question, and it really comes up because oftentimes people want to do the usual research and they want to understand and answer. But how much do they know how much they really want to know. Is it definitively? Or I just like to know direction, because if it's directionally, it's like putting your toe in the water to see how cold this in that case, I would recommend using the same logic that I've used with Formative 57 21 participants that smell check. It's the notion of the directional question putting your toe in the water if it's gonna be a problem or we're gonna be OK. That really works with the forward of logic using that rash now. But if someone looks in and says, You know what? No, I really need to know definitively. Now we're getting into a world where someone might say, You know, this decision is going to affect 40 million people. I need a little bit more with a higher degree of precision. That's when you need to know the truth. And that's when higher sample size happens. Okay, so what I want to do is I need you to think about this notion of a definitive answer and this is my term, and I'm using this again. I told you, these are the things I wish people told me in grad school I'm going to call it procedure. So you are. You have a target and you want to know if what you found in the study? The score that you found is it generalize herbal to what would happen in the real population. We're not gonna test everybody in the world. So I'm gonna take a small sample and I'm gonna test it. I'm going to get a result. Is that result Generalize, herbal. And to do that, what I want to do is think about the difference between A and B. You've got device A and you've got device be. We're getting data on A and B. Are they different? But let's think about precision. And the easiest way I can do for that for you is to give you this notion off. What is a score? Ah, single score. We like in this world to have a means for everyone may take this test, but we always like to think of the average. So think of approval rate. You've got lots of people that go into a presidential approval rating. You get one number, but is that number precise isn't really, is it generalize able to the population one of the things that we realize is if we were to take all the data from a study, whether it's 20 participants or hundreds or thousands, not everyone scores when you say so, you can put those on a curve, as I see here. And what's interesting about this is while you get an average, the rest of the curve describes the variation people that are a little bit higher than that . The people that are a little bit lower. And what's really nice about this concept of a normal her in statistics is if you were to take all this data, I get an average and I could start to understand the characteristics off that data set and that I'm gonna give you an example just to give everyone the sense. Um, what does it mean to collect data that feels like a number or a score? So here's a simple question. I'm gonna show you a picture in a moment of people, and all you need to answer is a single question. The single question is is a man wearing a red shirt, so I'm gonna show you the picture. All you have to do is decide. Yes, sir. Now you're going to see a number that kind of goes by and as it's kind of time and it shows 123 all show you time and all I need you to do mentally is saying is the man wearing a red shirt You think yes or now. And you just know the time it took you to get that to that? Okay, Have you got so that's a yes or no question. There is a correct and incorrect. It's a measure off accuracy or effectiveness. And let's say if we take all the participants were watching over the given month, it's 77% correct. Okay, you also have time itself report, but you told me about it, took me three seconds and I said yes. Some people say no, it took me four seconds. You have two data points in that simple question. Well, and you could imagine collecting lots of other questions on satisfaction. Other variables. But in that simple question of is a man wearing a red shirt we collected to players. Okay, so I'm gonna give you slightly harder task. This is a hospital setting. The physician is concerned about the patients M C H as it is now higher than normal. The question is simply, on January 27th 2001 was the patients N. C H value higher than normal? You decide yes or no and you know the time it took go. So how many people forgot the question? These are some of the challenges we have. I've blown up the table for you, and you could look at it on January 27th and I'll slide over to the genuine column. And then I'll look on January 27 2001. I go all the way down to the M, C, H Value and the emcee. Each value is 30.8, and it's white. And look at my legend. It's normal. So was the Pacers m stage value higher than normal? The answer's no okay, and I kind of blow this up so you can take a look at it. But that's a thought process. You find the day, go across and look and then you check the legend, and probably you go back and forth. That's how people come up with the answer. Some of the challenges some people forget. The answer. I have measure of accuracy in time. But what do you do if someone forgot cancer? Which can happen? And do you take the time off someone who forgot the answer? What I do is ask a simple question, and that is if someone got the answer, right? Yes, they could have guessed. But those who got the answer right time is a measurement of them going down The column looking across M. C h. The legend going back to confirm it's a measure of that efficiency. If someone was incorrect, you have no idea what they were doing. So tying doesn't really reflect what they think about because you don't know what they did . Whereas if they got the correct answer, you're more likely to think there was an efficiency in the process they used. So remember your collecting two measures accuracy and time accuracy is really important to feel a lot of people fail. That might be more important than failing, but the kind of events are really fast. Okay, so with those two measures, the question comes up with I've got a number. How many participants do I need to feel confident that I had to generalize that number so that it's real for precise, and then this next section will actually you saddle size
6. How Many Participants Should I Test? : So I talked about this concept of doing a study and getting a number and feeling that that number is precise. It's generalized. So how many people do you need to if you want to do that? And when you think of getting a number and collecting participants, a lot of people think of statistics and what a lot of people think of staff. They think of this. It's a picture. Ah, USA Today or some fact oy. It's some number that put into a really nice draft, and it's polish. That's what people think of when we think of staffs. And what I want you guys to think about is what is a sample size that will make you feel comfortable that we can generalize finding Teoh. So imagine a world and you can see this picture of a lot of peas and EPS, peas or passes and after fails. So for every person you get a P orna and that's the population. And based on this picture, believe it or not, there are 60% peas and 40% s, so the after 60 for him, that's reality at the population. How do I know what I grab represents population. So stick my hand that are, pull a small sample out, and look, will I find sixties for every four fails. That's this concept of sample size. How many do I need to do if I just grab a fingertip amount or a full handful? How money do I need to do to capture, to feel good about the number? Okay. And in order to feel confident, there's a concept I want you to think of, and my analogy is based on the dart. It's the likelihood that you found results by chance by accident just because you accidentally grab the wrong amount. That's confidence, and I'd like you to think about it as a dark. So what I've done is I've shown a picture off all of the possible times you could do nan grabs and studies. I'd like you to think of darks as the chance you actually found something by chance alone. So there are 20 darks in the picture and one of the dark fell outside this confidence area , and the box actually could represent, say, 95% confidence. If it's 95% confident, that means if I would do this study 20 times. Only one out of 20 did I accidentally find the results. You see that dark, sitting out society, doc sis concept. If I want a 99% confidence, that means ah, 100 darks and only one was found my assets. That's the confidence. The other thing you need to determine sample size if you want to be precise, is your level of example air level a position? Some people think of that as plus or minus 5%. So you can imagine it's the presidential approval rating, plus or minus 3% plus for my spot. It's that concept, I think, a lot of deep sea and in fact, toys. The last thing you need is the number of people in the population you're looking at. Are we talking about the whole world, the U. S. Population or a subset? You need to know how many. Okay, so here's an example. There are lots of seven sized calculators out there. You only need three things. You need to know that confidence all. Is it 95% 1920 darks or 99 you need to know well, put summer three plus or minus five. So we're gonna get a result plus or by some number and you know the population and that will tell you the number of people you need. So here's an example. Ah, presidential election poll. Let's say I wanted 95%. So the likelihood I found this by chance alone is 5% wonder. So now my conference levels 95. The score that I receive is plus or minus three. Pretty, pretty, accurate, pretty, pretty precise. And the population isn't population us. It's about 100 million. You put those numbers in and the answer is you need 1000 seeing the 1000 participants if those parameters. So the question becomes, well, house is a lot of people. 1067 a lot. Well, what happens if I brought in the the error instead of getting a store and a score of 70% approval rating plus or minus three? What if I made a little brother plus or minus five? Doesn't emphasize me to go up or down, right? Do any more or less. What if I made it plus or minus seven. Okay, so what I've done here is I've taken the stock there and put practice numbers the same except increase the plus or minus three to plus in my spot. And the answer goes from 1000 67 participants to 384. And if you were to go to plus in my 70 drops to 1 96 okay, that's a sample size. You need to be confident that the score you have is generalized able to depopulation. Given this prayers, let me do anything. Let's change the population. So instead of doing the half the population or the population, who would vote? Let's do, let's say, a specific type of insulin and made it only a 1,000,000 people. Judy more or less people. And this is a challenge because this is kind of a concept that humans were not really good at large numbers. Okay, so what I've done is I've broken it down to a 95 confidence interval. Acceptability are acceptable sampling of plus or minus three plus or minus five close to seven and a 1,000,000 people. And what I found is if you run those numbers, the numbers don't really change. Is 1066 384 and 1 96 almost exactly alike. Okay, so this is kind of a strange concept, so I put it on on kind of a table. So with 100 million, the numbers don't change between 101,000,000. How does that change if I go to 100,000 or five population is 10,000 or 1000 or under. What you're gonna find here is as you want to generalize the score high sample sizes needed unless the population is very small. OK, So as you can see here, if the population is 10,000 you still need 960 people, which isn't that much different than 1000 people, which is at 100,000. It's only when you get a very small populations. That's precision that increases size. So an example. If there were 100 people taking this last, and I wanted to give everyone a test on my material and I wanted to know that the score that I got reflected all 100 people and I only took five people. But if I took 92 people, how confident are you that I probably didn't find it by chance that the store that they got plus or minus 3%. I would be 92 if I wanted to be plus or minus five. Only 80 and 66. It kind of makes sense in the population small. But when the population gets really big 1,110,000 it doesn't know you need 1000 381 100. And that's a concept that holds true. I've talked to data sciences people, and this is what you just of his apartment. Copulation. You've computed the number of people you need to find the results. You can generalize what this also tells you. It's 2000 people too much. That's kind of the key concept. Where isn't that you can find the right amount of people toe recruit, pay and analyze, knowing you could have confidence that you could generalize and you didn't find it by chance, though, And that's this concept. So going back to this history, if it's a subject us and you know you need a definitive answer, it's precision testing. If you're only testing one device and here's the table, okay, and that's a lot of participants 384,000 and the answer is you. That's what you need if you want a single store to reflect the population. Luckily, user experience UX studies. We still don't need 384. We don't need 1000 people, and that's what I'll get into in the next section.
7. Is A Better Than B? Hypothesis Testing: So this next section It's about sample size, but for a vs B, also kind of a hypothesis. Assess. This is kind of where UX really can get into statistics and understand sample size for comparing a diversity. So going back to our original exercise, you have device a neck 10 product. I see a competitive ER or maybe an older division off older version, shall we say the question is so how many people do I need to test to know whether a is really better be? And I want you to just buy this. So we talked about this notion of a definitive answer, and I talked about precision as a single score. Well, that's a vice a So let's say device, they got an 85% success. Is it really? You need a lot of people to know that that precise number Israel plus or minus some percentage, and you typically see that in surveys. But when we think about a versus B, we move away from precision of a number to precision of the difference. That's a slight, different perspective, and I want to kind of give you an analogy off. Let's say the vice take it's 85% and device he had 60%. It's not whether the 85 sixties precise, it's whether that difference. Israel. If you were doing a study again, maybe it's not 85 60. Maybe it's 90 and 40 but there still is a difference. And that difference. Israel A is better than the that notion, and that concept reduces sample size. You don't need the precision that you need, which is why you don't see statistics in the Elects. But if you did it not in the hundreds, if not in the thousands. So what I want to do is not say is a score general accessible? What is the difference? Is a always gonna be better than be? Is that generalized? Okay, so in this case, you got to notions you've got to means and the hypotheses will use concepts of justice. But in this case is called inferential staffs. It's comparing a vs bees like maybe it's a test or in a noma things. You may have heard statistics, and the sample size you need is derived from a constant off power. Its power means I have a good chance of detecting a difference significantly if the difference is really there. If there really is a difference, I'll find it. How many people do I need? His power? So an example I have here is You can see these air package in search Using the hospital was really big and one small, the old ones really big in the new ones or condensed impact. And it's half the size. That's redundant. A question could be is a better than B is knew better than old or another example. Let's go to basketball. Let's roll back a couple of years to 2017. The Warriors are going at the Cats. Who's going to win in any given study? In any given game, one might win. Another might lose. But if you were to watch this over, we were to run this test 100 times. How many times would Golden State Wing versus Caps? So think of the game as a study, and in this case the visitors scored 54 the home team scored 49. It's not the number and the score of each. It's that 11 versus the other lost. What is more generalize herbal. And so the question is, let's say that the war is won and go back to my guard. Example. In the last step in the last section, if you have a dart and the likelihood that you accidentally found this by chance is 5% that's like running the sudden 20 times. And only one time has it happened by accident. In general, the better team wounds in general. If a is better than be a in general wins and that's a concept. Okay, so it's these two copulation. So again, if you look at my green peas and red APS, let's say one group have a whole bunch of failures and the other has a whole bunch of success. If I stuck my hand in there and do the study and stuck my hand in the other one and looked are the difference, Do I use my finger tips or doing grab or my poor old toe bucket? How many people over I need? That's the purpose of this section. It's the confidence level that you grab the right amount. So in hypotheses, testing like power that we used for sample size for precision, you need no confidence isn't 95 or 99. You have a sampling here that you have one wrinkle, and that's called a fact size. So if excise is important, because it will tell you how big a difference it is, OK and thinking of darks, you still we use 1920 darts or a concept of an Alfa 05 What's interesting is I was always taught Hey, you picking out of 05 It's one out of 20 darks by the by chance. If you get a significant result, you can publish. You could get your dissertation, but maybe a medicine might want a higher mobile street. They may want to go to a 1% chance. You found it by chance, whereas a business maybe not want results publishable. But you know what? Hey, there's a 10% chance that I found it by chance alone. That's how business may look at it. So knowing those factors that you have your like your darks, what is affects us effect sizes, how big the problem or I'll being The discrepancy is okay, so here's my example. We've got two faces of two people and see which ones more attractive. What I'm showing you here is a small of exercise in my next life. Who's more attractive? Neighbors? This is a big effect size. That's a distinction like you today. Okay, another. If I hired 1/4 on a football field, you have five minutes to find it. If you find it in five minutes, I'll give you a $1,000,000. How many people do you want to go? Help him on the search at your samples versus if I hide. Ah, hockey puck onto the football field and you have five minutes. You might not need that minute frizzy, for example. You might not need as many people because the effect size is bigger and you don't need many people. Okay, so that's why I wish I had learned in grad school, because this is a table. People showed me literally. It is a power calculation table, and I'll describe it. There is a section that's point no one 0.5 point 10 These are your darks point No one is for medical 0.5 is to publish, and 0.1 could be used for business along the roads. You've got the type of test, so the type of stab you run might be more people than others. So let's just pick the 1st 1 being different. Let's just call that a T test, Okay? There are three other columns small, medium and large. That's a fact size. So you pick your darks. Its profile. You picked your tests. It's the mean difference of first row, and I box it. If you have a small effect size, we're talking 1/4 versus a dying on the football field. Every small differences you need 393 people so that it's medium you need 64 it is largely one sense. Let me cut to the chase of dried school. When I asked how many people should run for my dissertation, my professors a run up our calculation and you're looking at right now and exit well. How do I know whether it's small, medium and large? Because, believe me, I took 26 participants in my study versus 393. I'll be testing all summer for one sentence, and I need to do fire them. When I looked at them, they said, Oh, just to a power calculation, and when you do your studies, you'll know whether it's small, medium, large like What do you need to know? In the end, I would ask my graduate program director in step in sight, You know, psychology. Hey, an experimental site. How many people should run? They run 16 or 20. I love this table, and I couldn't find 16 it for in fact, I found 20 cents. I found 64. I found 3 93 and I just sat there and said, I don't understand why all the studies I'm reading about trying to be like them to get my dissertation. They only around 16 20 people, but none of the power toilet table. Sad what was going on? The reality is this life, and that is usability. Testing typically doesn't need statistical. Your objectives will define it. Let's sample size you. The is driven by many factors, and they'll tell you run a power analysis. But let's talk about what happens if let's anticipate results. If you want a study and you don't find a difference, maybe a and B maybe is not really better. Or maybe you don't have participants, so they go, Oh, gallon. You didn't have a power run more OK, but what if I ran 393 and I could have found him 20. I lose my summer. I use my I lose lots of time. So here's the dirty secret, and that's the last point I have here. What happens if you find significance with 20 people? The answer is, it's real. And that's what my professor, all the studies that were published ahead of me had 16 1 people. If I do a stun and run 16 20 find that a is significantly better than being Israel, it's generalize. Herbal except the dark and the door is one other 20 I percent chance. I found it by accident. Whether I ran 20 or 393 the result would be significant. That's what they're trying to tell its pragmatic. Run the study with smaller sample. If you find significance, use that as your baseline. And that's how a lot of statistics came out in the social sciences and it's moved on into human computer interaction in psychology, into user research and user experience. So if you're looking at a difference for you really don't need having participants testis 16 20 if you find the result, it is not baby, that it really isn't
8. Testing Exercise: Now in this last section, I want you to feel a difference for a vs B. I want you to get a sensible why. Ah, smaller sample size matters. So it's really demonstration. So let's go back to our task of a versus B. And in this case, I'm going to give you a cast. And that is it's a hospital. Seven. You're assessing different prompts for interaction oven interface. Let's say it's an order interface where the doctor puts in all of these orders for us A patient. Now, let's say in this example, we're calling this three a design A. The patients find further treatment. So the physician asked you to go into the system. Look at all the orders they put in and check. Check, check, check, check. Check them all and press the cancel. But I'm gonna show you a problem. All you need to do is decide. Yes, sir. Now let's get started. So you've gone into the system. You check, Check, check, check, check. You press the cancel button and this is what props. So the props says Do not cancel all patient orders. Yes, and the answer should be no. Okay. So as you go through this, there are a number of you that may have gotten maybe four seconds. Maybe gotten six seconds. Some may have gotten three. That's kind of an interesting distribution. As I do this with all my classes, I sometimes see about maybe three or four seconds. Five seconds. Let's do device be. Hey. Hospital said it's the same calf. You're assessing different problems. The patients a climb further treatment position Ask you to go into system to go into the system and cancel all the orders. You check. Check, check, check, check. You select the orders and press cancel you decide yes or no? And you know at the time let's begin. I don't even need to go to seven seconds in the work that I do. The prompt is cancel all patient orders. Yes, What I find If I were to take this group and most people get that answer in 123 seconds versus a 36 seconds, we've got two different distributions. Decent effect size And I can tell you whether I give task Be first or task a first. I switched the order. I do it in Germany. I do it in Asia. I do it in the US and my classes. This this prompt that says, cancel all patients orders. Yes is always better then the oz first, and that's a feeling of is the result. Generalize Herbal. It wasn't whether it's two seconds versus six for 1.3 versus seven. One feels difference, and it generalizes that it's always different. It's always better. Okay? And so in this course exercise, what I hope you have is if you got device a device, be you can justify how many participants unite one test. And in this case, maybe it's only 16 or 20. Because if I find the answer and you've already felt it, it's generalize herbal except chance. And that's the dark. Okay, so when you go back to full decision tree, you should have a sense of sample size for Florida test. And that is what kind of problems are we looking to find how hard of it. Then you start to look at the Magic five. You think about the number of user groups, how complex and dance. It's not checked. If you want something study, you've gone from the recipe stage to actually going to make the full dinner. You need to know whether it's directional. I just want to put my toe in the water, and in that case it's going to be very similar to the formative and smell check. But if you need to know definitively, then you ask yourself isn't a question of precision. So I need to get that score just right now you see the table we've had that computes 300 participants 196 or 1000? Or is it? I have offices tests. I'm not looking just the vice a score, but A versus B. And in that case you do have a power copulation. But I'm also urging you to test in 16 20 see if you have a significant difference if the effect size seems pretty strong.
9. Conclusion: During this course, I gave an exercise and then essentially ask question Is device a better device? Be How many participants would you have in your study and please justify now? I gave some examples because device A versus B could be anything from, ah, phone. Or it could be, as I show here, a piece of paper. It could be a very could be this old piece of paper that's twice as long in terms of instructions and content as B. That could be the A versus B challenged. So what I love for everyone to do is to take a project gallery and post online the answers to this question. I'd love for you to describe your device, A versus device. Be challenged now. It might not be a device. It may be able to be in paper or a tool or some type of interface, but I'd love for you to describe it. If you can use pictures, that would be great, because that always gives a nice perspective. But to tell me how many participants you suggest using and to justify obviously, be great if you use the flow chart and you knew you was a some motive and you needed a definitive answer. And then you wanted this notion of hypotheses testing. But if you were really only doing the formative work only device A and you're trying to make it better. I still would like to see how many participants you choose and how you justified it, using some of the techniques that you may have learned in the course. So that's the ask for the project gallery assignment photo online, and I would love to see yours. So I hope you enjoy this lecture and this skill share. What I want to do is give you a little bit of science that goes behind some of the things that we do in you. X research. If you have any questions, feel free to contact me. A bold insight dot com and I can answer your questions. But I also hope to put together a fume or of these topics so you can learn a little bit more of the practical side of you. X research. Thank you very much.