Transcripts
1. Promo: Welcome to the experimentation and AB testing for product managers course, where you will learn the fundamentals of experimentation that every project manager should know. My name is Raul mill and I have 15 years of experience in experimentation. I've held leadership roles of log-log digital ritual, 500 Px, the score with Autodesk and Bell Canada. I have an engineering degree from the GIL and an MBA. I've established three high temporal experimentation programs, as well as save 30% in customer acquisition costs. I've led growth that has resulted in the act startups. And I've helped launch and expanded e-commerce division globally in all of this, on the back of experimentation. After completing this course, you will learn why experimentation is an essential skill for all product managers that want to build impactful products. You will also learn how to design an effective experiments, how to create an experimentation strategy, and how to analyze results. I will teach you importantly frameworks related to experimentation strategy, statistical concepts related to experimentation vocabulary. I will also teach you what a good experiment design is. Also how to speed up experiments and how to document and communicate experiments. The ideal student for this course, It's an aspiring or current PM or an entrepreneur looking to leverage the power of experimentation to make evidence-based decisions and deliver results. You should have some comfort with Matt to get the most out of this course and have a genuine interest in experimentation. Feel free to look at the course description and I look forward to teaching you soon.
2. Intro: Hello, and welcome to a product managers guide to experimentation, AB testing strategy and analyzing results. In this course, you will learn why product managers must experiment, why they should run experiments. Ten important terms in experimentation. How to run an effective experiment, how to develop and experimentation roadmap, how to build a team that supports experimentation, how to nurture and experimentation culture, how to effectively communicate the learnings from experimentation to document experiments, then will go into an intermediate or advanced topic analyzing results where we will cover frequentest Bayesian and sequential statistics at a very high level. And then I will share some interesting resources for further reading before we dive right in a little bit about myself. My name is Ronald Santiago. I've been in the digital space for about 20 years. I've held leadership roles in marketing, product and growth at a variety of companies, including those in startup, telecom, engineering, e-commerce, and other spaces. I hold an engineering degree and an MBA. I am also the editor for experiment Nation.com. For those who are interested in learning more about experimentation, I welcome you to visit.
3. Why PMs must Experiment: So why do product managers have to experiment? I like to explain this with a story. Imagine one day you are picked up and dropped into a deep dark forest. And someone said to you, hey, you over there, where this red arrow is, is a building. And that building is burning in that burning building as children. And those children are holding kittens and your job is to save them. So your first instinct might be to run as fast as you can towards that arrow, but I guarantee you that you will trip, you'll run into things. You'll hurt your face and you'll probably be very tired at the end of it. At the end of the day, you may succeed and saving the children and the kittens, but I assure you you won't be in the most efficient way. Now imagine this scenario where you have a little bit of light. You're able to see where you're going. You're able to avoid obstacles and you're able to avoid hurting herself, is a much easier path, true? He may still hurt yourself. You may still fall down, but definitely it will be an easier journey with less risk, less injury, and probably more success. It should come as no surprise in the scenario that the light is a metaphor for experimentation. Experimentation lights the way it reduces uncertainty and shows you the best path to take.
4. Learn, Earn, and Avoid Getting Burned: The way that experimentation does this is that it allows product managers to do three things. It allows them to learn, to earn and to avoid getting burned. I'll go into each one of these. The way experimentation helps product managers learn is by helping them reduce uncertainty to help them understand their problem. Space, experimentation helps uncover causality between two variables as well as test theories that they may have. Obviously, Product Managers can leverage experimentation to help them earn. What I mean by that is they can use experimentation to help them generate revenue, optimize the performance of a feature, or simply measure the impact of future they're rolling out. To illustrate the earning potential of experimentation. On the right is a graph of the performance of companies that have embraced experimentation recently. Those include Facebook, Google, Etsy, Microsoft, as compared to other companies who have not. As you can see, those that experiment performed better, at least on the stock market. Now of course, this is just correlation and not causation, but it's still an interesting thing to see. Now one of the most interesting benefits of experimentation, at least two product managers, is that experimentation helps product managers avoid getting quote-unquote burned. Having experimental data helps them avoid making choices that would lead to a loss in terms of revenue or performance. It gives them quantitative evidence. They can, two, when challenged in terms of why they made a decision. And finally, experimentation can help Arctic managers swayed challenging stakeholders by testing their ideas and proving them wrong sometimes, or in some cases, proving them right. Either way, decisions should be based on data, not opinion.
5. When should you run Experiments?: So as a product manager, when should you run experiments? So of course, the ideal answer is that you should be running experiments all the time. But unfortunately, real life is different. There are definitely some companies out there that have the mindset and the capability to test everything, such as bug fixes, marketing campaigns, et cetera. Companies like these tests every time they have a question, every time they launch anything, anytime they have a conflict, because that's the beauty of experiments. They can settle arguments that could show evidence that can prove or disprove someone's opinion. That's why many companies do it any time they want to release anything. But of course, that's an ideal world. Most of us, in fact, almost everyone has some sort of limitation, be it in terms of resources or time. So we have to be more selective when we run experiments. This is because experiments can be expensive. The involved often at times engineering, design, product time, the our major investments. So if you're ever in a situation where you're looking to run an experiment. And while I strongly suggest you make a best effort to run an experiment, sometimes you have to consider if this is the most important experiment that you can run right now, will this experiment getting any important learnings? Sometimes you're releasing a bug fix and you don't expect any change in any metrics. So you might have to reconsider running the experiment altogether. But as a general rule, you should run experiments when you have any uncertainty or questions prioritizing any questions that are in your way for making progress. Consider this. So on the y-axis we have uncertainty, and on the x-axis we have time. The red line shows the path without experimentation. So essentially as time goes on, you're uncertainty decreases gradually and at launch, it, it drops drastically. You look at the blue line, the path with experimentation, see the uncertainty drops earlier on at launch, you have very few uncertainties. This is because with early experimentation, you're able to understand the space more and make necessary adjustments reducing risk, which is illustrated by the difference between the red line and the blue line. Risk is what you don't know about the market, your users, your product, how it will perform in the wild. So one of the major benefits of experimentation is minimizing that risk.
6. How does Experimentation compare to other sources of data?: One of the most common questions I hear is, how is experimentation different from other sources of data, such as research, analytics, and user interviews? Let's look at each of these separately. Research helps product managers identify problems. Well, analytics unveils past behaviors and shows what people have done on the site. Surveys and interviews hint at why users have done certain things. But only experimentation demonstrates causality. It demonstrates statistically that one variable impacts another. Without experimentation, all you have is correlation. In some fields of science, there's a concept of Hierarchy of Evidence which attempts to rank different sources of information based on its quality and its risk of bias. Now, while all types of research are susceptible to biases, depending on how data is collected. In general, if front properly randomized controlled trials or AKA experiments, have the highest quality and have the least risk of bias. This is not to say we should ignore other sources of data because they're quite valuable. But rather, this is to illustrate the power of experimentation to provide insight and learnings with less bias.
7. 10 Important Experimentation terms: Now let's cover ten important terms and experimentation. While this section may seem dry, it will give you the vocabulary to speak about experimentation. What is an experiment? An experiment is a scientific procedure that helps you make a discovery, test the hypothesis, or helps you demonstrate unknown fact. Some examples are using an experiment to see if a new feature will increase sales, to understand whether lifestyle images will increase orders or prove that popups increase bounce rates. Another term that you'll hear is the term variant. Variants are also called branches or treatments. In a nutshell, these are the different versions of what you want to test or compare. In the case of an AB test, that is an experiment that has two variants. And ABC test has three variance, so on and so forth. Factors are measurable variables. You can change your control. Every factor can have an impact on the outcome of an experiment. Some examples of factors are the size of a button, a new feature, day of the week. It's just that some factors are easier to control than others in terms of your experiment. And independent variable is a special kind of factor that you intend to investigate that you believe will impact the value of the variable or KPI interests called the dependent variable. How much you learn from an experiment depends on the granularity of your independent variable. For instance, if you decide that your factor is the color of a button, you will learn the impact of the color of a button on say, click-through rate. However, if your independent variable is say, an entire homepage, a design, then you will learn the impact of the entire design on click-through rate, not necessarily the impact of any piece of the homepages online click-through rate. We recently talked about what an experiment is. However, a controlled experiment is an AB test where you keep all factors across the variance constant except for the independent variable. The variable for which the independent variable is set as the baseline to compare against is called the control. Any variant that isn't the control are often called challengers. Without a doubt, you've heard the term AB split. But what exactly does it mean while a and b refer to the variance of a test? If there are three variants, it's often called an ABC test. It should be noted that people in one variant do not experience with others see in other variance. And an important note is that traffic does not necessarily have to be split evenly between the variance. What that means is an AB split doesn't necessarily have to be 50-50 or it could be 60-40, 30-70, 9010. Nor does an entire audience how to be included in an experiment. You can select, say, 15% of your audience and split within that 15%. Not to be confused with a multi-variant test, which tests multiple variants. I'll multivariate test is a controlled experiment that tries different combinations of independent variables to understand their combined impact on a dependent variable. The easiest way to remember this is that a multi-variant test test a single independent variable, while a multivariate test tests multiple independent variables at once. The term promote simply means exposing a single variant to all of you are qualified users without any challenges. You should note that you don't have to promote the best performing variant. Situations where you may not want to necessarily promote the best performing variant, could include commitments are ready-made to clients or to senior leadership. There are also cases where a particular feature on blocks other work. However, in an ideal world, your decisions are based on data and you would promote the best performing variant. And finally, inconclusive. This term is most common with what we call frequentist experiments. Where an inconclusive experiment is when you have collected enough observations and the data neither supports their control nor the variant. Inconclusive experiments are not necessarily a bad thing as long as you are learning.
8. The Experimentation Cycle: Experimentation is a cyclical process, as we can see from this diagram. It all starts with observing your industry, your product, but trends, the market, trying to find problems or opportunities. Then you ask questions, how can you solve your problem? How can you take advantage of an opportunity? From there? And you craft hypotheses that can help you answer these questions. Then you prioritize these hypotheses to ensure that you work on the most impactful experiments first. Then you design and set up your experiment. When the test is done, analyze the results, then you document your findings, sharing them with others. Finally, you roll out your feature or product based on the information that your experiment helps you generate. At this point, you're back to observing. Observing how you work impacted the environment all with the goal of iterating in your work. Or just moving on to the next experiment on your priority list. Now, let's look at each one of these in a bit more detail.
9. Observe: Let's start with observe. This is where you'll want to look at your product and try to figure out what can be improved on. The first thing you'll want to do is to perform a situational analysis where you start with one, understanding your goals to, to map out your conversion funnel, essentially, understand how users flow through your product. Number three, putting metrics against your funnel. What that means is to put traffic numbers and conversion rates at all the major points. So you understand the performance. And for, after looking at the entire funnel, picking the metrics that you need to fix or fuel. A metric that has to be fixed is an area where users drop off or have a lower conversion rate than desired, you should only fix areas where communication is sub optimal. The UX has friction and other functional issues can be found. You should avoid fixing areas where you have to drastically change your user's desires. For instance, in the case of a grocery site is hard to convince a vegan to start eating meat. All that to say that not all metrics are fixable. On the other hand, some metrics are far more easier to fuel. What I mean by that is that there are areas on the site are performing really well, but have not been maximize. The theory here is that it is easier to pour gasoline on a fire than to start one IE, make a good thing, become better.
10. Question: At this step, you have to think of all the questions that must be answered in order to improve your conversion rate. For instance, if you know that you need to increase orders and you find that you are dropping off at checkout. Your question could be, am I asking for too much information at checkout? Think of as many questions as you can. It is preferable to ask all stakeholders to participate here, where ideally they'd be provided with the same conversion funnel metrics ahead of time, and then given the opportunity to share their best ideas with you. Once you're happy with your list of questions, it's time to move on to crafting hypotheses.
11. Create hypotheses: While all the steps curd here are important, the most important step is forming a hypothesis. A hypothesis outlines what you're testing to who and what you think will happen and why. It is a powerful tool that helps clarify your thinking and communicates to others what you are trying to learn. While there are many formats out there, I prefer the following for the audience of interest. If we do the variant experience as compared to the control experience, we expect something to happen to the primary KPI because of a reason. The reason piece is very important as it ties back to your original question. If you find your hypothesis doesn't align with what you're trying to learn, you should rethink your test. An example of a good hypothesis could be for new users. If we ask for their location during onboarding, as compared to not asking for it, we expect 30-day retention to go up because we will be able to deliver a more personalized content to them. And important to note when selecting your audience is to try to avoid biases as much as you can. Biases can skew your data one way or another, causing you to make a poor decision. Imagine trying to design a new food product and you happen to only select lactose intolerant test subjects, your test results would cause you to avoid the dairy category altogether. While this is a very obvious example, the same could happen for factors you could infer c to minimize biases, you must randomize your test subjects. Luckily, most, if not all, online experimentation platforms can do this for you. However, doing this offline is harder. If you have to select test subjects yourself, you should know that not all selection methods are made equally. Some have inherent biases which you should avoid. Here are some examples. Snowball sampling is when you allow test subjects you recruit to recruit other tests subjects. Since people like to hang out with others with similar values, this could introduce unwanted attitude and o biases. Convenience sampling is when you select subjects because they are easy to recruit. Common examples of this is when you ask your friends or people near you to be part of your experiment, it is much better to ask every third, for example, person you see if they'd like to be part of your experiment. Purposeful sampling is when you hand select test subjects that you feel represent your target population. Finally, survivor bias occurs when you only look at tests subjects that have performed an action previously that you may not have considered. The classic example of this is when during a war, engineers were tasked with improving the armor on planes. They looked at all the planes that came back from battle and where they took damage. However, the fact that the planes came back meant that damage they took was actually survivable. And that engineers should have focused on areas of the plane that were not damaged. Because clearly, those that took damage in those areas did not return. It can be challenging to decide what to include and what not to include in a test. How do you balance between learning and practicality? This is my advice. Pick the granularity that matches the level of learning you are seeking. If you're looking to understand the effect of color, test color. If you're looking to understand the effect of a feature, test, the effect of that feature, simple or complex, your tests should be able to summarize your changes in a single sentence. For instance, you're looking to understand the effect of a new form design on sign-ups. Or you're looking to understand the effect of a different CTA button design on sign-ups. If you aren't able to summarize your changes simply, you are compromising learning. However, say you aren't even able to decide what the tests together in the first place yet alone summarize anything. My advice here is to seek heat. There are times when you have no idea where to start and you just have the KPI you want to improve in mind. Say you're trying to figure out what are the best ingredients for a cake to make it taste better. The issue here is that you don't know which variables play nice with each other and which fight with each other. And you don't have all year to run an endless number of experiments testing each variable separately. For situations like this, you can try what is called fractional factorial design. It's a fancy name. But what it essentially means is trying different combinations of variables at different values or levels out of the gate and looking for which combinations correlate with improvement to your KPI of interest. At this step, you're only looking for hope rather than trying to prove anything statistically. It's a best practice to have control runs here and there. Ie set combinations where you aren't changing variables. This way you can see if there's any drift and the measurements. For example, let's go back to making a cake. Say your oven is wonky and is cooking hotter and hotter throughout the day. Running a control run here and there could catch if your temperature is drifting. Once you find the combinations that work well together, make those combinations you're variants. Ideally, you can look at these combinations and summarize them simply. If you aren't able to leverage fractional factorial design. You can also look at historical data, looking at the impact of different factors on your KPI through regression analysis. Either way, sometimes finding combinations with promise is a great way to figure out what the test.
12. Prioritize you hypotheses: There are many ways to prioritize hypotheses. Some of the popular ones like ice scoring, give weight to different dimensions like level of effort and confidence, and generate a single score to compare experiments. Whatever method you use, the end goal is to prioritize hypotheses that will have the most impact for your business. On the following slides, I will walk through why prioritization of hypotheses is so important and how I suggest experimenters prioritize their hypotheses because time is limited, you want to give you a product team that the greatest chance for success. And you can do this by front-loading your experiments that reduced the most uncertainty. He's run those later on in the process, you risk not giving your team enough turnaround time. Put it another way. You can plot your hypotheses along two axes. High-impact versus risk, where high impact is something that is of high importance to your business and where risk is how much it would cost you if things go wrong. As you can see, high-impact inexpensive risks should be prioritized first, these are critical hypotheses that are important to your business to explore and represent a large loss if things don't go the right way, you often know the least about this area. Next up would be high impact and cheap risk. This is usually for hypotheses that focus around maximizing your investment, IE, maximizing performance rather than avoiding getting something really wrong, you often know quite a bit about this area. Next up is low impact and cheap risk. These are typically hypotheses you have when you're starting out in experimentation, where it's in your interest to run safe experiments just to get used to the process. And finally, we have expensive risks with very low impact to your business. These are hypotheses that should be rethought and re-prioritized.
13. Experiment: Now that you've selected your hypothesis, it's time to experiment. You start with designing your variant experiences. Here are some things to consider. One, do your best to only create experiences you'd be okay promoting. Otherwise. What's the point? Number two, the goal is to learn cheaply and quickly. So when designing your experiences, always think MVP. Think how small can you make your experience and still answer all your questions? Three, Do your best to represent real life conditions. In other words, try to make things as close to reality as possible. For example, if you want to change a logo, you wouldn't just change it in one single location. Realistically, you change it everywhere. Now, that's a bit of an extreme example, but you'll probably get my 0.4. Finally, variation should be trackable, meaning you should be able to figure it out in the data, which experiences users saw, and what actions they took. If you aren't able to split out the analytics or data, then it will be impossible to analyze your results later. You can find inspiration from all sorts of sources such as Past experiments, user interviews, market research, and competitors. A quick note, while you're competitors are a great source of ideas for experiments, never assume that just because they are doing something that it means that it is working for them. For all, you know, they are running experiments themselves. Another piece of advice is when generating ideas get input from all your stakeholders. This will help with buy-in for contentious experiments in future. Also, avoid brainstorming. Encourage contributors to look at the data and propose their best ideas. Nothing kills a creative process than having one very dominant voice in a brainstorming session. The following are some of the most common test types. Redirection tests are those that split traffic between two separate pages. The user generally won't notice that they've been redirected. These are one of the most straightforward types of tests. Experimentation platforms typically support these natively. Feature flag tests hide or show functionality based on a variable set, either client-side or server-side. These usually require devs support. Visual Editor tests are those that are created using some form of a wysiwyg editor. These tests work by intercepting the loading of a page, manipulating the DOM, and then showing the contents of the page. These can slow down the page and cause instability issues in the case of complex changes. Multivariate tests, as we've covered before, test different combinations of factors by randomly combining them. These usually require a lot of setup, a front, and a lot of time to run. You're usually better served with simpler AB tests, which are more focused. Finally, roll-out experiments are those which hold out a percentage of users from an experience. So you can measure the impact of a feature against a baseline. Sometimes experiment results can lead to having the rollout pause, but they are a great way to ensure that you're putting out impactful work. Next is defining your overall evaluation criteria are OBC, this is where you define which metrics you will use to evaluate an experiment. These metrics typically represent each of your major stakeholders needs and, and a leading indicator of business health. Let's go over how you should ideally define your OSC. One, Pick your primary KPI. This is the metric your experiment is trying to impact too. Define your lead metrics for business health. These are typically metrics that help you understand if your experiment was good or bad for the business. You typically want to use a predictive metric rather than a lagging one like sales. Three, diagnosis metrics help you determine why a test result occurred like it did for better, for worse. And finally, four, guard real metrics are those that you must not negatively impact. If these metrics cross certain thresholds, the experiment should be stopped immediately. An example of this in e-commerce is say, if sales were to drop below 10%, then you'd want to end the test right away. These metrics ideally should be agreed upon by all stakeholders along with the next steps based on different scenarios. Now it's time to define your test parameters. These are the thresholds and decision criteria for your test. The ones that will determine whether your experiment generated e-learning or not. The definitions of these parameters are dependent on the statistical approach that you take. For example, you could take frequentist or Bayesian or sequential, et cetera. In a later section, we will focus mainly on the frequentest approach as it pertains to binary metrics. Binary metrics are those that can either be true or false. Next, you have to develop an instrument, your experiment. This is where your experiment comes to life as you build out your tests experiences, ensure that you Qe the branch as well, and that analytics is being collected properly. There's nothing worse than waiting two weeks just to find out that your data's incomplete or dirty. Now that you've done all that, it's time to launch. Remember, before you launch, give all your stakeholders a heads up.
14. Analyze, Document and Share, and Rollout: Now that your experiment has concluded, it's time to analyze your results, then you should document your experiment and share your learnings. After that, you roll out your winning variant. If it makes sense, remember that you don't always have to roll out an experience. Will cover analysis and documentation in later sections in more detail.
15. The Experimentation Roadmap: In this section we'll cover how to develop an experimentation roadmap. Just like a typical product roadmap, there is value in developing an experimentation roadmap. In short. And experimentation roadmap is one that details the experiments that will be run in an upcoming months, usually over a quarter. There are some obvious benefits to doing this. Firstly, you're able to maximize and coordinate resources because things are planned or at least scheduled in advance, you're able to free up the right resources at the right time, thus reducing double booking or downtime. Secondly, they help avoid test collisions. Knowing what you will run. It can help you avoid running a test that will collide with another PMs experiment which could affect your results. Thirdly, they also ensure coverage. A common trap that experimenters fall into is to put too much focus in one area. Having a roadmap helps you ensure that all the important areas get attention. Fourthly, potentially the most important benefit is business visibility. Everyone at a company wants to know what product will build next. Having an experimentation roadmap helps transparency and answer questions for stakeholders and important note as you plan out your roadmap, however, is to leave slack for iterations and rollouts. Since we cannot predict the future, you never know if you want or need to run iterations on tests or role at winners. Not having slack in your roadmap usually implies you don't care about learning about results. So remember to account for it.
16. The Experimentation Roadmap - Continued: Here are ten steps on how to build an experimentation roadmap. One, especially if you're new to experimentation, start with experiments that are low effort and low impact. By doing this, you'll be able to learn the ropes and frankly, make mistakes in a safer environment. Getting a few small tests out and hopefully some early wins will build your confidence and refine your workflow. To if you have the luxury of time, I know, I know. But in case you do, let's consider starting experimenting higher up in the funnel. The reason being here is that typically lower frontal experiments take longer. So by improving conversion at the higher stages of a funnel, theory will increase the traffic lower down the funnel in turn, helping those tests run faster. Three, frontload experiments that address questions with the highest risk. First, the ones that will benefit from having more time to address. Hard important questions take time to figure out for, make experiments as cheap as possible by always thinking MVP. In some cases, you might be able to answer the same question through a smaller test elsewhere. Five, shorten experiment times where possible. This will help you learn faster. And learning faster leads to success faster. That's easier to say and a bit trickier to do. Let's dive into this for a moment. Here are three common ways to shorten experiment time in order of my preference. Firstly, lower your standards. I always get looks when I say this, but it's true. Well, we'll get into stats leader in this course. Lowering your statistical standards like required statistical power and confidence level are very valid ways to shorten your tests only as long as you are truly okay with lowering your standards. For instance, sometimes a change isn't very critical to the business. So you'd be okay with say, 90% confidence rather than ninety-five percent. Secondly, test extreme changes. Bigger changes produced bigger results that are easier to detect. The example I love to share to illustrate this is detecting whether there's a car in your driveway. How many observations would you need? One, if you're mean maybe two. But if you wanted to detect if there was an ant on your driveway, you'd need plenty more observations which would take longer. Lastly, micro conversions. Sometimes the area you are testing just doesn't get a lot of traffic moving up the funnel along the path of conversion, of course, to the next opportunity to experiment is sometimes a good place to start. The typically get more traffic and higher conversion rates, which typically results in faster tests. Now mind you improving conversion rates higher up the funnel may not result in proportional gains lower down the funnel, but you do what you have to do to get started. This chart hopefully illustrates this notion. On the y axis is your baseline conversion rate, or in other words, the base conversion rate you are trying to improve. The values aren't important here, but rather the direction of magnitude we're going up means the conversion rate is larger. On the X axis is the size of the change. More extreme changes are on the right. As you can see, the shortest tests are for large conversion rates and large changes, while the longest tests are for small conversion rates and small changes. If I had to choose, I always suggest going for big changes first before finding larger conversion rates. As size of change has a bigger impact than current conversion rate. Now let's get back to the experimentation roadmap. Six. It should come to no surprise to product managers, but having a backlog of experiments that support decision-making is important. The trick is to time them so that they answer the right questions at the right time. Seven, don't always think iteratively. You have to have a mix of moon shots and iterative tests, test bad ideas, sometimes, at least bad ones you'd be okay promoting, ate, mine up your experiments so you can run one after the other. Nine, monitor interactions will go into this later in the course. But the idea here is to monitor how experiments interact with each other. Sometimes or rather often at times. You will run experiments in parallel. There's a whole debate on if this is okay or not. Again, more on this later. And finally ten iterate. Don't always test them. Move on to the next test. Sometimes you need to dig in more, which emphasizes the need to add slack into your roadmap to account for iterations.
17. The Experiment-led product team: It's one thing to know what to test and how to structure a test. It's another thing to execute on them to do experimentation, right? It takes a village and you need the right team. But what is the right team? So the team that you have and we'll need to support experimentation depends on the type of product we're building or working on as this impacts the complexity of setting up an experiment. However, in general, you will need the following skill sets on your team. Firstly, you will need a data scientist or an analyst, anyone with a wheelhouse around statistics. This is critical. You'll need someone who will be able to interpret the data and to ensure that it is collected and analyzed in the right ways so that your findings are valid. There's nothing more dangerous than building on invalid insights. Depending on your product, you'll likely need a user researcher. Experiment should align with solving user problems. User research is a great way to uncover these problems and is a rich source of ideas for experimentation as well. A UX designer, isolating for your independent variable can be tricky to do in a way that makes sense in a customer journey. A UX designer will help you create something that will make sense while still collecting your data. Engineering. This is obvious. Someone has to build everything. Analytics. You'll need a resource to actually collect an instrument the data properly. Qe last but definitely not least, I can't emphasize this enough. You need a QA resource because tests are typically, say, MVP quality, you absolutely need QA to run regression tests, check that nothing is broken and experiences are actually delivering as expected. Bring these resources in early into planning. So you don't miss deadlines.
18. Nurturing an Experimentation culture: Now that we've covered the team, we have to touch on culture. Without the proper mindset, you won't be able to extract the full value from experimentation. And it will feel like a slog rather than an exciting source of evidence to make decisions. The following are ten tips on how to nurture and experimentation culture. One, right off the bat, you need C-suite volume and public support. Nothing encourages people to experiment than having leadership behind it. Furthermore, the C-suite should be encouraged to require experimental data when making decisions. Of course, this may be really hard to do. So you may have to take a bottom up approach where you can convince lower down the chain to support and sell it upwards. But if you want experimentation to have any legs, you or someone will have to put the time in to sell leadership. To listen to your stakeholders and understand their needs and frame experimentation in ways that address those needs. Unless something helps folks achieve their goals, it would likely be abandoned. Three, share examples and best practices. This is straightforward, but it isn't done as much as you'd think. People learn from examples. Showing potential experimenters how to run proper experiments will go along way. Furthermore, showing successes and learnings can inspire others to experiment as well. For set volume goals, IE, the number of experiments launch over a period of time to get the ball rolling by quickly move on to business outcome goals. Once things are going. Five, hire the right people. Always seek talent that believes in the value of experimentation. Not only will they support experimentation, but more importantly, it will likely believe in making decisions based on data and evidence, which is something we should all strive for. Six, create workflows that not only can scale I0 can be automated in some way, but also fits into existing workflows, at least in the early days, you have to reduce those hurdles. Seven, avoid talking about winning. This is the wrong approach because the point of experimentation is to learn. So ideally every experiment generates a learning. So celebrate and share when you learn something. Eight, as such on previously automate tasks as much as possible. As your experimentation velocity increases, you'll be happy you did this. Nine, help teams define success, hone in on proper hypotheses and decision criteria. Early on, any people run tests without real goals, which can lead to frustration. Ten, finally, communicate learning wildly. When experimentation is seen as a source of important information, buy-in quickly follows.
19. Communication: This brings us to communication. What should you share with your team or organization? While this really depends on the org, I would advise you share the following, share the problem you are trying to solve and your hypothesis to give others context. Share your results in a factual and unbiased way. Avoid interpretation, at least at this stage, the level of statues shares is dependent on the statistics literacy of your audience. Once you have done that, share your interpretation of your results and what this could mean to the business or product. Next, emphasize the learning over the exact values. Values can vary slightly if you rerun an experiment. Finally, share your contact info and links to your documentation so others can learn more. How should you communicate this information? A former CEO of mine said that in order to get your point across, you have to repeat yourself at least nine times. So in short, share your learnings in every way you can. At the start, you will quickly learn what channels are the most effective. Here are my suggestions. Share them via email, newsletters, lunch and learns, infographics, business reviews, slack, product retrofits, dashboards. And if you're feeling adventurous, try video.
20. Documentation: This brings us to everyone's favorite topic, documentation. No one loves documentation, I don't think, but I've been wrong before. But for those who don't believe in documentation, I wanted to share the benefits of solid documentation for experimentation. The most obvious reason is to avoid rerunning experiments. And we have a lot of experimenters or bad communication. It's not that uncommon for tests to be rerun after a few months or years by accident. Obviously another benefit is that others can learn from your work. Documentation can help facilitate communication. A less obvious benefit is that it enables meta-analysis where you can look at a group of experiments and pulled together larger insights. Documentation is good to reference when making decisions. Documentation helps improve the quality of experiments as you can learn from others challenges. From a political perspective. Documentation demonstrates the value of your work. Sometimes you can't deliver on an outcome, but having documentation on your experimentation shows you've been learning, which in of itself is progress. Finally, documentation organizes your thoughts. If you document as you develop an experiment, as you showed, it helps you refine your thinking and helps you catch gaps. So finally, what should you document? This is a bit of a laundry list, but I'd suggest documenting the following title. Give it a name you can find later. Owner who to contact about the experiment. Flight dates. When was the experiment live? Sometimes your experiment impacted something outside of your world that you may need to track down. Area tested. Where did you test? What did you test? This is helpful when pulling larger learnings together. Business problem. What was the question you are trying to answer? Your hypothesis, your overall evaluation criteria, your decision criteria, the description of variance, information about the setup and any sign offs if needed. Test results, including statistics here for others to evaluate. Learnings and insights. What does the data mean to you and the business? And finally, next steps, what you did or will do because of these results.
21. Introduction to Power and Desired Confidence Levels: This is an introduction on how to analyze results. This is an intermediate to advanced topic, depending on your comfort with math and familiarity with statistical concepts. While you don't need to become a statistics expert, it helps to get a feel of the terms so that you can ask the right questions when looking at results. There are several schools of thought when it comes to statistics. Let's start off with what is called frequentist statistics. This is usually the kind of statistics you learned in school. But before we begin some disclaimers, I am not a statistics professor. I will loosely explain statistical concepts. While most third-party experimentation platforms report statistical results, you should always seek statistics support to validate results. Finally, if you're not interested in statistics, you or at least encouraged to skim through the material. The goal of experimentation, at least as it pertains to product management, is to understand if the independent variable has a meaningful impact on a dependent variable. To determine this, three criteria must be satisfied. One, you collected enough observations to detect the change you care about to compare it to the control. The probability of the variant occurring is very unlikely. And three, there is no chance that the control and the variants observed results are equivalent. In statistics terms, that translates to one. Each variant has hit the required sample size to your confidence level is greater than 95% classically. And three, your confidence interval of the difference of means does not cross 0. But before we dive deeper, there are few more concepts that we need to cover. We have to talk about experimental errors, which frequentist statistics famously controls for. There are two kinds of errors, type one and type two. Type one errors are also called false positive. While type two or false negatives. I love this illustration. I feel that captures the concepts very nicely. What frequentist statistics does is report on observed values while guaranteeing that type one and type two error rates never exceed limits. The next important concept of frequentist statistics is that an experiments observed value will converge towards one true value. What this means is that if you let an experiment run long enough, IT collect enough samples, the measured results of all the samples converges towards a single value. Early on during an experiment, the overall observed value of a variant will be all over the place. As you can see on the left. You will have to wait until things settle down or converge. In other words, you have to wait until you have collected enough samples before you analyze results. This is the concept of minimal sample size. It's how many samples you must collect until you can trust your observation and still trust that your type one and type two errors are within thresholds. Evaluating results before you hit your minimum sample size is called peaking and leads to invalid results. A big no-no. As mentioned, the minimum sample size is the minimum number of samples that each variant must reach so you can generate valid results. It's okay to collect more, but it's never okay to collect less, not even by a single sample. How is it calculated? However, the minimum sample size is a function of your false positive and false negative error rate thresholds ID. How many false positives and false negatives are you okay with? And the smallest change you're interested in? The MDE. False-negative thresholds are often represented by what is called power. While the false positives are represented by what is called the desired confidence level, we'll get into how to set power the desired confidence level, as well as MDE. In the next section.
22. Power and P-Value: In this section, we'll take a closer look at power and the desired confidence level. As mentioned before, power represents our false negatives level. Technically speaking, something called beta is the actual percentage of false positives we will accept. Classically, beta is set to 20%. In other words, we are okay with having false positives. 20% of the time we run this experiment, power being one minus beta would then be 80%. Really, there is no reason that power should be 80%. It's just a value that someone chose and everyone decided was okay. So if you're running an experiment and you really don't want any false negatives or very few, feel free to lower beta or rather increase power. Just know the higher the power, the larger the sample size you need. Similarly, something called alpha is are acceptable limits for false positives. Classically this is set to 5%. Again, there is no reason that everyone chooses 5%. So if he needed to be sure you have less than 5% false positives, you can select a smaller alpha. Just to keep things interesting. Desired confidence level is one minus alpha or 95%. Note that as you increase your desired confidence level, you're required sample size also goes up. The minimum detectable effect might be the second hardest concept folks have with frequentist statistics. In a nutshell, is the smallest change. It is the smallest change to your KPI of interests, also known as your dependent variable. Note that I said care about. What I mean by this is that this is the smallest change that you will actually do something about. Let's say you're running an e-commerce site and wanted to test a new layout for your product page, you will not typically care about a one penny change in revenue per customer. I changed that small wouldn't even be worth the development time. So you should set your MDE Appropriately, similar to how power and desired confidence level impact the required sample size. The smaller your MDE, the more samples you need. The easiest way to remember this is say you are trying to detect a car in your driveway. That would be a big change to catch. It wouldn't take many observations or samples that segue a change like that. But if you are trying to detect if there was an ant in your driveway, this would require a lot more observations. So back at the beginning of this section, I mentioned that you need three things to determine if your independent variable had a meaningful impact on your dependent variable. You need enough observations to detect a change. This mentioning each variant its sample. The second criteria was that knowing what we know about the control, the probability of your variants observed value had to be very unlikely. The probability of your variants observed value occurring given what we know about the control is known as the p-value. Just to make things more complicated, one minus p value is known as confidence level. So in frequentist statistics and unlikely result is when the probability of your variants observed value, also known as the p-value, is less than your alpha, which again is classically set to 5%. Or put another way, when the confidence level of your variant is greater than your desired confidence level. At the end of the day. If you see that your confidence level is greater than 95%. On most days, this is good enough.
23. Practical significance: Now that we checked that we hit sample and that our result was unlikely, I either confidence level was higher than our desired confidence level, classically set at ninety-five percent. The last thing we have to check is whether the observed values of our control and variant have any chance to be equivalent. This by far is the hardest concept to grasp and frequentist statistics for AB split tests. So you may want to rewatch this section a few times. If we were to run our experiment many, many times, think infinity, the observed value of our KPI For a given scenario orbit around one true value. A value we do not know because we couldn't possibly run our experiment over and over again. You can see this illustrated on the left where the frequency of the observed value is highest around the one true value. In other words, the observed values from our experiment for our variant is most probably different than what the true one value for our variant is. Taking this concept further, if we were to find the difference of the observed values of our control and our variant that too would orbit a single value, the one true value of the difference. Next, see the red line to the left is our observed value. We already know this is probably not the one true value. But what we can do is estimate plus or minus a value. Arrangement would likely capture the one true value most of the time, this interval is called the confidence interval. If this range captures the one true value for our variant ninety-five percent of the time, then this would be called the 95% confidence interval. So as mentioned before, we can plot the frequency of each of the differences between the variants and the control. And we would see that the plot would be centered around the differences. One drew value. This plot would be called the difference of means. Similar to the 95% confidence interval. The difference of means two could have a 95% confidence interval. Unsurprisingly, this is called a 95% confidence interval of the difference of means. This interval captures the true difference between the means 95% of the time. But why is this important? If the values of the control and the variant were identical, then the difference between them would be 0. Ie, the difference between the means would be 0. Since we do not know the true value of the difference for our experiment. And since the 95% confidence interval of the difference of means represents a range that contains the one true value of the difference 95% of the time. If this confidence interval contains 0, then there is a chance that the control and the variant are identical. In other words, if we find that the confidence interval of the difference of means contains 0, then we cannot say that the variant and the control are different. Put another way. If our confidence interval of the difference of means did not contain 0, then we could say that the control and the variant were not the same. Now let's put this all together. When you hit sample and your confidence level is higher than your desired confidence level, that means you have what is called statistical significance. But that is not enough. If you see that your confidence interval of the difference of means does not contain 0, then your variants are different with statistical significance, IE, you have something that is practically significant. It is only when you have practical significance. Can you say that your independent variable has an impact on your dependent variable? Note that these are not easy concepts to grasp in one sitting. It may take you a few views of this and extra reading to really grasp it. But hopefully you feel just a little bit more comfortable with the terms that you're able to ask the right questions.
24. Bonferroni Correction: As you can imagine, one of the biggest issues of frequentist statistics is the minimum sample size. Most product managers don't have a lot of time to make decisions. Sequential statistics is a variation of the classic frequentist in that the alpha values are dynamic based on early data. The theory is that if we observe a very large difference early in the experiment, there's a good chance that there is a true difference. So what this approach does is allow you to stop experiments earlier in cases where the difference is very evident. Generally, sequential statistics requires dedicated calculators and figure out significance. To close out frequentist. There's a problem called multiple comparisons that analysts have to account for. In short, if you have more than two variants or if you want to analyze more than one KPI, which is very common when you are trying to understand a problem space. You have to account for the increased false positive rate. In short, the more often you compare variants or analyze metrics, the greater the chance you will have a false positive. To correct for this, you would use a correction like the done Bonferroni, or more commonly called Bonferroni correction. You simply divide your alpha, but the number of comparisons you will make. Let's do a quick example. You are running an ABC test where a is the control. You would like an effective false positive rate of 5%, or in other words, a desired confidence level of 95%. You'll be comparing the performance of B against a as well as C against a. These are two comparisons. Thus, you should divide your desired false positive rate by two, resulting in a desired confidence level of 97.5. Similarly, if you are running an HIV test, but intended to analyze three metrics, including your primary KPI, you must divide your desired false positive rate by three.
25. Introduction to Bayesian: Now let's talk about Bayesian statistics. Bayesian statistics is a very popular approach, one that I'm personally fond of. Let's compare Bayesian to frequentist. Firstly, in Bayesian there is no one true value of the KPI of interest, but rather a range of expected values based on historical data called a prior and data collected from the experiment. A prediction of these ranges are made, which is called the posterior. There is no minimal sample size required for Bayesian because it does not control for any error. But what it does, however, is control for risk, IE, the potential loss if you were to promote a losing variant, because it does not control for errors, you don't have to worry about multiple comparison problems. Similar to frequentist, Beijing has a concept called a credible interval, which can capture 95% of the range of expected values. Bayesian benefits from being easier to understand than frequentist or sequential. Even the one downfall is that Bayesian requires simulations to calculate values, thus, often requiring that calculations are made on a server. Decision-making in Beijing is simpler than frequentist. While there are strict decision rules available, decisions in Bayesian are as simple as deciding whether the probability of a variant winning is greater than what you feel is acceptable. Or deciding whether the risk of promoting a variant is less than what you feel acceptable or is the expected lift with the variant greater than what you feel is acceptable? It could be said that decision-making invasion is like gambling. Where you bet only if you are okay with the odds.
26. Interactions: To close out the analysis of results will cover three topics that apply no matter what statistical approach you take. Let's start with sample ratio mismatch. Like many things in life. Sometimes things don't go as planned. The same applies to experiments. Sometimes while we wanted to deliver 50% of the traffic to the control and the other 50% of the variant, the traffic gets delivered differently. There are many reasons that this could happen, including service outages, bad code, perhaps an experimentation platform error, et cetera. To determine if this has happened or is happening. We perform a sample ratio mismatch tech that calculates the probability of observing a split and traffic compared to what we wanted it to be. If this probability is less than 1%, we say that there was a problem in the delivery and we should investigate. Here's a tip. It is good practice to perform a sample ratio mismatch check shortly after an experiment is launched to catch any delivery issues early, as well as when the experiment has concluded. If the SRM check fails, you have invalid results. Next, experiment interactions. Interactions occur when experiments that are alive at the same time and are exposed to comment, audiences impacted dependent variables of each other by either suppressing or magnifying results, making analysis challenging. There are many schools of thought and how to address this, ranging from running experiments in sequence, siloing experiments. So they are explosive, mutually exclusive audiences and simply just letting them run on top of each other. The most practical approach in dealing with experiment interactions is that when it is believed that to experience have a high probability of interacting, you should compare the performance of the overlapping audience against those of the intersecting experiments. If the percentage change is roughly the same, there is no issue. Otherwise, there's an interaction indicating the tests should be rerun separately or considering an additional factor.
27. Other resources: For those who are interested in learning more about experimentation, here's a list of resources that I like to use. And with that, I'd like to thank you for taking this course. If you have any questions, feel free to reach out to me at Rommel AT experiment Nation.com.