Transcripts
1. Introduction - What is this class about?: welcome to tips for regression modelling. In this course, I'm planning to share with you some tips and tricks that I've learned from 20 years of using regression analysis at a professional level, I hope to share with you some useful practical skills that applying to many different times of regression methods, the idea is to transfer a set of useful practical skills. It may take a long time to learn yourself. All these tips have been used in real life applications and should be considered when you're using regression modelling. What is the aim of this course? Ultimately, to give you the skills to build a better and more sophisticated regression models. Four regression models give you bad results and may do more harm than good. However, building in estimating regression models is not just about reading a textbook. Some skills can only be built up with experience, and here I wish to share with you some of those experiences and teach you what textbook or introductory courses may not necessarily cover. Many of the tips I will give you apply to a wide array of regression techniques but will focus primarily on ordinarily squares. Yes, this is the primary regression method used in the real world. What is the target audience? The target audience is anyone using regression analysis? So that's pretty simple, however. What I will keep my explanation a simplest possible. You are expected to have some basic understanding of how regression work and what ordinarily squares us. If you don't know what that is, I recommend watching my introduction to linear regression. First. You should have a good grounding of regression before starting here. I would also be using equations, but we'll keep them very simple and only highlight key concepts. I won't be using money sub script, and neither will I mentioned the character. What topics are covered? Well, look at the variety of topics and I intend to update its course on a regular basis. As of writing this, the full list of topics hasn't been decided yet. So far. I look at how to achieve flexible, functional form and regression models fire polynomial. This allows your regression models to be much more flexible and fits nonlinear relationship . I will also look at how interaction effects work. This allows the model the model, complicated effects between different groups of data. I'll also look at how time can be used in regression analysis. Specifically, the concept of dynamics can be used in your model. I look at missing values and how to deal with them in every question model. Missing data doesn't mean you can't use it in the regression. You just need to treat it in a special way all the time. I will add more topics and update this list structure. Each topic consists of two parts. First part of each video will consist of a basic theoretical overview of the issues at hand and how they might be addressed. The second part of each we deal is a practical demonstration. He's in data on the statistics program. This part is intended to demonstrate the concept from the first part. That doesn't mean that some of the videos are a bit long. The first part is general 10 minutes, and the second part is also 10 minutes. I could have split them up, but to avoid confusion in the topics, I thought it's better to keep them together. But you don't have to watch each video in one Go Stater. I'll be using the statistics off West State A. In this course to demonstrate key concepts. If you have state a great if not, that's not a problem. You can use other statistics programs to achieve the same effect. I won't be explaining the state code I'm using in great detail. I have another course on how to use data itself and that also teaches state a code. My aim here is to demonstrate the tips and tricks are outlined and to help you interpret the regression results.
2. Understanding Regression Modelling - Don't rush it: before we dive into the development sessions that will improve your regression walling. It is worth taking a step back and talking about some general themes and ideas that everyone who uses regression should consider the technical details of regression announces congee learned quicker than the art of regression analysis. It is important to take a step back and think about what it is that you're trying to do. All regressions olds are simply statistical machinations that need some sort of interpretation. If you're working in the social science, political science of business signs on Islamic signs that interpretation can often be very complex has evolved, relate to maybe human or firm behavior. For example, not every data analysis is a medical trial, where outcomes in effects are usually quite easy to understand. Drug A is better or no better or worse than drug be in doing something regression. Nurses, therefore, often requires use of theory, frameworks, philosophy, common sense and rational argumentation and sometimes even irrational argumentation as long as it's based on underlying rational arguments. It's all quite difficult review, and often it is time and experience that drives the accumulation off. These skills even be talking about them like this doesn't transfer a lot of specific skills to you. Unfortunately, you will likely need to path your own way through this rather nebulous area off statistics , However, I would like to mention a quick, high level, fundamental difference in approaches to Regression Morning. Some regression work will be predictive, while other regression work will be explanatory. Predictive modeling is often about data mining techniques and trying to best predict on the development data. Structures in regression terminology were often interested in the predicted why Values also called Why hat these kind of modelling techniques off news intuitive approaches that mine data for the best possible set of explanatory variables that predict the given data. There's nothing wrong with that. This kind of analysis is often used very successfully, for example, to classify pictures and not a type of analysis is its planetary modeling. Explanatory model is often used by social scientists to test cause a modeling and to understand relationships in more detail. In this case, we're interested in the coefficients of a regression model. I. The basis regression models of this nature often have strong theoretical component as theory is used to justify causal inference, the theory. Data relationship can very my field. But explanation. Modeling generally requires more theory than predicted moment. You might want the hang on a minute. A perfectly predictive model should also be a perfectly explanatory model, right on they the same. The short answer is no without getting bogged down in the philosophical details, prediction and explanation off separate distinct goals. Predictive and explanatory modeling therefore emphasised different aspects of regression. Modelling, for example, predictive modeling is often worried about over fitting. That is why this kind of analysis frequently splits data sets into a training day disip, and they hold out or testing the exit expand she approaches rarely do that. Also, certain issues are of lesser importance and predictive modeling. Multicolored geniality, for example, is less emphasized than in explaining to modeling, since this mainly effects the accuracy of the bait estimates. My aim and this cause is to primarily outline good modeling practice for explanatory morphing. This is the type of modeling that is most often used by social scientists and others who are trying to understand relationships across the variables. Off course explains your modeling can also make predictions is, for example, we understand the relationship between smoking and health. And then we can predict what will happen to someone's health if they stop smoking. So what is the basic set up of a regression in an explanatory model? In the majority of cases were interested in explaining one dependent variable called wine by one or more explanatory variables, often denoted by X, so typical set up would look something like this. Why is equal to some constant and x one x two Next three, etcetera. The constant is simply the value of why, when all the excess is set to zero, this may or may not make sense, it often doesn't begin about too much. The excess all have parameters, depending on the type of regression. These parameters may be complex to interpret, but in a basic linear regression, the Perama sister note linear slope coefficients. So if a one is equal to two, then a one unit change in X one leads to a plus two unit change and why expanded? The models are often very involved in this election of X. In many cases, however, explanatory models don't know the exact excess they want to enter into the regression that is some trial and error. Of course, a regression tells us what matters more than what matters less. But explanatory models don't only rely on trial and error. Example. It is important to be aware of pyre literature in the field. We don't want to be invented the wheel, so it is important to always examine what people live before you. Why should your model be different to theirs? How does it improve? This is an important part of regression. Morning explanatory model is also used theory to guide them. Theories are strong assumptions we make about causal on the line processes that drive relationships. The operationalization off theories into statistical models creates a disparity between the ability to explain phenomena at the conceptual level on the ability to generate predictions at the measurable level. Predicting data accurately does not employ. We can use such model to validate on the line theoretical constructs. That is why regression result often start with theory and also finish with theory for its planetary models. Finally, the data itself often gives explanatory models. Clues toe what exes might be important. They've also often group together. The themes might run through the relevant data. High quality data sense with appropriate manuals make it much easier to figure out what kind of exes we might need in our model. So what should you do before you run a regression? What planning is absolutely crucial. You can't spend enough time thinking about what it is you're trying to do. Quick and dirty regressions are a recipe for disaster. Ask yourself, what do you want to test? What relationships do you want to examine and why even better, try to explain it to at least one other person before you start. Their reaction will tell you a lot. Whether you're thinking is along the right path. And trust me, it really doesn't have to be another data scientist to get useful feedback. You should also think about where your data actually supports this kind of analysis. Data sets have improved exponentially in the last 30 40 years. People just didn't collect the quality of data in the path. Still, can you achieve what you want to achieve with the data at hand? If not, are there more limited ways that you can achieve what you want? There's no point thinking about the dynamic relationship over time. If your data doesn't have along to treatment component for. Also, ask yourself, Do you need special regression methods? All these methods complex. Do they have limitations? What's most regression models look similar. The underlying mechanics can differ and the needs of complications. You should be aware of such challenges. For example, probate coefficients needed transformation. That's not a big problem. Generally, however, non Parametric survival modeling doesn't allow for a lot of explanatory variables. So that might be an issue. Makes Ask yourself, if someone has done this work before, is so how will your analysis be better? Is it a different country? A new time period? While you're looking at new variables and relationships, acknowledge and learn from those before you improve on what's being done. Also, think about the software that you want to use. Excel, for example, has more limitations than other, more expensive city school software. Will that give you limitations in your analysis? Do you need to think about purchasing new or additional software and finally think about who this is for? It's just for yourself. Well, then you could probably lower the bar and little bit. Is it for your manager? Is it for the public well important decisions be based on these results. It's so you only to spend more time getting the modeling exactly right to make sure that you've covered your basis. Here's an example from my own field. I only work in the rec moments, and there are still a lot of different scientific questions to be answered in this field. But any data said, I open in my field in any regression will estimate should probably contain a good chunk off these variables with the old since day one there. That is because these variables have been found to be important again and again in thousands of studies in my field. Not using them would need careful explaining if I didn't have a good story. People without my results, ask yourself, is not the same for your field. Scientists often perform a literature review to make sure they captured the important modeling innovations in their field. Have you done the same? Do your homework. It will pay for itself. Once you're ready to execute your modeling, make sure you do it in steps. Always start with a basic overview of the data and produced the Scripture statistics. For key variables, visualize some of them. Often a picture can reveal a lot of important statistical information. Learn more about data visualization techniques to aid this. Make sure that you use a regression model that fits. The data structure on Is efficient. Having continues dependent variable, it would make sense to use all nearly squares to model that variable and not a loaded more . A loaded model reduces information today one or zero, and that would be very inefficient Once you start modeling starts slow, but don't work your way backwards. Don't include everything all at once and see what sticks on key variables or variable groups one at a time in chart. How you estimates change. Stop it each step and try to explain how and why you're coefficients have changed. Often there's a story there. That story is part of your results. It is common, for example, to present multiple regression models in one table so that readers can see the change in coefficients for themselves and agree or disagree with your interpretation. The final regression model is not once elected on statistical indicators, but one that should be selected by ourselves based on theory and common sense on and started school indicators If there are insignificant variables, well, that might be a finding in itself. It is common to indirect analysis, although predictive modeling this this more explanatory modeling. Also, it'll rates what happens when you want the model by gender. What happens if you're on the model for certain regions in the country? What happens if you remove one key variable? These kind of nutritive steps are often called sensitivity analysis and complain important role in balloting for Finals oats. But don't go overboard. It's a bit of steps for its planetary models will still have fundamental theory behind them . So having said all of these things, what kind of things would you look out for? One building regression more one. Turns out there's a few common issues that affect most regressions. Missing values can be a big problem and lead to potential buyers and efficiency. Loss of your regression. Knowing how to deal with them is important. Multicolored geniality can lead to severe efficiency loss and make your fate estimates not trustworthy again. Knowing how to spot it and deal with it will improve your modeling time and do useful tool to the better and more powerful expansion moments. Many processes and life have dynamic effects. Something that happened yesterday would affect choices made today. Functional form is important. Many people think variables could only be handed linearly. That is not true. You could build wonderfully complex relationships in the Grecian analysis by pollen, long girls and interaction effects. Both of these will allow you to build better models without actually requiring more data or more variables. There are, of course, many more issues, but these tend to be some of the most common issues in regression model. So now let's go and explore these issues in more detail on Learn How to Build Better Regression one.
3. Functional Form - How to model non-linear relationships: linear regression models on linear and parameters. But that doesn't mean that linear regression models need to have flat linear slopes be they upwards or downwards sloping. This is actually really important. Many processes in real life have curves or peaks, more ups and downs. The often call modeling such processes functional form. And it's easy to assume that linear regression such as ordinary squares is not able to fit complicated, functional form processes. That is completely wrong, Andi. It's dangerous. Misjudging functional form in regression models can lead to severe buyers in your estimates . So let me show you how you can get control of a functional form in your regression models Before we continue. Let me show you a common example of a nonlinear, functional form process. Age is often available that results in some sort of a nonlinear relationship with many other variables. That's because individuals change their behavior over their life cycle. So, for example, people often consume more leisure when they're young and old and least leisure one no middle age. This results in the U shaped relationship. Alternatively, earnings or employment are often inverted. U shape young people unless than middle aged people who in turn earn more and all the people. It's obvious that a straight line would not approximate such relationships very well. So let's take a step back and cast your mind back to line equations, the kind of stuff we learned at school. For example, a basic linear equation is often written in the form off. Why equals two A plus B X and all this is is that why is a function of X the Parham to be the fines? The slope of this relationship and the parameter A is a constant, also called an intercept. In other words, when X is equal to zero, why is equal to a Here's an example of such a linear function. Notice how, when the parameter B changes, the slope of the line changes accordingly. Another term for such a function is polynomial function. Polynomial functions are functions of a single independent variable. In this case, X, which add variable, can appear more than once raised to any interview power, in this case, our polynomial of degree. One because X is raised to the power one, of course, extra. The power of one is simply X, so we often don't fight the power number for a linear function. So that should all be pretty simple so far. How does all of this relate to regression? Well, this is exactly what regression models return to you. The coefficients on the variables are the slow prompters in the y equals two X function. A bigger parameter on the regression coefficient equals to the steepest slope. A key difference between the linear function shown earlier and multiple regression models is that for multiple regression models, we can do this in many dimensions. Each of Abel has its own access with. Why so a regression equation in the form of y equals two a constant plus x one plus X two plus +63 has three slopes, one for each of the variables, and it has one constant. I give the value of why, when all the excess percent to zero. The key thing to understand is that this is a multi dimensional model, but each of the exes live in their own dimension. We can look at a cross section of this hyper plane by focusing in on only one of the excess on the constant say, for example, we're really interested in the variable X tree well, we could plot the regression of justice relationship between why and next three by simply graphing the function. Why equals two? A zero plus a three times x three. This function represents only a slice about complicated multi dimensional plane, but it allows us to focus on what we're interested in and understand the better. I should add here, a little note. The constant is only valid of all the other excess are set to zero. Often this may not make sense, so the constant may take another value in a multi very model. Normally, the statistical software you use should sort this out for you. Now let's step things up the next polynomial after degree one. It's a degree, too. You'll recognize this as a quadratic equation that you had to solve in school with this horrible looking formula. In this case, why equals two a plus B X plus c x squared, where C is the quadratic coefficient. B is the linear coefficient in a Mr Constant. If the quadratic coefficient C equals 20 then the entire thing collapses back into the linear function we saw earlier. Different values of B and C give different types of quadratic relationship. For example, a positive B and the negative C will result in a inverted U shaped concave. Go. What's a negative B and they positive see will result in the U shape convicts coat. Here's an example of her notice how to cuff changes shape as the parameters on BNC change from positive to negative. I'm from negative to positive. So how this is related linear regression. While the key thing to understand here is that the two main parameters the quadratic and linear coefficient are related to different versions of a explanatory variable, X one is attached to a variable X squared and the other is attached to a normal version of X And what? We cannot enter the same variable twice into regression. Nothing prevents us from entering a transformed version off the same variable in the regression. So, for example, say that we have a regression along the lines off. Constant plus X one plus x two plus extra. This regression has three different explanatory variables. Each has a linear Grady int because each only has one parameter. We can introduce a non linear relationship between why and one of the explanatory variables by adding a square term to the regression. So say we're interested in X ray and think that the relationship between why and X ray might be U shaped. Then we can add 1/4 explains available to the regression, which is x three squared So we would end up with a regression model along the lines of why equals two x one plus x two plus x three. But importantly plus X three squared the coefficients A three and a four on the constant well now mimic the previous scene. Quadratic function. The cool thing is because we're estimating the regression parameters we can detect if the relationship between extra and wise linear my testing would have the prompt a four significantly different from zero or not. So there's an inbuilt test in this regression model that will default to a linear relationship. Should the quadratic relationship not work out, we can, of course, continue this concept and keep adding further pollen normal terms to our regression. For example, if we suspect that extra you might have a cubic relationship with my now, we could estimate the following regression well, why is equal to x one plus x two plus extra agree, plus x three Squared Andi House X three Cube. In this case, we need to examine the coefficients a three, a four and a five on the constant plot. A relationship between Why and x three. Figuring out what's going on can be hard from looking at the coefficients alone at such high polynomial that often makes sense to graft these coefficients. This could be done relatively simply by plotting the relevant polynomial function across the values of X and why most software packages will do this for you again. There is an inbuilt test here. The values of a three or four and five are not significant on the functional form reduces to a lower polynomial. So given that regression models contest their own functional form by including higher degree polynomial, why don't all regression models include and order polynomial on their variables? The answer is part date and part theory. Firstly, polynomial tend to induce Colin charity. Transformations of X are often correlated with the original X variable, and this gets picked up on the requestion Kalin. Unity, in turn, leads to noisy standard Aris, and if you include too many polynomial is often explanatory variable, not at some point in time model becomes a big mess with very high standard. There's There are ways around us by the meaning variables, but even this is not foolproof. Secondly, regression models are designed to approximate data relationships but also relate these two theory underlying framework. Often they will have little quirks that caused some data. Regions behave strangely. Such regions might be picked up on higher order polynomial and force your functional form into a particular data shape. However, it may be that these are just data. Artifacts do not relate underlying driving factors, including theoretical models. In short, no many processes in real life have more than one inflection point. Linear relationships are very common. Quadratic are common. Que Bix, less common quarter or above are almost unheard of. So often complex functional forms may simply not be supported on the long theory. So let's go have a look at all of this in practice. I'm going to load a small data set that contains information on how the earnings, education and age we'll have a look at how hourly wages evolved over the age profile. I'll be using states on which I have another course, but it doesn't really matter what software were using. The methods here are all the same. Let's go take a look. Okay, here we are in Stater, with the data already loaded. First, let's have a look at the psalmist. A distance of data. That's the summarize command here. And we see that we've got the variables. Age only pay and schooling. Age ranges from 16 to 70 years. I only pay rangers from 0.25 to 288 and this could be dollars, pound, two euros, whatever really an education. In this case, schooling is measured between one and 38 years, and the average amount of schooling is 13 years. So now let's go ahead and build a simple regression model that enters age and school has continues. Variables that are linear would why we could get a pretty easily by simply typing, regress, hourly pay, our dependent variable. The function of age and school both measured. This continues fables. So here the results we haven't asked red of around 0.1 to the results tell us that age in school are both positively related with hourly pay each year of age increases value paid by 0.17 and the traditional year school increases hourly pay approximately 1.15 and finally one. Both age and schooling are set to zero. Constant tells is that hourly pay is estimated at minus 8.9. And in states are and in many other programs, we can visualize this relationship state and makes it particularly easy through the margins and margins. Plot command. And in this case, let's plot the predicted probabilities off hourly pay for when school is set to zero across the age range is, let's say, 0 to 70. Let's go ahead and look at that. And they were. This graph shows the positive relationship that age House would only pay. Assuming school is set to zero, this relationship is currently linear. And remember, we're predicting out of date amounts here. How data only has age ranges from 16 to 70 predicting all the way to zero age. So in this case, hourly pay at zero doesn't make a lot of sense. But that's not a problem which is going to ignore that part of the model. Now let's go ahead and add a quadratic age viable intermodal to do that. We need to generate a new variable that is age squared and intuit internal regression. Thankfully, state that can do this automatically. But other software packages you may need to manually generate the new variable and inserted as an additional explanatory variable. So in states are we can do this like so we regress hourly, pay against age fully interacted with age itself and the additional variable school. Okay, so in this regression, we now have to pay travels, age and each square. We also have the additional schooling. Variable has an extra explanatory variable. Our our square jumped up a little bit 2.14 on all the coefficients in normal statistically signature. Let's focus on the age coefficient. The linear age term has a positive coefficient off the 0.89 That squared age term has a negative coefficient off a minus 0.8 Both these coefficients are the parameters from a quadratic equation, the constant actors an intercept, but only when other variables set to zero someone would. Seriously schooling is probably not a good example. So we're going to use the average school age off 13 from now on. Instead, the positive and negative term of our prompters tell us that the shape of our quadratic equation will be a inverted you shit. So let's go take a look. But let me first adjust our prediction for only are relevant data range, which is good. Those 8 16 all the way to 70 and we'll set schooling at 13 which the average amount of schooling Let's go estimate that and there were. Look at that. We've now fitted a quadratic relationship between hourly pay and aged. Well, it's still controlling for other factors. Such a school. We seen a peak pay occurs at approximately 50 years old. Of course, there's no reason to stop here. Let's keep going and see what happens to our model and our craft. Now let's use a cubic form by adding a further Cuba tub in town. Regression in ST a week. No, that's simply by adding 1/3 interaction valuable in this case, age full interacted with age, which hasn't turned full interacted with age. Okay, and here the results, How ask where didn't increase by a lot. We also see that the Cuba gauge term is not statistically significant in our regression model So this may be a good indication that we should stick with the court traffic for But let's go take a look at how old of this looks anyway. On a graph from the coefficients alone, we see that the linear terms positive. The quadratic term is negative. Onda Cuba term. That's positive again. So I'm expecting to see an inverted U shaped kerf that curves back up into a positive. Let's take a look. So here is the estimated relationship with a cubic polynomial. It looks pretty scimitar previous quadratic relationship. Peak hourly pay still because it 8 50 but arguably there's a little bit less declined now in the higher ages. Lastly, let's also look at a quart IQ function. Now we can do not by simply adding another age. Polynomial talks plan, she variables. And here we are. That's progress. Only pay against age interacted with age, contracted with age, contracted with a So this is going to turn into a polynomial of degree. Four has to make that and here's our model. We see another ask why didn't change a lot, but we do see that all our thought age terms Age H squared H Cube and H to the power for a now all statistically significant. So this suggests that the cubic approximation wasn't quite right, but neither made a quadratic approximation be right. The signs suggest that our car, his first going to go down, then up, then down and then up again. So let's see what this actually happens in our limited data range between the ages of 16 and 70. Okay, so here's a visualisation off the regression relationship we estimated that had a polynomial off the grief. Four. We see that things are a little bit more complicated now between age and hourly earnings. First, hourly earnings rise rapidly until around the age of 40 after which they level off again and peak at age 50 then only earnings to climb part After 60 they start to rise again. So this is interesting as it might suggest that those left working after the age of 60 ish might be special kind of people who maintain higher wages whilst others start of time and process. So there we are. Everything we've accomplished was accomplished with linear regression using ordinary least squares. These kind of aggressions are sometimes called polynomial regressions, but really it's just orderly squares with a more complicated explanation. Variable set. So you might ask, Why don't we do this all the time? One. Apart from such more, let's not always being supported My theoretical framework. We often also don't need so much detail. If H, for example, is now primary focus and simply approximating. Let's say the up and down relationship with a quadratic fit is probably good enough. Also, Every time we are there, not a polynomial, we increase our standard heiress. Take a look at the variance inflation factor from the first and fourth model. This is an indicator of multi Colin Garrity. So here's the first model, and here's the variance inflation factor, and we see there's a all relatively low his final model. And he is the variance inflation factor of that model, and we can see that these numbers to shoot up. So these numbers, given indication of Colin Garrity high in numbers, mean standard. There's get bigger and bigger in general numbers below 50 or maybe 100 or so tend to be acceptable in our first model. The numbers are around one, which is extremely good, and in the final model numbers are just huge. This means there's massive multiple linearity going on in that model. And that also means that the standard, as are much higher. And we can see that by comparing the standard era off the linear age term from both models . So in the final model, standard error on age was 0.85 and in the first model it was 0.7 That's a huge difference. So by adding more and more age terms, we massively increased now standard errors and we keep adding higher order age point of no meals. Very soon we won't be able to see anything anymore as our standard ends would just explode . So now it's a big disadvantage nonetheless. Hopefully, you can see how using Poland normal terms in your aggression model can lead to significant improvement in your modeling. More advanced version of this replace Paul enormous with fractional polynomial. So, for example, we don't only need to raise pulled normals to interview powers such as to the power one to the power of to the power three. We can also raise terms to a fractional powers, but to something to the power 0.5 to the power of one point. Sorry, and this creates even more complex shapes as he worked more with regression analysis. You come to realize that function form is an incredibly important component in model building, and using polynomial is to approximate nonlinear data. Relationship is a vital trick to know.
4. Interaction Effects - How to use and interpret interaction terms: like polynomial variables. Interacted variables are an important component of regression model building. Interactive variables, often called interaction effect give you an important measure of control over your functional form. They allow you to find you in the behavior of certain groups in a regression by giving them different trends or effect compared to other groups. This can greatly expand your understanding of the underlying data relationships. For example, that could be significant behavioral differences between certain groups or people or even companies. Gender is a common variable to be included as interaction effect, since often men and women are observed to experience certain events differently. A problem with interaction affects is that they're not always well explained, and they could be hard to interpret one first used. But they can be incredibly powerful when used properly. So what is an interaction effect? An interaction effect in a regression model is one. We've multiply two or more variables with each other and then include this as a new variable in the regression equation. In fact, the polynomial is we learned about earlier our interaction effects. They are the same variable multiplied with itself, but we can also multiplying variables with other variables, and that is what we mostly referred to when we use the town interaction effects. So in a regression equation, an interaction effect would look something a little bit like this. Hey, dependent variable. Why is a function of the constant and explain issue available x one x two and x three and then finally, ah, fourth explanatory viable, that is X two and X three multiplied with each other and entered as one final variable. Not that the final interaction variable is a new variable. The parameter. If four is the coefficient on the interaction toe on, that's what will focus on in decision. However, interpreting this parameter is not straightforward because it depends on how X two and X three measured interpretation of a four is also function off the coefficients a two and a three in layman terms. There's because regression analysis already controls for the effect of X two and x three, so any effect of X four must be above and beyond these two effects. That's why the problem toe a four often chose to deviation away from the other two variables rather than some sort of stand alone effect. It's well, it'll make more sense in just a moment when we look at some examples. So let's have a look at an example that's usually the best way to go about learning how interaction effects work. Imagine that we have a simple regression relationship between the dependent variable y and X off one. There's also a constant of one. In this relationship. A regression model would return values that looked something like this. Why is equal to a constant one plus one X? The Bible X can represent any continues viable maybe age or income or education or temperature? It doesn't really matter now. Let's go ahead and plot this on a graph. Visualizing this kind of set up will help us understand what an interaction effective. So here's the regression equation. Visualized The relationship between these two variables is positive one. Just like the coefficient from earlier. Why increases by one for every one unit of X missiles only intercept at one. So so far, this should be pretty simple. Next, let's include a dummy variable on our model. Let's set the coefficient on the stormy viable toe. Also be one. So new equation would look something like this. Why is equal to one plus one x plus one d. So this says that the dummy group, always one unit above the non dummy group at any point on the expect er, we can represent us graphically by two parallel lines, with a difference off one on our graph. Both groups have a relationship with Y off one. So for each X increase, they both increasing why I want. However, one group sits on a higher level as an interception of an additional one and therefore intersects why had a value to when x zero? So this is a more complex model that includes a level effect, but it should still be relatively easy to understand. Now let's make it even more complicated. Let's include an interaction between eggs and the in other words, we're going to multiply the X variable with the dummy variable and included as a separate ERM the new variable. Let's call it X one d one. As a coefficient of 0.5 found new regression becomes Why equals to one plus one x one plus one d one plus 0.5 times X one d one. So what does all of this mean? Well, this interaction effect measures the deviation in the radiant for the Tommy Group. So, in other words, the great in between X and Y for the Tommy Group is steep by how much steeper? Well, by an additional 0.5. So again we can visualize this. The original relationship between X and Y is represented by the Blue Line and has a stop off one and an intercept of one. However, people in the dummy group have a diverging relationship between X and Y. They're radiant is 0.5 more, and this is represented by the green dotted slope. But the green dotted slope is not the full effect. The green dotted line is only the regression controlled relationship of the interaction toe , so we need to add this back into the original relationship. Therefore, the relationship for the Tommy Group now becomes as follows. One plus 10.5 equals to 1.5. So therefore, the dummy groups Grady int or Slope, is 1.5, and also remember the dummy variable itself had a coefficient of one. This means that those in the Dharma Group are always one higher, and this is the level shift from earlier denoted by the total red line. So we need to add I didn't also, and finally we end up with the red solid line. That red solid line is the relationship between X and Y for the Tommy Group, so that is an interaction effect. Using it in a regression allowed us to build a more complicated model distinguishes two different groups not only in levels bark also in slopes, so interaction effects are a great way toe build more complex behavioral models. You might have already inadvertently used him in your own regression analysis. If you've ever run a regression for one group of observations and then the same regression for another group of observations, you've basically interacted every variable in that model with a group. Tommy. There are multiple types of different directions you can achieve in the regression model. These will be a function of how you explanatory variables are measured. Variables generally measured in two ways they could be continues categorical. The categorical version could consist of just a dummy available, or it could consist of the many categories I continues by continues. Interaction is something we've already explored with polynomial. They can be very useful in allowing us to model normally a slopes. Another type of continues by continues. Interaction involves two different continues variables which measure different things. These type of interactions are relatively red to see in the world. It's quite hard to fully interpretive visualized. However, they can be used for the right circumstances. A little tip on visualizing these effects. You can use a contour plot to visualize the complicated. Three. The elements of this interaction. I won't cover this here, but if you ever need to use this just guru contour, plot an interaction effect. Next. The categorical by continues interactions. These are probably the most common type of interactions. As they allowed, different groups have different trends. Often that is what we're interested in. Another type of interaction is a categorical my category contraction. These are also commonly used, but we're late to group. Mean differences are supposed to trend differences. A common application is with time dummies. When you have a before time and then after time and two groups, this interaction will result in something called the differences and different estimator that is often used in evaluation methodologies. Beyond this are harder, more complex interactions that multiply more than two variables together, These can be very hard to deal with, but again, I used my professionals to model very complex data patterns. So now let's go to Stater and take a look at some examples of this. As always, it doesn't matter with you, state or not, any software conductors. But you may need to generate the new variables by yourself and also estimates on predictions and plot. Those predictions and Stater. You don't need to do that and has a lot of automated features. Okay, so here we are in states. Let's load the auto training data that contains some informational car points is on car characteristics. It's only small data sets with 74 observations, so it's probably not suited to complicated interaction effects. But it doesn't matter too much. We'll ignore standard bearers and statistical significance on what would do it. So let's go ahead and load this. Okay, now that the data is loaded, let's go straight into it and estimate a baseline regression so that we have some sort of reference point going to regress car prices against a few explanatory variables, including mileage, whether cost domestic or foreign and it's repressed status. Let's go ahead and do that. Progressive price against MPG following on the Reppas Staters. Both of the last two variables are indicate the variables, and here the results. We see that mask per gallon has a negative relationship with car prices. Higher mileage results in lower prices. Foreign cars are more expensive than domestic cars and can't with certainty. Pest status is also are more expensive. We don't have any information on what the repair categories actually are, so we'll just ignore them. For now, let's focus on MPG and let's first vision lines this lean the effect that we estimated we conclude and state or by calling the margins command, then using Martin's plot. So let's go ahead and do that, okay? And here we see the relationship between our dependent variable car prices on the explanatory variable My little mosque began in the slope on this glue line is minus 299. It's the same coefficient from our regression model, so this is all pretty easy so far. Now let's go ahead and introduce an interaction effect. We'll begin with a simple question. That's the relationship between car prices and mileage. Differ whether a car is foreign or domestic. We can answer this question, but interacting the variable mpg with the variable foreign and state a weakened it is pretty easily by using the full factorial interaction command hash In other statistical software packages, you may need to code this up yourself beforehand. This will now enter mpg foreign and mpg times foreign as three separate variables in town regression. So let's go ahead and do that regress price against a continuous viable mass per gallon fully interacted with the indicated parable. Foreign. Andi will also include the repair records as additional controls. I estimate that, and there we go. The coefficients here tell us the following. Each unit increase in mpg leads to a minus 371 decrease in car prices. Foreign cars are $2670 cheaper than non foreign cars and a relationship that must be gallon has work. Our prices is different for foreign cars. We can see that on the coefficient of the interaction effect of this slope is 161 more. Remember, the original slope is minus 371 so we add 161 to this on the foreign slope is now minus 210 . In other words, for each unit increase in mpg, the car price off foreign cars decreases not by $371 but by $210. Again, let's go ahead, and Vision Eyes is using the margins on margins. Plot. Come out. There we are. We now see the two slopes, one for domestic cars on one for foreign cost. The foreign car slope is shallower, suggesting that changes in mpg result in lower changes to car prices when compared to domestic cost. So this is the power of an interaction effect. It allows us to give different groups in our data, different relationships with why perfect? Of course, we don't have to make the slope linear. We can use our previous learned knowledge of pollen. Orioles give both of these groups different Quadratic trance, for example. Let me show an example of her. So here's a regression when we have a quadratic trend which is indicated by the sea dot mass per gallon, fully interacted would see that mpg and now we're simply adding 1/3 interaction with the indicate available for in. We'll also keep controlling for the Republic as a separate, viable. Let's have a look at that. Okay, the regression coefficients now become a bit difficult to interpret in a standalone for because there's quite a few of them. But visualizing this model quickly reveals what's going on. Let's take a look. The graph now shows the different quadratic fits that both domestic and foreign cars have would have dependent variable car prices. Do remember all of this is taking place in a multi dimensional space. So all of this happening in a regression where we're controlling for are the factors. So this is a really powerful way to get very flexible modelling into a basic, ordinarily squares progression. Next, let's take a look at categorical by categorical interactions. This data set doesn't really support is very well because it doesn't have a lot of categorical variables. I'm going to first record the continues mpg variable into a new categorical variable with three categories low, medium and high. Let me go ahead and do that. The new variables, called must become, too, and now that is done. I'm going to interact mass begun to with the valuable foreign, so let's, you know, regress price against the indicator. Variable must begin to and interact that with the indicate available Thorin, and we'll also control for the repair record. I would also control for the repair it. I estimate that. Okay, and here the results. We have multiple categories in one about interactive variables, so that means there's multiple interaction effects going on here. Let's try and reading one of them. Cars in the mass per gallon to category cost $1200 less than those in Category one. Cars, which are foreign. Cost $2400 more than non front cars and foreign cars that have a mass began in Category two . Cost. $1944 less competitive foreign cars well in most began in category one. The same kind of interpretation takes place or master gotten Category three so we can visualize this again. It's data by using the margin commands. Hopefully, this will make it more so. So here we are. This graph tells us how the relationship between mpg categories are calm. Price evolves low. Most began in categories seem to have a higher price, and higher mpg categories seem to have a lower price. However, the foreign interaction allows these results to differ by foreign status. We see that foreign cars with a low mass began in category of much higher prices than non throwing cost. So again, the interaction effect allows us to model complicated effects in our regression, in this case, with categorical variables, Finally, we can also interact. A continuous variable within another continues variable. So let's take a look at that. In this last example, I'm going to interact mileage with Carl. Wait, let's see what happens. So progress price continues. Variable mass per gallon, fully interacted with continues, viable Wait. And finally, I'm also controlling for the foreign status under repair record. We see that the cowfish inal mpg was positive and also the coefficient on weight. But the interaction effect is negative. This tells us that cars with car hire weight have lower mass per gallon slopes. Again the best way Mr Visualized us. It's a little bit harder with continuous variables, since we can't visualize the entire spectrum graphically so normally, which was a few values of one off. The two continues variables. In this case, let's blow the entire mastered callin relationship with car prices for a few weight bothers . We could do not with the margins command, like so execute that, and the results for the continues by continues interaction. What we see is that lighter cars have a positive slope for mpg with car prices. What's to heavier cars? Have a negative slope. So this is interesting as it shows that the relationship between mpg and car prices really depends on the car's weight. So again, this is a powerful addition to the model that reveals potentially crucial information, and that concludes this overview of interaction effects in regression models. They really are a great way to really complex data patterns, and they help you build better regression models. You should consider using them when your data can support this kind of analysis.
5. Using Time - Exploring dynamics: let's see how we can use time information to our advantage. Time is an important aspect of data analytics that can be easily for gotten or used improperly. But of course, time is everywhere, and people and firms make important decisions based on time, including time in your regression. Modelling can make a big difference to the kind of results you get. And what are these results a meaningful. So let's start at the beginning and discuss where time information can come from. Many data sets our cross sectional in nature. That means that data is got it at one point in time. Actually, this is not quite correct. It means that each observation has only one data entry at some particular time point. Surveys for examples are not always carrying out in one day. Big National service come take many months to complete. Observations were looking survey at different time points, but importantly only once. So the key thing is that respondents are only interviewed once with no follow up. Colonel Data along the TUNEL data, on the other hand, follows up observations with a 2nd 3rd 4th 5th etcetera. Interview panel data is generally very similar to cross sectional data, except with the addition that observations are traced over multiple time period. Often they're some of the best data sets that scientist kind of pain. But they're also very expensive, making them rarer than cross sectional data. Because panel data includes repeated observations on the same individuals, firms, countries, regions, whatever, really often they need to be stored in a particular way for computer programs to use them properly. We call this particular way the long form, usually unformed matter. Data sets will come and something called the wide form, and you'll probably need to do some recording and reformatting off the data actually access the time elements properly. So what are long and white form data? Former Why Data is presented with each different data available in a separate column. When y data contains time, then each of valuable is repeated all the time, so they have variable one Babel, two of able three of able four. What a number indicates the time element. Long data presented with one column containing all the values on another column listing the time off that particular value when long data contains time, then it is normal for the column to contain all the time values for one specific variable. So here are two examples of wide and long data in the wind. Example. We have three observations and the measuring a variable called one and the variable called to also measuring, identifying called idea both wild one and what to are measured in two time periods. Time paid one and two. And this is represented by the final suffix in the valuables. So far, one becomes involved 11 and bar 12 and bar to becomes about 21 and law to to one, converting the state at a long form. The time aspect of the variables will disappear, and we'll be left with only of our one and want to. However, now the idea viable has been expanded, and we observed the same individuals multiple times in our data. We also have a new available that identifies the time component that is how wide formats are converted to long. Former Most statistical software packages will do this for you, but you need to make sure that you understand the data transformation process. So once the data is in long format and your software packages aware that yuks on including the time elements off the data into your regression models before we go into. Not it is important to be aware of the additional notation I was acquired in any requestion equation that you might right. Regression models with time often require an additional sub script subscript t, which in the gates time. So, for example, a regression of y against three explanatory variables where the variables are measured repeatedly over time would look something like this. Y t equals constant plus X won t plus X to T plus X three teeth. In this particular equation, we haven't actually made specific use of the time elements of was simply signaling that they exist in this data. Also note that I miss now the standard subscript high on the bottom. I don't want to make this all about equations. I want to keep things in relatively soon. Any manipulations that include time elements will often be signified. My changes to the T subscript So, for example, a very common way to use time is to include lags. We'll examine is in more detail in a moment, but we can denote a lack in our regression equation like so y t equals constant plus X won T minus one cross X to T minus one plus x three t minus one. So, in this case, we're regressing three explanatory variables that Atlantic by one time period against the current values. Why so T minus one represents data from the past. Conversely, T plus one would indicate that we're regressing data from tomorrow on values today. So now let's take a look at how we can incorporate time elements of data into a basic regression model. The first thing to say is that there are entire books and courses available on this stuff. Time can be used in many complicated way, and I will focus only on some to standard ways that will allow it again more control over a basic regression model. Such a sold nearly square. I want focus on more advanced modelling techniques, which requires a lot of time to explain. So the first way to include time, it's simply to use it as a variable in the regression. Many processes in life are related to time. For example, income will increase over time due to inflation or GDP growth a time variable can help us accountable now so we can amend our previous regression set up by including time as a variable. So why T is now function of a constant x 20 x two TX treaty and then finally t one The new variable to one is a measure of time and it has no t subscript since it is t itself the parameter a four now measures the time trend of how why of all conditional on the other exit. If t is measured this continuous variable, then this is referred to as a linear time trip for everyone. Unit increase in time. Why changes by a four units Time trance Tony To be linear, go. You can treat this valuable just like a normal variable and include more complicated function forms My higher order polynomial that member quadratic cubic relationship. I have a separate video on that. Alternatively, if you don't want to treat time continuously, you could treat it as a categorical variable and simply enter a lot of time. Dummies In that case, t get split into many dummy variables. What they represent this year and the parameters on the dummy variables tell you how y a balls in each time period in reference to a bass time period. Lastly, don't forget that you can also interact time with explanatory variables. It might be the same men and women on different time trajectories, and an interaction effect would allow you to model of. So if you wanted to make sure that the explanatory variable X three has a different trend in the regression model, then your new model, it would look something like this. Whitey is a function of a constant X won t plus x +23 plus X treaty, plus the time variable and then plus the time variable multiplied by X ray. In this case, the parameter A five mile measures the trend deviation for Group X three away from the base trend that is given by a four. So that's the first way to deal with time and access a standalone variable and allows you to model trends in why, over time, and even if you're not interested in the trends per se modeling, their process may be important. That could affect other coefficients in your question model. The next way that you can use time in a regression model is by transforming the explanatory variables a so called static regression model, his one that progresses y against X using contemporaneous values only a common alternative specifications. It's the so called finite distributed nag model. This model allows why to be related to like values of X. Often multiple legs are used in addition to the contemporaneous value. So a finite lack distributed model off ordered to would look something like this. Warranties a function of a constant plus the explanatory variable X 20 and X duty. Um on x three T minus one on x three t minus two. So in this case, we have two contemporaneous explanatory variables. X won the next two on the expansion. Viable X three is included three times once it's a contemporaneous variable, once with the lack of one and finally wants with the lack of to. This can be a very powerful model because it tells us how things that happened yesterday affect today's outcome. This allows model is to explore the concept of dynamics all the time. Many processes in life take time to achieve their full effect. Not everything happens instantaneously, and this is what this model allows us to investigate. The nice thing here is that we estimate the independent effect. How X ray effects white today each time period. So today, yesterday and let's say two periods ago. And we can get the total effect of X ray on why, by adding up old coefficient across all the variables of extreme. So the total effect of all three time periods works three on why in this case would be a three plus a four plus a fine. So that's pretty nifty. Lang distributed models can be very powerful in uncovering complicated, dynamic effect that would be impossible to observe with cross sectional data. You can also include positive legs in your model. In other words, leads or future values. There's nothing that prevents. This might seem a bit strange at first because we often assume that X causes why, in effect, is measured in the future. How can it affect why today? But remember that regressions correlation, not causation. Moreover, people, firms and other actors and look into the future and change behavior today based on future events. If somebody expects to lose their job tomorrow, they'll probably adjust their behavior today. So it is not unheard off to include future leads in a model. The operation is exactly the same. A score fine. I distributed like model, except that you include leads instead of legs. However, the interpretation it's a bit more careful thought. Such models are less common in the real world, but they can be you school, so uncover even more complicated, dynamic effect. So keep them in mind. Lastly, another way to include time into regression model is to transform the entire regression equation, including the dependent variable. Why so one common transformation is the difference. Transformation. So rather than regressing the level of y against the level of eggs, this set up progresses the change in why against the change in it in equation form, such models are often written with a delta where the Delta signifies that we're looking at changes as opposed to level. This kind of set up asked a slightly different question, but it's often used my data analyst to remove on observed time in variant effect. It's a little complicated to explain, but the difference sing in this equation. Let's a part of the air time being went out, and this can be a very powerful and appealing transformation that comes at the cost of one time period of data but has the benefit of allowing users to be slightly more causal in the interpretation of the results. But if you have only two time periods, it does mean that 50% of your data is thrown away to achieve this transformation. So that can be quite a large cost course. You can also include lags or even level effects in this model, but this raises the complexity even more. I suggest using this model carefully, but keeping it in mind significant changes between the level equation and a difference equation often mean that there are significant hidden processes going on that needs further examining all sort of thought. So now let's head over to my favorite statistical software, Stater to show you some examples. As always, it doesn't matter what soft very used. The methods are all the same. Here we are in state and our first load, a training data set that has a longer treatment component. Let's go ahead and do that, okay? And that's no loaded next litter. Describe to see what the date is all about. Describe tells us that the state to sit looks at the labor market outcomes of the group of women aged 14 to 26 in 1968. We have a variety of indicators that include personal, characteristic and also the labor market outcomes such as employment job type in wages. Let's assume that we're interested in how hourly wages are related to various explanatory variables. In this little analysis, the first step is to check what kind of data structure we have. So let's go ahead and run, browse to see how rule data. And here we can see the underlying rule data we see now that's already formatted into a long former. We have multiple observations per individual in the column and the year. Very well identifies in what time period That data was scouted for each individual. Great, so we won't have to do any complicated reformatting from white too long. So let's close this and tell Stater that this data is in long former and state, and we do that with the X T set. Come on, was now told stated at our datas and long former. We've told her that the time variable is the variable year, which ranges from 1968 to 1988 but it has some gaps on the panel. Variable is a variable I d code. Let's tabulate the time variable to see what we have, So here we can see that out. Data set consists of around 28,000 observations, and each year has approximately 1 to 2000 observations. Note. That is a common future of panel data. Here. People drop in and out of ways all the time. We don't have a perfectly balanced status is we have a lot of caps, both in time. For example, the year 1974 is missing, also in repeated measures that brings its own problems, some of which you'll see later. So because we haven't unbalanced time data set, it's worth checking this out a little bit further. And the state of wouldn't do that with the extra described command and actually described tells us that out data set consisted of around 4711 individuals. We also have 15 time period when we see none. Around 50% of the individuals in our data are observed for only five years or less. The most women only have a short observation span in our data set, and we'll see that this would limit our ability. So uncomplicated Regression models with long legs also see information on the 10 most common data pattern, but their frequency is relatively small, so I wouldn't say that there's anything strange going on here. For example, it is common for many people to start a survey and then to drop out and never come back. And this is the most common pattern. What it seemed only constitutes in less than 3% of our data. So now let's go ahead and run of aggression walk. It's built a basic model that doesn't take time into account at all. Let's run a regression of whether somebody lives in the South. Education great on their age on their logo. Hourly wages. The regression would look something like this. Regret Local ages against the variable cell grade and age and we need to cluster are standard Aref here because of the repeated nature off our data. If we don't do not understand, the dinners will be wrong. On this estimate is called the Pooled ordinarily squares estimator. Let's go ahead and run that regression here. We can see that we've got 28,000 observations in that regression, an R squared of around 25 and all variables are statistically significant and explaining current log hourly wages. So let's focus on living in the South for a moment. Women who live in the South appeared to have lower wages off, circa 14%. So let's go ahead and sign, including time in town analysis. We could include a time trend into our regression model by inserting the variable year as an explanatory variable. Let's run a new regression off locators on South grade in a and include year has explanatory variable, so the results suggest that time is not an important explanatory factor here. This is not actually too surprising because the dependent variable has already bean GNP de trended analysts. We can visualize the coefficient on time, executing the margins in Martin's plot Commanding Stater. Let's go ahead and do that, and here we can see from the result of a book. It is a shallow, upward slope of log on the waiters all the time, but this is not statistically significant because he had a rather large confidence intervals around our line here at the data, not beating trended. This would probably much more statistically significant. However, in this case, time was included as a linear prompter. There's no reason to keep it linear. We can also include time is in nonlinear parameter. By using higher order polynomial all we can split time into categorical variable. Let's do the latter InVision Life and state. That's very easy to do. We can simply at the island prefix two year, and they will turn it into a categorical variable automatically and other statistical programs. You may have to generate the dummy variables first before doing this, and here we see a much more complicated time pattern emerge, one that suggests that is our new variation in locally waiters, but also one that suggests that there might still be some kind of business economic cycle taking place here. And this seems to be some cyclicality in our data every few years. But now let's continue and look at a distributed lag model. This model includes lags off some or all of the explanatory variables in stated. This is very easy to do in other programs. You may have to create the relevant lagged variables first, so let's run a model where we like the variable cell will include the current time period, the previous time period and also the time period before that and in states that we can do not very easily by including the elder prefix before variable. This will tell Stater to like that variable, and in this case we're going to our state or to lack the variable from zero all the way to to And here are results. The results are interesting. They tell us that living in the south today doesn't affect hourly wages significantly, but living there last year does, and living there two years ago also affects today's wages. But someone less so. So it looks like that living in the South was a dynamic relationship with current hourly wages. Any effect is not apparent immediately, but takes time to manifest itself. Specifically, the peak negative effect of being with South comes after one year. We can compute the total effect across all years by simply adding up the coefficients and in states that we can do not using the Lincoln Command. So let's have a look and here we see that the total effect of live in the South on current hourly wages is approximately minus 17% but most of that is experienced after one year. So that is how we can use time to investigate dynamics of data in the regression. This is simply not possible to do with cross sectional data. Do note, however, that because our data is quite unbalanced, the sample size off our aggression dropped from 28,000. Observations toe only 3000 observations, so that is a huge drop, and we might worry here about some selection issues. Finally, let me show you another way to use time. I previously mentioned that a difference estimator is a common type of estimate. We can achieve that in Stater by adding the deduct prefix to our regression variables. Like so regress. Deduct wages against deduct south against details. Great, etcetera, etcetera, etcetera. This will not under regression into a difference aggression, and we can see that this regression kicks out half observations and also some variables. The variables grade in year on, not time. Varying differences of great for each person is zero, and the difference of time for each person is one. Because these values don't change it all, either within people or across people. They're perfectly Kulina. They're therefore excluded from the regression model. In this particular case, I ask what is very poor. So there is much to learn from the small. The results tell us that changes and hourly wages are negatively related to changes and live in the South, although this is not statistically signature. Likewise changes in age of positively related to changes in wages. But again then not statistically significant. And this often happens with difference models. Our squares plummet and variables become insignificant. So my recommendation when using these kind of models, is that you should really know why you're doing it. They are relatively complicated, however, The goal here was to demonstrate some examples. Why doesn't teach you the intricate details of how each of these estimators work? So that concludes this session on how you can use time in regression analysis. There are many more ways to make time working regression models, but hopefully this will have served as a useful, brief introduction
6. Handling Missing Data - Seeing the Unseen: missing data is everywhere. It's an unfortunate fact of life that most real world data sets have missing values. Individuals refused to state their income on household surveys. Ages are not reported. Firms do not say their profits. Political belief is not answered. Countries have missing GDP value. It's everywhere. We'll talk about the types of missing data more in a moment. But it's important to understand that in a regression world, missing values canopy in two places. Missing values could be found in a dependent variable. Why or missing values are found in one or more of the explanation Variables X. In both cases, missing values will lead to a reduction in the oval regression sample sauce. A regression can only be estimated on full data, so missing values anywhere either. And why on X will lead to those who take the rose being excluded from the analysis. The end result is that your aggression will have a smaller sample size and you might have originally anticipated say, for example, you have a data set with 20,000 observations, and you run a regression off various explanatory variables. Picks that can say dependent variable. Why, depending on how many missing values who have and how they're distributed across your valuables. You may end up with a final regression sample will say only 2000 observations. That is a big difference. And you really need to make sure you catch this kind of behavior on and that you become aware of it. A massive data reduction from your sample to a regression may not necessarily be a bad thing, although it often is, but you need to be able to explain why it might not be about things. So the question is, is missing data about thing? Yes, it is. The general assumption is always missing. Data in any data set is likely to be a problem. Well, some data may be missing randomly to two coding errors, or it may be missing because it shouldn't exist. In many cases, data is not missing at random. That is quite an important concept. Understand, individuals who refused to give the earnings in a survey usually do so for a reason. For example, the very wealthy are usually not working to report their precise income to the nearest 1,000,000. The very poor, on the other hand, may feel ashamed. Alternatively, someone's continued participation in the longitudinal survey will be subject to their motivation and mood to stay in that survey. This leads to patent in missing data. Some of these patterns may be observed, and some may not be. How you deal with that aspect of your data is crucial to want you can then interpret into your regression estimates. So before we look at how to deal with missing data, let's take a look at the different kinds of missing data that exists. Structurally. Missing data is data that is missing for a logical reason. In other words, data is missing because it shouldn't exist in the first place. A survey that adults about their Children they have missing information. For example. Why? Well, because not all adults necessarily have Children. There's no way someone can answer a question about a child. They don't have Children. So in such a situation, missing values are actually fine. An analysis that involved child age as a control variable would automatically be reduced to only those adults who are parents. The regression estimates then apply to only that part of the population were Children, but given that is the population of interest anyway. Actually everything's fine. So if you deem that some or a lot of your missing data is structurally missing on, the best course of action is to simply leave the log. The next kind of missing data refer to is missing completely at random or M car. But this type missing data is completely unrelated to anything, either observable characteristics now data or un observable characteristics. Not in our data. If we took the dice inside, predict these missing values on average, we would get the correct answer. This makes this kind of data very easy to deal with. We can test for it using any reasonable method such guessing, common sense, regression analysis, etcetera. Little's M contest provides an easy way to test such missing data. Should you ever be so lucky that little's tested just M car? The solution is the same as before. Simply ignore it because only random bits of data are missing. Your regression estimates will, on average, remain the same. So nothing changes in your analysis. So far, so easy. I'm sure you can imagine that this type of data is not very common. Life is never not easy. So what is next? Next is more my stance of missing at random rather confusing you. That data is not actually with single around them. Robber. We can predict its missile pattern by using other fixer. This other data may or may not be observable to us. The important concept is not. It is theoretically possible to predict these missing values accurately if we have the right information. So, for example, going back to our little data set of parents and their Children If some parents refused to answer the child question Andi, this was related to their type of income. Then we could use income to predict the child status. So, for example, in this particular case, parents have an income off 10 tend to answer yes, or parents have an income authority tend to answer no. So therefore, we could answer that our missing value, actually is there. Yes, because we can predict it, have the income variable, which is 10. The bad news is that unlike M car, Maher can't be tested statistically, you might say, but wait, You just said that we can predict it using observable characteristics. Yes, but that prediction assumes that your prediction model is correct. And unfortunately we have no way of determining that other than theory, so therefore, dealing with more often means additional assumptions, not require reasonableness and common sense. Moreover, you can't ignore this problem. You need to do something about it. Excluding the missing values and continuing with every question would be the wrong thing to do. You'll need some kind of imputation framework or method to deal with missing data. If you're missing Data Mart, the last missing data type is M now or missing, not at random. Any data that is not M car Omar will be M No. In short, it means that the chance of seeing the missing data in a variable why doesn't just depend on the characteristics off other variables x where they be observed to not, but also on why itself? So that's a bit of a headache. We need to have accurate information off. Why? To predict why accurately, I'm sure you could see how this can lead to a few problems. Imagine someone who's depressed. Not answering a question on depression might be a function of his or her depression, so there's no way we could predict an accurate response to such a missing value problem. We have missing crucial information. And unfortunately, because of this, empirical strategies to deal with em now are very difficult. Simple or complicated. Imputation usually won't work often. The answer is you'll need to find more data or conduct sensitivity analysis, sensitivity analysis. Is that kind of what if scenario when you make slight alterations to your data structure or assumptions and then see how the numbers fall after your aggression, you then report a range of estimates rather than only one set of estimates. Just like Mom. There is no way to test for em now. So if someone accused this field data of being in no, we'll need to argue your way out of it using theory. Now that we've discussed what kind of missing data there is, the next question is, what do we do about it? We've already talked about a few options for summer for different types of missing data data that is structurally missing or AM car requires nothing from you, just an acknowledgement and a persuasive argument why the data should be treated as such. For M cart, there is also a test for structural missing data. You need to use your logic for em now, Missing data. There's not much you can do. Really? You only to use logic, theory and argumentation on and hopefully more data to make a persuasive argument. Well, easy. You say that only leaves Mark missing data to deal with. Unfortunately, Maher, missing data is also the most common missing data type. If you think you're missing data is of this kind. You're probably need to do something about it. I say, probably because if we only have a very small amount of missing observations in a very big sample, then you could possibly get away with doing nothing. The exact ratios will depend on the relevant context, but numbers in the region off 5 to 10% missing might be acceptable. A big emphasis on my hope. If you think that you do need to do something about it, then the literature has developed a lot of ways to deal with this indicator. Unfortunately, it's not possible for me to cover everything in detail. My aim here is to only give you a few simple tricks and demonstrate thes using stater. So the main approaches to deal with missing data can be divided into two methods, deliciousness, data or imputation of That's not with deletion. The first approach is to remove the entire road data for any missing value. That's actually the default position of many software packages. If you want every question in SPS s stater, the software will automatically remove missing rose. I sample observations from the analysis. This means your regression model will run, but they'll have a lower sample size count on your total sample. If you're missing data counters low, then this might be the preferred way of doing things, since it's easy and you can argue that the missing observations wouldn't have affected your overall results anyway. The other solution is to delete the entire column of data that has some missing data. That's a different way of saying any variable that has missing values is going to be dropped from the analysis. The advantage of this approach is that your aggression model keeps its full set of observations for whatever variables are then left in your regression one. Conversely, the big disadvantage is that you might be dropping important explanatory variables from your model, which in turn may seriously affect our the restaurants. Generally, this will be the worst solution. The leading an entire variable to deal with some missing data is normally not a good idea. So stay away from this option. On the imputation side of things, we need to first ascertain what kind of data we have. Let's continue on the right hand side of this chart who work away. If we have time data, your data may have a trending component, say GDP well, that need to think about how to fill the gaps. This could be my seasonal adjustment. If a particular season is missing or my interpolation methods. Interpellation is just a fancy way of saying that will take the previous value to predict the missing value. Alternatively, one could take values close to the missing values, say, the average of the previous and next observations to predict the actual missing value. If the time data has no trying component, that we can use things like the mean median or mode imputation to predict missing values. Alternatively, we can look in our distribution of not missing values on just pick a value, add random, he fell Data's cross sectional and the missing values are categorical. We can create a new category specifically for the missing values. This is often the most preferred way of doing things. Since you're not getting anything, they're simply giving the missing observations a specific missing code and include that in the request in what the regression model will then estimate parameters for those who are missing. And it requires no assumptions on your part, except for maybe interpreting the regression coefficients after you're done. Note that this method can also be used when you have continues variables with missing data because you can always convert a continuous variable into categorical variable. For example, think of income and income bands. My recommendation, if possible, used this method. It's simple and very effective. Alternatively, you can specify some kind of a loaded regression model to predict the missing value. Categories, however, will then need to think of a model that predicts the missing bodies. Another way is to use imputation methods, such as multiple imputation that estimates several models, each with different imputed categories, and then averages these results for you. This approach is similar to a load over question on that have requires a model to predict the missing values, and it's a model is wrong and amputations of all, however, an advantage of multiple imputation. It's not a guess is a multiple values, so you don't only rely on a single imputed value. If your data is cross sectional and you're missing, values are part of a continuous variable. You can use mean median mode or random imputation. Alternatively, you can predict the missing values from a linear regression model or again from multiple imputation models. As always, you will need to think of a model that predicts the behaviour of the missing observations. What I've shown here is just a tip of the iceberg. If you're missing data problem is a relatively benign or effects variables that they matter too much in your regression model, then any of these mentioned solution will likely suffice if you're missing that. The problem is significant on all this part with your dependent variable all part of a very important explains available, and you may need to delve deeper into this topic and explore other modeling concept in more detail. Right. As you can see, missing values are a big topic. Let's it off the stater and try out some of these techniques. If you don't have stater, don't worry it will be very similar. And other software packages I won't have enough time to show you multiple imputation methods and state. Uh, but I'll show you some basic techniques we described earlier, including mean on regression based imputation. So let's go take a look. Here we are in states out with the auto training data already loaded. Before we start, I'm going to set a random number seat that allows you to replicate these results. SETC 1234 Without this seed, your results may defer to mine. But training purposes The auto training data already has some missing observations in the variable rep 78. But we're going to add a few more missing observations for demonstration purposes. So first, I'm going to set some off the gallon observations. Two missing. I'm going to use a random number function to randomly said 5% of the mosque began in values . Two missing. No, I did not hear the case. Mpg equals two missing his uniform discredited. And 0.95 OK, This now means that both the valuable rap 78 and mpg having missing bodies. I'm also going to give 1/3 variable missing observations this time, I'll make the missing observations. Relying on a process, I'm going to one of regression off the variables trunk and turn on the diable length, so regress length against trunk in turn. This now gives me a statistical model to predict length observations, and I can use to predict command to protect values of Ling for every observation in our deficit. So project why hot? And for a few of these observations in this case, those with high predicted values, I'm come to set them to missing in the original length variable. So replace link equals two missing. If why hat is greater than 210. And finally, I'm going to top why I had to make sure it doesn't cause us any problems later. Okay, so we now have a custom later said, that has missing values in three of its variables. Next, I'm also going to install the use of in command to perform Little's M car test. Remember always the commission to install any software, including having commands in state. So let's install that. Okay, and that is not done. We've been stole. Did we can begin our analysis, so let's first get overview off all missing data in the status it the Mist Able Command can be used to examine missing data. It has a few sub commands, which will use to the summarize on the patterns up. Come on minutes to miss table. Summarize the mist able Summarize Command shows us what variables intake, missing values, mpg contained. Six missing values. Rep. 78 House Spy on Link Has knowing missing values. However, that may be some overlap in these missing values across observations. 20 missing values Because three different Babel's does not necessarily imply that they regression using these variables will have 20 less observations. So let's analyze the Hatton missing values in the state, and we can do that with the command. Miss Table happens. I'm going to use the frack up from here to display frequencies rather than percent Miss table patents comma. And we now see the distribution of our missing data across the three variables that have missing data out. Total sample size of the data is 74 observations. In terms of patents, a one indicates that all values off the variable on long missing and the zero in the case at all values are missing. Therefore, the past, um, 111 means no missing values. And 56 of our observations in this data have that pattern. Hi. Regression involving those three variables is, as expected, to result in a sample size off 56 observations. But know that we see other patterns of using data. For example, there are eight observations that have complete information for rep. 70th master gallon parts contain missing values for length. So now that we've identified that we have missing values now data and would be good to think about why this data might be listen often if a variable some kind of choice based variable, the underlying reason is most likely to be marked. You sing it around them. In other words, the missing data will be a function off other things. Alternatively, we can also think about structurally missing data. For example, the mass per gallon missing values might relate to Electric House. That was the case. Who should identify this missing vector as such and leave the load mpg simply don't apply to electric cars. That's quite important. This now is the time to think about theory. Andi apply common sense to the data before continuing. So now let's assume that we've done that. How can we deal with old since England? The first option is to do nothing. Let the regression command take care of it. As highlighted previously. Most regression two months will automatically apply road allusion to missing data. So let's go ahead and modern regression off price against the Gables length mpg on rep. 78. Um, wait. The 1st 3 explanatory rebels have missing values on the 1st 2 of these are continues variables. What's his third is a categorical variable are still each of the regression estimates boy. Final comparison heavy. And so let's go ahead and one for small regressed price against length mpg rep 78 weak. So about execute. And here is our request model. We immediately noticed that the small only has 56 observations. What's now underlying data sample has 74 observations, so 18 observations were excluded from this analysis. So this is an example off road deletion producing values, any observation that have missing values, any of the valuables now model for completely removed, including values that worked not missing so we could leave things here. Often this will be a question off the relative proportion of missing values or where some of them isn't values are structurally missing. In this case, I would argue that 18 observations off 74. It's probably a little bit too much to leave it like this. So the next step would be to perform an M car test to check with our missing values are missing completely. Have London. It's now we can stop at the first progression, so let's go ahead and do that. What goals include all the explanatory variables in the test command? The test Kamal is M call test number. Simply specify how list of explanatory variables so Little's M car test tells us that our explanatory variables how repression are not missing completely at random. The P values below 0.5 So we reject the assumption of M car that look. But let me just show you that this test does actually work. Remember that mass per gallon missing values were generated vulnerably, so if we reduce our analysis, two months began and then wait, only hopefully we'll find that the missing values are actually run them. So let's go ahead and do that in court. test Moscow gallons wait on a few. It worked. The test tells us that mass per gallon missing values are indeed M Car. Senate gives us an option. We could estimate our aggression only with the variables that on income Boston Garden who wait. So let's go ahead and do that one of regression off price accounts mpg and wait and store assistance here. The results know that two of the variables have now been dropped. This is the same as column release. We did lead to the variables that have missing observations from our regression announces. Cool, except those on our M car also notice that we still have missing values. I'll sample size went to 68 but that still means that our regression is missing six observations. So in this scenario, we actually performed both a world delusion witnessing values on a column delusion. Missing values. The next step might be to assume that must be gun is not M car. We could go to the extreme and just do a column delusion across the board. That means excluding all variables that have any missing values for my requestion, a regression model of that would look like this regress highs against weight. That's only very well that doesn't have missing values. And here the results are sample size. Jump back to 74. So that's good. But our regression is very empty now. You threw away a lot of potentially important explanatory variables. There is no way that this approach satisfactory in this kind of context, we need those variables back to explain the variation in price. So let's go to the alternative of delusion. Let's do some kind of computation. The easiest way to improvements and values is to make them a separate capital that obviously only works for categorical data. Rip 78 is a categorical variable. Andi has missing values, but it still would not variable. First, we can recall rep 78 so that all missing values now take a new value, say 99. So recode prep. 78 Missing values equal to 99. Now let's go ahead and tabulate the variable rep 78. And then we see that we now have a new categorical variable with no missing data but with a new category heavy. So let's put all of this in every question. Estatal automatically avenue Catherine as a dummy variable toe regression, thanks to the Idol prefects, Progress, price against length, mpg and idle crept 78 wait. The results indicate that our sample size has gone up slightly. From my first progression, we gained four observations. We also see a new variable computers in our regression rep. 78 99. This new category now tells us that observation with missing car repair staters have car prices that are, on average, $3270 higher. And those who have a reaper staters from one. Great. This is a really useful way to deal with missing data. We didn't have to do any guessing off what the values are. We just included them and states that even told us a little bit about them in our regression. My advice, where possible, used this approach its simple, clean and avoids complicated molding that needs attending. But we also have two variables, not a continuously mission. And if we don't want to be called into a categorical variable, we'll need to find some other way to impudence and values. One easy way to choose the statistics, such as the mean mode median. Just get every missing observation that man isn't sample with mean imputation. Let's summarize Mass began. Next, let's replace mpg with mean off mpg. If the values of Moscow government promising, go ahead and finally, some rice mpg again and we can see that the observation count has now increased from 68 74 . The mean hasn't changed little since we used to mean that's imputation. However, the standard deviation has changed. Next, let's be run out Regression. Hey, we are on a regression. Some count has increased by 5 to 65 observations. If we want to do something more sophisticated but still relatively easy, we can run a regression explaining the variable that has missing values. For example, we conducted regression off chunk and turn on length and predict the length observations holds up. Let's go ahead and do it regress length against strong concern and then predict re predicted. Valleys, if we don't summarize, are predictable, useful length for the sample and the missing values. We get the following. We see that we've predicted the values for length or lying some observations and because we used the right model. The predicted values range from 212 to 242. This is exactly what we specified in the beginning. Off this exercise, where we dropped length values, I was greater than 210. Perfect Final step is to replace the length values with the predicted values for those values of length that I'm missing. So replace length because toe why have implants people still missing? Excellent. And now we can be run our regression long and there we are. We now have a regression that is complete with 74 observations. Missing values for mass per gallon. Mean imputed missing values for length but regression imputed and missing values for F 78. What category computer. Let's go ahead and compare all months, and here we can see the part time you chomp. The first few models relied on deletion methods on the final two months. We're light on imputation methods. You can see that the various approaches at various sample sizes and various coefficients you'll need to decide for yourself which model you prefer and what methods might fit your analysis. If you're missing data problem is very complex. You're most likely need more complicated methods and what not demonstrated here, However, These methods here are useful to know and should take you a stairway with basic data that has not too many missing values, and that concludes this session on missing values. There's a lot more to this topic, but these tips should get your good has.
7. Categorical Explanatory Variables - How to code and interpret them: a lot of data in the real world does not come in the form of continues variables. Not everything can be measured between negative infinity on positive infinity. Many variables or data types come in a categorical form. That is because they measure qualitative facts such as names, gender, times of transportation, country lists, political parties like IT, skills, etcetera. We need to be able to accommodate such data in our regression modelling note that only talking about categorical data on the right hand side here. We're not going to discuss what to do if the dependent variable is categorical. If you want information on that, go take a look at my non linear regression course. So here's an example of how categorical data might look in the raw form. In a data set, this little data set contains six observations which are denoted by the I D variable. It also contains four categorical variables. The first categorical variable, var one is a string variable that measures gender. The cells only contain the values M or F. This kind available will first need to be recorded. Two in America wearable eight dummy variable. The second, valuable also contains string data but this time it has more than two categories. It has full categories north, south, east and west. Again, this kind of data will need to be re coded this time into multiple dummy variables. The next variable is a numeric variable that has three categories. 12 and three. Treating this as a continuous variable would be the wrong approach, since his three categories unlikely to represent some sort of qualitative fact. Depending on the software, this may need to be recorded too many dummy variables in the last variable Bar four is a new miracle dummy Variable coda Designer 01 These are the kind of Abel's that we can enter into a regression their co efficiently. It's an intercept differential between the two categories. Zero and one. We'll explore that more later. So what should we look out for when explanatory variables a coded as categorical data with two or more categories? Well, there are two important things. The first is coding To use. Categorical data will need to generate many dummy variables, and you'll need to make sure that these dummy variables are coded properly. It sounds simple, but it's the easiest way to make a mistake. Some software packages will do automatically for you. In others, you'll need to do it manually. The second thing is the reference category, also called the base country. When you have more than two categories, one category will need to be left out of the model to avoid perfect linearity. The base category anchors all your category coefficients, and changes to this will lead to different interpretation off your results. So you need to be careful when choosing a reference category. So let's begin the coding. Why is careful coding important when using categorical variables? Well, the reason is that regression models can only take two times of Ebel's continues variables or dummy variables. It dummy variable also called a buying available. Why dichotomous indicator This often measured as 01 depending on the statistics package you use. You can also other numbers here, such as a one or two or zero or 99 or whatever. The key thing is, this variable only contains two values. If your entire data consists of continues valuables or finer indicators and you're done, there isn't anything to do on the coding from, However, if your data contains categorical data that consists of more than two categories. Then we'll need to kotis into several new dummy variables. How should this be done? Well, he's a visual example. In this example, we have a raw data set that contains an i D viable and a categorical variable. This categorical variable has three categories. Our data is measured. The string data measures the values a BNC, so we'll need to generate three new finery. Numerical variables. Tell me valuables. The first army valuable will take the value one for every value of a and zero for every other value. This is important. Zero is now everything else. So if a value takes a B or C, it will be set to zero in the first dummy variable. This is where mistakes can happen. If you said zero to something other than something else, like just see or be, then you're very likely experienced severe problems. Once you enter these variables into their regression, the second dummy variable Tell me to takes a value one for every be value and again zero for everything else. Finally, the third army variable Tell me three takes the value one for every value of C and zero for everything else. So that's exactly what needs to be done. Nothing else. If you get this wrong, your results will become meaningless or your aggression will collapse. So it's quite important to understand how dummy variables are created from categorical variables. What's in extent? The next step is to include all of the dummy variables, except for one in the regression. The reason for this is perfect. Multiple linearity, congeniality strangely, not a good thing, but perfect. Multiple linearity means the mathematics on the pain in the regression won't work, and nothing any software package will encounter an error. They all have various ways of dealing with that importantly, that tell me category that you live out was as a reference category. That means all the results are interpret relative to this is very important to realize, has changes to the reference category and change your coefficients on interpretation. Quite dressed. However, equally important, changing the reference category only doesn't change the underlying model you ask. Word and predictions will be exactly the same. So how does all of this look? Well, imagine we had a regression that has two continues variables, and we want to include a category variable with three categories in town. Regression. Well, then we'll need to add two dummy variables in town regression model. So in such an example, why would equal to a constant X one and X two, which are the continuous variable and two dummy variables? The two and the three. The dummy variable D one is coded but left out of our regression act as the reference category. So how do we interpret coefficients from such a model? Well, there are two things to remember here. First, everything's relative to the reference category. Second, dummy variables act as intercept shifts. This means everyone in that particular dummy category is on a higher or lower level of why the no reference category. Now let's take our regression equation from earlier. Let's put some coefficient numbers onto. Let's continue to assume that the dummy variables measure observations in categories A, B and C just like our role data from earlier. Let's assume that the regression of this model chance to following coefficients the constant is equal to one. The coefficient on the continues variables X won the next two is also equal to one on the coefficient on the to is too on the coefficient on the three is negative. Two. So what does this all mean? It means that the relationship between X one and wine on X two and why is one this applies to everyone? No data increases in either X one rex to It's a higher values off. Why for everybody. But remember, there are three types of observations in our data. Observations A, B and C, which correspond to the dummy Bibles D one d two and d three. So this means that if we set the dummy categories D two and D 3 to 0, the intercept is now one because we only have three types of observations. If the two and the three are set to zero, then logically the one must be true. So, therefore, observations in category A haven't intercept off one observations and category B or B two Having intercept that is to above the lap of the reference category. The reference category refers to those categories. A who's intercept was one. The observations and category B therefore, having intercept at three. Likewise, observations in Category C or D three haven't intercept. That is to below the reference category. Therefore, they have an intercept off minus one. So hopefully all of this made sense. But let me just demonstrate this to you visually. Just a really Make sure here's a visual representation off the earlier regression one. The Y axis is our dependent variable. Why and the X axis represents one of the two continues variables X one or x two. It doesn't really matter which one were not that interested in the continuous variables here. But importantly, the X variable has a relationship with wine in this case positive one. A regression with too categorical dummy variables for three underline categories. Let's a model that has three different intersects the relationship between Why and X four values in the reference category is given by the dash. Blue Line chose a slope in the intercept when the dummy variables D to empty three are set to zero. The dummy variables D 23. Shift the intercept up or down. Observations in category B two are two points higher than the Reference cups agree, and this is denoted by the bold Red line. Observations in Category D three are to lower than the Reference cats green, and this is denoted by the bold green line. So hopefully you can see that categorical variables are very easy to interpret, as long as you coded the dummy variables correctly and included all but one off the dummy variables into your regression. When the coefficients you see are simply intercept shifts up or down in relation to the reference category, you can, of course, change the reference category without any detriment to your model. Underlying predictions and ask word will remain the same. However, it doesn't mean that the coefficients presented to you will change. This is because the base category has changed and therefore the relative nature of the other categories, not what the dummy variables will also change. Let me show this to you with a visual example. Here we have exactly the same data and regression from earlier. However, we have now changed the reference. Category two dummy variable to this is represented by the red dashed line, and you can see that both the underlines for dummy variable D one and dummy variable D three on now below the dash red line. So observations in the dummy category one now experience a negative effect of minus two compared to the reference category. D to likewise observations in the dummy category. Three have a negative effect off minus four, compared to deep, too, so the coefficients in this regression model would come out of minus two and minus four. This could be confusing if you don't fully understand the role played by the reference category. But as you can see on this diagram, nothing has actually changed. Everything in our model on all our predictions remain exactly the same. So which reference category should you choose? It depends. Ultimately, it makes little difference of underlying model. Well, there are a few tips most software packages don't have. Some automated function would exclude the first value of your categories. This is often the default setting. Also, in real life, many first values and categorical variables will act of some kind of reference point. So often it makes sense just to stick to that. Take the first values, the reference counting. However, there is some advice and statistics literature that suggests that you should take the value but the highest observation car. This is because the reference category actors an anchor point for all the other estimates, and of this category is a small cell size count, then it might cause all the other coefficients on the other. Dummy variables become a little more unreliable. So do consider that another consideration is what kind of co efficiency want to present. Often categorical data will have some kind of ranking, like like it scales with school grades or something similar. This will often result in a noticeable pattern of coefficients that travel in one direction . So you should ask yourself, Do you want to present positive and negative estimates? For example, looking at the relationship between education and wages? Do you really want to set university as the reference category? It's the highest education one can achieve. All the other education categories will probably have the negative coefficients off that will confuse Lehman readers. They think that education is a bad thing because they see collective coefficients, so that's quite important. Think about how you want to present your coefficient. Finally, you can also group several dummy variables together to create a bigger reference. Camps Greece Imagine a like a scale that includes strongly, disagree, disagree, neutral, agree strongly agree that would result in five dummy variables. You could leave two of them out instead of one for example, you could leave out the categories strongly disagree on disagree to have one general disagree reference category. This can often be useful if you have a small cell size count or if you want to make more aggregate statements about the data at hand. One more important thing to remember when dealing with categorical variables that you should always perform group tests were possible. Some dummy variables might be statistically insignificant. Maybe all the dummy variables or statistically insignificant, however, the way statistics work said a group of statistically insignificant variables might be separately insignificant. But together as a group, they are jointly statistically significant, and you should not exclude individual dummy variables that belong to a group. You should either remove the entire group or keep the entire group. Don't make so much. Don't makes it much because, as we saw earlier, this will also thick your reference category. Now a little over the state are so that I can show you some examples. It doesn't matter what software they use. The methods are all the same, but I should have that state or has some excellent automatic capabilities to deal with categorical variables that save you a lot of coding work. Okay, here we aren't Stater on. I've already loaded the auto training data set. This is a training data set that includes 74 observations about car prices. There's some useful properties we can explore. In particular, one of the variables in the state of said, is a categorical variable. This variable is rep 78 which stands for repair record. Let's go take a look at it by tabulating Tab Rep 78. And here we can see that we have five numerical categories ranging from 1 to 5. Some of the categories have many observations, like Category three. On other categories, such as Cathy, one has only to observations. I could be a problem later, so we'll remember that the next step before any regression analysis is to create five new dummy variables, one for each category of data. We can do not easily in ST about using the tablet command with the generate option. In other statistical software, you may need to use more muscle power code everything yourself. Let's go ahead and do that tablet rep 78 invoked that generate option. Okay, now we can now see that we've created five new dummy variables called Dummy 12 dummy. Five. Let's take a quick look at them some rice. Tell me Stone, and we've now some rice to five new dummy variables. Here we can see that the five new dummy variables are coded correctly because their distributions are correct. The mean numbers here much. 2% numbers up here and now we can move on and include these dummy variables into a regression. So let's regress car prices against a continuous variable mileage or mpg and the categorical variable rep 78. To do that, we're going to need to enter four the five dummy variables in the regression and in states that we can do not like so regress. Price against MPG. We're going to select the last four dummy categories. Tell me to to dummy five. Execute that and they would go. We now have our first regression with a cask orca variable. The dummy variables 234 and five are included and the first time available. Tell me one is excluded and acts as the reference category. So the results say that we have an R squared of 0.25 on the continues viable mpg is statistically significant. The piece to have a negative relationship with why our master gallon caused lower car prices. The four dummy variables are statistically insignificant, and at this point it is Western, a group test to figure out what if the entire group is statistically significant or insignificant, and in states that we can do that with the test command. So let's go ahead and do that test dummy 234 and five. And here we can see that the group of these variables is not statistically significant. So, in other words, this Cooper variables is not helping us explain car prices at all. They don't add to the model, and we could therefore potentially exclude the entire group of repair categories. So that's a shame. But for exemplar purposes, we're going to ignore order this for now. Let's take a look at the coefficients, and we see the following those with a repair record of to have car prices that are average $877 more than those with a record of one. Likewise, those with a record of three have car prices that the higher by $1425 compared to group one , etcetera, etcetera, etcetera. We also actually see a Grady int here taking place in each of the categories as two categories increase, the car prize goes up and up and up, so this might suggest there is some sort of ordering taking place in this variable. Next, let's go ahead and visualize this coefficient in ST ER. I can do that with the margins and Martin's plot commands. However, before that I went you status automatic expansion capability to rerun the regression because thank you's me a little bit more control over the post estimation results. It's really easy to do in state of all I do is use T. I don't prefix. That tells data that the variable rep 78 is a indicator or categorical variable state of will automatically expand these into doubly valuables and leave out the first category. So here we go. This is exactly the same regression and same set of results as earlier with the same mask word coefficients. However, this time I awarded any coding responsibility by using the i lock prefix command. So this could be a really handy way of doing things, especially when you're doing a lot of specifications tested. Next, I'm going to call the margins of Martin's blood command. That's going to graft these estimates. I will ask what a dummy variables to be drawn over Alvaro. Use in mpg. But I will suppress the standard there is here because we know none of this is statistically significant. Anyway, let's have a look at that. So there we go. We now see exactly what the inclusion off the categorical variable does to a model. It produces five different levels of relationships between mpg on car prices. The blue line at the bottom is repair record one. This is the excluded reference category, and each of the other categories represents one of our four dummy variables we included in our regression model. The vertical distance between the Blue Line and each of the other lines is the coefficient on the dummy variables. So hopefully that should all be fairly clear. Easy to understand. Let's close this and let's go change our reference category to something else. Let's change it to Category five. What do you think will happen to restaurants? Well, they should alter negative, so let's have a look to see if that works. In ST I can change the reference categories on the fly by again invoking the idol prefix, but this time specifying the base to be five. And I can do that with the code I Be five in front of Rep. 78. So let's execute that. It was now changed your reference category toe the last category, Category five and indeed we see that old are coefficients are now negative relative to count to be five. Those in category one now have car prices that are $3131 lower. Notice how the ask what is still exactly the same. Nothing has changed. Only the coefficients on a dummy variables have changed. The predictions from this model are the same as the previous model, and we can visualize this to see that. So let's run the margins and mountains brought command again. And here we see that this graph is exactly the same. Graph is before, except the coefficients now relate to the highest category. In other words, the teal line, rather than the blue line at the bottom. All the other lines are below the reference line, and the distance between the teal line to each of the other lines are the coefficients on our dummy variables in the regression model. So let's close this and let's go ahead and change the reference category to the larger sample size. In this case, this is category three. Let's take a look at that in state, and I'm going to use a specialized option that is, I be frank, which will automatically select for me the category with the largest sample size. And here the results. Again. The underlying regression model is exactly the same as before. The only thing that's changed is the reference category, and therefore some of our Corporates are no negative, and some of them are positive because now the reference category is the third group that sits in the middle. Finally, what happens if we live out more than one time available? Well, this may sound reference category is now mix of those two categories that we left out So say, for example, we want to leave out categories one and two, maybe because their cell sizes are so low that we want to combine them. We can do this by including only three of the five original dummy labels so we could run a new regression that is regressed price. Mossberg gallons against dummy variables 34 and five notice. Now that the ask word hasn't actually changed slightly, this regression model is now different to the previous model. It won't give the same predictions. That is because we have one less parameter, and the results show that those in Chapter 33 have car prices that are $723 higher than the average car price in categories one and two combined. So that interpretation is slightly more difficult on the little bit trickier, but it may make sense from the certain circumstances. However, I advise you to always be careful when combining reference groups, and that concludes this video on how to deal with categorical variables in the regression.
8. Dealing with Collinearity - Exclude or Transform?: multiple linearity, often just termed Colin unity. It's a phenomenon that emerges when different explanatory variables in the regression measured the same thing. It is generally an undesired property toe having a regression because it leads to a high standard. There's more noise and poor coefficient estimates two times of Colin Charity matter in a Regression World Perfect Colin Ality, which results when you enter the same explanatory variable twice into regression. Both explanatory variables are perfectly correlated, and this result in the mathematics of the regression model breaking down. Most statistical programs will either bought the process will remove one of the offending valuables automatically. The other type of cooling Garrity is in perfect. This is when two variables mostly measured the same thing, but not exactly weight and height. For example. In this case, a regression will run, but the output may be very inefficient. So depending on the strength of the Colin Garraty standard, there's may substantially increase. This will make it hard to see what's going on in a regression, and obviously that is not a desirable event. What makes it worse? It's not the standard regression Diagnostics of many statistical programs don't test for this by default, and it can be hard to recognize this from model estimates alone. Thankfully, there's a few things that can be done to test for culling Garrity and, importantly, avoided. But first, let's have a closer look at the impact of Colin Garrity. On Standard, there's Here's a video, and in this video I have set up an underlying data set using simulation techniques. The data set consists of five expansive ables, and one continues, Why or dependent variable. The relationship between each of the X valuables and the wine viable is shown in the title . Each viable has a parameter coefficient of exactly one. On the left hand graph, you can see the road data points between X one and X two I to about five ish monetary variables. And in this graph, I'm going to slowly increase the correlation between these two explanatory variables. I would only do this for X won the next two. The other explanatory variables will be left completely alone on the right hand graph. You can see the results from my regression of why, against all of the explanatory variables X, the estimated coefficients and confidence intervals represented in an ideal world We want each estimate to be exactly one on the confidence in talk to be a smallest possible. So are starting estimates look quite good. Everything is estimated to be approximately one with a relatively tight confidence in tow. Now let's induce ever more core linearity by increasing the correlation between X one and X two until they almost measured the same thing for reference. I've also included a little computation by how much Astana there is actually increased for both the X one and X two variables. So let's have a go. We see that as the correlation increases, the standard error and confidence interval on X wanna next to slowly increases, however, the increase accelerates towards the end. The noise around the estimates for X one and X two shoots into the sky. So this is the effect of co linearity at very high levels off Kohli narrative between explanatory variables standard. There's literally explode. So what kind of warning signs are there when you have Colin Yet, in your regression? Well, we've seen the 1st 1 already. High standard there is. Do some of your valuables have extremely high standard. There's that seem unusual in your regression that is a good indication of calling Garrity. Are coefficients not significant, even though they should be? These are classic clues that something's not quite right. Another warning sign is theory. If you're not data mining, you should always have some basic understanding of why explanatory Bibles are included in the regression model. And that should alert you to the potential of including the same kind of measurement twice . Anything that measures the same concept repeatedly is unlikely to cause Carly narrative problems. Another indicator is in stepwise regression. If you load variables consecutively into regression, then sudden changes to coefficients and standard there's are a clue that something is not quite right. Descriptive statistics are another indicator. It is common practice to examine pair wise relationships between explanatory variables before filing a regression. This allows you to spot potential, calling Garrity problems before a regression. Of course, a pair wise correlation doesn't mean there's going to be a Multivariate correlation. But most of the time this is a fairly good indicator. You should get into the habit of looking at the correlation between explanatory variables before a regression, and finally, there post progression test statistics such as the Variance inflation factor, I can alert you to call America. So that's quite a few warning signs out there. If you know what to look for, what a common causes of core linearity. Well, first of all, the underlying cause is always the same. Something is measured twice or nearly twice in a regression, but in a few reasons why this might happen. The most common reason is that it is an accident, the underlying Kaling Garrity's unknown. And for whatever reason, two or more Colin, your valuables were entered into a regression that can be several reasons for that. In experience. Can player will but also a lack of data descriptors or pre regression correlation checks between regresses, our common causes? That is why it matters, that we think about the regression one we're building and that we take care to examine the data beforehand. Another common reason is the inclusion of higher order polynomial higher order polynomial czar. When we enter the same variable multiple times but add different powers to each in practice , X and X squared and X cubed measured the same on the line thing. So the use of pollen novels in a regression can cause Kalyn ality. Finally, maybe a user's decided to deliberately enter to highly Colin your valuables. One would need a pretty good reason for doing that as highly calling. The variables essentially measured the same thing, So why would anybody want to measure the same thing twice in a regression? Well, it could be part of a research question that looks to determine to what extent other factors above and beyond the call geniality between the two variables player world in explaining why that would need some careful thinking about. But everything is possible in a regression. What should you do when you encounter co linearity in a regression model for the first approach is to do nothing. Approach. Do the variables that cause causing the antimatter? Are they of interest or are they just background controls is to co linearity high? Are the variable significant? All this signs correct. Depending on the answer to these questions, it may be possible. Simply ignore the problem. Leave it alone. Another approach is to execute one or more of the offending variables. This can be done via theoretical methods where you have some reason to exclude a variable or my arbitrary methods, where you simply pick one or more of the offending Bibles and excluding from your model. Another alternative is by some kind of selection method. Such a stepwise progression selection methods such a stepwise regression with its are generally not preferred by economists but another sciences. They're seen to be less of a problem here you outsourced decision making to the computer, which will fit various models and exclude the worst offending variables, depending on a particular rule. Said that you might provide another way to deal with congeniality is to use something like principal component analysis on the group of Colin your variables and include their principal components in a regression model. Instead, off the Colina Variables, principal component analysis confined the possible linear combination of variables that can produce large variances without much loss of information. Plus, the set of correlated variables can be reduced into a new minimum number of variables, which are independent of each other but contain the same information as the original variables. And finally, another transformation trick is de meaning, sometimes called century more variables. This is especially Hanley for polynomial on just solidarity. Okay, so having explored, why could linearity is bad. What the causes are and how we might deal with it. Let's explore this, using some data in a statistical program, as always, however, using stater. But all the concept I'm showing you will work anywhere else. Here we aren't Stater and I've already loaded the auto training they said on and said they random number seat I will allow you to replicate the exact same results. This is because I'm going to use random number this later to help create Kaleena variables in this training data. Without a seed, these numbers will be different every time. Andi, for everybody, you can check out more information on random numbers and simulation methods in my state. Of course, now the next step is to examine the pair wise correlations of some key variables that we want to use specifically, the explanatory variables in the regression. Our aim is to run a regression of price as the dependent variable, with mileage length and weight as explanatory variables. So let's go ahead and do a paralyze correlation of these explanatory variables in ST ER. We can do that with the PW core command stands for pair wise correlation, and here we can see that the pair was. Correlation tells us that lengthen weight appear to be quite correlated. That is, to be expected. Longer cars are heavier cars. The Pearson Correlation coefficient is around 0.95 Now let's find a regression of these variables on price and check the variance inflation factor, which is a cooling Garrity statistic. I'm also going to our strata to report standardised coefficients. This will come in handy later. Okay, so our regression tells us that Mars per gallon is not statistically significant and that longer cars have lower prices. At the same time, heavier cars appear to have higher car prices. We're going to focus on length in this series of examples. So let's note it's standardized, also called beta coefficient, off minus 0.79 The variants inflation factor Statistics are not bad, actually, even though weight and length measured very similar things on have a high paradise correlation. Lower values here are better, but a rule of thumb is an valleys below 50 or so are generally OK ish in every question, and here all our values are 10 or below, so we're actually fine from a Colin Garrity prospectus. Next, let's introduce some additional valuable sell regression that have a very high degree of colluding. Garrity. And to do that, I'm going to use random numbers. Specifically, I'm going to create two new length rivals than are equal to length and have a little bit of random variation. So let's go ahead and do that and also check out the new pair of eyes correlation of these new variables. Okay, look at that. We have some pretty high correlation statistics. 0.990 point 9980.998 These are very high correlation statistics. You wouldn't normally introduce such a variables into regression because they're virtually identical. But let's do it anyway and see what happens. And here is the result of this little experiment. The variance inflation factor has exploded for the three Collignon variables. The standard errors in aggression have also increased substantially. The standard areas are around 10 times bigger than previously for the relevant length variables. So that is a problem. What do we do about it? Can we leave it like this? Well, if the length variables are not important to analysis, we might be able to leave it like that But a better solution would be to simply removed two of the offending variables from the regression the variance inflation factor. Statistics give you a pretty good indication which ones we should kick out, probably length and length to. However, if you don't want to do it yourself, you could do it. My a stepwise regression method. Stepwise Regression is a selection method that Antal removes variables depending on their effect on the model. In this example, I'm going tohave stater to perform a backward selection from the full model we have earlier . And I'm going to ask data to look at the length very well specifically and keep removing one by one. Unless now significant at the 10% level. Let's go see what happens. Okay, so the stepwise selection off this regression model has excluded the variables length three and length to in sequential steps from a model, and then we are only left with a variable length. So then we our that was our selection, and we eliminated the offending Colin E Variables. Another potential method to deal with the problem is to run a principal component. Analysis on the group of Colina variables extract the first principal component and then simply use that in a regression. Let's go ahead and do that one of principal component analysis, or predict the first principal component. And here we can see that actually, the first component is really dominant here. The counts for 99.9% of the total variance between all three variables. So let's take that and included in our regression and rerun the regression will. And there we go. We get quite a large next of number on the principal component, variable on by itself. This coefficient doesn't make a lot of sense, but refers to some underlying later component driving all of the length variables length links to link three. However, the standardized coefficient is virtually identical to the standardise coefficient from the first regression. Remember, that was minus 0.79 Now we get a result from minus 0.75 So ultimately, we ended up with pretty much the same result as the first regression. So would step my selection and principal component analysis. We gave the computer more choice in the finding which Kaleena variables should be executed . Finally, let me show you how the meaning or centering viable can help in a cooling a situation. One common cause of Colin Garrity is higher order polynomial. Let me show you. I'm going to run a regression of price against mass per gallon weight and length and length squared, so I'm going to include the variable length. Twice. Once is a linear term and want is a quadratic term. Let's go see what happens here. The results. We have a length term and the length square term. There's a negative coefficient on the first positive coefficient on the second, I'm going to ignore the significance here. In this case, the results suggest that length might have a U shaped relationship with price. Let's go ahead and visualize this so that we get a slightly better idea off what these coefficients mean. State is really good at this week. News. The margins on margins Plot command to help us do that. And here we can see that there's a potential nonlinear relationship with price. A long length. Remember the shape of this curve for later. Let's close this, and now let's check out the variance inflation factor to get an idea of the coldly legality on that particular regression model. here we see that we have relatively high values on our variance inflation factor. Statistics both length and length squared increased co linearity significantly. This will result in high standard. There's for these variables, so that's not great. But we can combat this problem by demeaning the original length variable and using that in our regression instead. So let's go ahead and do that. This code over here calculates the mean off the length variable and subtracts it from the original length variable to compute a T mean length variable. And then we're going to use that now. Regression instead of length. Let's go see what happens. There we go. What happened? It's not our variance. Inflation factors dropped significantly. That's a great result, however, remember that by the meaning, the variable. We have changed the reference points so D mean length is now distributed between minus 42 plus 42. Let's go ahead and plot these coefficients on the model on a graph, and there we are. We computed exactly the same relationship was before, but we minimized the influence of any potential Colin Charity by the meaning are variable stuff, so transforming potential. Colin, your variables is another way to deal with Congeniality. I do know that there are limits here. Entering polynomial of order nine, for example, even on demeaned variables is likely to cause cooling unity problems. And that ends this session on how to deal with Colin your variables.