Data Analysis - What is Linear Regression?

Franz Buscha

Get unlimited access to every class

Taught by industry leaders & working professionals

Topics include illustration, design, photography, and more

Get unlimited access to every class

Taught by industry leaders & working professionals

Topics include illustration, design, photography, and more

Lessons in This Class

- 1.
  
  Introduction
  
  4:12
- 2.
  
  What is Regression Analysis?
  
  2:45
- 3.
  
  What is Linear Regression?
  
  1:48
- 4.
  
  Why is Regression Analysis Useful?
  
  1:37
- 5.
  
  What Types of Regression Analysis Exist?
  
  2:33
- 6.
  
  Explaining Regression
  
  3:40
- 7.
  
  Lines of Best Fit
  
  7:58
- 8.
  
  Causality vs Correlation
  
  1:54
- 9.
  
  What is Ordinary Least Squares?
  
  1:04
- 10.
  
  Ordinary Least Squares Visual 1
  
  4:15
- 11.
  
  Ordinary Least Squares Visual 2
  
  7:43
- 12.
  
  Sum of Squares
  
  3:07
- 13.
  
  Best Linear Unbiased Estimator
  
  4:43
- 14.
  
  The Gauss-Markov Assumptions
  
  0:41
- 15.
  
  Homoskedasticity
  
  2:13
- 16.
  
  No Perfect Collinearity
  
  2:35
- 17.
  
  Linear in Parameters
  
  2:43
- 18.
  
  Zero Conditional Mean
  
  2:14
- 19.
  
  How to Test and Correct Endogeneity
  
  0:52
- 20.
  
  The Gauss-Markov Assumptions Recap
  
  1:56
- 21.
  
  Applied Examples
  
  21:32
- 22.
  
  Final Thoughts and Tips
  
  3:54

Beginner level

Intermediate level

Advanced level

All levels

343

Students

Projects

About This Class

An easy introduction to Regression in Data Analysis

Learning and applying new methods and techniques can often be a daunting experience.

This class is designed to provide you with a compact, and easy to understand, class that focuses on the basic principles of regression in data analysis.

This class will focus on the understanding and applying linear regression in data analysis

This class will explain what regression is and how Ordinary Least Squares (OLS) works. It will do this without any equations or mathematics. The focus of this class is on application and interpretation of regression in data analysis. The learning on this class is underpinned by many animated graphics that demonstrate particular concepts.

No prior knowledge is necessary and this class is for anyone who would like to engage with quantitative analysis.

The main learning outcomes are:

To learn and understand the basic intuition behind linear regression
To be at ease with regression terminology
To be able to comfortably interpret and analyze regression output
To learn tips and tricks

Specific topics that will be covered are:

What kinds of regression analyses exist
Correlation versus causation
Parametric and non-parametric methods
The least squares method
R-squared
Beta's, standard errors
T-statistics, p-values and confidence intervals
Best Linear Unbiased Estimator
The Gauss-Markov assumptions
Bias versus efficiency
Homoskedasticity
Collinearity
Functional form
Zero conditional mean
Regression in logs
Practical model building
Understanding regression output
Presenting regression output

The computer software Stata will be used to demonstrate practical examples.

Meet Your Teacher

Franz Buscha

Teacher

See full profile

Related Skills

Development More Development Data Science

Level: Beginner

Hands-on Class Project

If you have access to Stata, use the associated code and dataset to replicate the relevant examples of this class.

If you have access to Excel use the associated dataset to replicate as many examples of the lesson as you can.

If you have access to R or SPSS use the associated dataset to replicate as many examples of the lesson as you can.

As an associated project please try to answer the following question: You have been tasked by your manager to properly price cars for sale and purchase. Your company buys cars and sells cars. You have little experience of the car market and need to build model that determine what value you should set for each car. Build a model that identifies the determinants of car prices and post what your most expensive and cheapest car would be (including its price).

This project mimics a real life application of regression analysis and there is no right or wrong. Leave your car prices and model characteristics for others to see.

Also, leave questions and comments in the discussion section.

Proyecto RL

Rodrigo Eduardo Méndez Villabona

Class Ratings

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Introduction: Welcome. Data analysis can be harmed. There's so many different methods and so many different ways to analyze and interpret data that can make learning very difficult. In this class, I want to give you an easy and fast outline of one of the most popular methods and data analysis, linear regression. The key to this class is that there is no equation since no maths, no tricky bits of theoretical knowledge. I want to give you an intuitive and graphical explanation of what linear regression is. And then show you a range of practical data analysis examples. No matter what your current professional knowledge status, you can feel confident about knowing the ins and outs of linear regression. After this class. What is linear regression? Linear regression is the most popular regression method used in the world. The linear regression techniques available, ordinarily squares, often abbreviated to OLS, is the most common. And I'm going to focus on ordinarily squares because it's by far the most used regression method for data analysis in the world. Ordinarily squares is a technique that examines the relationship between one continuous variable and one or more continuous info categorical variables. And this technique is used in many disciplines including economics, sociology, psychology, drug, fear, and even history. It's used all over the world. And it's also often used in business for quantitative analysis. And it underpins many government reports not perform some kind of policy evaluation. Anybody who wants to have a good understanding of data analysis will need to understand linear regression. What are the main learning outcomes? To learn and understand the basic intuition behind linear regression message and data analysis. Learn the associated terminology and underpinnings. To learn how to comfortably in and analyze output. Finally, to learn some extra tips and tricks that will help you in data analysis. Who's this course for? This course is aimed for those who are starting off the career in data analysis. That could be practitioners, somebody in government, somebody and policy, somebody in business, or even students. What prerequisites on it. There is no mass and you don't need to worry about any equations to get the most out of this course. Curiosity is all that is needed. Some state of knowledge may be handy for the practical application of this course, but it's not required. Status is statistical software program that allows users to estimate many different quantitative methods. I'm going to use it to demonstrate them ordinarily squares examples. Also, a keen interest in understanding how data might be related to each other is a useful prerequisite. Often, data analysis is all about measuring quantitative variables against each other. If you want to know how y is related to x, then this stomach place for you using Stata. This course I'll be using data to demonstrate some examples. Instead as approachable statistical software. There are many courses on how you can use statement. Should you be interested in this course? I will not teach you ins and outs of Stata, but I will focus on the interpretation of output. There are many other statistical software packages such as R or SPSS that can do exactly the same. However, if you are interested in Stata and you want to replicate some of the examples from this course. I have attached the relevant code files to this course. I'm going to be using something called the auto training dataset that comes inbuilt. Which data? For practical examples. This data is a training set that contains a variety of useful variables and relationships. Another great for teaching purposes. You can also download it as part of this course. Let's proceed to the next section and learn more about regression methods. 2. What is Regression Analysis?: What is regression analysis? Regression analysis is a statistical technique that attempts to explore the relationship between one dependent variable and one or more independent variables. An alternative term used for dependent variable can sometimes be the outcome variable, the response variable, or at the endogenous variable. Dependent variable is normally denoted by the symbol y. Alternative terms for independent variables or predictor or explanatory or exogenous variables. Explanatory variables are normally denoted by the symbol x. It is common to write regression models in the form y equals to X1 plus X2 plus X3, et cetera. The last term will be an error term. This is often denoted by E. This captures everything that is missing. However, there are many different practices. We're inviting regression models in mathematical form. So we will avoid all of that in this course. Variables can take many different forms and regression analysis. They can be continuous. In other words, data can be measured anywhere on the number line, too many decimal points. E minus 2.305100.3. Data can also be an integer format such as 12345, etc. Data can also be in binary format such as 0 or one. Often these denote binary responses such as yes and no. Sometimes data are ordinal. Ordinal data is categorical data that is ranked, such as likert scales. Finally, data can also be normal. No, this is categorical data that is unwrapped. For example, modes of transport. Importantly, data must always be in numeric format. In mathematics and computer software can do very little with string type data. String type data is data that contains the letters and other non-numeric characters, like exclamation marks. Data can also be transformed and this is a common future of regression models. For example, taking the log of y and making this the new dependent variable is a very common technique in regression analysis. By doing so, the interpretation of the entire model will be changed. And clearly, this needs to be carefully considered when using or analyzing such models. 3. What is Linear Regression?: What is linear regression? Regression analysis is a catch-all term for every type of regression method. Often regression methods are split into linear and non-linear regression methods. There are many methods in both of these two camps. In this course, we'll focus only on linear methods, specifically the ordinarily squares method, that is the most popular linear method. Linear regression assumes that variable parameters relate to the dependent variable in a linear way. Variable parameters are what we tried to estimate, but regression models and data find the relationship between x and y. We often call parameters coefficients. For example, a parameter or coefficient of one means that for every unit change in X, Y, the dependent variable changes by one. Without getting too technical, linear regression assumes that dependent variables are measured as continuous variables. Explanatory variables can be measured in any way. When the dependent variable is a non-continuous, the correct regression method is often non-linear. However, there are instances where linear methods can be used when the independent variable is not continuous. When there's only one explanatory variable in the model. In other words, there is only one x variable. We call this simple regression. When there are multiple explanatory variables, we call this multiple regression. Most regressions are of the multiple kinds, as in practice, we usually want to test or evaluate many variables against the dependent variable y. 4. Why is Regression Analysis Useful?: Why is regression analysis useful? Regression analysis is useful for when quantitative evidence is needed to answer a particular question. Quantitative analysis, by definition, requires the analysis of numbers. The opposite of this is a qualitative analysis which analyzes non-numeric data such as word, stories, meaning, or concepts. Regression analysis is useful because it allows for the testing of hypotheses. For example, do men really earn more than women? Is unemployment in the economy related to inflation? Or how much more ice cream is bought on sunny days? These kind of questions can be answered with statistics and you'll often heard a term this is statistically significant at the 5% level in such analysis. However, regression also allows for predictions. Because regression models estimate parameters or coefficients. These parameters can then be used to compute new statistics. This can be done within a data sample and even outside of that sample. For example, after a regression of various explanatory factors on wages, we can use the estimated parameters to compute the expected wage of a very particular type of person, whether they are in the sample or not. This prediction is a great strength of regression methods and it allows businesses, researchers, and policymakers to compute various effects. 5. What Types of Regression Analysis Exist?: What type of regression analysis exists? There are many, too many to count. In fact, many advanced regression methods will be customized for the relevant research question and the data. However, there are some core methods that you should be aware of. These methods are primarily a function of the nature of the data and then nature of the dependent variable. The most common method is ordinarily squares. This method requires the dependent variable to be continuous and is often applied to cross-sectional data. Cross-sectional data is data that doesn't have repeated time elements within it. Ordinarily squares also serves as the basis for many advanced methods such as weighted least squares. Next or three non-linear methods. These methods are non-linear because the dependent variable is not continuous anymore. Logit and probit models are useful for binary dependent variables. Ordered logit and ordered probit models are useful for when there are multiple ordered categories in the dependent variable. And multinomial logit models are useful when there are nominal, unordered categories and the dependent variable. If you're wondering what logit and probit models are, these are simply two common ways to achieve a nonlinear relationship between the variables. Whilst there are some mathematical differences between logit and probit models and realities, these often make little difference to the results. Also note that multinomial probit models also exist, but they are not frequently used, which is why I'm not listing them here. Next, our panel models, both linear and non-linear models. There are many methods in each category, but the Common Future is that they all work with data that is collected repeatedly over time. This could be short household panels or long high-frequency trading time series. Next, account data models, which are similar to logit and probit models, but you slightly different transformations to account for count properties. The data. Examples of counts are things like the number of doctor visits or the number of t-shirts salts. Finally, Cox proportional hazard models are often used when a dependent variable is time. A common example of a time dependent variable as survival time of cancer patients. And this method is often used in the health sciences. 6. Explaining Regression: Explaining regression. Now that we have some basic understanding of the concepts behind regression analysis and also what type of regressions there are. Let's explore how it actually works. If you're an academic students, regression is often learned through a variety of equations. Often matrix type equations that have a lot of x's and y's and ease and use in them. They serve their purpose, but you don't actually need to understand them to learn how regression works. Using visual aids can achieve the same effect. And this is something we'll focus on in this course. Simple linear regression is often explained through correlation. Let's follow that approach and then slowly keep building things up later. Correlation, sometimes called association or dependence, is the relationship between two things. In statistics, these things are often variables, let's call them x and y for now. Note that both variables x and y are connected to identifier. Without this identifier, none of this will work. They identify is often represented by the symbol I. And we can imagine it to be something like individual people or firms or countries or anything else that can connect the two variables of interest. This little table over here, there are three identifies, and each identify has one value of y and one value of x. Let's go ahead and visualize a larger version of this table on the graph. I'm going to plot 100 data points on a scatter plot where the y-axis represents the variable y and the x-axis represents the variable x. This visual representation is slowly starting to tell us something. In this case, we seem to get a fairly good idea that they seems to be a positive relationship between y and x. In other words, as x increases, so does Y. However, there's also some noise in the data. And this seems to be some clumping in the values of y and x around 0. The relationship between the two variables can also change. For example, the relationship could become weaker or even negative. Here we see an example of how data can change its relationship with each other. The correlation between Y and X becomes weaker, going all the way to no correlation and then becoming negative, we end up with a relationship that is almost the opposite of what we started with. Visually, it is quite easy to distinguish between extreme type of relationships. However, it can be more difficult to visually identify differences between only minor relationship changes. Take a look at this example. Here is some data that is correlated in different ways. It is easy to tell a plus one correlation apart from a minus one correlation. However, this task becomes more difficult for smaller correlation changes. At first glance, it would probably be quite hard to identify any difference between the first two graphs. Even though the correlation is different, one has to look quite closely to identify that the relationship between y and x has flattened off a little bit in the second graph. This becomes especially tricky if there are lots of data. If we had a million data points, that all we would see, for example, is a giant blob of blue. And that is why we often want to summarize the relationship between y and x via some kind of data reduction process. 7. Lines of Best Fit: Lines of best fit, what are they and how do they work? A key thing to understand before jumping into the concept of how to produce lines of best fit is that there are two methods that we can use. These are parametric and non-parametric methods. Parametric methods are methods that apply some kind of parameter or many parameters to data. Parametric methods are methods that apply some kind of parameter or many parameters to data. Often the parameters will be in the form of an equation such as y equals to one. The parameter in this case is one. This method is the method that is used in regression analysis and in ordinarily squares. And it has the advantage of simplicity and working with high dimensional data. Disadvantage is that it requires stronger assumptions about the data. When these assumptions are not met, your analysis might be completely wrong, and often you may not even know about it. Non-parametric methods let the data speak for themselves. The advantages that you need to make a less assumptions about the initial relationships in the data. A big disadvantage is that this method is not very transposable. In other words, you cannot easily tell other people about it. In addition, and becomes extremely hard to operate this kind of method in multi-dimensional environments, we often use non-parametric methods to explore basic relationships between Y and X. And parametric methods to explore more complicated relationships between y and x1 and x2 and x3, etc. Let's have a look to see what I mean by all of this. Let's start with a scatterplot of some new data. In this case, let's plot data from the stator altered dataset and try to figure out how the price of cars is related to the miles per gallon of petrol consumed by individual cars. The initial scatterplot here tells us there is some kind of relationship between the cost price and its miles per gallon. It looks a negative, in other words, downward-sloping. Now, let's try to estimate what kind of relationship this is exactly. We'll begin with a non-parametric method like regression. There are many non-parametric methods. Let's pick one called local polynomial regression. Local polynomial regression is a form of moving regression. The user defines a bandwidth or lets the computer choose one, and a regression is then estimated within that bandwidth. The band then moves continuously across the x-axis step-by-step and repeats this analysis, the individual steps and then all stitched together to reveal what is essentially a moving average plot of the data. Let's see how this works in practice. The non-parametric methods shown here slowly moves across the data space and continuously updates the relationship between y and x. We see that the relationship between y and x starts negatively, but it ends up being slightly more horizontal. In other words, the relationship between y and x here does not appear to be entirely linear. Bigger advantage of this method is that it lets the data speak for itself and doesn't rely on specific functions or even theory to fit the data. One disadvantage of this method is that the relationship still need some kind of input. In this case, it requires the size of the bandwidth. If we change the bandwidth to something smaller, relationship will look different. Here's an example of that. Another disadvantage of this method is that it is difficult to transfer this relationship to other users. How can we explain this squiggly line to somebody else? We often choose a parametric relationship. Parametric relationship is one that can be defined by some kind of equation. For example, a linear line fit through the data will have a gradient. And that gradient will be the parameter defining the relationship between y and x. Let's plot a linear function through the data and see how this looks. Here we see a linear line being fitted through the data. In this case, the line fit is based on minimizing the overall distance between the fitted line and all the available data points. This concept is known as least-squares, and we'll explore this in more detail in the next session. It underpins the ordinary least squares regression methodology. The fitted line in this case has a particular slope of minus 238. In other words, for every one unit increase in miles per gallon, the average cost price appears to drop by $238. Great. However, parametric lines of best fit don't always need to be linear. We can also add a quadratic line of best fit. In this case, we recover two parameters to have to find the relationship between y and x. Here's an example of that. In this case, the relationship between y and x is parametrized by a one parameter pulling wide down as x increases. And the another parameter pulling y back up as x increases. In this case, the parameters are approximately minus 1200 for every increase in x and plus 20 for every increase in x squared. Don't worry about the x squared right now, will explore this later. But the important concept is not the functional form of parametric lines of best fit can be made to be very flexible as long as enough parameters are available. How does all of this relates to regression? Well, this is regression specifically, this is simple regression where Y is regressed against one variable x. How about multiple linear regression? Multiple linear regression is an extension of simple linear regression, and it adds more variables to the mathematical framework. One easy way to visualize this is by adding further dimensions to the scatter plot, where each extra dimension represents an additional variable. Let's say, for example, that we wanted to explore the impact of MPG on car price. But controlling for a cause weight, heavier cars are likely to have poorer MPG. And this may affect the price. Visually, we can represent this by a three-dimensional scatterplot that plots price against MPG against weight. It could look a little bit like so. Moreover, by rotating the scatterplot, we can look at the relationship that each explanatory variable has width y, and even examine how the explanatory variables are correlated with each other. Finally, what multiple regression analysis does is instead of estimating a line of best fit through the data, it fits a plane of best fit through the data. This can be hard to visualize on a screen, but here's a crude attempt with mine. The left graphs show the actual data points on a 3D scatter plot. While the right graphs show the estimated relationship between these data points, this relationship is represented by a 3D plane. If more variables are added to the framework, the plane of best-fit becomes a hyperplane of best-fit. This is why we sometimes hear people talking about multi-dimensionality when referring to regression analysis. 8. Causality vs Correlation: Causality versus correlation. Hopefully, the previous examples we will have given you a good intuitive grasp of what regression analysis tries to do. There are lots of statistics and maths in each type of analysis, but the underlying concept will always remain the same. Regression analysis tries to tell users how data is related to each other in a way that is easier to understand than looking at the raw data points. However, it is important to be keenly aware of the concept of causality versus correlation. Every regression method is a statistical method that correlates data. That's it. A computer or a mathematical equation cannot identify what is causal. Causality is always interpreted by the end-user. And some models allow better claims of causality than others. Evidence obtained from regression analysis about a strong and statistically significant relationship between two variables may be attributed to causality through a compelling theoretical framework and common sense. This can take a lot of practice and almost becomes an art form. Sometimes the data helps. For example, if yesterday's events are used to explain today's action, the time element in the analysis can be used to make better causal inference. However, in other settings such as cross-sectional survey settings, it can become much harder to attribute causality. Are people happy because they're healthy? Or are people healthy because they're happy? These are tough questions to answer and require theoretical and philosophical reasoning in addition to statistics. So you should always be careful when dealing with regression analysis. 9. What is Ordinary Least Squares?: What is ordinarily squares? Ordinarily squares is a regression method that is based on the concept of least-squares. Least-squares is a statistical method that fits a line or plane or hyperplane of best-fit by minimizing the sum of squared residuals between the line of best fit and the actual data points. We square the so-called residuals because the sum of them is exactly 0 when they're not squared. Therefore, negative and positive residuals above and below the line of best-fit cancel each other out. Squaring solves this problem. Many other ways of fitting the line of best-fit exists. One example is to fit a line by the method of least absolute deviations, where instead of squared residuals, the absolute value of them is taken. In other words, negatives are turned positive. However, least-squares is by far the most popular method. Of course, all the sciences. 10. Ordinary Least Squares Visual 1: Let's explore ordinarily squares visually. Understand it better. Imagine a small dataset with a few data points, a little bit like this one. Ordinarily squares will fit a line through these data points. This line may be linear, but it can also be nonlinear. Let's go with a linear example. The red line represents the line of best fit estimated by ordinarily squares mechanics. In this case, the line of best fit can be represented by a single slope parameter called beta. We often use the Greek letter Beta to denote the slope of a regression line. This slope informs us of the estimated relationship between y and x. In this case, y is the price of a car and x is the mileage and miles per gallon. The slope is negative, which means that as the miles per gallon increases, the price of cars decrease. However, note that our slope does not hit any of the actual data points. That is because we are estimating an average relationship between all the available data points. The actual data points are often called observed data points. In other words, y observed. The predicted value of y at any given value of x is then given by the line of best-fit. These are called predicted data points or y predicted. The difference between the observed value and the predicted value is called the residual value. This is what ordinarily squares tries to minimize. You can see here that there are three data points and therefore three different residuals. The sum of all three is the smallest value that we can achieve. In this case, if we change the line of best-fit, for example, by moving the line of best-fit down, the total sum of the residuals will increase. This is a graphical explanation of what ordinarily squares tries to do. It finds a regression slope and the intercept that leads to the very best minimum sum of residuals. Let's have another look at this with more data. In this example, we're going to use the full auto training data to see what happens to the root mean squared error when we apply different regression slopes to the data. In the left panel, we observe the regression slope going through the data. We'll start with a positive slope of plus 100. On the right panel, we see the size of the individual residuals. The residuals are squared and then square rooted to ensure only positive values were made. The lowest value a residual can have therefore is 0. Higher residual values mean that the relevant data point is far from the actual regression line. The average of all these residuals is called the root mean squared error of the residuals. And this is depicted by the red line. It tells us how far, on average the data points are from the regression line. Now let's look and see what happens when we change the slope. We can see that as we slowly change the slope of the regression line from positive values and negative values. The average error between the line and the data point decreases. The residuals on average are trending down as we decrease the slope. This keeps happening until after a certain slope value, the average of the residuals starts to increase again at a slope of around minus 230. The average error from our line of best fit is minimized. And therefore, that is our line of best-fit. Of course, this graph is a simplified version of what happens. Regression models can have many more variables and therefore many more parameters. And we would need many more dimensions to display such models graphically. Now let's take a look at how ordinarily squares models are often presented by computer. 11. Ordinary Least Squares Visual 2: Here's an example of how stator presents regression output. Other computer programs may present this differently, but the essence of the information displayed will be similar among all programs. Often part of the regression output displayed will be diagnostic information that provides high-level information about the overall regression model. In states. This is usually the top part of the output. The lower part of the output table normally present the estimated coefficients for the relevant variables. There are many pieces of information in this table. However, generally three pieces matter most. The first is the actual parameter estimates. In other words, the estimated slopes are coefficients of lines or planes of best fit through the relevant data. In states, this is called DOF, which is short for coefficient. Each explanatory variable has a relationship with the dependent variable, in this case, price. Each explanatory variable is also conditional on each other. In other words, the effect of miles per gallon, conditional on controlling for weight, is stopped for each increase by one unit of miles per gallon, price drops by $49. The effective weight is as follows. Conditional on miles per gallon, an increase in one unit of weight leads to a price increase of $1.7. The final variable is a constant. Constants on the value that the dependent variable, in this case, price takes when everything in the model is set to 0. In other words, at a weight of 0 and at 0 miles per gallon, a car should cost around $1946. According to this model. Constants sometimes makes sense, and sometimes they don't. In this case, it doesn't make a lot of sense because cause would never have a weight of 0 or consume 0 miles per gallon. Some people say that constant should be removed from models, especially when they don't make sense. I think that is wrong. You just need to take care when interpreting constants. Often, constants should not be interpreted but left in the model. The next most important piece of information comes from the column called std error, which is short for standard error. The standard error statistic is a statistic that reveals with what degree of accuracy the slope coefficient is estimated. The standard error is low relative to the coefficient. Then we can be more certain that the estimated coefficient is close to the true population parameter. The standard error is high, we can be less certain and have more noise around our estimate. The standard error is important because it allows us to determine to what extent the estimated coefficients from the regression model are statistically significant. The full remaining columns in the results outputs are all further computations of the standard error. And that's simply different ways to identify the significance. The t-statistic, the p-value, the lower and upper confidence intervals are essentially all the same thing and are based purely on recalculations of the standard error. We'll look at what they mean in a moment. Finally, the third piece of information that matters most is something called R-squared. This information is given in the diagnostic parts of the output table and can be found here. R-squared is a common indicator of goodness of fit for ordinarily squares regression models. It is bounded between 01 and higher values indicate Dr. model better fits the data. However, many professional users will quotient against an over interpretation of R-squared statistics. Numbers are relative to the discipline. If you are working with behavioral data such as people and their choices than R-squared of 0.2 or 0.3 are very common and usually indicate good fitting models. If you're working with time series data, such as macroeconomic GDP measures, then R-squared of 0.8 or 0.9 are very common and indicate good fitting models. Finally, let's talk a little bit more about how the estimated coefficients are related to statistical significance. Let's begin with t statistic. This statistic is an indicator of statistical significance, and normally we are looking for a value of 1.96 or above one. We're using a reasonably sized sample. Reasonably sized samples means around 100 or more observations in the among. The t-statistic is easily computed by dividing the estimated coefficient value by the estimated standard error value. Note that when the coefficient is negative, state that will produce a negative T-statistic. The sign on the t-statistic should however, be ignored. Next to that is something called the p-value. This is shortfall. Probability value indicates the probability of obtaining the observed results of a test, assuming that the null hypothesis correct. The null hypothesis in regression tables, is normally that a specific result is not different from 0. In other words, small p-values mean that there is stronger evidence in favor of the alternative hypothesis. The alternative hypothesis being that the coefficient is the actual estimated coefficient in layman terms and number of 0.05 or below in the KD, statistical significant at the 95% level, numbers below 0.01 indicates significance at the 99% level, and so forth. The next, our confidence intervals, there is an upper and lower confidence interval. Upper and lower confidence intervals are computed by adding or subtracting 1.96 times the standard error from the estimated coefficient. In other words, the confidence interval is usually two standard errors away from the coefficient estimate. Confidence intervals are really useful because they allow you to quickly, I will statistical tests. Any number outside the confidence interval range, will be statistically significantly different from the coefficient estimate. This example, MPG is not statistically significantly different from 0 because 0 is within the confidence interval range. However, mpg is different from minus 500 because this number is outside the confidence interval range. This can be a really useful way to quickly perform statistical testing. And all it involves is multiplying the standard error by approximately two sum of squares. Now let's take a look at the sum of squares in a bit more detail. 12. Sum of Squares: The previous regression table also provided thought analog signal information on the explained sum of squares, the residual sum of squares, and the total sum of squares. These values indicate how much variation is explained by the fitted model. How much variation unexplained by the model? How much total variation there is in the data. By comparing the proportion of explained sum of squares to the total sum of squares, we can produce something called the coefficient of determination, often called R-squared. R-squared. The R-squared value is a widely used measure of fit for ordinarily squares models. The value indicates how well the model fits the data. Values of one mean, a perfect fit. Values of 0 mean a terrible fit. However, the basic R-squared can only increase as more explanatory variables are added to the model. In other words, models with hundreds of random covariates can saturate the data and produce artificially high goodness of fit statistics. This is why we often also report the adjusted R-squared, which imposes penalties. Two more variables being added, two models. If additional variables are not statistically significant, they will reduce the adjusted R-squared value. This statistic tries to strike a balance between rewarding, good model-building and overloading models with unnecessary variables. However, it should be noted that R-squared can be easily abused and should be treated with caution. High R-squares do not necessarily imply that one model is more valid than another. Let's take a look at this example. In this demonstration, I'm going to change the noise level around the line of best fit. The true relationship between y and x is one. And this is what is estimated by the line of best-fit. The original data has very little noise and the regression line hits almost every data point, resulting in an R-squared of one. Now let's go ahead and change this noise level around the true regression line. We can see now the R-squared changes quickly as we increase the noise around the data. The R-squared quickly drops in value, suggesting that the model fit that data worse and worse. However, the model actually remains the same. What is changing is only the noise around the data. Noisy data result in the lower R-squared value. And the layman observer might claim this to be a poll model. But as you can see, the relationship between y and x hasn't changed at all, and the model continues to recover the correct coefficient value. Both models in this case have the same validity, even though they have different R-squared values. And that is why I want you to always be careful when R-squared. The R-squared example leads us to our next discussion point. 13. Best Linear Unbiased Estimator: Best linear unbiased estimator. Ordinarily squares is set to be the best linear unbiased estimator. It, certain conditions are true. Having an understanding of these conditions is important as some matter more than others. These conditions are often called the Gauss-Markov assumptions and refer to for particular assumptions that needs to be made mouth data. If these assumptions are met, then the ordinarily squares estimator is said to be unbiased. In other words, the results produced by the estimated will on average be correct. If the Gauss-Markov assumptions are met. The OLS estimator will also be there. Best estimator. Best is another word for efficiency and statistics. This simply means that the ordinary least squares estimator will produce the most accurate results with the least amount of noise. Let's explore these two concepts a little further before we discuss the actual assumptions. Efficiency refers to the width of the sampling distribution. When an estimator is said to be most efficient, it's sampling distribution is less than that of any other estimator. We can visualize this in an easy way by assuming we have two different estimators, an infinite amount of data. From this infinite amount of data. Let's go ahead and select a small sample, then try to estimate a particular coefficient for a variable. We're going to use an inefficient estimator and an efficient estimator. We're going to set the true value of the coefficient to one. The first time we estimate the coefficients using both estimators, we return a value of around minus six for the inefficient estimator and minus two for the efficient estimator. Now let's go ahead and repeat this process. The second time our estimates are closer. The inefficient estimator predicts a value of around minus one and the efficient estimator of around 0. Both are still some way of the true value, but the efficient estimator seems to be getting closer. Now let's go ahead and repeat this process quickly, hundreds of times and see what happens. Both estimators, on average gets the correct value of one. However, the inefficient estimator is on average further away with its predictions than the efficient estimator. This is the concept of efficiency. And once we normally don't have an infinite amount of data, this concept is often visible in the standard errors of real-life result. In efficient estimators tend to have high standard errors, resulting in more uncertainty around the true estimated value. Next, let's explore the concept of unbiasness. When an estimator is said to be unbiased. This means that the mean sampling distribution of the coefficient estimates will approximate the true population coefficient. We can visualize this in an easy way by, again, assuming that we have two different estimators and an infinite amount of data will select a small sample of this data and try to estimate a particular coefficient. The true value of this coefficient is set to one, and this is denoted by the dotted red line. We use a biased and an unbiased estimator to estimate the same coefficient. The first past produces an estimate of around 0 for the biased estimator, 1.5 for the unbiased estimator. Now, let's do it again. On the second pass. The biased estimator performance better with the result of three compared to the unbiased estimator with the result of five. But let's continue and repeat this process. Many times. We repeat the process, we see that on average, the unbiased estimator starts to predict a value of one. What's the unbiased estimator predicts the value minus one. That can obviously be a big problem. For example, the objective might be to perform a policy evaluation. And a biased estimator estimates the policy to have a negative effect. What's in reality, it might actually have positive effects. Bias is a serious problem in econometrics. And ordinarily squares requires some pretty strict assumptions for estimates to be unbiased. It is important then to have some understanding of the assumptions behind ordinarily squares. 14. The Gauss-Markov Assumptions: Gauss-markov assumptions. The Gauss-Markov assumptions are the underlying assumptions that make ordinarily squares the most efficient, an unbiased estimator. Generally, four major conditions on needed to achieve this result. These are the homoscedasticity assumption, the notepad outfit called linearity assumption, the linear in parameter assumption, and the 0 conditional mean, sometimes called exogeneity assumption. Roughly speaking, the first two relate to efficiency, while it's the last two relate to bias. Let's explain each in turn and try to determine which matters most. 15. Homoskedasticity: The homoscedasticity assumption. This assumption states that the variance of residuals remains stable across the spectrum of independent variable. In other words, the errors produced by variable remain roughly constant whenever we look at a small part of that variable, value of this assumption leads to buy standard errors. And this means we cannot rely on hypothesis testing. However, many modern statistical packages can easily test and correct for this assumption. It is very common, for example, to use something called robust standard errors, which increased the inefficiency of the estimates slightly, but make them immune to the failure of this assumption. Let's go ahead and look at an example. In this video, there are two graphs. The left graph shows the relationship between the explanatory variable x and the dependent variable y. The overall relationship never changes, but the variance across x will. The right graph, we see the residuals or errors of x. It shows the distance of the actual data points to the line of best-fit. The left graph also shows the slope estimate and the standard error from a normal ordinary least squares regression and a robust ordinary least squares regression. Now let's go ahead and run this example and examine what happens when we introduce a changing variance across x. We see that as we increase the variance across x, the actual regression coefficient never changes. However, the standard errors increase as we increase the variance across X. Moreover, the robust standard errors increase by a little bit more. All this means is that the failure of the homoscedasticity assumption, it leads to less precise estimates. The real-world with modern datasets, a failure of this assumption often has little overall effect on the actual results, and most practitioners do not focus on this assumption a lot. 16. No Perfect Collinearity: No perfect co-linearity. This assumption states that an explanatory variable cannot be an exact linear combination of another explanatory variable. If this is the case, ordinarily squares simply cannot be estimated. This is rarely a problem in real life, as you would never enter the same variable twice into a regression. However, when there is partial correlation between two variables, in other words, they measure the same thing to some extent. Then we term this multicollinearity. And this can have some effect on our estimates. Specifically, it will increase the noise and therefore the standard errors of our estimates. This phenomenon is generally easy to test for and also easy to deal with, but either excluding variables or transforming them. Let's look at an example. This example, I generated a dataset that has five different explanatory variables. These range from x1 to x5. Each X variable has a coefficient of one. The graph on the right presents the estimates from an ordinary least squares regression and the associated 95% confidence interval around these estimates. We can see that ordinarily squares estimates a value of approximately one for each of the five variables. On the left graph, we see the correlation between x1 and x2. Currently, there is no correlation between both variables, which is why the data points scattered randomly. Let's go ahead and see what happens when we start to introduce a correlation between x1 and x2 and slowly force X1 and X2, measure the same thing. At first, not much happens, but then as the correlation between the two variables increases, the standard error and therefore confidence intervals of both x1 and x2 stops increase. This happens until they explode towards the end. This is the effect of colinearity. High colinearity between variables leads to very noisy estimates. But as you see, the noise Explosion only happens towards the very end. And in most real scenarios, the effects of colinearity are hardly noticeable. 17. Linear in Parameters: The next assumption is that the model is linear in parameters. This assumption means that the relationship between y and z axis in the ordinarily squares model is linear. In other words, the coefficient estimates take single values and can only be added or subtracted, that cannot be exponentiated, divided or multiplied. In general, this assumption makes ordinary squares regression models easier to interpret. Note this only applies to the actual coefficients. Variables can be transformed in any way, including nonlinear ways. We often call this functional form and we can vary the functional form as we please in ordinary least squares regression. For example, it is common to add higher-order polynomials of variables to a regression equation. Commonly used example is H and H squared, where both variables are entered separately. This has the effect of introducing a curve into the line of best-fit. Variables can also be interacted with each other. And we call this interaction effects. This means that lines of best fit can take on very complicated functional forms. Let's go ahead and look at an example. This example, there are two graphs. The left-hand side shows the data plot of the auto data where the price of cars is plotted against MPG. The right-hand graph shows the residuals or how far the individual data points are from the line of best fit. The average distance is represented by the red horizontal line. The initial relationship plotted through the data is linear. But it should be fairly obvious that this relationship is probably not a good fit. So let's introduce a quadratic into this relationship and slowly increase the coefficient on the quadratic term from 0. Here's what happens. The line of best-fit starts to curve up, puts this curve results in a better fit. And we can see the residuals coming down, especially for higher values of MPG. Model fit improves. At some point, we overfit the model by continuously increasing the quadratic coefficient and then model fit becomes worse again. This example highlights the power of functional form. The model is still linear in parameters because the two estimated coefficients are only added or subtracted. But the square manipulation of x leads to a complicated nonlinear functional form that improves the model fit. 18. Zero Conditional Mean: 0 conditional mean, often called the exogeneity assumption. This assumption is one of the most important assumptions in ordinarily squares. The assumption states that there is no correlation between an explanatory variable X and the error term. Failure of this assumption leads to bias in the coefficient estimate. This assumption can often fail in real life. And because it involves the error term, which by definition is not observable, can never be tested. A good rule of thumb is that whenever a variable is a choice, especially in individual choice, then it's likely to be driven by factors that are unobserved. And hence a relationship with the error term might exist. Let's have a look at an example. This example, I've setup a simulated dataset that again contains five explanatory variables. Each variable as a coefficient of one in relation to y, the dependent variable. On the right-hand graph, we can see the individual owner least squares estimates and associated confidence interval for each of the five variables. The correct results are shown by the vertical red line. On the left graph, we see the correlation that the variable x1 has with the error term. Note, in reality, we can never observe this as the error term will always be hidden from us. Only in this simulated example, can we see the error term. The original correlation between X1 and the error term is set to around 0. Now let's go ahead and increase the correlation between X1 and the error term and see what happens. We observe that the ordinarily squares estimate for x1 slowly deviates to the right away from its true value. The more we increase the correlation between X1 and the error term, the higher the bias in our result becomes. This can be a real problem in applied work. When we have such a problem, we often call it endogeneity. 19. How to Test and Correct Endogeneity: How to test and correct for endogeneity, it is not possible to test for something that cannot be seen. That is why good ordinarily squares models are strongly underpinned by theoretical frameworks, prior literature, and rational argumentation. This assumption is also why many scientists argue against data mining would ordinarily squares models. Data mining approaches increase the likelihood of the exogeneity condition failing and results Becoming biased. In the real-world. The way to deal with endogeneity is often by more data, better, more thoughtful model-building, different functional forms. And also sometimes simply accepting that the models may have some bias. 20. The Gauss-Markov Assumptions Recap: Let's recap the Gauss-Markov assumptions. The linear in parameters assumption is a condition that requires all betas to be additive. It means in layman terms that the dependent variables should be continuous. But it does not mean that the relationship between Y and X must be linear. More complicated functional forms can be worked into ordinarily squares regression models. Violation of the 0 conditional mean assumption, often called the exogeneity assumption, can lead to biased estimates. This is a very important assumption. It is not possible to test for it. Statistically. Identifying or defending against it must be done on theoretical grounds. There is no easy solution if this assumption is violated. Options are to include missing variables in the regression model, to attempt alternative identification techniques, or to result to simulation type methods that try to identify the size and direction of any potential bias. The no perfect co-linearity assumption must be met or ordinarily squares regression won't work. However, weaker collinearity between variables will result in increased standard errors. Fortunately, standard errors only explode. They extreme correlations. And this can be tested for and corrected by either dropping variables or transforming them. Violation of the homoscedasticity assumption leads to incorrect standard errors. It is easy to test for using an appropriate statistical tests and easy to correct for with robust standard errors that are included in almost all statistical software packages. 21. Applied Examples: Let's explore some of these concepts we've been discussing in a more applied environment. We are now in Stata, which is a statistical software package commonly used to analyze quantitative datasets. It's similar to other packages such as SPSS or SAS. I won't explain how to operate stator, the code that I'm executing to obtain the results. You can learn more about data specific state. The courses. Already opened up a training dataset called auto. Let's go ahead and examine it a little bit closer before we start running regressions. A common mistake is to start analyzing data to quickly before fully understanding what's actually inside the data. Modern data sets can be very complex. And more often, the time spent on data preparation and manipulation will outweigh the time spent on actual regression analysis. Let's describe the data to see what we have. The output return by the scribe will produce some high-level information about the data, such as where it is located, how many observations and how many variables are included. In this case, our data contains 74 observations and 12 variables. It's not very big. It also has a title that tells us that this data is related to cars from 1978. Below that is information about the variables. One of them is a string variable that contains the names of the car types, and the rest are all in numeric variables. Let's pretend that we are really interested in explaining the determinants of car price. We can already start building a picture in our head. What variables might be important in explaining the price of a car? Weight and mileage seem like important variables. Or it's turning circle is probably less important to most people who buy cars. Next, let's explore some summary statistics of the data so that we get some idea of how the variables are measured. And distributed. Price appears to be measured in dollars and the least expensive car costs around $3 thousand. While it's the most expensive car costs around $16 thousand. Such prices seem reasonable for 1978. We also see that the variable web 78 has some missing observations. It only has 69 instead of 74. Most variables also appear to be continuously measured. However, it looks like the variable foreign is measured as a barn new variable. Let's go ahead and confirm this quickly. By tabulating forum, we see that indeed foreign is measured as a bind variable around 29% of cars foreign. So let's go ahead and estimate some ordinarily squares regression models. Rather than immediately going into a full-blown model with many variables and interaction terms. Let's build it up slowly and interpret the output and diagnostics along each step. The variable foreign leads itself to a nice simple question of foreign cars more expensive than domestic costs? We could answer this question by quickly computing the mean for both subsets of the data and simply comparing the means. However, we can also achieve the same thing in a regression framework. Let me show you this code regressors the explanatory variable foreign against the dependent variable price. The regression results of this table are pretty easy to interpret. But before we do that, let's quickly look at some diagnostics. The regression includes 74 observations. So that's good. There are no missing observations. The S statistic is not significant. Here we are looking for values below 0.05. Values above 0.05 employ that our total model. In other words, all the variables in our ordinarily squares regression, not explain how price berries. Likewise, the R-squared is extremely low. Value of 0.0024 means that we are explaining almost nothing in terms of price variation with the variable foreign. Now let's go and look at the results. We have one variable called foreign. However, this is a final variable, not a continuous variables. Such variables have the following interpretation. If the value of the variable is flipped from 0 to one. In other words, if a car changes from being rent domestic car to a foreign car, by how much will the cost price increase? The answer here, it pays to be $312. However, we also observe that the standard error around this estimate is quite large. The standard error is $754. That means the associated t-statistic is below 1.96. P-value is above 0.05. This means this variable is not statistically significant at the 95% level. We get an idea of the uncertainty by looking at the confidence interval. This ranges between minus $1200 plus $1800. The true value is somewhere in there, but because the confidence interval crosses 0, we cannot claim statistical significance compared to the value 0. Finally, remember that the effect of a variable is conditional on other controls. In this case, there are no other variables in the model, but there is a constant. And the constant is the value of price, is everything else is set to 0. In other words, if a car is domestic and it's a value of foreign, is set to 0. It will cost $6 thousand. A foreign car is $312, more expensive would cost around $6,300. We can also visualize this. Here we see the estimated effect of foreign cars on price. Domestic costs are cheaper on average, and foreign costs more expensive by $312. But the confidence interval of both values is so large that they are not statistically different. Great. Let's go ahead and increase the number of variables in our model. We could throw all our variables in and simply see what sticks. This is what a data mining approach would generally do. Stata has various data mining abilities, including stepwise regression that will automatically eliminate variables that are not statistically significant. However, there are some conceptual problems with this approach. One of the most important problem is that it prevents users from thinking about the problem at hand and doesn't allow them to understand how that data analysis is related to underlying theory or their research hypotheses. For this demonstration, let's go ahead and slowly add one variable after another variable to our regression model. We will not remove phone even though it is insignificant, because the addition of other variables may change its effect. Let's go ahead and add miles per gallon to our model. We see now interestingly that some immediate, significant changes have occurred. Our R-squared has jumped drastically to 0.28. The adjusted R-squared is a little bit lower at 0.26, but this is still much, much higher than before. Our new variable MPG is statistically very significant with a small standard error. And they hide t-statistic. Each increase in one unit of mpg. In other words, costs getting more fuel efficient will decrease the car price by $294. However, we also see that the effect of foreign cars has increased dramatically to plus $1700. The standard error has come down a bit from previously 752, now 700. The variable is now statistically significantly different from 0. What a big difference one variable may term model. Importantly, we can explain this change. It turns out that foreign costs have significantly higher miles per gallon numbers than domestic cars. And once this factor is controlled for the actual price of foreign costs is higher than for domestic costs. This is because the effect of mpg is negative on price. Because foreign cars have higher MPG, their price was lower. Now that this effect is being controlled form and therefore taken out of price. The actual effect of a car being foreign is that it causes a price, rice. This is a perfect example of the exogeneity assumption I was talking about in the previous session. We admitted a important variable from the regression model. And the explanatory variable we did include was correlated with That's important variable in the error term. Therefore, the previous result was biased. However, because we have now moved the offending variable MPG from the error term into the regression model. We are controlling for it. And hopefully. Produced a less biased estimate. This really shows the importance of careful model building. Let's go ahead and introduce a third variable to our model. Weight. Weight is likely to be an important variable because heavy cars need more raw materials, but also because heavier cost unlikely to affect the MPG number. And we know that this in turn affects the foreign estimate. So let's go ahead and add it to our regression model. Look at that. Now, R-squared jump up again by a large margin. And also our estimated effects have changed again. Let's explain it one more time from the top. The new variable weight is statistically significantly different from 0 due to small standard error, high t-statistic and small p-value, the effect is positive. In other words, each additional pound of weight on the car increases the price by $3.46. The effect of mpg is now positive instead of negative. The inclusion of weight reverse the sign of this estimate. Higher MPG cost now lead to higher prices. Although this effect is not statistically significant. This makes sense. After all, higher miles per gallon, cars are more fuel efficient and save money. This may require better technology, and therefore, such cars may cost more. However, the previous effect was masked by the fact that the heavier cars Hepworth mileage. Now that this is controlled for the effect of MPG has become less biased. Moreover, because there's a knock-on effect of MPG on foreign status, we see now the effect of foreign cars jump to $3,600 with a lower standard error of 680. This is another important example of regression bars. Important explanatory variables were left in the error term. Let's assume for a moment that we are now finished with our model-building and that we are happy with the specification that we have. The next step is usually to perform some type Gnostic statistics, especially in relation to the Gauss-Markov assumption discussed in the previous session. Unfortunately, the exogeneity assumption cannot be tested and can only be inferred by adding other variables to the model as just shown or by resulting to theory. We can however, test the homoscedasticity assumption. Let's go ahead and do that. Here's data performed a test for homoscedasticity. The results show that the null hypothesis of a constant variance is rejected in favor of the alternative hypothesis of heteroscedasticity. In other words, varying variance. We can also explore this visually by examining the residuals. Here we've plotted the residuals versus the fitted values. This residual versus fitted plot shows how the residuals are distributed around the plane of best-fit values close to 0 mean a good fit. We can clearly see on this plot that when we move from low fitted values to higher fitted values of price, the variance of the residuals around 0 increases. This is clear evidence of changing variance and needs to be dealt with. We can either use robust standard errors or specify a different functional form that tries to remove this changing variance. Improving model fit is often a better first option. And in this case, the problem might be caused by the fact that like many price variables, car price has a long tail. Often we transform such variables with logs. So let's go ahead and do that. Now, let's run a new regression with the dependent variable as log price instead of price. Let's see what happens. At first glance, it looks like everything has changed. The coefficients are completely different. However, because we have now transformed the dependent variable, all explanatory variables relate to the log price and not the price. This means their interpretation slightly different. Now, a one-unit increase in weight increases the log price of a car by 0. For. This can be a rather inconvenient way to interpret a model estimates. So we often read transform the coefficients to make them easier to understand. When a regression model has no log transformation, either for the dependent variable or the explanatory variable. We call this a level level model. The interpretation is straightforward. When an explanatory variable is unlocked, the interpretation on the coefficient changes on percentage increase in X. Causes a Beta divided by 100 unit change in y. When the model has a log dependent variable, the interpretation changes to a one-unit change in x, causes a beta times 100% change in Y. When the model is a log-log model, the interpretation is that they 1% change in x causes a Beta percent change in y. So in this case, a one unit change in weight causes a 0.0004 times 100 equals to 0.04% increase in price. Likewise, foreign cost now cost around 53% more in terms of price. Now let's go back and test the homoscedasticity assumption. Again. The test statistic reveals that we can now accept the null hypothesis of homoscedasticity. We can also visualize this again using the residual versus fitted plot. Here we can see that as we move along fitted price values, the spread of the residuals around the horizontal 0 line is much more even. This is visual evidence that our model now has homoscedastic errors that we can accept that particular assumption. Next, let's go check for co-linearity. This variance inflation factor tests highlights to what extent each variable inflates the variance of the model. High values above, say, 50 or so on. Particular variables are indicative that these particular variables are co-linear with other variables. Here, there is no evidence of high colinearity in our model because all variables have very low variance inflation factor values. Finally, we can also introduce more complicated functional forms. Parameters must be linear, but variables can be transformed and offer more complex forms than just linear relationships. For example, we can include a weight squared variable into the regression to allow a quadratic relationship to exist between log price and wait. This new regression states are included, wait, and they wait squared variable. It's important that the interactive variables are analyzed together. Whilst the weight variable is not statistically significant, the squared weight variable is statistically significant. And I, joined tests should be done on both to see whether the pair is significant or not. Let's assume for a moment they are jointly significant. The interpretation of the output becomes a little bit more complicated. But interaction effects can also be visualized and states, I can do that for us. Here we can see that the relationship from our predicted ordinarily squares model between weight and log price is not actually linear, but it pays to be quadratic in nature. In other words, there is a curve going through the relationship between price and weight. Weight increases, the log price increases more and more and more. Great. Now let's assume we are done model-building. Regression models are often not presented as they are shown by statistical programs. There's simply too much information in the tables presented by statistical programs, most of which is redundant or not useful to laymen readers. It is also common to include multiple regression models in a table so that readers can follow the progress of the coefficients as additional variables are included or removed from models. Here's an example of how regression tables often look in reports. This here is a classical regression output table that contains the coefficient to three decimal places and standard errors to three decimal places. Asterisks are included to easily identify statistically significant effects. The diagnostics only include the observation count and R-squared statistic. This table easily allows readers to read across and examine how the effect of the variable Foreign on price, for example, changes as we change our model specification. This kind of approach is important as it is a transparent approach that shows the ingredients of how this particular statistical meal was made. Readers can judge for themselves if they agree with your particular conclusion or not. This concludes this practical session on ordinarily squares regression. 22. Final Thoughts and Tips: Final thoughts and some tips. Hopefully, you've enjoyed this introduction to linear regression analysis. I have some tips you may want to consider when applying regression analysis to data. Practice. As with many things in life, it is practice and frequent application that leads to an ever greater understanding of the issue at hand. The same is true for regression analysis. All the theory in the world will not overcome lack of engagement or application. I always recommend that people should just get stuck in and start exploring data. I think carefully about your original objective. Are you trying to simply understand correlations in your data? Or are you trying to determine cause and effect? The first can be done through simply playing around with the data and the regression model. The second will need much more deliberate thoughts about theoretical underpinnings and rational argumentation. Why might X cause Y? And what could the transmission mechanism B. What else might influence such transmission? Estimate multiple models with small variations. Results are more convincing when different models continuously show the same kind of outcome. Thus, the inclusion of a particular variable, change everything or do your coefficients remain robust? Showing a pathway towards your final preferred specification is a very important part of modern regression analysis. Data quality and sample size matter as much as model-building. Big innovations in data quality and size have happened since the 1980's. Not every model needs to be a complicated thing. Quality and data can add significant credibility to any results, and you should not shy away from claiming that this data is the best data available to answer this particular research question. High-quality datasets often require complicated data manipulation. Lot of regression mistakes do not emerge from bad model-building, but from poor data coding. Do not underestimate the amount of time that should be spent on data cleaning and preparing the data for regression analysis. Ordinarily squares is still the most commonly used regression method in the world. It would be wrong to dismiss it as a simplistic method. Playing around with functional form through interaction effects can lead to complicated ordinarily squares models that closely resembled reality. Do not be afraid to explore more complicated models that use quadratic terms and other interaction terms. Understand the role of diagnostics in regression analysis. Do not get hung up about textbook diagnostics, but do query where the regression assumptions about the data hold. Other assumptions that might be too strong for the data at hand. Finally, have a healthy dose of skepticism when someone is claiming a causal relationship. Regression coefficients often contain some kind of bias. At the same time. Don't be a nose, hair, and reject everything. Like many things in life, regression analysis is an extra tool that should be used in conjunction with other evidence, such as prior results, theoretical frameworks, and also qualitative evidence. There is a fine line between art and statistics in regression analysis.

Data Analysis - What is Linear Regression?

Franz Buscha

Watch this class and thousands more

Watch this class and thousands more

Lessons in This Class

1.

Introduction

4:12

2.

What is Regression Analysis?

2:45

3.

What is Linear Regression?

1:48

4.

Why is Regression Analysis Useful?

1:37

5.

What Types of Regression Analysis Exist?

2:33

6.

Explaining Regression

3:40

7.

Lines of Best Fit

7:58

8.

Causality vs Correlation

1:54

9.

What is Ordinary Least Squares?

1:04

10.

Ordinary Least Squares Visual 1

4:15

11.

Ordinary Least Squares Visual 2

7:43

12.

Sum of Squares

3:07

13.

Best Linear Unbiased Estimator

4:43

14.

The Gauss-Markov Assumptions

0:41

15.

Homoskedasticity

2:13

16.

No Perfect Collinearity

2:35

17.

Linear in Parameters

2:43

18.

Zero Conditional Mean

2:14

19.

How to Test and Correct Endogeneity

0:52

20.

The Gauss-Markov Assumptions Recap

1:56

21.

Applied Examples

21:32

22.

Final Thoughts and Tips

3:54

About This Class

Meet Your Teacher

Franz Buscha

Related Skills

Hands-on Class Project

Class Ratings

Why Join Skillshare?

Learn From Anywhere

Related Classes