Transcripts
1. Introduction: Welcome. Data analysis can be harmed. There's so many
different methods and so many different ways to analyze and interpret data that can make learning
very difficult. In this class, I
want to give you an easy and fast
outline of one of the most popular methods and data analysis,
linear regression. The key to this class
is that there is no equation since no maths, no tricky bits of
theoretical knowledge. I want to give you an intuitive
and graphical explanation of what linear regression is. And then show you a range of practical data
analysis examples. No matter what your current professional knowledge status, you can feel confident
about knowing the ins and outs of
linear regression. After this class. What is linear regression? Linear regression is the most popular regression
method used in the world. The linear regression
techniques available, ordinarily squares, often abbreviated to
OLS, is the most common. And I'm going to focus on ordinarily squares
because it's by far the most used
regression method for data analysis in the world. Ordinarily squares
is a technique that examines the
relationship between one continuous variable
and one or more continuous info
categorical variables. And this technique is used in many disciplines
including economics, sociology, psychology, drug,
fear, and even history. It's used all over the world. And it's also often used in business for
quantitative analysis. And it underpins many government reports not perform some
kind of policy evaluation. Anybody who wants to have
a good understanding of data analysis will need to
understand linear regression. What are the main
learning outcomes? To learn and understand
the basic intuition behind linear regression
message and data analysis. Learn the associated
terminology and underpinnings. To learn how to comfortably
in and analyze output. Finally, to learn
some extra tips and tricks that will help
you in data analysis. Who's this course
for? This course is aimed for those who
are starting off the career in data analysis. That could be practitioners,
somebody in government, somebody and policy, somebody in business, or even students. What prerequisites on it. There is no mass and
you don't need to worry about any equations to get
the most out of this course. Curiosity is all that is needed. Some state of
knowledge may be handy for the practical
application of this course, but it's not required. Status is statistical
software program that allows users to estimate many
different quantitative methods. I'm going to use
it to demonstrate them ordinarily
squares examples. Also, a keen interest in
understanding how data might be related to each other
is a useful prerequisite. Often, data analysis
is all about measuring quantitative
variables against each other. If you want to know
how y is related to x, then this stomach place
for you using Stata. This course I'll be using data to demonstrate some examples. Instead as approachable
statistical software. There are many courses on
how you can use statement. Should you be interested
in this course? I will not teach you
ins and outs of Stata, but I will focus on the
interpretation of output. There are many other statistical software packages such as R or SPSS that can
do exactly the same. However, if you are
interested in Stata and you want to replicate some of the
examples from this course. I have attached the relevant
code files to this course. I'm going to be using
something called the auto training dataset
that comes inbuilt. Which data? For
practical examples. This data is a training
set that contains a variety of useful
variables and relationships. Another great for
teaching purposes. You can also download it
as part of this course. Let's proceed to the
next section and learn more about
regression methods.
2. What is Regression Analysis?: What is regression analysis? Regression analysis is
a statistical technique that attempts to explore the relationship between
one dependent variable and one or more
independent variables. An alternative term used for dependent variable can sometimes
be the outcome variable, the response variable, or
at the endogenous variable. Dependent variable is normally
denoted by the symbol y. Alternative terms for
independent variables or predictor or explanatory
or exogenous variables. Explanatory variables
are normally denoted by the symbol x. It is common to write
regression models in the form y equals to X1 plus X2
plus X3, et cetera. The last term will
be an error term. This is often denoted by E. This captures everything
that is missing. However, there are many
different practices. We're inviting regression
models in mathematical form. So we will avoid all of
that in this course. Variables can take many different forms and
regression analysis. They can be continuous. In other words, data can be measured anywhere
on the number line, too many decimal points. E minus 2.305100.3. Data can also be an integer
format such as 12345, etc. Data can also be in binary
format such as 0 or one. Often these denote binary
responses such as yes and no. Sometimes data are ordinal. Ordinal data is categorical
data that is ranked, such as likert scales. Finally, data can
also be normal. No, this is categorical
data that is unwrapped. For example, modes of transport. Importantly, data must
always be in numeric format. In mathematics and
computer software can do very little with
string type data. String type data is data that
contains the letters and other non-numeric characters,
like exclamation marks. Data can also be transformed and this is a common future
of regression models. For example, taking the
log of y and making this the new
dependent variable is a very common technique
in regression analysis. By doing so, the interpretation of the entire model
will be changed. And clearly, this needs to
be carefully considered when using or
analyzing such models.
3. What is Linear Regression?: What is linear regression? Regression analysis
is a catch-all term for every type
of regression method. Often regression
methods are split into linear and non-linear
regression methods. There are many methods in
both of these two camps. In this course, we'll focus
only on linear methods, specifically the
ordinarily squares method, that is the most
popular linear method. Linear regression assumes that variable parameters relate to the dependent variable
in a linear way. Variable parameters are
what we tried to estimate, but regression
models and data find the relationship
between x and y. We often call parameters
coefficients. For example, a parameter
or coefficient of one means that for
every unit change in X, Y, the dependent
variable changes by one. Without getting too technical, linear regression assumes that dependent variables are measured
as continuous variables. Explanatory variables can
be measured in any way. When the dependent variable
is a non-continuous, the correct regression
method is often non-linear. However, there are instances
where linear methods can be used when the independent variable
is not continuous. When there's only one explanatory
variable in the model. In other words, there
is only one x variable. We call this simple regression. When there are multiple
explanatory variables, we call this
multiple regression. Most regressions are of the multiple kinds,
as in practice, we usually want to
test or evaluate many variables against
the dependent variable y.
4. Why is Regression Analysis Useful?: Why is regression
analysis useful? Regression analysis
is useful for when quantitative evidence is needed to answer a particular question. Quantitative analysis,
by definition, requires the
analysis of numbers. The opposite of this is
a qualitative analysis which analyzes non-numeric
data such as word, stories, meaning, or concepts. Regression analysis
is useful because it allows for the testing
of hypotheses. For example, do men really
earn more than women? Is unemployment in the
economy related to inflation? Or how much more ice cream
is bought on sunny days? These kind of questions can be answered with
statistics and you'll often heard a term this is
statistically significant at the 5% level
in such analysis. However, regression also
allows for predictions. Because regression models estimate parameters
or coefficients. These parameters can then be used to compute new statistics. This can be done
within a data sample and even outside of that sample. For example, after
a regression of various explanatory
factors on wages, we can use the estimated
parameters to compute the expected wage of a very
particular type of person, whether they are in
the sample or not. This prediction is
a great strength of regression methods and
it allows businesses, researchers, and policymakers
to compute various effects.
5. What Types of Regression Analysis Exist?: What type of regression
analysis exists? There are many,
too many to count. In fact, many advanced
regression methods will be customized for the relevant research
question and the data. However, there are
some core methods that you should be aware of. These methods are
primarily a function of the nature of the data and then nature of the
dependent variable. The most common method
is ordinarily squares. This method requires the
dependent variable to be continuous and is often applied
to cross-sectional data. Cross-sectional
data is data that doesn't have repeated
time elements within it. Ordinarily squares also
serves as the basis for many advanced methods such
as weighted least squares. Next or three
non-linear methods. These methods are
non-linear because the dependent variable is
not continuous anymore. Logit and probit models are useful for binary
dependent variables. Ordered logit and ordered
probit models are useful for when there are multiple
ordered categories in the dependent variable. And multinomial logit models are useful when there are nominal, unordered categories and
the dependent variable. If you're wondering what
logit and probit models are, these are simply
two common ways to achieve a nonlinear relationship
between the variables. Whilst there are some
mathematical differences between logit and probit
models and realities, these often make little
difference to the results. Also note that multinomial
probit models also exist, but they are not
frequently used, which is why I'm not
listing them here. Next, our panel models, both linear and
non-linear models. There are many methods
in each category, but the Common Future is
that they all work with data that is collected
repeatedly over time. This could be short
household panels or long high-frequency
trading time series. Next, account data models, which are similar to
logit and probit models, but you slightly different
transformations to account for count properties. The data. Examples of counts are
things like the number of doctor visits or the
number of t-shirts salts. Finally, Cox proportional
hazard models are often used when a
dependent variable is time. A common example of
a time dependent variable as survival
time of cancer patients. And this method is often
used in the health sciences.
6. Explaining Regression: Explaining regression. Now that we have some
basic understanding of the concepts behind regression analysis and also what type of
regressions there are. Let's explore how
it actually works. If you're an academic students, regression is often learned through a variety of equations. Often matrix type equations
that have a lot of x's and y's and ease
and use in them. They serve their purpose, but you don't actually
need to understand them to learn how
regression works. Using visual aids can
achieve the same effect. And this is something we'll
focus on in this course. Simple linear regression is often explained
through correlation. Let's follow that
approach and then slowly keep building
things up later. Correlation, sometimes called
association or dependence, is the relationship
between two things. In statistics, these things
are often variables, let's call them x and y for now. Note that both variables x and y are connected to identifier. Without this identifier,
none of this will work. They identify is often
represented by the symbol I. And we can imagine it
to be something like individual people or firms or countries or anything else that can connect the two
variables of interest. This little table over here, there are three identifies, and each identify has one
value of y and one value of x. Let's go ahead and visualize a larger version of this
table on the graph. I'm going to plot 100 data
points on a scatter plot where the y-axis represents
the variable y and the x-axis represents
the variable x. This visual representation is slowly starting to
tell us something. In this case, we seem to
get a fairly good idea that they seems to be a positive
relationship between y and x. In other words, as x
increases, so does Y. However, there's also
some noise in the data. And this seems to
be some clumping in the values of y and x around 0. The relationship between the two variables
can also change. For example, the relationship could become weaker
or even negative. Here we see an example of how data can change its
relationship with each other. The correlation between
Y and X becomes weaker, going all the way to no correlation and then
becoming negative, we end up with a
relationship that is almost the opposite of
what we started with. Visually, it is quite easy to distinguish between extreme
type of relationships. However, it can be more
difficult to visually identify differences between only minor relationship changes. Take a look at this example. Here is some data that is
correlated in different ways. It is easy to tell a plus one correlation apart from a minus
one correlation. However, this task becomes more difficult for smaller
correlation changes. At first glance, it
would probably be quite hard to identify any difference between the first two graphs. Even though the
correlation is different, one has to look quite closely to identify that the
relationship between y and x has flattened off a little bit in
the second graph. This becomes especially tricky
if there are lots of data. If we had a million data points, that all we would
see, for example, is a giant blob of blue. And that is why we often want to summarize the
relationship between y and x via some kind of
data reduction process.
7. Lines of Best Fit: Lines of best fit, what are they and
how do they work? A key thing to understand before jumping into
the concept of how to produce lines of best fit is that there are two
methods that we can use. These are parametric and
non-parametric methods. Parametric methods are methods
that apply some kind of parameter or many
parameters to data. Parametric methods are
methods that apply some kind of parameter or many
parameters to data. Often the parameters
will be in the form of an equation such as
y equals to one. The parameter in
this case is one. This method is the
method that is used in regression analysis and
in ordinarily squares. And it has the advantage of simplicity and working with
high dimensional data. Disadvantage is that it requires stronger
assumptions about the data. When these assumptions
are not met, your analysis might
be completely wrong, and often you may not
even know about it. Non-parametric methods let the
data speak for themselves. The advantages that
you need to make a less assumptions about the initial relationships
in the data. A big disadvantage is that this method is not
very transposable. In other words, you cannot easily tell other
people about it. In addition, and becomes
extremely hard to operate this kind of method in
multi-dimensional environments, we often use
non-parametric methods to explore basic relationships
between Y and X. And parametric
methods to explore more complicated
relationships between y and x1 and x2 and x3, etc. Let's have a look to see
what I mean by all of this. Let's start with a
scatterplot of some new data. In this case, let's
plot data from the stator altered
dataset and try to figure out how the price
of cars is related to the miles per gallon of petrol consumed by
individual cars. The initial scatterplot here tells us there is some kind of relationship between
the cost price and its miles per gallon. It looks a negative, in other words,
downward-sloping. Now, let's try to estimate what kind of relationship
this is exactly. We'll begin with a non-parametric
method like regression. There are many
non-parametric methods. Let's pick one called local
polynomial regression. Local polynomial regression is a form of moving regression. The user defines a bandwidth or lets the computer choose one, and a regression is then
estimated within that bandwidth. The band then moves
continuously across the x-axis step-by-step and
repeats this analysis, the individual steps and then
all stitched together to reveal what is essentially a moving average
plot of the data. Let's see how this
works in practice. The non-parametric methods
shown here slowly moves across the data space and continuously updates the relationship
between y and x. We see that the
relationship between y and x starts negatively, but it ends up being
slightly more horizontal. In other words, the
relationship between y and x here does not appear
to be entirely linear. Bigger advantage of
this method is that it lets the data speak
for itself and doesn't rely on
specific functions or even theory to fit the data. One disadvantage of
this method is that the relationship still
need some kind of input. In this case, it requires
the size of the bandwidth. If we change the bandwidth
to something smaller, relationship will
look different. Here's an example of that. Another disadvantage of
this method is that it is difficult to transfer this
relationship to other users. How can we explain this
squiggly line to somebody else? We often choose a
parametric relationship. Parametric relationship
is one that can be defined by some
kind of equation. For example, a linear line fit through the data
will have a gradient. And that gradient will be the parameter defining the
relationship between y and x. Let's plot a linear
function through the data and see how this looks. Here we see a linear line
being fitted through the data. In this case, the line fit
is based on minimizing the overall distance between the fitted line and all
the available data points. This concept is known
as least-squares, and we'll explore this in more detail in the next session. It underpins the ordinary least squares
regression methodology. The fitted line in this case has a particular slope of minus 238. In other words, for every one unit increase
in miles per gallon, the average cost price
appears to drop by $238. Great. However, parametric lines of best fit don't always
need to be linear. We can also add a quadratic
line of best fit. In this case, we recover two parameters to have to find the relationship
between y and x. Here's an example of that. In this case, the relationship between y and x is parametrized by a one parameter pulling
wide down as x increases. And the another
parameter pulling y back up as x increases. In this case, the parameters are approximately minus 1200 for every increase in x and plus 20 for every increase
in x squared. Don't worry about the
x squared right now, will explore this later. But the important concept is not the functional form of
parametric lines of best fit can be made to be very flexible as long as enough
parameters are available. How does all of this
relates to regression? Well, this is regression
specifically, this is simple regression where Y is regressed against
one variable x. How about multiple
linear regression? Multiple linear regression is an extension of simple
linear regression, and it adds more variables to
the mathematical framework. One easy way to
visualize this is by adding further dimensions
to the scatter plot, where each extra dimension represents an
additional variable. Let's say, for example, that we wanted to explore the
impact of MPG on car price. But controlling for
a cause weight, heavier cars are likely
to have poorer MPG. And this may affect the price. Visually, we can
represent this by a three-dimensional
scatterplot that plots price against MPG
against weight. It could look a
little bit like so. Moreover, by rotating
the scatterplot, we can look at the
relationship that each explanatory
variable has width y, and even examine how the explanatory variables are
correlated with each other. Finally, what multiple
regression analysis does is instead of estimating a line of best fit
through the data, it fits a plane of best
fit through the data. This can be hard to
visualize on a screen, but here's a crude
attempt with mine. The left graphs show the actual data points
on a 3D scatter plot. While the right graphs show the estimated relationship
between these data points, this relationship is
represented by a 3D plane. If more variables are
added to the framework, the plane of best-fit becomes
a hyperplane of best-fit. This is why we sometimes
hear people talking about multi-dimensionality when referring to
regression analysis.
8. Causality vs Correlation: Causality versus correlation. Hopefully, the previous
examples we will have given you a good intuitive grasp of what regression
analysis tries to do. There are lots of statistics and maths in each type of analysis, but the underlying concept
will always remain the same. Regression analysis
tries to tell users how data is
related to each other in a way that is easier to understand than looking
at the raw data points. However, it is important
to be keenly aware of the concept of causality
versus correlation. Every regression method is a statistical method
that correlates data. That's it. A computer or a
mathematical equation cannot identify what is causal. Causality is always
interpreted by the end-user. And some models allow better claims of
causality than others. Evidence obtained from
regression analysis about a strong and statistically significant relationship between two variables may be attributed
to causality through a compelling theoretical
framework and common sense. This can take a lot of practice and almost
becomes an art form. Sometimes the data helps. For example, if
yesterday's events are used to explain
today's action, the time element in
the analysis can be used to make better
causal inference. However, in other settings such as cross-sectional
survey settings, it can become much harder
to attribute causality. Are people happy because
they're healthy? Or are people healthy
because they're happy? These are tough questions
to answer and require theoretical and
philosophical reasoning in addition to statistics. So you should always be careful when dealing with
regression analysis.
9. What is Ordinary Least Squares?: What is ordinarily squares? Ordinarily squares is
a regression method that is based on the
concept of least-squares. Least-squares is a
statistical method that fits a line or plane or hyperplane of best-fit by
minimizing the sum of squared residuals
between the line of best fit and the
actual data points. We square the so-called
residuals because the sum of them is exactly 0 when
they're not squared. Therefore, negative
and positive residuals above and below the line of best-fit cancel each other out. Squaring solves this problem. Many other ways of fitting
the line of best-fit exists. One example is to fit a line by the method of least
absolute deviations, where instead of
squared residuals, the absolute value
of them is taken. In other words, negatives
are turned positive. However, least-squares is by
far the most popular method. Of course, all the sciences.
10. Ordinary Least Squares Visual 1: Let's explore ordinarily
squares visually. Understand it better. Imagine a small dataset
with a few data points, a little bit like this one. Ordinarily squares will fit a line through
these data points. This line may be linear, but it can also be nonlinear. Let's go with a linear example. The red line
represents the line of best fit estimated by
ordinarily squares mechanics. In this case, the line
of best fit can be represented by a single
slope parameter called beta. We often use the
Greek letter Beta to denote the slope
of a regression line. This slope informs us of the estimated relationship
between y and x. In this case, y is the price of a car and x is the mileage
and miles per gallon. The slope is negative, which means that as the
miles per gallon increases, the price of cars decrease. However, note that our slope does not hit any of the
actual data points. That is because
we are estimating an average relationship between all the available data points. The actual data points are often called observed
data points. In other words, y observed. The predicted value of y at any given value of x is then given by the
line of best-fit. These are called predicted
data points or y predicted. The difference between
the observed value and the predicted value is
called the residual value. This is what ordinarily
squares tries to minimize. You can see here that there are three data points and therefore three
different residuals. The sum of all three is the smallest value
that we can achieve. In this case, if we change the line of
best-fit, for example, by moving the line
of best-fit down, the total sum of the
residuals will increase. This is a graphical explanation of what ordinarily
squares tries to do. It finds a regression slope
and the intercept that leads to the very best
minimum sum of residuals. Let's have another look
at this with more data. In this example,
we're going to use the full auto training
data to see what happens to the root mean squared error when we apply different regression
slopes to the data. In the left panel, we observe the regression
slope going through the data. We'll start with a positive
slope of plus 100. On the right panel, we see the size of the
individual residuals. The residuals are
squared and then square rooted to ensure only
positive values were made. The lowest value a residual
can have therefore is 0. Higher residual values mean that the relevant data point is far from the actual
regression line. The average of all
these residuals is called the root mean squared
error of the residuals. And this is depicted
by the red line. It tells us how far, on average the data points
are from the regression line. Now let's look and see what happens when we
change the slope. We can see that as we
slowly change the slope of the regression line from positive values and
negative values. The average error between the line and the data
point decreases. The residuals on average are trending down as we
decrease the slope. This keeps happening until
after a certain slope value, the average of the residuals
starts to increase again at a slope of
around minus 230. The average error from our
line of best fit is minimized. And therefore, that is
our line of best-fit. Of course, this graph is a simplified version
of what happens. Regression models can have many more variables and
therefore many more parameters. And we would need
many more dimensions to display such
models graphically. Now let's take a look at
how ordinarily squares models are often
presented by computer.
11. Ordinary Least Squares Visual 2: Here's an example of how stator presents
regression output. Other computer programs may
present this differently, but the essence of the
information displayed will be similar
among all programs. Often part of the regression
output displayed will be diagnostic information
that provides high-level information about the overall
regression model. In states. This is usually the top
part of the output. The lower part of the
output table normally present the estimated
coefficients for the relevant variables. There are many pieces of
information in this table. However, generally three
pieces matter most. The first is the actual
parameter estimates. In other words, the estimated
slopes are coefficients of lines or planes of best fit
through the relevant data. In states, this is called DOF, which is short for coefficient. Each explanatory variable has a relationship with the
dependent variable, in this case, price. Each explanatory variable is also conditional on each other. In other words, the effect
of miles per gallon, conditional on
controlling for weight, is stopped for each increase by one unit of miles per gallon, price drops by $49. The effective weight
is as follows. Conditional on miles per gallon, an increase in one unit of weight leads to a price
increase of $1.7. The final variable
is a constant. Constants on the value that
the dependent variable, in this case, price takes when everything in
the model is set to 0. In other words, at a weight of 0 and at 0 miles per gallon, a car should cost around $1946. According to this model. Constants sometimes makes sense, and sometimes they don't. In this case, it doesn't make a lot of sense
because cause would never have a weight of 0 or
consume 0 miles per gallon. Some people say that constant should be
removed from models, especially when they don't make sense. I think that is wrong. You just need to take care
when interpreting constants. Often, constants should not be interpreted but
left in the model. The next most important
piece of information comes from the column
called std error, which is short for
standard error. The standard error statistic is a statistic that reveals with what degree of accuracy the slope coefficient
is estimated. The standard error is low
relative to the coefficient. Then we can be more certain that the estimated coefficient is close to the true
population parameter. The standard error is high, we can be less certain and have more noise around our estimate. The standard error is
important because it allows us to determine
to what extent the estimated coefficients from the regression model are
statistically significant. The full remaining columns
in the results outputs are all further computations
of the standard error. And that's simply different ways to identify the significance. The t-statistic, the p-value, the lower and upper
confidence intervals are essentially all the same thing and are based purely on recalculations
of the standard error. We'll look at what
they mean in a moment. Finally, the third piece
of information that matters most is something
called R-squared. This information is given in the diagnostic parts of the output table and
can be found here. R-squared is a common
indicator of goodness of fit for ordinarily squares
regression models. It is bounded between 01 and higher values indicate Dr. model
better fits the data. However, many professional
users will quotient against an over interpretation
of R-squared statistics. Numbers are relative
to the discipline. If you are working with
behavioral data such as people and their
choices than R-squared of 0.2 or 0.3 are very common and usually indicate
good fitting models. If you're working with
time series data, such as macroeconomic
GDP measures, then R-squared of 0.8 or 0.9 are very common and
indicate good fitting models. Finally, let's talk a
little bit more about how the estimated coefficients are related to statistical
significance. Let's begin with t statistic. This statistic is an indicator of
statistical significance, and normally we are
looking for a value of 1.96 or above one. We're using a reasonably
sized sample. Reasonably sized samples means around 100 or more
observations in the among. The t-statistic is easily
computed by dividing the estimated coefficient value by the estimated
standard error value. Note that when the
coefficient is negative, state that will produce
a negative T-statistic. The sign on the t-statistic
should however, be ignored. Next to that is something
called the p-value. This is shortfall. Probability value
indicates the probability of obtaining the observed
results of a test, assuming that the null
hypothesis correct. The null hypothesis
in regression tables, is normally that
a specific result is not different from 0. In other words, small
p-values mean that there is stronger evidence in favor of
the alternative hypothesis. The alternative hypothesis
being that the coefficient is the actual estimated
coefficient in layman terms and number of
0.05 or below in the KD, statistical significant
at the 95% level, numbers below 0.01 indicates significance at the 99%
level, and so forth. The next, our
confidence intervals, there is an upper and
lower confidence interval. Upper and lower
confidence intervals are computed by adding
or subtracting 1.96 times the standard error from the
estimated coefficient. In other words, the confidence
interval is usually two standard errors away from
the coefficient estimate. Confidence intervals are really useful because they
allow you to quickly, I will statistical tests. Any number outside the
confidence interval range, will be statistically
significantly different from the
coefficient estimate. This example, MPG is not statistically
significantly different from 0 because 0 is within the
confidence interval range. However, mpg is
different from minus 500 because this number is outside the confidence
interval range. This can be a really
useful way to quickly perform
statistical testing. And all it involves is
multiplying the standard error by approximately
two sum of squares. Now let's take a look at the sum of squares in a bit more detail.
12. Sum of Squares: The previous regression
table also provided thought analog signal information on the explained sum of squares, the residual sum of squares, and the total sum of squares. These values indicate
how much variation is explained by
the fitted model. How much variation
unexplained by the model? How much total variation
there is in the data. By comparing the proportion
of explained sum of squares to the total
sum of squares, we can produce something
called the coefficient of determination, often
called R-squared. R-squared. The R-squared value is a widely used measure of fit for ordinarily squares models. The value indicates how well
the model fits the data. Values of one mean,
a perfect fit. Values of 0 mean a terrible fit. However, the basic
R-squared can only increase as more
explanatory variables are added to the model. In other words, models with hundreds of random
covariates can saturate the data and produce artificially high goodness
of fit statistics. This is why we often
also report the adjusted R-squared,
which imposes penalties. Two more variables being
added, two models. If additional variables are not statistically significant, they will reduce the
adjusted R-squared value. This statistic tries to strike a balance
between rewarding, good model-building
and overloading models with
unnecessary variables. However, it should be
noted that R-squared can be easily abused and should
be treated with caution. High R-squares do not
necessarily imply that one model is more
valid than another. Let's take a look
at this example. In this demonstration,
I'm going to change the noise level around
the line of best fit. The true relationship
between y and x is one. And this is what is estimated
by the line of best-fit. The original data has
very little noise and the regression line hits
almost every data point, resulting in an
R-squared of one. Now let's go ahead and change this noise level around
the true regression line. We can see now the
R-squared changes quickly as we increase the
noise around the data. The R-squared quickly
drops in value, suggesting that the model fit
that data worse and worse. However, the model
actually remains the same. What is changing is only
the noise around the data. Noisy data result in the
lower R-squared value. And the layman observer might claim this to
be a poll model. But as you can see, the relationship between y
and x hasn't changed at all, and the model continues to recover the correct
coefficient value. Both models in this case
have the same validity, even though they have
different R-squared values. And that is why I want you to always be careful
when R-squared. The R-squared example leads us to our next discussion point.
13. Best Linear Unbiased Estimator: Best linear unbiased estimator. Ordinarily squares is set to be the best linear
unbiased estimator. It, certain conditions are true. Having an understanding
of these conditions is important as some matter
more than others. These conditions are often called the Gauss-Markov
assumptions and refer to for particular assumptions that
needs to be made mouth data. If these assumptions are met, then the ordinarily squares estimator is said
to be unbiased. In other words, the
results produced by the estimated will on
average be correct. If the Gauss-Markov
assumptions are met. The OLS estimator
will also be there. Best estimator. Best is another word for
efficiency and statistics. This simply means that the ordinary least
squares estimator will produce the most
accurate results with the least amount of noise. Let's explore these two
concepts a little further before we discuss the
actual assumptions. Efficiency refers to the width of the sampling distribution. When an estimator is said
to be most efficient, it's sampling distribution is less than that of
any other estimator. We can visualize
this in an easy way by assuming we have two
different estimators, an infinite amount of data. From this infinite
amount of data. Let's go ahead and
select a small sample, then try to estimate a particular coefficient
for a variable. We're going to use an
inefficient estimator and an efficient estimator. We're going to set
the true value of the coefficient to one. The first time we estimate the coefficients using
both estimators, we return a value of
around minus six for the inefficient estimator and minus two for the
efficient estimator. Now let's go ahead and
repeat this process. The second time our
estimates are closer. The inefficient estimator
predicts a value of around minus one and the efficient
estimator of around 0. Both are still some
way of the true value, but the efficient estimator
seems to be getting closer. Now let's go ahead and
repeat this process quickly, hundreds of times and
see what happens. Both estimators, on average gets the
correct value of one. However, the inefficient
estimator is on average further away with
its predictions than the efficient estimator. This is the concept
of efficiency. And once we normally don't have an infinite amount of data, this concept is often visible in the standard errors
of real-life result. In efficient estimators tend to have high standard errors, resulting in more uncertainty around the true estimated value. Next, let's explore the
concept of unbiasness. When an estimator is
said to be unbiased. This means that the mean
sampling distribution of the coefficient estimates will approximate the true
population coefficient. We can visualize this in
an easy way by, again, assuming that we have two different estimators
and an infinite amount of data will select
a small sample of this data and try to estimate
a particular coefficient. The true value of this
coefficient is set to one, and this is denoted by
the dotted red line. We use a biased and an unbiased estimator to
estimate the same coefficient. The first past
produces an estimate of around 0 for the
biased estimator, 1.5 for the unbiased estimator. Now, let's do it again. On the second pass. The biased estimator performance better with the result of three compared to the
unbiased estimator with the result of five. But let's continue and
repeat this process. Many times. We repeat the process, we see that on average, the unbiased estimator starts
to predict a value of one. What's the unbiased estimator predicts the value minus one. That can obviously
be a big problem. For example, the objective might be to perform a
policy evaluation. And a biased estimator estimates the policy to
have a negative effect. What's in reality, it might actually have
positive effects. Bias is a serious
problem in econometrics. And ordinarily squares requires some pretty strict assumptions for estimates to be unbiased. It is important then to have some understanding of the assumptions behind
ordinarily squares.
14. The Gauss-Markov Assumptions: Gauss-markov assumptions. The Gauss-Markov assumptions are the underlying
assumptions that make ordinarily squares the most efficient, an
unbiased estimator. Generally, four major conditions on needed to achieve
this result. These are the
homoscedasticity assumption, the notepad outfit called
linearity assumption, the linear in
parameter assumption, and the 0 conditional mean, sometimes called
exogeneity assumption. Roughly speaking, the first
two relate to efficiency, while it's the last
two relate to bias. Let's explain each in turn and try to determine
which matters most.
15. Homoskedasticity: The homoscedasticity assumption. This assumption states that
the variance of residuals remains stable across the spectrum of
independent variable. In other words, the errors
produced by variable remain roughly constant
whenever we look at a small part of that variable, value of this assumption
leads to buy standard errors. And this means we cannot
rely on hypothesis testing. However, many modern
statistical packages can easily test and correct
for this assumption. It is very common, for example, to use something called
robust standard errors, which increased the inefficiency of the estimates slightly, but make them immune to the
failure of this assumption. Let's go ahead and
look at an example. In this video, there
are two graphs. The left graph shows the
relationship between the explanatory variable x
and the dependent variable y. The overall relationship
never changes, but the variance across x will. The right graph, we see the
residuals or errors of x. It shows the distance of the actual data points
to the line of best-fit. The left graph also shows the slope estimate and
the standard error from a normal ordinary least
squares regression and a robust ordinary
least squares regression. Now let's go ahead and run
this example and examine what happens when we introduce a
changing variance across x. We see that as we increase
the variance across x, the actual regression
coefficient never changes. However, the standard errors increase as we increase
the variance across X. Moreover, the robust
standard errors increase by a little bit more. All this means is that the failure of the
homoscedasticity assumption, it leads to less
precise estimates. The real-world with
modern datasets, a failure of this
assumption often has little overall effect
on the actual results, and most practitioners do not focus on this
assumption a lot.
16. No Perfect Collinearity: No perfect co-linearity. This assumption states that an explanatory variable cannot be an exact linear combination of another explanatory variable. If this is the case, ordinarily squares simply
cannot be estimated. This is rarely a
problem in real life, as you would never
enter the same variable twice
into a regression. However, when there is partial correlation
between two variables, in other words, they measure the same thing to some extent. Then we term this
multicollinearity. And this can have some
effect on our estimates. Specifically, it will increase the noise and therefore the standard errors
of our estimates. This phenomenon is
generally easy to test for and also
easy to deal with, but either excluding variables
or transforming them. Let's look at an example. This example, I
generated a dataset that has five different
explanatory variables. These range from x1 to x5. Each X variable has a
coefficient of one. The graph on the right presents the estimates from an ordinary
least squares regression and the associated 95%
confidence interval around these estimates. We can see that ordinarily
squares estimates a value of approximately one for
each of the five variables. On the left graph, we see the correlation
between x1 and x2. Currently, there is no correlation between
both variables, which is why the data
points scattered randomly. Let's go ahead and see what happens when we
start to introduce a correlation between x1 and x2 and slowly force X1 and X2, measure the same thing. At first, not much happens, but then as the correlation between the two
variables increases, the standard error and therefore confidence
intervals of both x1 and x2 stops increase. This happens until they
explode towards the end. This is the effect
of colinearity. High colinearity
between variables leads to very noisy estimates. But as you see, the noise Explosion only
happens towards the very end. And in most real scenarios, the effects of colinearity
are hardly noticeable.
17. Linear in Parameters: The next assumption is that the model is linear
in parameters. This assumption means that
the relationship between y and z axis in the ordinarily
squares model is linear. In other words, the
coefficient estimates take single values and can only
be added or subtracted, that cannot be exponentiated,
divided or multiplied. In general, this
assumption makes ordinary squares regression
models easier to interpret. Note this only applies to
the actual coefficients. Variables can be transformed in any way, including
nonlinear ways. We often call this functional
form and we can vary the functional form as we please in ordinary least
squares regression. For example, it is common to add higher-order polynomials
of variables to a regression equation. Commonly used example
is H and H squared, where both variables
are entered separately. This has the effect
of introducing a curve into the
line of best-fit. Variables can also be
interacted with each other. And we call this
interaction effects. This means that lines
of best fit can take on very complicated
functional forms. Let's go ahead and
look at an example. This example, there
are two graphs. The left-hand side
shows the data plot of the auto data where the price of cars is plotted against MPG. The right-hand graph shows
the residuals or how far the individual data points are from the line of best fit. The average distance is represented by the
red horizontal line. The initial relationship plotted through the data is linear. But it should be
fairly obvious that this relationship is
probably not a good fit. So let's introduce a quadratic
into this relationship and slowly increase
the coefficient on the quadratic term from 0. Here's what happens. The line of best-fit
starts to curve up, puts this curve results
in a better fit. And we can see the
residuals coming down, especially for higher
values of MPG. Model fit improves. At some point, we overfit the model by
continuously increasing the quadratic coefficient and then model fit
becomes worse again. This example highlights the
power of functional form. The model is still linear
in parameters because the two estimated coefficients are only added or subtracted. But the square
manipulation of x leads to a complicated nonlinear
functional form that improves the model fit.
18. Zero Conditional Mean: 0 conditional mean, often called the
exogeneity assumption. This assumption is one of the most important assumptions
in ordinarily squares. The assumption states that
there is no correlation between an explanatory
variable X and the error term. Failure of this assumption leads to bias in the
coefficient estimate. This assumption can
often fail in real life. And because it involves
the error term, which by definition
is not observable, can never be tested. A good rule of thumb is that whenever a variable is a choice, especially in individual choice, then it's likely to be driven by factors that are unobserved. And hence a relationship with
the error term might exist. Let's have a look at an example. This example, I've setup a simulated dataset that again contains five
explanatory variables. Each variable as
a coefficient of one in relation to y,
the dependent variable. On the right-hand graph, we can see the individual
owner least squares estimates and associated
confidence interval for each of the five variables. The correct results are shown
by the vertical red line. On the left graph, we see the correlation that the variable x1 has
with the error term. Note, in reality, we
can never observe this as the error term will
always be hidden from us. Only in this simulated example, can we see the error term. The original correlation
between X1 and the error term is
set to around 0. Now let's go ahead and
increase the correlation between X1 and the error
term and see what happens. We observe that the ordinarily
squares estimate for x1 slowly deviates to the right
away from its true value. The more we increase
the correlation between X1 and the error term, the higher the bias in
our result becomes. This can be a real
problem in applied work. When we have such a problem, we often call it endogeneity.
19. How to Test and Correct Endogeneity: How to test and correct
for endogeneity, it is not possible to test for something that cannot be seen. That is why good
ordinarily squares models are strongly underpinned
by theoretical frameworks, prior literature, and
rational argumentation. This assumption is also
why many scientists argue against data mining would ordinarily squares models. Data mining approaches
increase the likelihood of the exogeneity condition failing and results Becoming biased. In the real-world. The way to deal with endogeneity
is often by more data, better, more thoughtful
model-building, different functional forms. And also sometimes simply accepting that the models
may have some bias.
20. The Gauss-Markov Assumptions Recap: Let's recap the
Gauss-Markov assumptions. The linear in parameters
assumption is a condition that requires
all betas to be additive. It means in layman terms that the dependent variables
should be continuous. But it does not mean
that the relationship between Y and X must be linear. More complicated
functional forms can be worked into ordinarily
squares regression models. Violation of the 0
conditional mean assumption, often called the
exogeneity assumption, can lead to biased estimates. This is a very
important assumption. It is not possible
to test for it. Statistically. Identifying or defending against it must be done on
theoretical grounds. There is no easy solution if
this assumption is violated. Options are to include missing variables in
the regression model, to attempt alternative
identification techniques, or to result to simulation
type methods that try to identify the size and direction
of any potential bias. The no perfect co-linearity
assumption must be met or ordinarily squares
regression won't work. However, weaker
collinearity between variables will result in
increased standard errors. Fortunately, standard
errors only explode. They extreme correlations. And this can be tested
for and corrected by either dropping variables
or transforming them. Violation of the
homoscedasticity assumption leads to incorrect
standard errors. It is easy to test for using an appropriate statistical
tests and easy to correct for with robust standard
errors that are included in almost all statistical
software packages.
21. Applied Examples: Let's explore some of
these concepts we've been discussing in a more
applied environment. We are now in Stata, which is a statistical
software package commonly used to analyze
quantitative datasets. It's similar to other
packages such as SPSS or SAS. I won't explain how
to operate stator, the code that I'm executing
to obtain the results. You can learn more about
data specific state. The courses. Already opened up a training
dataset called auto. Let's go ahead and
examine it a little bit closer before we start
running regressions. A common mistake is to start
analyzing data to quickly before fully
understanding what's actually inside the data. Modern data sets can
be very complex. And more often, the time
spent on data preparation and manipulation will
outweigh the time spent on actual
regression analysis. Let's describe the data
to see what we have. The output return by
the scribe will produce some high-level information
about the data, such as where it is located, how many observations and how many variables are included. In this case, our data contains 74 observations and 12
variables. It's not very big. It also has a title
that tells us that this data is related
to cars from 1978. Below that is information
about the variables. One of them is a
string variable that contains the names
of the car types, and the rest are all
in numeric variables. Let's pretend that we
are really interested in explaining the
determinants of car price. We can already start building
a picture in our head. What variables
might be important in explaining the
price of a car? Weight and mileage seem
like important variables. Or it's turning
circle is probably less important to most
people who buy cars. Next, let's explore some
summary statistics of the data so that we get some idea of how the
variables are measured. And distributed. Price appears to be
measured in dollars and the least expensive car
costs around $3 thousand. While it's the most
expensive car costs around $16 thousand. Such prices seem
reasonable for 1978. We also see that
the variable web 78 has some missing
observations. It only has 69 instead of 74. Most variables also appear
to be continuously measured. However, it looks like the variable foreign is measured
as a barn new variable. Let's go ahead and
confirm this quickly. By tabulating forum, we see that indeed foreign
is measured as a bind variable around
29% of cars foreign. So let's go ahead and estimate some ordinarily
squares regression models. Rather than
immediately going into a full-blown model with many variables and
interaction terms. Let's build it up
slowly and interpret the output and diagnostics
along each step. The variable foreign
leads itself to a nice simple question of foreign cars more expensive
than domestic costs? We could answer this question
by quickly computing the mean for both subsets of the data and simply
comparing the means. However, we can also achieve the same thing in a
regression framework. Let me show you this code regressors the
explanatory variable foreign against the
dependent variable price. The regression results of this table are pretty
easy to interpret. But before we do that, let's quickly look
at some diagnostics. The regression includes
74 observations. So that's good. There are no missing
observations. The S statistic is
not significant. Here we are looking
for values below 0.05. Values above 0.05 employ
that our total model. In other words,
all the variables in our ordinarily
squares regression, not explain how price berries. Likewise, the R-squared
is extremely low. Value of 0.0024 means
that we are explaining almost nothing in terms of price variation with
the variable foreign. Now let's go and
look at the results. We have one variable
called foreign. However, this is
a final variable, not a continuous variables. Such variables have the
following interpretation. If the value of the variable
is flipped from 0 to one. In other words, if a car changes from being rent domestic car
to a foreign car, by how much will the
cost price increase? The answer here, it
pays to be $312. However, we also observe that the standard error around
this estimate is quite large. The standard error is $754. That means the associated
t-statistic is below 1.96. P-value is above 0.05. This means this variable is not statistically significant
at the 95% level. We get an idea of the uncertainty by looking
at the confidence interval. This ranges between
minus $1200 plus $1800. The true value is
somewhere in there, but because the confidence
interval crosses 0, we cannot claim
statistical significance compared to the value 0. Finally, remember that
the effect of a variable is conditional on
other controls. In this case, there are no
other variables in the model, but there is a constant. And the constant is
the value of price, is everything else is set to 0. In other words, if a car is domestic and it's a value
of foreign, is set to 0. It will cost $6 thousand. A foreign car is $312, more expensive would
cost around $6,300. We can also visualize this. Here we see the estimated effect of foreign cars on price. Domestic costs are
cheaper on average, and foreign costs more
expensive by $312. But the confidence interval of both values is so large that they are not
statistically different. Great. Let's go ahead and increase the number of variables
in our model. We could throw all our variables in and simply see what sticks. This is what a data mining
approach would generally do. Stata has various data
mining abilities, including stepwise
regression that will automatically eliminate
variables that are not statistically
significant. However, there are some conceptual problems
with this approach. One of the most important
problem is that it prevents users
from thinking about the problem at hand and doesn't allow them to understand how that data analysis is related to underlying theory or their
research hypotheses. For this demonstration,
let's go ahead and slowly add one variable after another variable to
our regression model. We will not remove phone even
though it is insignificant, because the addition of other variables may
change its effect. Let's go ahead and add miles
per gallon to our model. We see now interestingly
that some immediate, significant changes
have occurred. Our R-squared has jumped
drastically to 0.28. The adjusted R-squared is a
little bit lower at 0.26, but this is still much, much higher than before. Our new variable MPG is statistically very
significant with a small standard error. And they hide t-statistic. Each increase in
one unit of mpg. In other words, costs
getting more fuel efficient will decrease
the car price by $294. However, we also see that
the effect of foreign cars has increased dramatically
to plus $1700. The standard error has
come down a bit from previously 752, now 700. The variable is now statistically significantly
different from 0. What a big difference one
variable may term model. Importantly, we can
explain this change. It turns out that
foreign costs have significantly higher miles per gallon numbers than
domestic cars. And once this factor
is controlled for the actual price of foreign costs is higher
than for domestic costs. This is because the effect
of mpg is negative on price. Because foreign cars
have higher MPG, their price was lower. Now that this effect is
being controlled form and therefore taken out of price. The actual effect of a car being foreign is that it
causes a price, rice. This is a perfect example of the exogeneity assumption I was talking about in
the previous session. We admitted a important variable from the regression model. And the explanatory
variable we did include was correlated with That's important variable
in the error term. Therefore, the previous
result was biased. However, because
we have now moved the offending variable MPG from the error term into
the regression model. We are controlling for it. And hopefully. Produced a less biased estimate. This really shows the importance of careful model building. Let's go ahead and introduce a third variable to our model. Weight. Weight is likely to be
an important variable because heavy cars need
more raw materials, but also because heavier cost unlikely to affect
the MPG number. And we know that this in turn affects the
foreign estimate. So let's go ahead and add
it to our regression model. Look at that. Now, R-squared jump up
again by a large margin. And also our estimated
effects have changed again. Let's explain it one
more time from the top. The new variable weight is
statistically significantly different from 0 due to
small standard error, high t-statistic
and small p-value, the effect is positive. In other words, each
additional pound of weight on the car increases
the price by $3.46. The effect of mpg is now
positive instead of negative. The inclusion of weight reverse the sign
of this estimate. Higher MPG cost now
lead to higher prices. Although this effect is not
statistically significant. This makes sense. After all, higher
miles per gallon, cars are more fuel
efficient and save money. This may require
better technology, and therefore, such
cars may cost more. However, the previous
effect was masked by the fact that the heavier
cars Hepworth mileage. Now that this is controlled
for the effect of MPG has become less biased. Moreover, because
there's a knock-on effect of MPG on foreign status, we see now the effect
of foreign cars jump to $3,600 with a lower
standard error of 680. This is another important
example of regression bars. Important explanatory
variables were left in the error term. Let's assume for a moment
that we are now finished with our model-building
and that we are happy with the
specification that we have. The next step is usually to perform some type
Gnostic statistics, especially in relation to the Gauss-Markov assumption discussed in the
previous session. Unfortunately, the
exogeneity assumption cannot be tested and
can only be inferred by adding other variables
to the model as just shown or by
resulting to theory. We can however, test the
homoscedasticity assumption. Let's go ahead and do that. Here's data performed a
test for homoscedasticity. The results show that the null hypothesis of
a constant variance is rejected in favor of the alternative hypothesis
of heteroscedasticity. In other words,
varying variance. We can also explore this visually by examining
the residuals. Here we've plotted the residuals versus the fitted values. This residual versus fitted plot shows how the residuals
are distributed around the plane of best-fit values close
to 0 mean a good fit. We can clearly see on this
plot that when we move from low fitted values to
higher fitted values of price, the variance of the residuals
around 0 increases. This is clear evidence of changing variance and
needs to be dealt with. We can either use robust
standard errors or specify a different functional form that tries to remove this
changing variance. Improving model fit is often
a better first option. And in this case,
the problem might be caused by the fact that
like many price variables, car price has a long tail. Often we transform such
variables with logs. So let's go ahead and do that. Now, let's run a
new regression with the dependent variable as
log price instead of price. Let's see what happens. At first glance, it looks
like everything has changed. The coefficients are
completely different. However, because we have now transformed the
dependent variable, all explanatory
variables relate to the log price and not the price. This means their interpretation
slightly different. Now, a one-unit increase
in weight increases the log price of
a car by 0. For. This can be a rather
inconvenient way to interpret a model estimates. So we often read transform the coefficients to make
them easier to understand. When a regression model
has no log transformation, either for the
dependent variable or the explanatory variable. We call this a
level level model. The interpretation
is straightforward. When an explanatory
variable is unlocked, the interpretation on
the coefficient changes on percentage increase in X. Causes a Beta divided by
100 unit change in y. When the model has a
log dependent variable, the interpretation changes
to a one-unit change in x, causes a beta times
100% change in Y. When the model is
a log-log model, the interpretation is
that they 1% change in x causes a Beta
percent change in y. So in this case, a one unit change in
weight causes a 0.0004 times 100 equals to
0.04% increase in price. Likewise, foreign cost now cost around 53% more in
terms of price. Now let's go back and test the homoscedasticity
assumption. Again. The test
statistic reveals that we can now accept
the null hypothesis of homoscedasticity. We can also visualize this again using the residual
versus fitted plot. Here we can see that as we move along fitted price values, the spread of the
residuals around the horizontal 0 line
is much more even. This is visual evidence
that our model now has homoscedastic errors that we can accept that particular
assumption. Next, let's go check
for co-linearity. This variance inflation
factor tests highlights to what extent each variable inflates the variance
of the model. High values above, say, 50 or so on. Particular variables
are indicative that these particular variables are co-linear with other variables. Here, there is no evidence of high colinearity in our model because all variables have very low variance
inflation factor values. Finally, we can also introduce more complicated
functional forms. Parameters must be linear, but variables can be
transformed and offer more complex forms than
just linear relationships. For example, we can
include a weight squared variable
into the regression to allow a quadratic
relationship to exist between
log price and wait. This new regression
states are included, wait, and they wait
squared variable. It's important that the interactive variables
are analyzed together. Whilst the weight variable is not statistically
significant, the squared weight variable
is statistically significant. And I, joined tests
should be done on both to see whether the pair
is significant or not. Let's assume for a moment
they are jointly significant. The interpretation of the output becomes a little bit
more complicated. But interaction effects can also be visualized and states, I can do that for us. Here we can see that
the relationship from our predicted ordinarily squares model between weight and log price is not
actually linear, but it pays to be
quadratic in nature. In other words, there
is a curve going through the relationship
between price and weight. Weight increases, the log price increases
more and more and more. Great. Now let's assume we are
done model-building. Regression models are
often not presented as they are shown by
statistical programs. There's simply too
much information in the tables presented by
statistical programs, most of which is redundant or not useful to laymen readers. It is also common to include multiple regression models in
a table so that readers can follow the progress of
the coefficients as additional variables are
included or removed from models. Here's an example of how regression tables
often look in reports. This here is a classical regression output
table that contains the coefficient to three
decimal places and standard errors to
three decimal places. Asterisks are included to easily identify statistically
significant effects. The diagnostics only include the observation count
and R-squared statistic. This table easily allows
readers to read across and examine how the effect of the
variable Foreign on price, for example, changes as we change our model
specification. This kind of approach
is important as it is a transparent
approach that shows the ingredients of how this particular
statistical meal was made. Readers can judge for
themselves if they agree with your particular
conclusion or not. This concludes this
practical session on ordinarily
squares regression.
22. Final Thoughts and Tips: Final thoughts and some tips. Hopefully, you've enjoyed
this introduction to linear regression analysis. I have some tips you may want to consider when applying
regression analysis to data. Practice. As with many things in life, it is practice and
frequent application that leads to an ever
greater understanding of the issue at hand. The same is true for
regression analysis. All the theory in
the world will not overcome lack of
engagement or application. I always recommend that
people should just get stuck in and start
exploring data. I think carefully about
your original objective. Are you trying to simply understand correlations
in your data? Or are you trying to
determine cause and effect? The first can be done
through simply playing around with the data and
the regression model. The second will need much
more deliberate thoughts about theoretical underpinnings and rational argumentation. Why might X cause Y? And what could the
transmission mechanism B. What else might influence
such transmission? Estimate multiple models
with small variations. Results are more convincing when different models
continuously show the same kind of outcome. Thus, the inclusion of
a particular variable, change everything or do your
coefficients remain robust? Showing a pathway towards your final preferred
specification is a very important part of
modern regression analysis. Data quality and sample size matter as much as
model-building. Big innovations in
data quality and size have happened
since the 1980's. Not every model needs to
be a complicated thing. Quality and data can add significant credibility
to any results, and you should not shy away from claiming that this data is the best data available to answer this particular
research question. High-quality datasets often require complicated
data manipulation. Lot of regression mistakes do not emerge from bad
model-building, but from poor data coding. Do not underestimate the amount of time that should be spent on data cleaning and preparing the data for
regression analysis. Ordinarily squares is still the most commonly used
regression method in the world. It would be wrong to dismiss
it as a simplistic method. Playing around with
functional form through interaction
effects can lead to complicated ordinarily
squares models that closely resembled reality. Do not be afraid to explore more complicated models that use quadratic terms and
other interaction terms. Understand the role of diagnostics in
regression analysis. Do not get hung up about
textbook diagnostics, but do query where the regression assumptions
about the data hold. Other assumptions
that might be too strong for the data at hand. Finally, have a healthy dose of skepticism when someone is claiming a causal relationship. Regression coefficients often
contain some kind of bias. At the same time. Don't be a nose, hair, and reject everything. Like many things in life, regression analysis
is an extra tool that should be used in
conjunction with other evidence, such as prior results, theoretical frameworks, and
also qualitative evidence. There is a fine line between art and statistics in
regression analysis.