Transcripts
1. Introduction: Welcome. Data
analysis can be hard. So many different methods and
so many different ways to analyze and interpret data can make learning
very difficult. In this class, I
want to give you an easy and fast outline of an important method and data analysis,
nonlinear regression. The key to this class is
that there are no equations, no math, no tricky bits
of theoretical knowledge. Not to give you an intuitive
graphical explanation of what non-linear
regression is. And show you a range
of practical examples. No matter what your current professional
knowledge status, you can feel confident about
knowing the ins and outs of non-linear regression
after this particular class. What is non-linear regression? Linear regression is a
popular regression method that is often used and trying to model choices or other types of
discrete behavior. Many non-linear regression
message available, probit logit regression
are the most common. Both methods are
almost identical. And I'm going to focus on
these two because they're the most used method to
analyze discrete data. They also formed a base for more complicated
nonlinear methods. Property and logit regression or techniques that examine
the relationship between a binary
variable and one or many continuous
categorical variables. These techniques are used
in many different sciences. It's often used for quantitative analysis of
choice and discrete outcomes. Anyone wishing to delve deeper into the world of
regression statistic should have a good
foundation understanding of probate and logit modeling. One of the main learning
outcomes to learn and understand the basic intuition behind non-linear regression
method in data analysis. And the associated terminology
and also underpinnings to learn how to comfortably interpret and analyse
non-linear regression output. Finally, talent some extra tips and tricks that will help you. Integral analysis. Who is this class for? This class is aimed at dose or starting off
their careers and data analysis could
be practitioners, people working in
government policy, and in business, and
deepen students. Now let's contrast. This is an important addition
to basic regression skills. The focus on
non-linear modelling is a slightly more
advanced concept, but it is a concept that is used very often in
the real world. What prereqs unaided. There's no mass
and you don't need to know any math to follow up, get the most out of this class. You need this curiosity. Some state and knowledge
may come in handy for the practical
application of this class, but it's not required. Us state and Stata is a
statistical software program that allows users to estimate many different types
of regression models. Now we'll use this program to demonstrate some logit
and probit examples. Keen interest in understanding how data might be
related to each other. Often, data analysis
is all about measuring quantitative variables.
We can see each other. So if you want to know
how y is related to x, then this is the
right place for you. Using Stata. Going to be using stator
didn't demonstrate logit and probit
regression examples state that it's a purchasable
statistical software. And you can find out
more at WW.state.com, many classes on how
you can use data. Should you be interested in? This class? I will not teach data. I will focus on the
interpretation of the output. Note that the output will
look very similar to other statistical software
packages such as R or SPSS. If you do by chance
use data and you're interested in replicating the
examples from this class. I have attached a relevant
to files to this class. Two files are status
syntax files that contain code which allow
you to replicate. But I'll be showing you on
screen going to be using the NSW training dataset that comes inbuilt with data
for practical examples. This is a training dataset
that contains a variety of useful variables
and relationships on labor market outcomes. So let's proceed to the
next section and learn more about non-linear
regression methods.
2. What is Non-Linear Regression analysis?: What is non-linear
regression analysis? Just like linear
regression analysis, nonlinear regression analysis is a statistical technique that
examines the relationship between one dependent variable y and one or more
independent variables X. An alternative term used for dependent variable is outcome, response or endogenous variable. Alternative terms used for
independent variables or predictor or explanatory
or exogenous variables. Like linear regression models. Nonlinear regression
models often write models in the form
y equals to x, x1 plus x2 plus x3, etc. The last term will
be an error term, often denoted by E, that captures everything
that is missing. Will avoid writing too many
equations in this course. We'll leave this
expression like this. Variables can take many forms and non-linear
regression analysis. They can be continuous. In other words, data
can be measured. Anyone a number line, too many decimal points. That can be an integer
format such as 12 or three. Data can also be in binary
formats such as 0 or one. Sometimes data are ordinal. Ordinal data is categorical
data that has ranked, such as likert scales. Finally, data can
also be nominal. This is categorical
data that is unwrapped, for example, different
modes of transport. The key difference with linear regression is that for non-linear
regression models, the dependent variable
is often not continuous. Nonlinear regression
is primarily used when the dependent variable y
is measured as an integer, binary, ordinal, or
even nominal variable. This obviously applies to a lot of variables in real life. This is one of the reasons why non-linear regression
methods are so common.
3. How does Non-Linear Regression work?: How does non-linear
regression work? Non-linear regression
assumes that variable parameters with late to the dependent variable
in a non-linear way. Very parameters or coefficients is what regression
analysis estimates. For example, y equals
to one times x. In the linear world. This means that for every
one unit change in X, Y with increase by one unit. However, in a nonlinear world, we can't be sure what
the change in y is. The change in y depends on
the specific value of x. It could be more than one, or it could be less than one. The exact value will depend on the type of non-linear
transformation used. This unfortunately
makes interpreting nonlinear regression
models much harder. The row coefficients often have no reasonable
interpretation. That is why it is important
to understand how the coefficients from nonlinear regression
models can be reached, transformed into
something useful. Often, this is done using
marginal effects computation.
4. Why is Non-Linear Regression analysis useful?: Why is non-linear
regression analysis useful? Like linear regression? Non-linear regression
is used to answer questions that require
quantitative evidence. Like linear regression,
it allows us to examine the effect of an
explanatory variable on a dependent variable, controlling for other factors. It is used for hypothesis
testing and for predictions. Very much like
linear regression. However, nonlinear
regression has a significant advantage
with certain datatypes. Specifically, it helps us avoid an out-of-bounds
prediction. For example, if a
dependent variable is measured as a binary
variable, in other words, 0 or one, linear regression can predict probabilities of greater
than one or less than 0. But how can we have less than 0 per cent
chance to do something? Alternatively, dependent
variables like time, require positive
predictions only. If someone has given the drug, how much longer will they live? Well, at minimum, it must
be 0 or more, right? So therefore,
predictions should not be below 0 from such models. Nonlinear transformations,
and sure that we don't predict nonsense
from our regression models.
5. Types of Non-Linear Regression models: What types of nonlinear
regression models exist? Quite a lot actually, while it's linear
regression models, such as ordinary squares, remained the most commonly
used regression method. It turns out that many popular regression methods
are actually non-linear. The most famous example
of non-linear regressions are probably logit and
probit regression models. These are regression models for binary dependent variables. The dependent variable is
often measured as 0 or one. Common examples include
voting decisions, being unemployed in
educational attainment, choosing to do something, etc. Logit and probit models use
nonlinear transformations to ensure that model predictions stay within the 01 boundary. Both models are very similar, but you slightly different
non-linear transformations. To analyze dependent variables that have ordered categories, such as a Likert scales. We often use ordered
logit and probit models. These are very similar
to logit and probit models and use similar
non-linear transformations. The additional trick
that these models used is to include cut points
into their modelling, which estimate where
decisions are cut so that predictions into different categories
can be made. Another class of
non-linear models on multinomial logit models. These are often used when
a dependent variable consists of unordered
or nominal categories. A famous example includes what modes of
transport people take, the bus, the car, or the train. Note that multinomial
probit models do exist, but then not frequently used. However, nonlinear models to not only work on
categorical choice models, some datatypes required that
predictions are bounded between 0 and positive infinity. In other words, model should
not predict negative values. Examples include Count
regression models and time regression models. Both require
transformations so that the predictions from these
models are not negative. The Poisson and negative
binomial regression models. A common examples
for count data. Once the Cox proportional
hazard model is a common example, when time is the dependent
variable in a regression.
6. Maximum Likelihood: Maximum Likelihood. Whilst ordinarily squares is estimated by solving the
least squares equations, most a nonlinear models are estimated using
maximum likelihood. Maximum likelihood is
a numerical method that estimates the value
of the parameters. After greatest likelihood of generating the observed
sample of theta. Maximum likelihood is often
estimated iteratively, which means the
computer performs many calculations to narrow down the best
possible parameters. I'm not going to explain this technique in
a lot of detail. But here are some basic
tips that should be observed when dealing with
maximum likelihood estimation. Maximum likelihood
should be used when samples are longer
than 100 observations, 500 or more observation is best. More parameters requires
more observations. A rule of thumb is that at least ten additional
observations per extra parameter
seems reasonable. However, this does
not imply that they minimum of 100 observations
is not needed. Maximum likelihood
estimation is more prone to colinearity problems. Much more data is needed if explanatory variables are highly colinear with each other. Moreover, it'll variation
in the dependent variable. In other words, too
many outcomes at either one or 0 can also lead
to poor estimation. Finally, some
regression models with complex maximum likelihood
functions require more data, probe it, and load models
are least complex. Models like multinomial logit
models of very complex.
7. The Linear Probability Model: Linear probability model. Let's have a look and explore why non-linear
regression might come in handy by examining the
linear probability model. The linear probability model is a standard ordinarily squares regression applied to a model where the dependent
variable y is binary. But before we continue,
please note the following. The linear probability model is often used to
demonstrate point is a bad idea to run linear regression through
categorical data. However, often the results from the linear probability
model will be very similar to the final
module effects from a logit or probit model. I will demonstrate this later. But for now, be
warned that whilst we often stated that the linear
probability model is wrong, the truth is probably
more complex. It can be surprisingly useful when used with the
right amount of knowledge. Also, be aware that if you ever do decide to use the
linear probability model, you need to use robust
standard errors as the linear probability model
causes heteroscedasticity. Imagine for a
moment that we have a very simple dataset
that contains only two variables, y and x. We're interested in the
relationship between y and x. Imagine that y is also
measured as a binary variable, either 0 or one, and x is measured as a
continuous variable. Before we go further, let's see how this
would look on a graph. It would look
something like this. Each continuous x
observation is associated with either a 0 or
one wire observation. A scatterplot of such
data is probably not the best way to
visualize this kind of data. But bear with me because the
sample size is not enormous, we can just about make out that observations here
with higher values of X are more likely to have a
value of y that equals one. Whilst observations
with lower values of x appear more likely to
have a y-value of 0. This tells us that
there seems to be a positive relationship
between x and y. Increases the next lead to a higher chance of y being one. So far, so good. But of course, doing this
visually as its limits. We don't know what the exact relationship
between y and x is. We could plot the
relationship between y and x using a nonparametric fit. So this method clearly tells us there is a positive
relationship between y and x. Initially, the relationship
is nonexistent. And then at a
certain value of x, the relationship
becomes positive. After a certain
higher value of x, the relationship flattens off again and becomes non-existent. Great. However, we've already discussed the problems with nonparametric
in a previous course. We want to be able to parameterize the
relationship between y and x that we can compare
it to other data or give this information
to somebody else. How can we do that? One way is to use
ordinarily squares and run a simple linear regression
throughout data that would result in something
that looked like this. The linear fit
clearly establishes a positive relationship
between y and x. The estimated slope
coefficient of this regression has
approximately 0.23. In other words, for every
one unit increase in x, the probability of Y being one increases by 23
percentage points. Great. Next, let's plot the estimated predicted values of y from
our simple regression model. Seems to be a problem
with our model. The predictions from our linear regression
model results in three observations, having a predicted y value
above 11 observation, having a predicted
y-value of below 0. This is the problem of the
linear probability model. Its linear nature,
by definition, predicts values
outside our bounds. That doesn't make sense. Such results are nonsensical. It is not possible to
have a probability of voting for party a of 120%. Unfortunately, no matter what the relationship
between y and x is, any linear relationship
will at some point predict y-values that
go out of bounce. And this example here, I drew a slightly shallower regression slope
between this data. But you can still see that at some point it will
go out of bounds. There is no escaping this
problem with linear regression. Something will always
be a little bit wrong. Clearly, we need a
better kind of model.
8. The Logit and Probit Transformation: The logit and probit
transformation. The answer is to use
a nonlinear model. Specifically in this case, we need to use some kind of
transformation that makes the linear relationship
between y and x non-linear. The two most commonly
used transformations for our previous problem, the logit and probit
transformation. Both transformations ensure
that the relationship between y and x remains
bounded within 01. In other words, there can be no out-of-bounds predictions from these regression models. Mathematics bind
these transformations can look a little bit complex. Let's explore both
transformations visually. Here is the estimated
relationship between Y and X from a logit and probit fit. You can see that both are very similar in how they
relate y and x together. In general, both have a very similar shape and offer the same kind
of predictions. There's often very little reason to prefer one over the other. And both are frequently
used. In applied work. Both models predict y-values that are now bounded between 01. Take a look. The predicted values of Y
from both the logit and probit regression stay
within the 01 bound of y. Fantastic. It looks like we
solve our problem. Linear probability is out
and nonlinear models are in.
9. Latent Variables: Latent variables. Nonlinear models on generally more difficult to interpret
than linear models. Let me explain why.
Many non-linear models, like logit and probit models, assume that there is a linear process on the line,
each dependent variable. What does that mean? Well, imagine your
decision to eat, to eat, not to eat.
How do you decide? Logit and probit models assume that underneath
your decision to eat or not to eat is a continuous and
infinite hunger scale. If you're not hungry,
you don't eat. If you're a little bit
hungry, you don't need. If you're a little bit more
hungry, you still only. But at some point
your hunger becomes too much and you decide to eat. This is how logit and
probit models work. They assume that
every choice decision is the realization of people passing some invisible cut point on a hidden continuous process. We call such a process
a latent process. We often denote
such a process with a variable called y star. In our equations, y star will be a function
of many factors. For example, if y
star is hunger, it might be a
function of exercise. If exercise is measured x, then the relationship
between exercise and hunger might have a positive
coefficient of one. However, y star is
always hidden from us. We don't see it. We can never observe
this process. To make things more difficult. This is what logit and probit
coefficients relate to. They recover coefficients
that relate to y star. This means that
probe it and logic coefficients have no
natural interpretation. They simply don't make sense. A one-unit increase
in x will lead to a one-unit increase
in unseen hunger. That doesn't make sense. What do we observe? We observe the realization
of y star, often called y. In other words, did
somebody eat or not? To figure out how x is related to the
realization of choice, we need to transform
the coefficients from nonlinear models
such as logit and probit regression into
something useful. This is often done
using marginal effects.
10. What are Marginal Effects?: What are marginal effect? Marginal effect or
slope coefficients sometimes are also
called partial effects. In linear regression, estimated coefficients
are marginal effects. That is because they have a constant slope
that doesn't change. Every one unit increase in x
leads to a Beta change in y. However, in non-linear
regression, such as probit or
loaded regression, slopes constantly vary. There is no single
modern effect. This is why we must compute module effects at
particular points. This is why we must compute the marginal effects
at particular points. Two types of computations
are most popular. Effects computed at
the mean of x and the average effect of all effects computed
along every point of x. These are the most common
marginal effects of practice. But users can also choose any other point
that makes sense to them. Let me demonstrate
this visually. Here we are back with one of our nonlinear fits
of y against x. In this case, the
fit is a probit fit. Each data point has a
predicted value of y. Along this fit, we observe
that as x increases, so does the probability
of Y being one. We also note that
the relationship between x and y is not linear. To understand the
effect of x on y, we compute marginal effect, marginal effect on a slopes
at respective points of x. As you can see, the slope
changes constantly. At low values of x, the relationship between
y and x is almost flat. App, average values of x. The relationship is
strongly positive. At high values of x, the relationship is flat. Again. We need to choose some value of x where to compute
our module effects. The mean of x is
usually good value. In this particular case, the slope coefficient
is approximately 0.30. This means that the effect
of X on Y is as follows. A one-unit change in x causes a 30 percentage point increase in the probability
of Y being one. Just remember, the
relationship does not hold across all values of x. At higher values of x. Further increases in x leads to much smaller increases
in y being one.
11. Dummy Explanatory Variables: Dummy explanatory
variables. So far, we've established that the
coefficients coming out of a non-linear model require a bit of extra work
to make sense of. However, we only looked at a
single continuous variable. To be precise, we looked at
the model along the lines of y equals to Beta X
plus an error term, where x is a variable that
is measured continuously. What about if we include an additional dummy
variable in our model? In other words, we want to estimate the model
along the lines of y equals to Beta X plus beta a dummy variable
plus an airtight. Dummy variables are binary
variables that often take the numbers 0 or one bit, like our dependent variable y. In linear regression,
coefficients on dummy variables, sometimes called intercept
shift coefficient because they change
the intercept. In other words, they move
the entire relationship between x and y
upwards, downwards. However, in nonlinear models, their effect is not constant. They still shift the
nonlinear relationship between Y and X up or down, but the size of the
shift is not constant. Let me show you
this graphically. In this example, we continue to fit a non-linear fit
on our observed data. Y is measured as a point of variable and X is
measured continuously. However, the actual
model underneath is from a regression model also
includes a dummy variable. Dummy variables act as
an intercept shift. Observations with a
dummy value of one. Say, these represent men, have a higher
probability of observing a y-value of one for
any given value of x. However, as can be
clearly seen here, the size of this effect varies
depending on where we are. At low values of x, the effect of the dummy
variable is almost negligible. Medium values of x, the difference between
the two curves is high. And finally, at
high values of x, the effect of the dummy
variable decreases. And again, this all makes sense. This is because we
continue to bound our relationship between y and x between 01 via the non-linear, in this case, logistic
transformation. Therefore, any stepwise effect from a dummy variable
must also be nonlinear to continue
to ensure that we don't go out of bounds
with our predictions.
12. Multiple Non-Linear Regression: Multiple non-linear regression. Finally, what about when we have a regression model with multiple continuous
country variables? How does that work? Let's take our previous model
with a dummy variable and simply add in another continuous
explanatory variable, let's call it x2. This gives us a model
along the lines of y equals to Beta times x1 plus beta times x2 plus
Beta types of dummy variable. The key thing to
understand about multiple non-linear
regression is that the effect of each beta, or very, not just according
to what value of x we're out. That also at what
value of other axis. Whereas in other words, the effect of each
page that will depend on the value of every x, not just the variable
in question. In practice, we often
measure the slope of each coefficient of the mean
value of ball on the axis. This can be hard to comprehend. So again, let me show
you a visualization of a logit model with two continuous variables
and one dummy variable. Here is a visualization of the aforementioned
logit regression model. Our data consists of one independent variable that
takes only the values 01. That is y, on the
left-hand graph, that data is distributed
on the ceiling and floor of the
three-dimensional image. Outdated also consists of two continuous explanatory
variables, X1 and X2. Both have a positive
relationship with Y. But it's pretty hard to figure that out from our scatterplot. On the right graph, we've plotted the
predicted values from a logit regression. Whereas a linear
regression model, such as ordinarily squares, attempts to fit linear planes of best fit through these data. Logit regression fits
non-linear planes of best fit through these data. However, the logit pain
of best-fit is not only non-linear in relation
to only one x variable. The slope of the plane changes according to both
the X variables. Specifically, the value of both x's will determine the
relationship between X1 and Y, also x2 and y. All of this can be quite a
tricky concept to grasp. If we add more
explanatory variables, all of this moves into
higher dimensions. Finally, the effect of the dummy variable
is also visualized. Here. We have two planes of
best-fit in this graph. One plane is for all the values of 0 for
the dummy variable, and the other plane is for o on the values of one for
the dummy variable. I think it's obvious to see how difficult it can be to
make sense of such models. It's basically impossible.
13. Goodness-of-Fit: Goodness of fit. Now that we have a
reasonable understanding of how non-linear regression, such as logit and probit
regression models work. Let's talk about how to measure whether such regression
models fit the data well. Traditional R-squared
values from ordinarily squares does not
exist for nonlinear models. There is no sum of
squares computation coming from these
kind of models. That means we cannot compute how much variance is
explained and unexplained. Other ways to measure
fit on needed. Many software packages compute something called
pseudo R-squared. This attempts to mimic it
goodness of fit diagnostic by first estimating a
so-called null model. Non-modal is a model with no explanatory variable
and only a constant. Second model with full
covariance is then estimated. And they comparison of the log-likelihood
function is made. The ratio of how much
better the full model is, is then provided as
a pseudo R-squared. It can be a useful statistic, but it should never
be considered to be similar to the traditional
nought squared. There is some danger here. Another way to compute
the goodness of fit is to look at something called
a classification table. It classification table assigns predicted values from the
model to either 0 or one. Values that are
predicted to b1 and are actually one would be
classified as correct. Likewise, values that
are predicted to be 0 and actually 0 are also
classified correctly. Any other values would then
be classified as incorrect. The proportion of correctly
classified values then serves as an indicator of how well the
model fits the data. Here's an example of a
classification table from Stata. Quite a lot of output
going on here. So let me explain
what's happening. At the top we see a classification table file
logistic regression model. We have a total of
100 observations. Of these 63 observations are classified as 137 observations are classified as the 0 of the 36 observations that
are classified as 145. Actual one values
in the row data. 18 have 0 values. Likewise, for those with
a prediction of 011, not actually once in the data, and 26 are zeros
in the world data. Then a total of 71 out of 100 observations
predicted correctly. We can see at the bottom, 71% of observations are
correctly classified. A higher value indicates a better fitting logit
or probit model. Generally, values above
80 or 90, or excellent. Values in the 70s are good. Values in the 60s are okay. And values in the 50s and
indicate a poor fitting model. Remember that simply
by rolling the dice, we could expect to classify
50% of the values correctly. So 50 per cent should be
seen as the baseline here. There are quite a few other
statistics in this table, but all are just
variations of a theme. However, there's one
last item to note. The classification
depends on a cut value. My default. By default, many programs use 0.5. In other words, values
above 0.5 are predicted as one and values below
0.5 are predicted as 0. This is arbitrary. Say value of 0.5 seems
to make logical sense. The cut point value
can be changed. This will result in completely
different model fits. Here's an example of that. In this video, I'm
demonstrating the impact on the goodness of fit statistic by changing the
classification cut. The graph shows the ward data
points of a regression of a binary y-variable against
a continuous x-variable. A logit model is estimated. The predicted
values are plotted. Red values are classified as 0 and green values are
classified as one. Gray values, slightly enlarged
for better visual effect. Denote incorrectly
classified values. The initial cut point
for classifying variables is set at 0.5. Now, let's go ahead
and change this. We can see that as we move the cut point value between 01, the proportion of correctly classified data
points changes dramatically. In other words, this measure of goodness of fit is
subject to what we think is the right cut
point to classify data points. This could never happen in a normal linear
regression model. My personal advice
is to stick with 0.5 unless there are very
specific reasons to do so. One reason might be
very skewed data. For example, if a binary
dependent variable has a very high or low
proportion of ones.
14. A note about Logit Coefficients: A note about logit coefficients. Probit coefficient do not have a natural
interpretation as they relate to the
underlying latent score of a dependent variable, which by definition is
always unseen and hidden. However, Logit coefficients do have a natural interpretation. Thanks to a quirk
of mathematics. For logit models, the estimated coefficients
can be interpreted as one unit increase in x causes a beta increase in the
log odds of y being one. This natural interpretation
has some meaning, but the log odds portion
can still be a bit awkward. To overcome this, we can exponentiate coefficients
from loads of model. This allows logit coefficients
to interpret it as odds. Odds. Specifically, odds ratios
are still complex interpret, but it does mean that
uses are able to avoid marginal
effects computation. We can interpret an exponentiated logit
coefficient as follows. For a one unit change in x, the odds are expected to
change by a factor of Beta, holding everything
else, constant. Odds ratios have a base of one
when the odds are similar. Therefore, if the
pizza is above one, we can say that the
odds beta times larger, the beta is below one. We can say the odds are
beta times smaller. However, remember that whilst
odds have some meaning, they do not reveal the magnitude of the change in the
probability of outcome. Only marginal
effects can do that.
15. Tips for Logit and Probit Regression: Tips for logit and
probit regression. What state the requirements
for nonlinear models tend to be higher than
for linear models. It should be noted that probe it and logit regression
models are very robust to even small samples
and scaling variation. In other words,
whilst models like multinomial logit models
require a lot of data, logit and probit
regression can be done with a much
smaller sample size. There's often very
little reason to choose between logit
or probit models. Both results. Both result in very
similar predictions and similar marginal effects. However, one reason
why some people gravitate naturally
towards the load models is the extra flexibility of the odds interpretation
of its coefficient. Rho logit coefficients
are generally 1.7 times larger than war, probably coefficients
for the same model. However, marginal effects
will be very similar. It is generally good practice to report marginal effects at the mean of all other variables or the average marginal effects. It would be strange
not to report these when you
using such models. However, sometimes model effects computation
can be intensive. There are two ways
to overcome this. Raw coefficients from
logit and probit models. They'll allow users to interpret the sign relative size
and significance. Or one could result to a
linear probability model. Let me explain why.
16. Back to the Linear Probability Model?: Back to the linear
probability model. We started this course with
a clear example of why a linear probability model
is generally a bad idea. However, it turns out that
there is a silver lining. Linear probability
models often produce the same marginal effects as the marginal effects from
logit and probit regression. If most of the variables in the regression model have
normally behave data, marginal effects
computation will often produce the same slope estimates as the slope estimates from a standard
linear regression. In other words, it is
possible to genuinely use a linear probability
models to compute marginal effects for regressions with binary dependent variables. This can be really useful for situations where computational
time needs to be reduced. Alternatively, it can be useful for complicated nonlinear
regression models, such as panel data
loaded models for the mathematical
complexities make marginal effect calculation
extremely difficult. Here's an example
of what I mean. Here, I'm using Stata to estimate a logistic
regression between Y and X. And the logit coefficient
comes out at around 1.26. Average marginal
effect computation produces a result of circa 0.24. In other words, the
average marginal factor is that a one unit increase in x leads to a 24 percentage
point increase in the probability of Y being one. Now, let's take a look at it. Ordinarily squares regression
using the same model. And this model estimates
a coefficient of 0.23. In other words, a one
unit change in x leads to a 23 percentage point increase in the probability
of Y being one. This is almost identical
to the logit model and highlights the
potential usefulness of a linear probability model.
17. Stata - Applied Logit and Probit Examples: Let's explore some of
these concepts we've been discussing in an
applied environment. We are now in stator, which is a statistical
software package commonly used to analyze
quantitative datasets. It is similar to other
packages such as SPSS or SAS. I won't explain how to operate stator or the code
that I'm executing. To obtain these results. You can learn more about stator in specific
state or courses. I've already opened up a
training dataset called National Longitudinal
Survey of Women in 1988. Let's examine it a bit closer before we start
running regressions. Let's start with a
description of the data. The output return by
describe producers high-level information
about the data, such as where it is located, how many observations
and variables are included, and its size. In this case, our data contains 2246 observations
and 17 variables. That's a fair sample size. But modern datasets tend
to be a lot bigger. Below this is information
on the variables. Or variables are measured
as a numeric variables. While some are measured
two different precisions. There are no string
variables in this data. The variables or related to
labor market outcomes of a sample of women
aged 35 to 451988. We have information
on their ages, wages, occupation,
education, and more. Good. Now let's do a quick summary. Summarize provides us with some basic statistics
for each variable, such as the observation
count, the mean, the standard deviation, and the minimum and maximum values. Scanning through the
data reveals that most mountainous look normal
for what we would expect. The average age is 39 years and 64% of the
sample are married. Wages look fine. Although we know
that the variable Union has observations missing. Now, let's pretend
we're really interested in explaining the determinant
of union membership. We can already start building
a picture in our head of what variables might
be important in explaining the choice of
being a union member. Wages and education unlikely
to be important factors. Maybe h2. In fact, lot of the
variables here might be important factors in determining someone's decision to
be a union member. To keep things easy, that's only include
a small number of variables to start with. Let's pick age, wage, married, and college,
grad as our variables. The variable Union looks like it is measured
as a binary variable. Let's confirm this
with a tabulation. Indeed, the variable is measured as a barn
new variable and 24.5 per cent of our
sample members of a union. Next, let's plot
the variable union, again, self first variable
on the list, age. This is a good example of why a graphical analysis of
binary data can be difficult. We can't really
see anything here. Other than that. For each year of age, there are union members
and non-union members. We could draw a local polynomial smoother through his plot to get a better understanding
of what the relationship between age and being a
union member looks like. It doesn't look like that. There is a particularly
strong relationship between age and
union membership. For demonstration
purposes, let's now estimate a parametric
relationship. Using a logit model will only use age as an
explanatory variable. For now. Status logit regression
output looks very similar to that of a standard ordinarily
squares regression output. Diagnostic information
is presented at the top and results are
presented below that. At the very top of the results, we see the maximum likelihood
process taking place. Stator, compute several models with different parameters
and estimates. A log-likelihood,
then converges on the best set of parameters that offer the smallest
log-likelihood. Because logit and probit
models are so well-developed, it doesn't take
many iterations to achieve a final set of results. The final log-likelihood
is presented here. Next, we have information on the observation count and a likelihood ratio
Chi-Square statistic. This statistic is
similar to an F-test for linear models and tells us that the model explains
something or not. In this case, the
answer is all not since the p-value of the
chi-square statistic is way above 0.05. Next is the pseudo R-squared, which further confirmed that
this is a terrible fit. What's one should never
translate this as being analogous to linear
R-squared statistics. A value of 0.0001
is extremely bound. The results section, we see why the coefficient
on age is very small. The standard error is high. The associated z statistic is analogous to the t-statistic
in linear regression. Values above 1.96 implies statistical significance for
reasonably sized samples. The p-value also has the same meaning as
for linear models. Values of 0.05 or below are statistically
significant at the 95% level. Both Z stat and
p-value shown that the variable h is very
statistically insignificant. To further illustrate this, we can compute the
predicted probabilities of union membership from this model and plot this on our graph. The blue dots represent
the raw data points and the red dots represent the predicted probabilities
of union membership. The result is that there's virtually no
relationship between age and union membership. It is hard to see, but the predicted relationship
here is still non-linear. It's just that the nonlinear
part in this bit of the data is so flat that
we can hardly see it. If we predicted
this relationship into higher ranges of age, we could see the
logit transformation. Here it is. Using an age range of minus
10000 to plus 1 thousand, reveals a nonlinear
relationship between age and union membership from
this particular logit model. Obviously, this doesn't
make a lot of sense. We are predicting
far out of bounds. Moreover, ages below
0 or not possible. Let's go back to our logit model and add in some more variables. We know that age is not
statistically significant. But unless there's a
problem with sample size, my advice is generally do not exclude a statistically
insignificant variables. The reason is that
controlling for additional new variables might make earlier variables
statistically significant. Again, let's take a look. We'll add wage, married, and college graduate as further explanatory
variables to our model. The model now has a
chi-squared statistic of 48, which is statistically
significant. This means our variables
to explain something. Pseudo R-squared is 0.023, which is much
better than before. However, it still
seems like low value. It is worth exploring this further with a
classification table. The moment. First, looking at the results, we see that two variables are statistically significant
at the 95% level, wage and college graduate. One variable, married is
statistically significant. At the 10% level. The currently
presented coefficients are difficult to interpret, but we can infer size,
sign, and significance. Wages are positively related to the probability of
being a union member. Being a college graduate is
also positively related. Being married is negatively related to being a union member. Both college graduate
and married. A dummy explanatory variables. So we can infer that
the effect of being a college graduate is stronger than the effect
of being married. This is because the
absolute coefficient of college graduate is
around 20% larger than the coefficient of married. To make a sense of
the coefficients in a more meaningful way, we would normally compute
marginal effects. This can be done easily
and states and by default, state to compute the
average marginal effects. In other words, all
the slopes across every value of x
and then averages. These here are the results. States are computed the
average marginal effects with respect to all variables. The effect of age
is insignificant, but the interpretation of
the estimate is as follows. On average, a one unit
increase in age increases the probability of
union membership by 0.1 percentage point. Wage is also a
continuous variable. The interpretation
is, on average, a one unit increase. In hourly wage increases
the probability of union membership by
1.2 percentage points. Married and college graduate,
or dummy variables. So they can be interpreted
as, on average, being married decreases
the probability of union membership by
3.9 percentage points. On average. Being a college graduate
increases the probability of union membership by
4.6 percentage points. Great. We can also compute specific
module effects to answer questions about how
specific people might be affected
by change in x. For example, the effect
of being married on union membership is minus
five percentage point. For women who are aged 40 with a college background and
a wage of $30 per hour. Next, let's explore goodness
of fit a little bit closer. The pseudo R-squared
value was 0.0231. By calling a
classification table, we can obtain more information. The classification table
file logit regression, shows that we classified 75%
of observations correctly. And that seems like a
pretty good number. But it is important to examine the classification
table in more detail. Whilst our model
did a good job of predicting 0 values
that are actually 0, it is a very bad job at
predicting any positive values. Only 20 observations are
predicted to be union members. We know from our
summary statistics around 450 observations. Actually union members,
what's the proportion of correctly classified
values is relatively okay. A further inspection of the classification
table tells us that our model does a bad job at
predicting positive values. It clearly needs more work. Next, let's compare the
output from the logit model. The results from a probit
and linear probability model comparing the raw coefficients
won't be very useful. Let's compute the marginal
effects for each model. The linear probability model produces marginal
effects by default. For logit and probit regression. We need to ask STATA
to compute them, will store these
estimates and then compare them in a table like so. The results table indicates that all three models produce
very similar results. The marginal effects
are almost identical. For example, being
married results in a full percentage
point decrease in the probability of
being a union member. From the linear
probability model. A three-point nine
percentage decrease from the logit model, and they fall percentage decrease some of
the probit model. Finally, before we finish, let me show you the concept of Lake variables with
a probit model. This can be a hard
concept to understand, so I prefer to demonstrate
this with simulated data. Let's clear everything
in our data. Let's invoke the set command
that tell Stata to do something 1000 times when we invoke random
number commands. Finally, let's set a seed so we can reproduce our results. I'm now going to
generate a new variable out of thin air using
status random number function's going to generate a new variable called x that
is normally distributed. Let's do a summary to
explore what I've done. I've generated a new dataset
that has one variable x. This variable is
normally distributed. It has a mean of 0 and a
standard deviation of one. Kernel density plot shows the normal distribution
of this variable. Next, let's generate
another variable called e that is also
normally distributed. This variable will mimic an
error term in a regression. Now, let's generate a third
variable called y star. We generated y star equals to two times x plus one times E. So there is a positive
relationship between Y star and X of slope two. However, let's now pretend that y star is a latent and
unobserved process. We don't actually see why star. What we see is why the
realization of y star. Y is one. If y star is greater
than 00, if it is less. If we tabulate why we see that
51% of observations are 1, 9% of observations are 0. Now, let's want to probably the regression
of y against x. Look at that. The Probit coefficient
is approximately two. This coefficient relates to the underlying relationship
between Y star and X. This is what we mean when we
talk about latent variables. How logit and
probit coefficient, or the coefficient of
underlying latent processes. If we change the value of two to four in our
Weinstein generation, the probit model will predict
a coefficient of four. Hopefully. This little simulators
example made the concept of latent variables more
real and easier to grasp.