Stata: Survival Analysis | Franz Buscha | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Stata: Survival Analysis

teacher avatar Franz Buscha

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

8 Lessons (55m)
    • 1. Intro skillshare survival

    • 2. Survival 1 - What is Survival Analysis?

    • 3. Survival 2 - Setting up Survival Data

    • 4. Survival 3 - Descriptive Statistics in Survival Data

    • 5. Survival 4 - Non-parametric Survival Analysis

    • 6. Survival 5 - Cox Proportional Hazard's Model

    • 7. Survival 6 - Diagnostics for Cox Models

    • 8. Survival 7 - Parametric Survival Analysis

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

This class is a Stata module that explores how to analyse, and model, survival data using the statistics software Stata.

Survival analysis - also called duration analysis or event data analysis - analysis examines the expected duration of time until one or more events happen, such as death of patient, time until someone leaves unemployment or failure in a mechanical system.

The key concept of survival analysis is that time is the variable of interest. In regression terminology we say that time is the dependent variable.

If you are working with, or analyzing data, that puts time as the dependent variable then you will require a special class of modelling techniques.

In this class I will highlight some of the most important concepts of survival data analysis including:

  • What is survival data
  • What is non-paremetric survival analysis
  • What is cox proportional hazards model
  • What is parametric survival analysis

You are expected to have a basic understanding of Statistics and Stata to get the most out of this course.

Meet Your Teacher

Teacher Profile Image

Franz Buscha


Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Intro skillshare survival: Hello and welcome to the short course on stator Survival Analysis. This course is a state and module names to provide viewers with a quick overview of how to analyze survival data in data to gain the most from this course, you should have a basic understanding of statistics, specifically regression analysis. And you should have a basic understanding of states that if you don't have the first, I recommend that you watch my easy statistics course that focuses on the concept of regression analysis. If you don't have the second, now recommend that you watch my LAN data analytics, which states are cool. Thank gives you a great primer on how to use data. Please note that theta is a proprietary software. It is not free. But don't worry, even if you don't have access to stator, you'll still learn something from this course. Survival analysis is also culturation analysis or event data analysis, or survival analysis examines the expected duration of time until one or more events happen. For example, the death of a patient, time until someone leaves unemployment. Oh, failure in a mechanical system. The key concept of survival analysis is that time is the variable of interest. In regression terminology, we say that time is the dependent variable. If you're working with analyzing data, that puts time as the dependent variable, then you'll require a special class of modelling techniques. And in this course, I'll highlight some of the most important concepts of survival data analysis. Including what is survival data? What does non-parametric survival analysis? What is Cox proportional hazards model? And what is parametric survival analysis? These models often formed the basis of survival data analysis. And if you work with data with a primary variable of interest is in time form. And you should know about these models. This course is not intended to be an in-depth mathematical exploration of survival analysis modelling. There are no equations in this course. Each session will provide a quick overview, a statistical issue at hand, and then move on to state, well, I will demonstrate how to do things and how to interpret things. The outcome of the course is to give us a clear and basic overview of how to handle survival data. Because this is an advanced topic, all states interruption is via code. And you should have some basic understanding of state decoding. So let's go ahead and jump into the analysis of survival data. 2. Survival 1 - What is Survival Analysis?: Let's go ahead and explore the concept of survival analysis. Sometimes this is also called event history. Oh, curation analysis or survival analysis is used when the dependent variable in our regression is time. O t. Time can be measured in many ways. It can be measured in seconds, minutes, hours, days, or even in unknown units. The key thing to be aware of is that we do not usually use ordinarily squares estimation when time is the dependent variable. There are many reasons for this, but a key reason is that ordinarily squares assumes that time is distributed normally, but time is not normally distributed. And it can take negative numbers. So we need to apply a different methodology to ensure that we don't get any silly predictions like negative time. And another important point is that survival analysis often contains graphics. A lot of analysis is done through visual based approaches. This doesn't have to be the case. But generally, there's always a strong visual element to survival analysis. Here's an example of survival analysis. This study starts with 100 people. And after a certain amount of time, only 40 people were left in the study. Of the more time, only one person remained. The research questions we might be interested in, in this case, the research question that we might be interested in might be, what is the shape of this function? How do we estimate this function? How many people survive after a particular time point? And what is the probability of surviving past a certain time? We'll try to answer all these questions using a variety of different methodologies in the following videos. What kind of methodologies exists in survival analysis? Well, it turns out there are three main methodologies in survival analysis. There is non-parametric analysis, there's semi-parametric analysis. And finally, there's parametric analysis. Each imposes different amounts of structure on the data. And we'll explore these in more detail in separate videos later. But in layman terms, that difference between these three can be explained as follows. Non-parametric analysis. Let the data speak for itself. It makes very little assumptions. What you see is what you get. On this graph. You can see the survival functions for two types of patients. And this is pretty much just a reflection of the raw data. We might think this is a good approach, but if it's not there, we don't see it. And that can be an issue. For example, if you want to predict outside our summer range or how sample is too small and will quickly end up with tiny samples if we want to investigate our results over many different variables. For example, if we have a large dataset, but we want to do a survival analysis by different types of gender and age and ethnicity across regions. You might very quickly end up with samples that can't support non-parametric analysis. So that can be a problem. One solution to this is to mix non-parametric analysis with parametric analysis. And this results in something called semi-parametric analysis. In this case, we often let the raw data compute one core survivor or hazard function. Well, all variables that we're interested in are set to 0. Variables that we're interested in, such as patient type or gender, will then simply move this underlying base function proportionately up or down. In other words, the effect of variables, as we like to think of them in our regression terminology, is that they do not change the underlying non parametric function themselves, but simply move the baseline function up or down. All variables are therefore proportionally related to each other. This is shown in the second graph. We're both types of patients follow the same kind of survival pattern but on different levels. And the final type of analysis is parametric analysis. And this specifies the kind of slope the survivor or hazard functions can take from various. This kind of model imposes a lot of form on the data and therefore also on the results. The effect of the variables is similar to the semi-parametric models and that they move the underlying survivor function up or down. But this time as a function of the underlying predefined distribution. All variables are distributionally, but later to each other. So when should you choose? Well, if you want certain of what to do or you're just exploring the data, then it's usually best to start with nonparametric modelling and then proceed to semi-parametric model. But if you know exactly what you want or you have a lot of covariates that we want to analyze. And you understand the assumptions that different distributions and pose on the parametric method, then parametric methodologies can offer significant efficiency advantages. And that concludes this quick theoretical overview of survival analysis. 3. Survival 2 - Setting up Survival Data: Before we start executing survival data models in state, we need to declare the data to be in survival time data. And before we can do that, we need some basic understanding of how survival data is generally setup. In its simplest form, will normally follow a series of observations over a specified period of time called the observation period. Within this observation period, observations may fail or they may survive. During the observation period, there's an onset of risks that some event will happen. Often this risk coincides with the start of the observation period, although it doesn't have to be the case. And this example shown here, observation one fails after a certain amount of time. Once observation to survives until the end of the study. More complicated data patents can of course also exist. For example, an observation may survive until the end of the observation period, but fail at a later date. A patient may die, for example, from the effect of some medical drug. After the medical study has ended. We tend to call this right censoring. Left censoring is the opposite. And observation failed before the observation period has even started. Sometimes there's also Interval censoring when observation jumped in and out of the study. Perhaps due to difficulties in tracing the relevant individually. And again, more complicated data patterns within this can also exist. However, we'll mainly focus on well-behaved survival data in what we'll be doing in these videos. But do understand that states can accommodate a wide variety of very complicated survival data patterns. So mobile data is declared using the set command. The syntax of this requires that uses specify the time variable in question. However, often survival data setup is more complex. And this is where the options of SDC become important. And we'll explore some of the more commonly used options in this session. So let's head over to state and take a look. Hey, we own status. I've already loaded the associated hot data file, which contains theta on the survival times of heart transplant patients. Instead of doing a summary, let's go ahead and look at the raw data for the first ten observations. And we can do that by executing lists in one to ten. And here's our data. We see that we've got the following variables. We have an ID code. We have an indicator whether a patient died. We have the survival time measured in days, and we have an indicator whether they receive that transplant or not. We also have a variable that measures how long they have to wait. And finally, we have the patient's age. The next step is to call the SD set command. Before we do that, let's Trump to its help how quickly by typing Help St. set. Here's the Help entry for SD set. We can immediately see that ST set applies to two different times of survival data. Single record per subject survival data, and multiple records subject survival data. In the first case, we only need to specify SD set and development time variable. In this case, S t time for us. However, and it's not uncommon for survival data to contain the same individuals multiple times. So if we have multiple individuals or multiple records per subject survival data, we'll need to specify the SD set command with the time variable, and also will need to specify the id variable and also what constitutes failure. I'll demonstrate this later. But there are also options for the single and multiple record cases. Again, I will demonstrate some of these later, and most of them are fairly self-explanatory. I recommend that you have a good look at this helpful before embarking upon survival data analysis. Let's close this and let's continue. Our survival variable is called St. time. In this case, we only have single record per subject theta. So therefore we can simply specify SD set and then the survival variable, which is St. time. Then we are, we've now told STATA that this data is survival data. Set, when executed gives us some basic information on our survival data. In this case, we can see that there are 103 observations and all of them fail. That is because we didn't tell state and what the fail variable list. So it assumes that everybody failed. In this case, everybody ultimately died. Let us not correct. So we'll need to fix that. To do that, we can use this failure option. The failure option specifies what the failure variable is. In this case, the variable is died. We can do that as follows. Fc set the time and then comma to invoke the options failure. And then in round brackets, we specify the variable type. And the failure is when dive equals, equals one. So this now tells data that is should treat survival failure. Variable type is equal to one. St set now records that 75 individuals died during our study. In total, we have almost 32 thousand days of analysis. Time. Risk starts at 0 and the first entry into the study is not 0. The longest period of observation is 10000799 days. Note that St. set also creates additional underscore variables that can be used to further explore the data. Sometimes survival data may contain multiple IDs. For example, we may be exploring people in households owe people are observed multiple times within a study. For example, there might be interval sensory, what subjects dropped in and out. We can accommodate such data by using the ID option. Sd, SD time specify the failure, and then invoke the ID option. And then round brackets specify the id variable, which by coincidence in this case is also called execute up. And in this particular dataset, the Arno multiple entries. So the information returned by St. set is simply the same as before. Finally, if the risk of the study does not materialize at time 0, we can also specify a custom time using the origin options. What we can do that as follows, specify ST and ST time, specify the failure, specifying the ID variables case we have multiple entries and then specify the origin option. And in round brackets specify that time starts at ten. So in this case, nine would only be considered a failure if the person died ten days into the study. Let's execute that. And SD set now reports not certain observations exit and study before the onset of risk. And 90 subjects were remaining. 62 of those 92 subjects fail. However, in this case, this would be the wrong way to set up our survival data. Since actually risk began at time 0. There are of course, many more options associated with st. said. Feel free to explore these in your own time. But the basics of setting up survival data can be covered by what was presented in this session. 4. Survival 3 - Descriptive Statistics in Survival Data: In this session, we're going to learn how to explore and produce initial summary statistics from survival data. It is important to explore your data once the data has been declared to be survival time data, exploring your data and producing summary statistics will uncover any initial setup errors and also provide you with important statistical information for later. As a rule of thumb, I recommend the following procedure. Firstly, carefully read the output provided by St. set. Secondly, explore the underscore variables that are created by St. set. And finally, you specific survival summary commands such as St. described as the sum and STP time to explore the data. Doing all of this will provide you with a good overview of the data. We use a few commands in this video, but all of them are designed to provide basic descriptive information about the survival data in front of you. Will you summarize to explore the underscore variables. Will also use the SD describe command, which provides a brief description of the data. The sun also provides a brief description of the data. However, St. some can be used with the option to produce some statistics over multiple variables. Let us extremely useful. Stp time allows us to compute person-time and incidence rates. So let's head over this data and take a look at each of these. In turn. Hey, we announced data and I've already loaded the heart transplant data. Let's go ahead and set the data to survival time theta y using the set command. Okay, that is now done. We've already examined the Alpert produce biased, and he said in the previous session. So the next step is to go ahead and explore the underscore variables that were created by St. set. And the easiest way to do that is to simply do a summary on all of them. So let's go ahead and summarize, underscore and then star as the wildcard for all variables that begin with an underscore. And here on the floor on the school variables that were created by St. set underscore EST shows that our study population is 103 observations and all values are equal to one. In other words, there are no exclusions in this dataset. Underscore t shows that 73% of observations in our dataset ultimately died. The rest survived and have a value of 0 by the end of the observation period. On the school t shows the time for each observation till exit of the study. The average length is 310 days, the standard deviation is 427 days, and the minimum length, one day bought some maximum length is 10000799 days. Underscore t shows that the start date for everyone in this analysis is set at 0. So, so far, so easy. The next command we probably use is the St. describe command. So let's go ahead and run that and see what happens. St. described. Executing. St described provides statistics that are similar to the underscore statistics, but have a little bit more detail. Report presented here includes a number of subjects and per person subject summary statistics related to the number of records, entry and exit times, gaps in the data, time at risk, and the number of failures. So for example, here we can see that we have 103 subjects and 103 records. This tells us that this is a single-subject dataset. This already tells us that we don't observe the same individual multiple times. Now dataset, the first entry time is recorded at 0 for everyone. The mean Exit Time is recorded at 310 days. The minimum exit time is one day. The median is 90 days and the maximum is 1000799 days. Also, there are no caps on our data. And the total time at risk is approximately 32,875 failures occurred, which corresponds to 73% of how theta. Great. So provided us with a little bit more information. Next, let's go ahead and take a look at St. Sum. The sum executes. St. Some provides more information on the survival time about data, presents the total at time 32 thousand days and the number of subjects 103. The division of the number of subjects that died by the total risk time is called the incidence rate. In this case, 0.0.0 two. In other words, the incidence of deaths per subject day was 0.2%. The median survival time for all subjects whose 100 days and the 25th percentile, 36 days. What's the 75th percentile was 979 days. Sd sum is particularly useful with the pie option that allows us to compute these statistics over of the variables. An obvious question would be, how the survival time differ by whether a patient had a heart transplant. To explore this, we can type the following St. sum and then invoke the pie option and insert the dummy variable transplant. And we now observe that there appears to be a difference in the median survival times. My whether somebody has a transplant, the incident rate is much lower. And the median survival time is 285 days for those with a heart transplant, voltage 21 days for those without a heart transplant. We also see that the incidence rate is much higher for those without a heart transplant. All of this is of course to be expected. We can also use the command STP, time to compute and tabulate the person-time and incidence rates. This command also allows you to use the buy option, but where it differs to St. some standards allows you to specify custom incidence rates. For example, if you work in epidemiology, incidents rates often presented per 1000 person years. And we can compute such incidents rates by doing the following STP time, my transplant. And then you see option per 1000. And here we can see the person's home reported my STP time matches what was produced by St. some previously. However, the incident rate is now reported as 11000 person years, as opposed to the previous per person year that was produced by STC. So if you wish to customize the or incidence rates, remember to use STP time. There are other summary in the script commands that can be used to explore survival data. But the ones presented here should form the backbone of any initial data exploration. And that concludes this section on descriptive statistics in survival data. 5. Survival 4 - Non-parametric Survival Analysis: In this session, we're going to learn about nonparametric survival analysis. Non-parametric survival analysis. Let the data speak for itself. In other words, there are no assumptions about the shape of the underlying data or the survivor function. This kind of analysis is mostly graphical, which is why we'll explore graph-based commands in this session. It is important to remember that non-parametric analysis does not control for other covariates, not words. It's not a regression analysis. Sample size committing. We can split our analysis over different categorical variables, but we can't do that for other continuous variables. Here's an example of a survivor function that is often presented in this kind of analysis. To survive a function reports the probability of surviving beyond a certain time point t. In this case, we have two different types of people. Some received a drug and others received no drug. And analysis. The survival function shows that the probability of surviving to any point in time t and instantly higher for those who receive the drug. The primary command in state to produce non-parametric survival statistics is the STS command. This command has a few subcommands, of which one is the graph subcommands. This is the sub command we'll use in this session. The graphs up command has options that allow you to customize your crops. And we'll explore a few of the most popular options. So let's head over this data and explore this a bit further. Here we on states. And I've already loaded the how transparent data and settle up and survival data. Before we go ahead and plot and survival functions, let's have a look at the relevant Help entry. To do that, we can type help STS graph. This brings up the help entry 40 STS graph command. This command will allow us to flock to survive the hazard or the cumulative hazard function. To do that, we simply select one of the options below. We can also select the option to produce several graphs over a variable list. And finally, the CI option is another useful option that this place confidence intervals on your cross. Let's go ahead and close this and start with a Kaplan-Meier survival function, which is the default specified graph. So let's go ahead and execute STS. Graph will not specifying any options, since the default option of SDS graph is already the Kaplan-Meier survival function. Execute that. And then we are, this graph shows that after 100 days of analysis time, approximately 20% of the original sample is still alive. The appears to be a rapid decline in survival. Early in the study, which then flattens out, leaving approximately 20% of survivors on the end of the study. Let's get to know. But what we really want to know is whether a heart transplant treatment changed the function of this graph significantly. So we can go ahead and produce two survival functions, one for each transplant category. To do that, we can invoke the buy option, STS graph comma and then pi transplant. This graph now shows us two different survival functions. For individuals who receive the transplant, the probability of surviving to any point in time T is higher than for those who did not receive a heart transplant. We can also use an additional option called risk table to populate our graph with a risk table that shows the number of survivors at certain points in time. Let me demonstrate STS graph by transplant. And then at the option risk table. The graph now has a table at the bottom, which shows at 500 days after the study began, only one person of 34 is alive who didn't get a heart transplant. To determine whether these functions are statistically significantly different, it can be useful to add confidence intervals. Let's go ahead and do that. We can add consonant intervals using the CI option STS graph by transplant and CI. And there we are. The confidence intervals clearly suggests that both functions are statistically different. So it looks like a heart transplant does affect your survival rate. Great. Let's go close this. To plot the cumulative hazard function, we can use the QM has options like so, STS graph again by transplant and then specify the cumulative hazard function. This graph shows the Nelson, Ireland cumulative hazard estimates for both transplant groups. Interpreting the actual numbers on this graph is not straightforward. I'm afraid. The best way to think about these numbers is to think of a video game that allows multiple lives. So in this case, a transplant patient would be expected to die around 1.2 times 1000 is a non transplant patient could be expected to die around 2.5 times after 100 days. So that's one way to interpret this. The primary use of the cumulative hazard function is actually in helping us understand how the hazard function would look like. I need a function that trucks the changes of this cumulative shape. So in this case, we would observe that the hazard risks of time is very high initially for those without the transplant. Because the cumulative risk increases very rapidly. The cumulative risk for those with abstracts plant then slows, suggesting that the underlying daily risk of dying diminishes. We can't ask data to obtain estimates of the underlying hazard function. The underlying hazard function tells us the risk of death at any point in time. But computing this isn't necessarily straightforward and it often causes boundary problems. Let me show you how to do this and how it looks. Let's close to cumulative hazard estimates. And invoke STS graph by Trump's clamped. And then that has adoption. The hazard rate shows the number of failures in this case, we can expect per day. For each of these two groups. We can expect more deaths per day in the early part of our study population for non transplant patients. After a while, the hazard functions begin to converge. However, notice that because computing these functions requires complicated kernel smoothing techniques over a moving window of data, we will always miss the beginning and the end points of these functions because there's a lack of data there. And that can make these functions look a little bit funny in comparison to the two previous graphs. Ultimately, I suggest that the combination of all three graphs is used to fully explore and interpret that data you have. And that concludes this basic introduction to non-parametric analysis of survival data. 6. Survival 5 - Cox Proportional Hazard's Model: Let's learn about semi-parametric survival analysis, often simply referred to as Cox regression. The Cox proportional hazards regression model is one of the most commonly used methods in survival analysis. A key disadvantage of the previous non-parametric analysis is that we were unable to control for the covariance. Once we can produce separate non-parametric slots by different categorical values, it isn't possible to control for the influence of other variables, like it is in a traditional regression setup. Parametric analysis can do this, but puts more rigid constraints. Time is treated. Cox regression offers a halfway house between both types of methodologies. By first estimating a baseline hazard function non-parametrically like we saw previously. And then assume that all covariates shift this function proportionally up both the commands that we're going to use, the SDK docs command and the ST curve command. The SDK docs command does not need a dependent variable specified. It assumes that survival time is already the dependent variable. Sdk is a post cox regression command that allows users to plot the saliva has a cumulative hazard functions from the regression. So let's go ahead and take a look at how this works in states up. Here we aren't states i with the heart transplant data already loaded and SDC. Before we begin, let me show you what I mean with the statement that a COX model estimates a baseline hazard. To do that, we can simply run an empty Cox model with no covariance. And we can achieve this with the estimate option in SD card. So let's go ahead and type ST. Cox. No covariates, coma. This is now produced an empty regression model, but has no coefficients. However, underneath this, empty modal aligns a baseline hazard. And we can now call upon the St. curve command to plot either the survivor has a cumulative hazard function from this hidden baseline hazard. So let's go ahead and plot the estimated survival function from this estimation XD curve survival. And then we are, this shows the percentage of people who survive to a certain time point t. This function looks very similar to a Kaplan-Meier survival function. And indeed in this case they're identical. So let me close this and show you how the Kaplan-Meier survival function looks like. And we did that previously. You simply type sd graph. And here's how Kaplan-Meier survival estimates. This graph is identical to the previous graph. So what I'm trying to show you here was the non-parametric part of the St. Cox regression. And this function here will then be proportionally moved up or down once we include covariates in our model. So let's go ahead and include some covariates, normal. We were previously interested in whether heart transplant patients live longer. However, to do this properly, we might want to control for h. H is unlikely to be an important factor in determining survival length. So let's control for it using a Cox regression. Let's type SD card. We're interested in the transplant army and we want to control for age, estimate that. And then we are here on our results from the Cox regression. What these results say that post-transplant and age of statistically significant in determining survival length. Very good. The coefficient on transplant is 0.1. six. And that tells us that is subject to have a heart transplant experiences only 60% of the hazards that a non transplant patient would experience. The coefficient on age tells us that for each year of additional aid, the hazard of time increases by 1.06 times. In other words, 6%. These numbers are useful. But visualizing survival time is often an even better way to understand what the model actually says. To plot specific functions, we can use the adoption in the SDK curve command to plot separate curves for different groups. For example, the survival function for transplant and non transplant patients can be obtained by typing the following code. Sdk. Survival at one when transplant is 0 and up to one transplant equals to one, execute that. And then we have, this graph shows that the probability of survival is significantly higher for those with heart transplants. Controlling for any age differences. To obtain a hazard function, we can replace the survival option with the house of option. Let's go ahead and do that. St. curve has it at one, transplant equals to 0, up to transform equals to one. And here are the two hazard functions. The graph tells us that the risk of dying on a daily basis is higher for those without a transplant, but diminishes rapidly until both hazards are equal. And around 500 days. After this, there is another pickup in the risk of dying on a daily basis for non transplant patients. The cumulative hazard function can also be called upon with the cumulative hazard option. Let's go ahead and do that. St. curve, cumulative hazard executes. And here's the cumulative hazard function. Finally, we can use as many adoption as we like and includes multiple variables. For example, to explore how 18-year-olds with transplants fare against 65-year-olds with transplants. We can use the following code. Sd curve, survival. Transplant equals to one and h equals to 18. And compare that to those with transplants equals to one at h equals to 65 a. We can see that the survival rate of younger people is significantly better than the survival rate for all the people. Notice how this final graph really shows that now in the Cox regression, oldest survival functions have similar shapes. There's simply shifted up or down in relation to the baseline hazard. And this is the parametric part of Cox Regression. And that concludes this brief introduction to Cox regression in survival data. 7. Survival 6 - Diagnostics for Cox Models: In this session, we'll explore diagnostic options for the Cox proportional hazards model. Specifically, we'll look at commands that tests the proportional hazards assumption. The Cox model assumes that the hazard ratio is constant over time. In other words, there is a baseline hazard and variables simply shift this baseline up or down. It is important to evaluate this constraint when presenting Cox regression models. If this assumption fails, and alternative modelling choices may need to be considered. These could include either different specifications or different methodologies such as non-parametric oh, parametric techniques. Here's a quick graphical reminder of what the underlying concept is that we need to test. The Cox model computes a baseline hazard function. In other words, everyone has a similar survivor or hazard function. Different covariates will proportionally a justice hazard function up or down. This assumption may be by restricted. For example, in the heart transplant data, wind to patients who had a transplant needs to follow the same underlying hazard. Also, why the shape as those who did not have a transplant. Maybe transplant patients face a higher risk. Later in life. We're going to introduce two graphical ways to evaluate the proportional hazards assumption and one statistical way. The first is via the command SD ph clot, which stands for survival time proportional hazard clot. Sd ph plot, plot, something often referred to as a log log plot. Another graphical method of evaluating the proportional hazards assumption is to plot the Kaplan-Meier survival curves and compare these with the cooks predicted curves for the same variable. With this plot is produced with the command St. cock scam. Both commands are intended to test discrete variables. Another way of checking the proportional hazards assumption, which allows for continuous variables in the Cox model, is an analysis of residuals. This is done using the E stopped pH test command. So let's go ahead to state law and examined all three in action. Here we on-state or with a heart transplant takes up already loaded. And I've just estimated a Cox regression of survival time using transplant status and age as controls. We can plot a log log plot of proportional hazards by typing the following pH plot, point transplant and adjusted for age. Note that this command requires the buy option to be used with a categorical variable from the regression specification is specified. In this case, we use the transform variable. Because we estimated now Cox model with an H5 will make sense to address Starcraft two, the average value of H. And that can be done with your just option. Let's go ahead and execute. The underlying statistical nature behind this graph is not very laymen friendly. But under the proportional hazards assumption, the plot curve for each category should be parallel. In this case, both curves look reasonably parallel, although my can of course be subjected the proportional hazards assumption, thus a pistol hold for the transplant variable. Another way of exploring the proportional hazards assumption is to compare the predicted observed survival curves of the Cox Regression with the observed survival curves of a Kaplan-Meier estimation. And that can be done with the St. Cox KM command like so SD Cox KM by transplant. And again, we need to specify a categorical variable from our regression here. And here's our graph. When the predicted and observed curves are close together, the proportional hazards assumption has not been violated. It looks like there is further evidence here that proportional hazards assumption holds for our Cox regression model. And finally, we can also use the E-step pH test command to test all regressors in our Cox regression. And we can do that as follows. First lets me run our regression and then swipe estab pH tests and invoke the option detail. By invoking the detail option, we see that test for each covariate and also a Global tests. This test analyses something called the Schoenfeld residuals. And in this case we see that both variables and global tests have it insignificant test statistic. All values are above 0, 0.05. providing further evidence that the proportional hazards assumption holds for our model that includes the covariates transplant NH. And this concludes this session on testing the proportional hazards assumption from a Cox regression model. 8. Survival 7 - Parametric Survival Analysis: In this final session, we're going to explore parametric proportional hazard model. Cox regression models make no assumptions about the underlying hazard function might look like it. Let's set theta, the terminant. Often this isn't advisable way to go about survival analysis and is one of the reasons why Cox regression is so enormously popular. However, sometimes we may want to constrain the underlying hazard function by parameterizing it. This system by choosing a relevant distribution and fitting that to the data. And the advantage of these models is they are more efficient estimate and allow users to make easier out-of-town predictions, areas of time for which there might be no data. Finally, sometimes theory predicts a particular setup and we may wish to constrain out theta to such a theory. For example, the likelihood of being hit by a car is probably the same minute by minute. And in that case, we may want to impose a constant memoryless hazard function. So our data. Here's an example of two survival functions for both non-parametric and parametric methodologies. The first graph shows to survive a functions of a Cox proportional hazard regression. The underlying base has it is estimated non-parametrically. And it's then shifted proportionally up or down depending on the effect of the variables. The second graph shows a parametric regression using an exponential distribution. Both graphs are similar, but you can clearly see that the second graph imposes a specific function on the data. In other words, the baseline hazard for the second graph can be described by an equation, and the same cannot be done for the first graph. Also note that in parametric models, the effect of covariates on time is still proportional to related to the underlying hazard function. In other words, they continue to follow the same shape. Here's the same data, but this time the hazard functions plotted. Notice how the Cox regression allows the daily hazard of tying to very day-by-day. Also notice how the exponential regression imposes a constant hazard rate across all analysis time. Depending on what you desire. This may be a good thing. Oh, it may be a bad thing. The command we're going to look at in this session is the St. recommend. The SD recommend is very similar to St. Cox on the previous session. Ought to work. It needs a distribution selected as an option. Various distributions exists, but in this session, I'll only demonstrate exponential Weibull distribution. To choose between different distributions, we often resort Information Criterion. These can be called off the regression using the East up I see come out of this data. And let me show you how this works. Here we on-state them with a heart transplant that's a loaded enough. Already set it up as survival data. To estimate a parametric proportional hazards model will come to use the stearic command. Let's go ahead and estimate the effect of having a heart transplant on survival time and controlling for age. We can type S T reg i dot transplant NH We'll need to choose an appropriate distribution. Let's go be naive and go with an exponential distribution for now and see what happens. So aptly option distribution and then exponential. Let's go execute that. The output presented is very similar to the output of a Cox. The question, there are 103 subjects enough study and 75 of them fell. The chi-square statistic is very significant, indicating that our model predicts the data well, both regressors are statistically significant. But notice how we've now also estimated constant. But the interpretation of the coefficients is the same as before. Subjects who had a heart transplant, I would daily hazard, that is 8% of those who didn't know Cup a transplant. And an additional year of H increases it daily hazard by 9%. As before, we can use the sd curve command to plot the survival cumulative hazard functions at different covariants. Let's look at all three for those who had not by Trump's plan. So SD curves, survival for those who had a transplant and those who did not have a transport. The survival function is rapidly downward-sloping. For those who did not have a transplant. Circa 50% of subjects was still alive after 500 days. If the 100 transplant for those without it's almost 0%. Now let's look at the hazard function by transplant status. To do that, we can execute x t curve, use the hazard option. Again at transplant equals to 0 and I transplant equals to one. The hazard functions of both groups is constant. That should worry us a bit, as one would theorized that the risk of time All Star I transplant diminishes with time. So this parameterization is probably the wrong one. But let's continue and let's look at the cumulative hazard function. Sdk, cumulative hazard for those with transplants and those without. And then we are because the hazard function, which we saw earlier is constant. Both of these functions are linear. Again, we can probably infer that this doesn't sit well with what we know, which we all live. One way to help us select which parameter model fits the data best is to compare the information criterion of each model. We can obtain the Bayesian and Akaike information criterion of each model. My typing E stat, I see, I see execute. And the AIC is 377 and the BIC is 385. Models with lower numbers are better fit and models and are preferred. So now let's go ahead and change the distribution to a Weibull distribution and see how this model fits S, T, reg, transplant and h. And we're going to use the Weibull distribution execute up. And now again, let's display the information criteria. And we see that the AIC and BIC numbers and now significantly lower than previously. This suggests that the Weibull distribution fits the data better than the exponential distribution. And we could now test every possible distribution available in the St. recommend Find which fits the data best. However, I cautioned against such a pure data mining approach. You should generally have a reasonable idea of what kind of distribution you want to use. If you don't, then it's probably impossible to simply stay with the Cox model and let the data determine the underlying hazard function. And that concludes this basic overview of survival analysis. Instead.