Learn Data Analytics with Stata | Franz Buscha | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Learn Data Analytics with Stata

teacher avatar Franz Buscha

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

72 Lessons (9h 44m)
    • 1. Starting 1 - Stata's Interface

    • 2. Starting 2 - Using Help

    • 3. Starting 3 - Stata's Command Syntax

    • 4. Starting 4 - .do and .ado Files

    • 5. Starting 5 - Creating and Viewing Logs

    • 6. Starting 6 - Loading and Importing Data

    • 7. Exploring Data 1 - Viewing/Editing Raw Data

    • 8. Exploring Data 2 - Describing and Summarizing Data

    • 9. Exploring Data 3 - Tabulating and Tables

    • 10. Exploring Data 4 - Missing Data

    • 11. Exploring Data 5 - Numeric Distribution Analysis

    • 12. Exploring Data 6 - Using Weights

    • 13. Manipulating 1 - Recoding a Variable

    • 14. Manipulating 2 - Creating and Replacing Variables

    • 15. Manipulating 3 - Naming and Labelling Variables

    • 16. Manipulating 4 - Advanced Variable Creation

    • 17. Manipulating 5 - Creating Indicator Variables

    • 18. Manipulating 6 - Dropping and Keeping Data

    • 19. Manipulating 7 - Saving Data

    • 20. Manipulating 8 - Converting String Data

    • 21. Manipulating 9 - Combining Datasets

    • 22. Manipulating 10 - Macro's and Loop's

    • 23. Manipulating 11 - Accessing Stored Information

    • 24. Manipulating 12 - Multiple Loops

    • 25. Manipulating 13 - Date Variables

    • 26. Manipulating 14 - Subscripting over Groups

    • 27. Visual 1 - Introduction to Graph command

    • 28. Visual 2 - Bar Graphs and Dot Charts

    • 29. Visual 3 - Distribution Plots

    • 30. Visual 4 - Pie Charts

    • 31. Visual 5 - Scatterplot and Lines of Best Fit

    • 32. Visual 6 - Drawing Custom Functions

    • 33. Visual 7 - Contour Plots

    • 34. Visual 8 - Jitter in Scatterplots

    • 35. Visual 9 - Combining Graphs

    • 36. Visual 10 - Sunflower Plots

    • 37. Visual 11 - Sizing Graphs

    • 38. Visual 12 - Graphing by Groups

    • 39. Visual 13 - Changing Colours

    • 40. Basic tests 1 - Association Between Two Categorical Variables

    • 41. Basic tests 2 - Testing Means

    • 42. Basic tests 3 - Pearson's and Tetrachoric Correlation

    • 43. Basic tests 4 - Analysis of Variance (ANOVA)

    • 44. Linear Regression 1 - Basic Ordinary Least Squares

    • 45. Linear Regression 2 - Factor Explanatory Variables in OLS

    • 46. Linear Regression 3 - Diagnostics

    • 47. Linear Regression 4 - Log Dependent Variable and Interactions

    • 48. Linear Regression 5 - Hypothesis Testing

    • 49. Linear Regression 6 - Presenting Regression Results in Tables

    • 50. Linear Regression 7 - Standardized Estimates

    • 51. Linear regression 8 - Graphing Estimates

    • 52. Linear regression 9 - Oaxaca Decomposition

    • 53. Linear regression 10 - Mixed Models

    • 54. Choice Models 1 - Logit and Probit Regression

    • 55. Choice Models 2 - Logit and Probit Goodness-of-Fit and Marginal Effects

    • 56. Choice Models 3 - Ordered and Multinomial Logit Regression

    • 57. Choice models 4 -Fractional Dependent Variable Models

    • 58. Simulation 1 - Drawing pseudorandom numbers

    • 59. Simulation 2 - Data Generating Process

    • 60. Simulation 3 - Violating Estimator Assumptions

    • 61. Simulation 4 - Monte Carlo Simulation

    • 62. Matrix 1 - Matrix Operations

    • 63. Matrix 2 - Matrix Functions

    • 64. Matrix 3 - Matrix Subscripting

    • 65. Matrix 4 - Matrix Operations with Data

    • 66. Power 1 - Sample Size

    • 67. Power 2 - Power and Effect Size

    • 68. Power 3 - Simple Regression

    • 69. Instrument 1 - Instrumental Variable Regression

    • 70. Instrument 2 - Multiple Endogenous Variables

    • 71. Instrument 3 - Non-linear Instrumental Variable Regression

    • 72. Instrument 4 - Heckman Selection Models

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

An extensive introduction to Data Analytics with Stata

Learning and applying new statistical techniques can be daunting experience.

This is especially true once one engages with “real life” data sets that do not allow for easy “click-and-go” analysis, but require a deeper level of understanding of programme coding, data manipulation, output interpretation, output formatting and selecting the right kind of analytical methodology.

In this class you will receive a comprehensive introduction to Stata and its various uses in modern data analysis. You will learn to understand the various options that Stata gives you in manipulating, exploring, visualizing and modelling complex types of data. By the end of the class you will feel confident in your ability to engage with Stata and handle complex data analytics. The focus of this class will consistently be on creating a “good practice” and emphasising the practical application – and interpretation – of commonly used statistical techniques without resorting to deep statistical theory or equations.  

This class will focus on providing an overview of data analysis with Stata.

No prior engagement with is Stata needed.

Some basic quantitative/statistical knowledge will be required; this is not an introduction to statistics course but rather the application and interpretation of such using Stata.   

Topics covered will include:

  1. Getting started with Stata
  2. Viewing and exploring data
  3. Manipulating data
  4. Visualising data
  5. Correlation and ANOVA
  6. Regression including diagnostics (Ordinary Least Squares)
  7. Regression model building
  8. Hypothesis testing
  9. Binary outcome models (Logit and Probit)
  10. Categorical choice models (Ordered Logit and Multinomial Logit)
  11. Simulation techniques

Meet Your Teacher

Teacher Profile Image

Franz Buscha


Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Starting 1 - Stata's Interface: in this session will have a look at the state of interface. Please be mindful that I'm using a Mac version of Stater, and there are some slight visual differences to the PC version. The main difference is how some windows are displayed and how they can be customized. So let's good stater and take a first look. Welcome to state of the first time you open Stater. You'll probably see something like this. This is status primary program window. It may be subdivided into up to five parts, depending on which version you're using PC or Mac. Here in the middle, we have the results when though this window displays all output. This when there was very important because it presents all your results. However, the physical interaction with this window will probably be limited. You can interact with blue text that you see, but it's quite rare. You can highlight text, my left clicking and then right clicking to see specific options for what you selected. Most of these will be copy options. You can also right click anywhere without highlighting text to set preferences such as colors or fun size. At the bottom, we have the command window. This is how you interact with data. Any commands at a type here and then entered will be displayed and executed in the results . When, though, here's an example. In this case, the command was unrecognized and state of returned an error message that's fine to the right of the results would know we have the variable window. This shows us all our valuables. In our data set, we can right click to interact with selected variables here, although this is generally not the preferred way to manage variables. Currently, there are a few options because we don't have any data loaded below the variable window in the bottom right hand corner is the variables properties window by default. This is locked as you don't want to accidentally change variable properties. This can have severe consequences. We can unlock it by pressing on the lock, and we lock it like so I find that I really used this particular window. We can change the size of both of the windows by dragging them left to right, left and right. Finally, if you're using a PC version, then on the left hand side of the screen, you might have a review window that shows you all your previous commands again. Generally, the interaction with that window is low, but you can double click on previous commands to repeat them. If you're using a Mac like here than that window won't be there, you can get to that window by changing the windows setting up here. And here's the history of all the commands that we wrote, and we could return back to the available window by clicking on this button. Over here. At the top will be the menu options. The menus include far edit data graphics, statistics window and help. The majority of the menus are pretty self explanatory and don't need much explanation at this point. One menu I would like to point out is the window menu. Let's click on that. The window menu allows you to bring back any state of windows that you accidentally closed , and also it allows you to access State A specialized in, though here's an example of me closing window and bring it back again. Let's say we close this and now we can bring back the variables window by going to the window menu and clicking on variables, and there it is again. States also has a toolbar of commonly used features up here. A lot of these will refer to more specialized Windows, such as Viewer, the data editor that do file editor on the graph editor. It's actually in the specialized windows that will do most of our interaction with state. Lastly, let's have a quick look at status preferences. We can access data settings by clicking on the edit menu and then the preference option If we're using Windows. If you're using, A Mac will need to go to the state a menu up here and then click on Preferences. Here we see three different types of preferences. General preferences, graph preferences and language preferences. Let's click on general preferences under general preferences. We have different tabs here. We consent and change different preferences. There are general preferences, their preferences for highlighting syntax. There are no window preferences, Internet preferences and safe preferences on the Internet preferences. If you are behind a firewall here, you can specify a proxy server that allow stater to connect to the Internet through your firewall. You mainly to contact your I T administrator to set this up on the window preferences tab, we can edit the preferences off the various state of windows that we see. So, for example, here the preferences were the results. When, though, and we can choose a different color scheme, let's click on standard that will now apply the standard color scheme. Let's revert back to using the dark color scheme. Most of these options are fairly self explanatory. However, one notable mentioned I would like to make this the scroll back buffer size. The scroll back buffer is the space in the results of your window. If you have a lot of output, you're probably filial screen with a lot of results, and some of your results will disappear from the top. In most programs, you can scroll back up to find that out. Put the same happens with stater, but you can only do it to around 200 kilobytes of data by default. This means at some point you might lose access to some of your prior results. I therefore recommend setting this to its maximum value of 2000 which will increase your ability to see prior results in your results for your by 10 times. I recommend going through summaries, options in your own time, but Stater has a lot of customize ability. Finally, let me just go back to Preferences one more time and just talking about the graph preferences a little bit in the graph preferences. You can change the default color skin with your graphs. Most of the time, you probably may not want to touch these, but here on the scheme, you can choose different types of graph schemes that allow you to change how the default graphs are generated in state of We'll talk about this more in the graph sessions and finally, for those people who really want to customize data and modify everything, including what's under the bonnet and, in the engine of state, a state that has something called the set command weaken type set in our command window. Execute that this will give us access to a wide range of settings that the fine status default behavior. These settings include memory settings, output settings, program settings, around number settings and many more. I'll leave it to you to explore this by yourself. Usually, the set command is used by experience state of users, but knowing it's there is half the battle, and that concludes this quick overview off Stater 2. Starting 2 - Using Help: Let's learn how we can you state US help options effectively state that has some excellent help futures to guide you no matter what your experience level. I still use help off, and I thought 20 years of experience with State in this video, we're going to explore the help menu status. Help command the user guide, and some are status online resources. The command we're going to look at is called help and its aims to help you with whatever you type after it. It's often the first Come on my teaching state. So let's move to ST and take a look at this. Here we are in states. Let's go to the menu at the top. On the right hand side, we can see the help menu. We can click on it to see various options. First Link will send us to the manual. This is much improved in recent versions of state. We'll take a look at that in a moment. The other Ling's will bring up entries from status help for your, and this will give you some advice on how to start which data or some basic content on certain commands or processes. Let's have a look here and click on contents we see here, for example, there's a link on getting started. Which data? We can now read some of these entries to help us get started with data. However, what is actually happening here? Is that your visually calling their help command every single time? Take a look up here at this command bar. It says. Help contents on school Start. So what's data is actually doing is that it's executing the help command with the world contents on the school. Start after it, and this gives you a flavor off. How help work. So let's go use help in the main state of window. Let's close this and let's go to the command window and simply type help and then anything . Let's have some gibberish. Execute that Help will now look through status. Help falls on online resources to see what a confined in relation to your query in this case, because we simply type gibberish. It doesn't return a lot, but let's go back a step. We know one. Come on, it's help. So let's try help. Help close this. No. And we can now see that state opens a view a window with the help entry off the command. Help, we can now read and learn about the help commanded Stater. Generally each help entry on a command will have an overview of the syntax, some option description on some basic examples for you to try. That may also be links to pdf Documentation. Especially useful is often the first blue hyper link at the top. This will send you straight to the manual, so let's click on it and see what comes. You'll see that state. I will take a straight to the pdf Your, which opens the PDF manual at the help entry. This manual contains an expanded version off the View entry we saw earlier in Stater. Note that the manual consists of a series of books, which we can see here on the left hand side. Each of these books is essentially a separate PdF foul, and there are bookmarks in these which allowing to explore them further, however, or the complete manuals are well over 10,000 pages long. So I recommend if you don't read any of the pdf manuals, start with the getting started book. Most people will not randomly read the manuals but they will come here when they target very specific commands that they want to know more about. So now let's head back to state. One great aspect of Stater is its online resources. So going back to the menu help, we can see that there's a resource option here. Let's click on that. This will now open up a new view window, which provides us with a list of various resources that are available to us. We quickly covered the 1st 3 options. Searching, using help and using the PDF documentation. Let's stay that provides many more. Resource is done now. A lot of them are online. Resource is such as video tutorials or frequently asked questions. One particular resource that I like a lot. It's the state that journal. Here's the State of Journal. The State of Journal is a peer reviewed academic journal that is dedicated to ST only and regular presents, new estimators and new commands and new ways to do data analysis. In state, all these estimators and new commands can be downloaded and used to free. As you can see, state office, a wide variety of help, options and as you grow more confident with data you'll actually start to use these help options more and more toe unlock some of the most powerful and specific futures Allstate. 3. Starting 3 - Stata's Command Syntax: Let's take a quick look at status syntax. In other words, status, command, language, status, command Syntax governs the structure off. How you successfully interact with Stater beyond its graphical user interface. In other words, anything that isn't point and click. For professionals, this is the main way to interact with Stater. And even when you are pointing and clicking your way through staters menu to do things, status grafico menus mostly. Just translate your point and click requests into code and then submit that to the command window. In other words, the vast majority of things that happen in Stater happened because of the command syntax. The majority of state of commands take the following arguments. A by prefix is often available to Luke, the command over specific variables. Some kind of command is cold. A variable list or some nameless, is defined after the command. An expression is provided, for example, in generating commands, something equals something, and then an if condition can be added if we want to focus on only a subset of data or an in range can be specified if you want to look at only a certain range of data, many commands also take weights. We can add weights to give different observations, different weights in our estimation procedures. And finally, most commands will have various option that will change or modify the behavior or the results that you see from the relevant command square brackets in the command syntax in the Cape. Optional components on the comma denotes the break between the main command on its options . The syntax options for commands are often explored using the help menu, and you should get into the habit of always exploring new commands with the help command first and reading of the relevant help entry. So let's head over to ST ER and have a look at an example helpful here. We aren't stater, and I've already loaded the helpful for the command regress, which is a commonly used estimation procedure at the top of the helpful. We can see the relevant syntax for that command. Command itself is bold it, which means we can call a linear regression but typing regress. In many cases, we often don't need to type out all the characters in the command regress. So, for example, in this case, regress will also work with simply wreg. This is called a shorthand or shortcut. Commands that they used more frequently can have very short shortcuts. Help, for example, will work by only typing H generate. Another frequently use command can work by only typing G. Next, you'll need to specify one dependent variable for Agresto work. Note that you can only specify one dependent variable, and because it's not in square brackets, you must specify exactly one dependent variable for progress to work. Everything else after that is optional. So, for example, we can add optional independent variables, and we can add as many as we like. We could also include, if on in conditions oh, include weights order. These are optional because they're surrounded by square bracket. Finally, you may want to specify additional options. After the comma, the comma separates every state of Maine command from its options. The option this for aggress is given below. For example, we can add a no constant option. We can change how standard air is a computer. Well, we can change the reporting of the regression results overall, Once you get used to status, command syntax is very easy and straightforward. To use as good practice, you should get into the habit of learning the specific syntax off the most commonly used commands here in this cause will focus on creating such good practice by mainly using the command syntax to interact with data. 4. Starting 4 - .do and .ado Files: Let's examine what do fouls and end do fouls are in state of do falls are incredibly important component of using state are properly. In fact, all I work from here on will be carried out in a do far do files are text files that contain states a code most data users use do falls in the work. They're important for good data practice as they allow you to store your work in code form . Do fouls can be created and edited using any regular text editor, and they should be saved with the file name extension. Don't do. However, State A has a built in text editor to help you create such falls. The advantage of this editor is that a color culture code for you for better visual identification on DIT, allows you to send your command straight to Stater to be executed, but do follows a similar to do fouls but contained more complicated general purpose programming code. This allows you to use the defaults on a wider variety of data sets. In other words, they're not customized for only one type of data set for analysis. Every state of command that will be using is actually in a do far. If for any reason we accidentally delete some of these or do falls, then the state of program will still work. But the commands that you delete it may not. This concept allows stated to be modular. So, for example, it is very easy to drop in new commands al capabilities into stater by simply downloading new ADO falls into the state of Folder. Often this is done from dedicated Internet sides like the State Archives and this session we're going to look at to command specifically do edit, cause the do Fall editor and allows us to write code and store code and execute code. Often do it. It will be one of the first commands that we type. When we first opened Stater, the second commanders do which executes do falls. So let's head over to ST ER and take a look. Here we are in ST ER. We can open the Dufayel editor by either clicking on the do Fall editor bottom up here, or we can type the command, do edit into the command window. Let's do it the command, way and side due, and it and execute up that now opens the Dufayel editor, and it looks like a basic text editor. At first glance for convenience, I often place to do file editor next to ST Anna rather than on top of stater. This way I can see both Windows at once, and it helps me enormously with interactivity. So to do that in a Mac, we can just hold this. Move this to the right hand side of the screen, and then we can adjust our to screens appropriately, like so, In the Do File editor, you see the usual options menu at the top. In this case, we can open and save two fouls we can print to falls. We can do a search without now do falls. No, we could set the zoom level. On the right hand side is a button that were used frequently. This button sends any highlighted text towards the command window to be executed in Stater . A convenient keyboard shortcut to know about that means a less clicking is the execute Short cut. This can be triggered by pressing control D on your keyboard. If you're using windows or shift command and the if you're using a Mac, you have nothing selected, then this shortcut will execute the entire do far. If you have something selected, then this shortcut will execute selected text. No, that's data will execute any code in you do far from the beginning, off the line to the end of the line, even if you only selected a portion of that line. So let me show you an example. Let's type some random gibberish in town. Do though. Here's the first bit and here's the second bit. Now let's select only a portion off the first line. Let's say these three characters, if we know, execute this states that will send the entire text to the command window and this is then executed and the result? We get back it's not. This card is unrecognized, which is fine in this case. Next, let me give you a quick example off a riel do far. So let's delete this. And let's not with a clear let's open some data. Let's summarize the data and let's maybe generate a valuable in our data. So now we're doing something specifically, we're clearing something. We're loading something. We're looking at some statistics, and we're manipulating our data. We can now highlight all of this code and send the student command window in one batch and have it executed by Stater. So this is the concept off a do far in real life, we would obviously generate a lot more code. But the great thing about this do file editor is that we can now save our do foul if we're finished with at work and come back to it at a later stage. So let's go ahead and save this. We can simply click on Save Culture Development Directory, give the name, Let's give it a name off test and we've now saved are working on this data analysis. If we now want to to we can actually execute the entire safe, do far with the Duke amount. And we can even didn't that within a do far. So, for example, to re execute the foul that we just saved with all of its code within simply type do. And then we specify the name which was test in this case. Select all of that executed and you'll see that was now re executed. All the code in the do file that we saved. So that is the concept. All the do far two quick little tips about do fouls state that evaluates the execution code line by line, and it stops if it's notes an error. So if there is an error in your do far, then stated will break at the error, and it will not complete the rest of the code after the era. And also you can insert Commons find using an ass tricks in your do far. So, for example, if we want to make it clear in L. Dufaux that this particular code refers to opening a data set, we can ride star and then into some text here, such as open data. This code is now color coded green, and it won't be executed and if we highlight it, So let me show you an example. This is now not acted upon. Do falls on X and way to store your work, and all serious state to use is used him. It will quickly become your main way of interacting. Which data, and this is the reason why the remainder of the school's focuses so much on coding into false 5. Starting 5 - Creating and Viewing Logs: log files allowing to record your entire state decision into something called a log foul by default stated, does not start a log far for you. You'll need to manually tell Stater to start a look for So in this session we're going to explore how to start a log Far how to view a log far both William Stater and outside Stater and also how to amend will replace a log of all. We're going to introduce the command long that starts, stops and the men logs. We'll also have a look the View command so that we can view our laws how we're going to have a look the translate command that allows you to convert status default log four months into something that other text editors like word can read. So let's head over to ST and explore how to log out data stations. Here we are in states that with an empty sheet, imagine we're about to start a new data analysis session and that we want to record all I work just in case the computer crashes. Well, we can start a log by using the local months before we do that. Let's have a look at its help. Voluntary help. Look here we see the help voluntary for long, it turns out, look actually has a variety of sub commands that will need to use. The most important ones are love. Using that will open and start a new log far and low close that will close a look for Also important to note are the options upend and replace That allows us to upend current log files to previous log files or replace, which allows us to override current local. Always be careful with the replace option. It will override anything you might have their in state. I won't ask you. Are you sure? So let's close this and let's start a log for We can do that by calling the low command with the using sub command. That, and simply specifying the name in this case will call it Love one. Let's execute that, and we can now see that state of started a log. Let's load of the auto data set and do a quick summary for demonstration purposes. So sis use auto the summarize and generate a variable. Now let's assume we finished our work. We can close the log far by calling on the log Close, Subcommander. Now we can see our log file is now closed. We can view this log file in two ways. The first is to use status in build viewer and that can be accessed either by the farm in you. Oh, my typing the command view in this case, let's use the view command. We simply specify view and then specify the full name off our log file. In this case, it was called Log one and Stater saves the fall Type A something called Smikle S M C L. But that's what this little things over here Let's do that. And we can now view how previously recorded work. If we wanted to continue writing into this log file, we can opinion a new dog fall to this current look for we can do that by so we can simply start a new log using the old log log one and then we're pinned to it and they would go Anything we now type will be appended to the bottom off the log. One far. That's closest. If we wanted to replace our local, I delete everything and start again. We can use to replace option, so that's pretty simple. We type bloke using log one and then comma replace replace will delete everything you have . I was now replaced our previous local. That's so good. So let me close that again. And finally, if you want to view a log found in a different text editor, your first have to translate status. Mikel Far four months, which stands for state up, mark up and control language into another former. A useful one often is plain tick, and we can do that by using the translate command. And the easiest way to do it is probably to simply translate the local into another fall by using the former dog log. So translate Loke Wan gots Mikel to lo Cuando log. Replace well, then translate your state or look for into a more generic log file that could be open using many different data. Editors such as Work or another text editor, I strongly advise, said you get into the habit of starting each state obsession with a log far. It is good practice to store everything you do in your session. Once executed, you can forget about the log file as state I will also close the local when you terminate Stater, so it's a useful little thing to do. Start a recession with a log far and then forget about it. And then should something ever happen, you always have a record of what you've done. 6. Starting 6 - Loading and Importing Data: Let's take a look at how to load data an import data into Stater. The basics of loading data are fairly straightforward, but state offer some powerful import futures that will take a closer look up, and this session will focus on several state of commands. The command clear wives, all data and the current state of session. Be careful. There's no back button in state, and executing this command will remove all data in your current session. The command used loads stated data sets, and it's fairly easy to use the command sis use is a sibling of use, and it's actually a command will be using a lot of this course, but you probably won't use it a lot. In real life. Sis use a short for system used and allows you to load training that sits in state. We'll also look at the input command that allows us to type data straight into state. And finally, we'll look at the import command that allows us to import non state of data sets into state of So let's head over to ST, uh, and load some data here. We aren't Stater, and our code begins with a clear this is quite a common start to any do far. What it does is why all data from status memory clear by itself removes data and value labels from the memory. If you want to clear other things like matter, code or stored results, you may need to use one of clear sub commands. If you really want to get rid of everything, you'll need to type clear. Or but let's just go with a normal clear for now. Clear. So any data we currently have open will now have been removed. Now we can load another data set. Let's load the data set called Auto. This data set is actually a state that teaching there said that will use often in this course to load data we need to use the use command. It's pretty simple to use. All we need to do is specify the file name on the path name in quotation marks after the use command to tell States are where the data set is in this case, our data set to start in state as default installation directory so we can load it by simply typing used and then where we installed state or two and then specify auto. Dr DT A, which is the data, said, We want to look, and there we are with now loaded the data said Interstate er and we see variables appear in the variable window. Do note that once we've modified their data, said State, that won't let us open. Another Data said without first clearing on memory, will need to execute the command clear again or at clear as an option in the use command to open new Data 1000 after we've modified them. We haven't changed anything in our current later said that we can open new data sets without clearing out current data. Interestingly, used doesn't just open local data that is stored on your hard drive. You can point use towards network paths or even Internet parts and open data straight on the Internet. Here's an example of that clear and then we're going to W W state oppressed dot com, and there's a data said there that's called the auto data set, which is the same one that we have in our installation directory. So there we are with now open the auto training data straight from the Internet. Finally, we can also opened the auto training data using the command sis use, which is very similar to use, but you don't have to specify any directory path. This is because the auto data is a training day to set to get an overview of all training data sets in building two Stater Weaken type sys use their And here we see all the available training data sets that come in, built which data to load any one of them, weaken type, says use and then specify the name. And this is the third way that we've opened the water training data. We can also enter data directly into Stater using the Input Command. To do that, we need to specify the input command, after which we need to specify the variable names and then enter each data value into columns. Below that, here's an example of a small data set that I'm loading from our do fall into Stater. We'll start with a clear well, then call the input commanders best find two variables, and those two variables have data in columns where variable one contains the cells 12345 and viable to 0.10 point two etcetera, etcetera. Importantly, this operation needs an end. Command this. Tell state when we want to stop in putting data directly into it. Let's execute that. Okay? And now we could do a list to see what data we have, and then we see we've now inputted data directly into stater throughout. Do file. In this particular case, input is not a frequently use command, and I don't recommend getting to the habit of using it. But if you ever need to enter small data sets interstate er than this can be a useful option to know about. Now, let's imagine that we have some Excel data that we really want to get into state and analyze this data, for example. Well, one option is relatively simple. We can simply copy and paste the data straight into state. So let's select all our data, copy it in excel, then move over to Stater. And before we do anything else, let's just clear our current later to make sure that we haven't empty data set. We can then click the button up here to access the data editor. The data editor will be explored in more detail in a later session But for now, all you need to know is that this is where you can view and edit raw data. Well, we currently have no data, so this is entirely empty. But here we can pace now Excel data right into stater simply by right clicking click on Paste treat the first role as variable names and was now imported our Excel data straight into Stater. By simply copying and pasting it across, we can now close the data editor and we see that we now have variables in our valuable window. Note that this operation was entirely graphically driven and state A did not convert our operation command language. We therefore have no way of repeating this process unless we do exactly the same point and click steps again. It is therefore recommended that you use copy and paste methodology sparingly and in more formal situations used the import command that comes with data. We can access the import command by the farm menu and clicking on import like so or alternatively, we can simply use the import command in code. Let's use the import commanding code and let's look at its helpful first help import. Here we see that import actually consists of many different subcommander. Each sub command relates to a different data time and has different options associated with it. State is capable of reading many different data for months. For example, it will be Excel Be limited O. D. B C Space tab and comma, separated for months. Sounds four months, etcetera, etcetera. Because our data is excel, that's usually excel, Subcommander, and that's import Excel. So this command is relatively simple to use. We specify the command and then the file name and path. One notable option to mention is the first roll option that treats the first row is variable names. Austin. You want that enabled? Another useful option is the shit option, which allows to load specific sheets from your ex so far. So let's close this and go ahead and use this import Excel commander to use it. We type Import Excel were just a command. We specify the farm, which is located there, were going to open sheet one. We're going to treat the first role as variables, and we're also using a clear here because we currently have data open. We've now loaded our Excel data straight into Stater we can do a list to see what we have, and this is exactly what we saw in Excel just a few moments ago. Hopefully, you can see that state is where user friendly when it comes to loading on importing data. Recent versions of state I have put a lot more emphasis on importing different data times, and it is rarely stays to come across data that's data can't load. 7. Exploring Data 1 - Viewing/Editing Raw Data: Let's learn how to view and edit rule data in Stater. Unlike some other statistical programs such as Excel or S P SS data doesn't offer a default view off the raw data. Therefore, if we wish to examine or edit raw data cells, we'll need to open the data editor. So let's go ahead and learn how to do that. We look to commands in the session the Browse Command and the Edit Command, which are actually the same thing, except that the Edit Command allows you to edit data. What's the Browse Command only allows you to look at the date politico over two Stater and explore how we can view the underlying orto training. Here we are in a state, and I've already loaded the auto training victor. If we want to examine the underlying raw data, we can use the browse command simple to use simply time. Browse, execute. And there we are. Let's move this over to the left to see better. This opens a new window, which is that data editor will go. You'll recognize the traditional stretching layout with observations going down the rows and variables along the columns. If we have very large data set with thousands of variables and millions of observations. We can also filter out observations and variables by specifying if and in conditions in the browse command, for example, if we only wanted to see the raw data for the variables and make a price, we can specify brows make price, and here we are. This will show us only the raw data for make and price. We can also examine only a select group of observations by using the in sub. Come on. So, for example, to examine on Lee the 1st 10 observations in male data weaken type brows in one ford slash 10 that will display the 1st 10 observations. You know, this can be useful if you've made changes too many variables. Well, if you want to double check that, everything's okay with your role data. But you don't have time to examine thousands and thousands of road. Note that at this point we can't actually change anything. If we wanted to make changes to the data, we need to call the Edit Command, which removes the modification law. So let's go ahead and do that. It's time and it execute, and we can now go ahead and make changes to the underlying raw data. For example, I can double click on this cell and edit underlying cell injury. Or I can click on this cell press delete answer, and I've now deleted the data in that cell. We can also insert new variables by right clicking and then selecting that begin certainly variable option. So, for example, to inside, available between price and lost a gallon will select a cell in this column, right click dates, huh? Inside a valuable that's called a variable first. And let's make every observation equal to one. Okay, as we've now included, a new variable cold test that a school of warms and when we're all done, we can close is again to make sure we don't accident. You change or delete any further roll data, and that is how you view and edit data in Stater. I should note that at a professional level, it is quite rare to manually edit the underlying roll data. Generally, we will do all this and cold in our do Paul so that we have a record. What's going on. But sometimes it can be helpful to make one or two Manu elated. Just remember to be careful 8. Exploring Data 2 - Describing and Summarizing Data: in this video will explore how to perform basic descriptive statistics understanding the data you have. It's crucial to avoiding mistakes in any analysis. There are many ways to explore dater in state of, But in this session we're going to look at the script of commands that state that uses used most often when they first opened up new data. We're going to explore the describe on the summarize. Come on. We'll also take a quick look at the command code book. Describe offers a basic description of your data and allows you to explore the nature of your variables. Summarize office Basic statistics such as means and observation counts that allowing to start building in the initial picture of how your data might be shape. Both are frequent to use. Finally, we'll have a look at the command code book, which is a useful little command that tries to automatically generate a code book from the data at hand. So let's head off the Stater and explored his concert using the auto trading day here. We aren't stated with the auto training data already loaded when we first load a new data set. Describe will most likely be the first command that we use. It doesn't have too many important options, so we'll avoid looking at its help. Our let's use it and see what happens. So that's execute described. There we are. The output returned by describe will first produce some high level information about the data, such as word is located, How Maney observations on the data set and how many variables are included in this case outdated contained 74 observations and 12 variables that is called 1978 or two mobile data below. This is a table of variables here state. It will produce a description off each viable containing its name, storage time display, former value label and variable label. Let's have a quick look at each of term. The valuable names are pretty obvious, but if there was available that we weren't quite sure than the variable label could provide some additional information, the storage type in the cater indicates whether a variable is numerical strength. String variables are indicated by the prefix STL on the number after that indicates the maximum strangling contained in that variable in this case make is a string variable, with the maximum length of 18 characters. The American labels can be stored with varying degrees of accuracy, and these are represented as bites in long float or double Hi. Storage types allow for bigger and more accurate numbers to be stored. Also require more storage space. That this play form represents the four months of variables these characters are not easy to read. You should keep an eye out for any teas as these represent time or date for Martin, you can make your data look a bit funny if you aren't aware of them. The column value label this place what value labels aren't touch to each variable status stores of value labels a separate entries, and in this case, the label origin is attached to the variable form. No other viable labels are attached to variables, although they may exist in the tape. Next, let's have a look at the command. Summarize this commanders often used right after describe. The command provides initial summary statistics of our data and is used to explore initial data pattern again. It doesn't have too many options or complicated syntax, so we won't look at its help. However, I should note that the detail option can be a very useful option to use. It provides additional detail on the distribution statistics of valuables should you need it. So let's call some rice, summarize and execute and now summarizes produced statistics on the observation count averages, standard deviation, minimum value and maximum value for each variable. Note that, in this case, the first viable returns empty values. That is because the first variable make is a string variable, which cannot return numeric statistic. Thankfully, we know this because we used described earlier, summarizes useful because it allows us to spot key characteristics of data quickly. For example, we notice that there are some missing observations for the variable rep. 78. It also looks like that the variable Farrah might be a bindi. Available the distribution of price maybe rights. Cute because it has a large maximum value whilst it's mean, is relatively low competitive. Maximum value, as is the standard deviation and repair record. Maybe a categorical variable with five Catholic. These are useful properties to know for when we explore the data further or start data manipulation. Summarize can also be used with status if and in conditions that can be limited to only certain valuables by specifying a variable this. For example, we could summarize the price off foreign or domestic cars by executing summarize price if foreign equals equals one. Now that tells us that foreign cars haven't average price off $6384 in this data set to do the same forecast that have a foreign status off zero i e. Our domestic arts weaken type summarize prize if foreign equals equals zero. And that will give us the summary statistic for domestic cars who appear to have a lower average price. Finally, let's take a look at the command code. Book code book is essentially a mix of the previous two command. It produces descriptive information and summary statistics for each variable. It basically tries to create the code book from the day code. Books are often large word documents or PDFs that have lots of detail for each variable and come attached to large data sets. This allows users to better understand such data sets. The way it is intended to be used is that its output should be safe to a lock file that this lock Palestine printed or stored somewhere else for future reference So let's take a look. That's execute code book. Wow, that's a lot of information. There are a lot of detailed statistics for each variable. Whether this is useful is something you must decide. Like I cannot emphasize enough how useful to commands describe in some ways are. You should always invest a good amount of time in exploring and understanding your data as it will save you from trouble further down the line. And, as you can see during this is very easy in state. 9. Exploring Data 3 - Tabulating and Tables: tabulating variables is an important part of initial date operation after use tabulate to explore categorical variable. What's the use summarized for continuous variable in this session will explore how to use tabulate to create one and two way tabulations. We'll also examine how we can create custom tables and state the command. Several introducing this session are the tabulate command that tabulates variables. I will also have a brief look at the table and tap Start command could be used to create customized table. So let's adopt this data and explore the auto data said a little bit further with tabulate . Here we are, with the older data already pre loaded. Let's say we suspect that some of our variables are categorical in particular the variables rep 78 foreign. Her characteristics that we might want to inspect more close if we suspect that a variable is categorical that weaken tabulated values by using the tabulate, Come on. It turns out there are actually different versions of tabulate, so let's have a quick look at it's helpful. That's type help savagely State A recognizes three different flavors of tablets. We can create one way tabulations with tabulate. We can also create two way tabulations. And finally, we can create summary statistics for tablet using the sunrise option. Let's click on one way. Here we see the syntax of camping like this one to you. Simple. He called tabulate and then specify one variable. There's a few options down here of which the most useful ones are probably missing on no label. The no level option removes any labels and allows you to see the actual category. The missing option reports mission values as a separate category and can be useful to identify how many missing values. A variable house. Let's go back, then click on tabulates away. Cool it to wait table. We simply entered two variables. Pastor Tabulate This version of tabulate has lots of extra options, some of which will be explored in third. The sessions Useful options to note for now are the Row Cole and sell options that provide percentages of various frequencies. Let's go back and look at the third version, but summarize aggression. Tabulate. Summarize. This version of tabulate can have one or two of Ebel's, but requires the sunrise option with 1/3 variable. It will then produce statistics for 1/3 variable over the 1st 2 variables, so let's closes and explore tabulate in action to tabulate the valuable. Rep. 78. We simply type how. Rep 70 egg and here with tabulated breath 78 we can see that. Rep 70. It is indeed a categorical variable, with five categories ranging from 1 to 5. Just under half. The observations appear to be in Category three. We call this type of population a one way tabulation. Things were only tabulated one variable, as mentioned previously. One way tabulations have a few options, which the most important one is Dennis Option on the no label option. The no label option can be useful to see the actual number values behind any value labels. Let me demonstrate with foreign. Let's tabulate foreign. We can see that this variable is a buying new variable with porn on Domestic are, however, we don't know what value corresponds. So what label? By invoking the no label option, we can see the number behind the label Happy Lakes Farm comma. No label. Well, we now observe that domestic cars are coded to zero and foreign cars are coded to one. Now let's explore the missing option. Let's do tabulate rep. 78 Comma missing We now see that we have an extra category at the bottom. This category is the missing value category. In this case, the variable rep 78 has five missing values. Great take to wait a population and simply specified tabulate with two variables, like so tabulate. Rep. 78 France Warrant. We now observe a two way table of frequencies between repair record on foreign status of cast. It looks like foreign cars have slightly higher repair records. To explore this even further, we could call additional options that are not available in the one way tablet command. We can have the column and row percentages, my typing tabulate rep 78 against Foreign and then call upon the co and row options. And we now see a to wait out relation. But not only has the frequencies, but also the column on the row percentages. We see, for example, that 80% of foreign calls have a repair record off four or more. What's only 22% of non wrong cars? Have this two way tabulations are great for evaluating initial relationships between categorical variables. Have your find that this is a very frequently used commands. Finally, we could also include summary statistics of 1/3 variable in our two way tabulation. To do that, we can use the option. Summarize. Here's an example with price tabulate rep. 78 foreign comma, summarize and then in round brackets the variable price. This presents us with a two way table where each cell contains the mean of price, the standard deviation of price. Um, the frequency we can see, for example, that domestic cars with a repair record of one have an average price for $4564 Palestine. A deviation 522 Craig More fancy tables such as three weight on relations or even costume tables only to use the table or tap stop. Come on. However, the code for both commands can be relative your conflict. So my advice is that you initially consider using the graph commend you to specify the table and then copy and paste the code to the do fall. However, here is an example in code. In this particular case, we're going to specify a three way table off rep. 78 headroom and foreign and simply asked for the frequencies of each of these. And they were. The actual output doesn't particularly useful, but hopefully you get the idea off. The capability off the table come up table was quite a complex command to use, and you may want used tab stand instead, which is slightly easier but so complicated taps that could also be answered by the many options. But here's another example in code, with very statistics for selected variables. In this particular case, we're creating a custom table around the variables. Mpg rep. 78 headroom and trunk. We're asking for some statistics about the mean on the count over the Babel foreign. Let's see what comes out and there we are. We now have a table, some key statistics for selected variables. I recommend that you explode table and taps not further. In your own time, the tablet command will be one of your most used commands. When using stater, use it often. Make sure you understand your categorical vehicles well before perceiving with data manipulation, the table and taps that commands are powerful but quite complex use, and I recommend building up some experience for 10. Exploring Data 4 - Missing Data: In this video, we'll explore the concept of missing values and how Stater treats thes. Ensuring that you had 10 to 5 track and properly called missing values is crucial to big data analysis, meaning mistakes occur because users did not properly manage them missing data. And in the real world, missing data is everywhere. I can think of virtually no real well data set that is 100% complete. Beware that state that treats missing values in a particular way. Missing values are coded. That's full stops in ST, and a key concept to be aware of is that state evaluates missing values as positive infinity. In other words, any cell that contains a missing value in this case denoted by full stop character will be treated as positive Infinity state and must assign missing values some kind of value on the number line. As otherwise, any variable that has even one missing value will become a string variable, which wouldn't be very useful. This means we must always be careful when using a logical operators such as greater or greater than because thes operators automatically include infinity in their vehement, you should get into the habit of using the operator not equal to missing. Often, the state of commands were looking up in this session on the Missed Able Command, which allows you to get an overview of which variables have missing observations. We'll also look at M V encode and envy decode, which are codes or decodes missing values. Both command, allowing too quickly, recalled missing values in your day and finally would have a quick look at the tabulate miss option, which is another way to identify missing values. So let's adopt a stater and explore missing values in the autumn day. Here we are state at using the auto training data. This data said, does not have many missing observations, so it's not a great example. But the repair variable does have some missing observation. Get an overview of missing data, and our data said we can run the Miss table. Come on. This command has quite a few sub commands, so let's type help. Missed. Able to explode is a little bit further. Help miss table, and here we can see that Miss Table requires one off four sub commands to work. The summarize sub command is probably the most used, and this gives you a quick overview of what variables have missing values in your data set , the other three sub commands allow you to explore more complex missing patterns. In your data, I recommend exploring is further with larger and more complex data sets in your own time. So let's close this and used to summarise Subcommander Miss table summarize. And this now provides us with a summary of Alvar evils that have a missing observations. In our data, we observed that they have only one variable that has missing values in this case. Rep 78. This variable has five missing values and 69 values that are not missing in other words, less than infinity. The column in the middle, which says observations greater than four stop or greater than infinity, therefore extended missing. But don't worry about this too much. It's very rare to meet these in real life. If we have more variables with missing values, we could also use the subcommander patterns to see if there are specific patterns of nothing. But let me show you a quick example. Anyway, Miss table patents execute and there we are. We see that only one variable in this case rep. 78 has missing values and 7% of its values are missing. Have we had more variables? We could see patterns in March here that would tell us how these missing values are distributed among the various variables. So now that we have a overview off where the missing data is, let me show you why we need to be careful. Let's tabulate mpg and also let's tabulated with the option, Miss. This option will display mission rounds as a separate category in our population. So top rep 78. And here we see our five categories off the repair record. Now let's tabulate us again with the option miss this option after further category at the bottom of the table, the missing category. We can now observe that Rep 78 indeed has five missing values on that state evaluates them as being greater than any around number categories. Let's now assume that we forgot about this and summarize the price of cars for all cars with their repair record or five or more. If we did this night evenly, we might type something like this some rice price if Rep 78 is equal or greater than five. As you can see, the problem with this expression is that the summary statistic returns 16 observations back toe. What's only 11 observations? Have a repair record. Find out more. This is because state evaluates the infinity of missing values and falsely included them in our summary off price the correct way. Tough type Tres would have bean some rice price if rep 78 Sequel of Britain and Five um rep 78 is not equal to missing. And there we are. We now have 11 observations in our summit statistic I have made. This is not an elegant solution, but this the way it is. And you should always remember this Finally, if we wanted to recode all are missing data to something else we can call upon the every winter coat. Oh, every decode command, for example. We can turn all missing values to numeric values, my typing envy in Cote Stall Coma and V And then I'm gonna turn or missing values to the numeric value minus 99. That's time Blake up 78 to check. And here we see that we now have a new category minus 99 that has my observations and the decode reverses this process. So envy. Decode Stomp and then MLB minus 99 have related Web 78 and this is now turned all values, with the value minus 99 2 Missing values M V, encode and decode are really used open, large, complex data sets. Often these data sets will have a mission. Values hard coded to a specific number, such as minus 99 for minus nine for minus eight. These commands allow you to quickly recalled all of these to something else should you want to. And this concludes dis Overview Pub, Missing Values and Stater Take care that can be very dangerous if left to go. 11. Exploring Data 5 - Numeric Distribution Analysis: in this session, we're going to explore distributional statistics when we see a continuous variable, such as price income or earnings, we often want understand more about how these variables are distribute. Do they have a bell shaped distribution or didn't have sq? Do they have short or long tail? Often such questions are export graphically, but we can also do this numerically. That's what we'll focus on here. We'll have a look at the Inspect Command, which produces a rough history Graham, and it's designed to give you a little more distributional information on a very A similar command is the stem command that produces a stem and leaf blower. We'll also used to summarise command, which we've used previously, but here will invoke the detail option, which presents additional distribution statistics for a variable. Finally, we'll use disc Eunice and courthouses test Kamala Escape test to form the test. Whether a variable is normally so, let's head over to ST ER and practice these commands on the auto training here. We aren't stated with the water training data, said loaded, and let's assume that there's a continuous variable that were very interested in price. For example, let's summarize price. Here we can see that price has a mean of around $6000 in the standard deviation of around $3000 with a relatively high maximum value. If we wanted to explore this variable in more detail, we could First Inspector, using the inspector amount second, stood out, inspect and then price. This reveals to us that all values in the variable price are positive. That's good, a price of zero or negative mind people worrying for something like a car. We see that there are 74 unique values which clearly in the kids that this is a continuous variable, as each observation has a unique and distinct forces. See a very rough Mr Brown displayed here that tells us that a lot of the data is bunched up on the left hand side. There's a long tail going off to the right hand side. This causes us to suspect that we have a long, normally distributed variable here, a similar commander inspectors to stamp amount. This command produces stem and leaf plot. Stem and leaf lots are a compact way to present considerable information about it. Batch of data let me show you with available price, then Christ. And here is a stem only float off the variable price, the expression to the left of the vertical bars called the stem and the digits toe the right. The court leaves all the stems that begin with the same digit, and the corresponding leaves written beside each other. Reconstructing observation in the data, for example, we can see that we have two data points that have prices of $8000. I can't say this is a very often used command, but it's there. If you need it, it won't work very well with large data sets. We can also use the Summarize command to obtain more detailed statistics by invoking the detail option. Go ahead and do that summarize price comma detail. There's a lot of information here, so let's take a few moments to make sure we understand it. The percentiles tells us what the values at each percent Allah in this case, the fifties person fell, which is the same as the medium. This 5000 gone. These founders give us some idea how the variable is distributed. A state of tells us that values for a wide range of percentile. The column smallest and largest did not. The smallest and largest four observations in this naval. The observation count mean and standard deviation are not new, so we can skip these. But here on the bottom, right, we find additional information on the variants. Que nous and ketosis of these spareness and keratosis are the two statistics that we're interested in. Eunice is a measure off the lack of symmetry off a distribution. If the distribution is symmetric, the coefficient on skill nous will be zero. If the coefficient is negative, the media is usually greater than that mean in the distribution is set to be skewed to the left. If coefficient is positive, the median is usually less than the mean that the distribution is said to be skewed to the right Gatos is is a measure of peak nous or tailed nous of a distribution. The smaller the coefficient of keratosis, the flat of the distribution and this thicker details are the normal distribution has a coefficient of ketosis off around three and that provides a convenient benchmark. In our example, we will see that prices they rights you on their Contos is above three, suggesting sit details than a normal distribution. We can asks data to formally test with a variable deviates from the normal distribution. Michael in the escape test. Come on. Well, let's do that with price SK test price SK Test performs a stewardess and cook toasters test for normality contests. Both disc Younis al courthouses independently and jointly values below 0.5 indicate that we reject normality. Assumption Have a 5% significance level we've seen after price. We reject the hypothesis that it is normally distributed after 5% level. Quite signature if you so so there we are. These commands are easy to use and will help you to quickly better understand the nature of any continuous variables in your day. 12. Exploring Data 6 - Using Weights: Let's take a look at weights and Stater. If you work with survey data, you're most likely come across sampling weights at some point. Weights are variables. They're trying to give certain observations more or less influence, often to make your data look more like population level data. For example, ethnic minorities or very large firms may be under sampled in surveys, so giving them more importance in your analysis makes up the great thing about State A. Is that the majority of state of commands, allowing to specify wait in ST ER, waits are always specified by being equal to something in square brackets just before the comma. To the note options states that takes full, different types of weight. You can specify frequency weights, something wait, sometimes called probability weights and also analytical weights. Finally, there's also something called importance. Wait, although these weights and how they work will depend on each day to command. It's very rare to use these if you don't know what way to how State will try to take a best guess and let you know what it assumes, so that's quite handy. Now let's head off the state and explore how we can implement weights into everyday statistics. Here we are in Stater, and this time I've loaded up a data set Cold Senses. This is a training data set that has information on various population statistics for each of the 50 U. S states. Let's have a look and describe the state of us. Describe describing the data reveals to us that this is a 1980 census data set with information by state. Contains information about populations. Age is the number of deaths, marriages and divorces. So let's go ahead and run a summary on this date. Summarize, Execute. Let's take a look at the median age were able here we see that the average median age in the U. S. Is 29.5 years. However, this statistic ranks each state equally. Some states are bigger than others, and we may want to adjust that giving each state equal weight would probably not be a good comparison. So we can give each state a different weight by uncertain a population way. In this case, that variable is called pop on. The type of weight is a frequency weight, since it contains the frequency off people per state. Let's take a quick look at the variable pop top you late book and here we see actual population count for each of the U. S. State. Now let's imagine that we don't know this and we'll let state are trying to figure this out . We can ask data to produce a wait at summary of meeting age. My typing, the command summarize met age and then in square brackets specifying the weight option equals to the variable pope. The average computed median age has now changed from 29.5 to 30.1 years. We also see a weight column that summarizes that total weights used 225 million. In this case, note that state assumes analytical weight in this example, which is actually wrong. So let's go ahead and change this to frequency weights by adding a letter f to wait. So now summarized median age in square brackets, F weight equals to pop states and now correctly inflates the observation count to 225 million observations. The mean value remains the same, but the standard deviation changes ever so slightly we can apply waits in many settings, including descriptive statistics Walter. Very statistics and even graphs. For example, we can include waits in the regression analysis. Here's a regression example without weights, regress marriage against median age and that, and then repeat the same thing with weights. In this case, the frequency wait equals two population. Notice how the coefficients and standard error on one of the variables has changed significant. We can also include weights and grass. For example, if we wanted to correlate the number of deaths to the median age and weight used by population needs state, we can type, scatter death against median age and then again in square brackets include away. This now produces a scatter plot with weight markers that indicate the relative strength. Each observation, as you can see weighing, is very easy to accomplish its data and can be added to the vast majority of state of commands. 13. Manipulating 1 - Recoding a Variable: Let's explore how to manipulate data in state of data manipulation and data management is a key scale that all users should concentrate on proficiency in data manipulation, reduces output mistakes and also makes other operations, such as data visualization, much easier. And this video will explore how weaken recode existing numerical specifically will introduce the recode command that can help us decode existing variables very quickly. So let's head over to ST and open the auto data set and take a look at the valuable rep. 70. Here we are stated with the auto training data already loaded, let's take a closer look at the variable rip 78 that's tabulated here. We see that this variable consists of five categories, but two of the categories categories one and categories to it painted the sparsely populated with data. So let's say, for example, that we wanted to recode is variable so that category one and two are merged together. How do we do that? Well, this is whether recode command can help us out. It's very simple to use, so let's have a look at it's helpful help. Recode, looking at records helpful, highlights that it's a very simple rule based in tax where we can enter as many variables as we like. Although the norm is usually only one at the time, we can then specify as many rules as we like in round brackets, although that's not absolutely necessary. Finally, the generate option will generate our recoding to a new variable. One of them replace existing variables, which is the default option. The rules can be found here. We can choose from five different sets of rules where some number equals to some other number. A set of numbers become another number. A range of numbers become another number. Non missing numbers become another number or missing. Numbers become another number. So let me demonstrate this for you. Let me close this and then it's type recode rep 78. And then we're gonna change the values one and two to become the values one. Execute that. Now let's Tab Rep 78 have a look at it again, and we now see that we've recorded the values one and two to become the value one. So Category one now has an observation count of 10 and contains the values of the previous first and second category together. Perfect. However, we may decide that we're not very happy with number gap that is now presented. We would like to re coat all the other values before we do that. Note that the original rep 78 variable is now irretrievably changed. There is no back button in state of the only way to get back to our original represent a variable is to reload the data. So let's go ahead and change all the other values in rep 78. I've already typed out the code here and this time have put our rules in square bracket for better visual identification. So we're gonna Rico the values one and 2 to 1 now. We've already done this, so not much will change here. However, the values three is going to be recorded to the new value to four is gonna go 23 and five is going to go to four. Let's execute that. And then let's top rep 78 again. And there we are with now recorded all the other values on. We've eliminated the original number gap, so that's all looking pretty good. Finally, we can also use recode to quickly change a range of values. For example, let's say we want to categorize mpg. This variable looks almost continues because it has many distinct values. Let's say we want to categorize us into three separate categories. Well, we could type the following recode mpg and then change the values 10 to 19 to become the value one 22 21 to become divided two and 32 99 to become the value three. Let's execute them and tabulate mass begun. And we can now see that we've quickly recorded a continuous variable mpg to become a categorical variable with only three categories. As you can see recoding variables and states that is easy and fast. The rules based approach of recode allows you to perform multiple becoming tasks in one command line. That's a really useful to know 14. Manipulating 2 - Creating and Replacing Variables: in the previous session, we explored how to be coat on existing variable. In this session, we're going to focus on how we can generate new variables. We'll look at how we can generate new variables by a computational combination of other variables. Say, for example, price divided by weight. We'll also look at how we can create new variables based on function transformations, such as taking the longer with them off price. We're going to introduce that generate on replace command. Both commands are identical, except that one generates a new variable. What's the other changes in existing table? So let's go to Stater and explore this further. Here we are in ST a, with the auto training data already loaded and before we start coding, it might be worth taking a quick look at the help entry for generating. So that's type, help, generate. Execute that. And here we can see the help file entry for Generate Generate looks like a complicated syntax, but it's actually very easy to use. We generate a new variable that is equal to some expression and then at relevant, if in conditions, notice how the replace command just below it. That's the same syntax structure. So let me show you the simplest version of this that's type generate and then the name of a new variable. Call it constant equals to one execute that so here, always done, has generated a new variable called constant that is equal to one for every observation in the deficit. The new variable can be found at the bottom of our variable list. Now say, for a moment, we are interested in generating a new variable that divides price my weight, perhaps to get some indication what each pound of car actually costs. What we can do that by specifying the following code generate. And then the name of the new variable in this case will call New price equals two, and here was simply dividing one variable by another price divided by weight. So we now have another new variable called New Price at the bottom of our variable. This we can also generate a price squared valuable, which is simply price multiplied by price. So to get that we were time generate and then it's called a price square equals to price times price notice here that we multiply a variable with itself and that's absolutely fine . Finally, we can also add a number two price, my typing price, plus something. So, for example, generate price 1000 equals to price plus 1000. This will add 1000 to price instantly. Variable price. 1000. Great. So let's take a quick look at the summit statistics of all these new valuables. Let's somewhere I felt that. And here we can see that the variable constant has a mean of one and standard deviation of zero. Because, of course it takes the value one for every single observation in our data. New price is now an indicator off the cost per pound of car. We see that some cars cost more. Another's cost less per unit of weight. My square has very large values because we've multiplied high values with themselves. And finally, price 1000 is simply price, with 1000 added. So we see that the mean minimum and maximum increased by 1000 competitive price. But of course, the standard deviation stays exactly the same. So what's next? Well, we can use some of state dysfunctions. There are many, and we can access these by typing help functions just moved us over to the left here help functions, and here we can see all of the functions that are available in states. So, for example, let's click on some mathematical functions over here and here. We can see that we can take the exponential with take floors InterTrust, etcetera, etcetera, etcetera. That's going use the locals. Close this this over here for better visibility and let's generate the log off price. Well, we can do that by calling the relevant function in our generation command. So, for example, generate the log of price equals to the log on, then in square brackets, the variable price. And we can now see that we've generated the new variable log price at the bottom of our variable. This finally, all the above actions can be used with the replace command, which makes changes to current variables. So, for example, if we wanted to replace all the values in the variable price with the values in the valuable new price, we could execute, replace price equals two New Christ like so, and we've now made permanent changes for available price. Another example is where we can replicate some of the coat. In the previous session, where we merge categories in rep 78. So, for example, to merge a Category two in rep 78 onto the category one in rep 78 with the type replace Rep . 78 equals to one, then specify an if condition that says, If Rep 78 equals equals to two, let's execute that and it's tabulate rep 78. And here we see that with now merged the previous five categories into four categories or categories. Wanted to How much? So this is very similar. So what we did in the previous video. The disadvantage here is that we need to generate a new line of code for each replacement. Often the previous recall command office a much faster way to do this so we can ask ourselves, Why would we ever want to use replace? Well, as we've seen up, they will generate. The advantage of replace is that it allows us to call on a wide variety off statistical and mathematical functions that we can use to transform existing variables. Recode cannot do this. It can only recode existing values, and that is the advantage of using replace to change existing variables 15. Manipulating 3 - Naming and Labelling Variables: in this video, we'll examine how to rename and re label variables. We're gonna investigate how to be named single variables. We name groups of Abel's change available labels and create new value labels. One should not underestimate the importance of accurate naming on labeling. This can be a very important part of data manipulation, especially when working with large data sets. Um in achieve. We're going to look at the rename command on the label command. The Rename Command allows you to be named variables. What's the label command allows you to change labels? Let's adopt the states and explore this further. Here we aren't stater with the auto training data already loaded before we continue with the name and label, I'd like to point out that this is one of the few occasions where I sometimes recommend a graphical approach as opposed to the written code approach State to have something called a variables manager, which can be accessed by the data menu all by simply typing bar managed. So that's how I will look at that. And here's the vault manager. You can see that this is a graphical user interface and you can simply highlight and click on the variables that you want to manage, and here you can change. The labels on also be named of Ebel's. This menu can sometimes be a simpler and faster way for editing your variables of value labels, but let me continue and teach you the hard way in this session. So let's close this and let's look at our data here. We have all the valuables on the variable names for our auto data. Let's assume for a moment that we don't like the variable name rep 78 and that we want to change it. So rename allows us to do this in a simple way. Let's explore. It's helpful very quickly. We can type help you name and here's me names helpful. The syntax for rename is pretty simple. Just rename the variable from the old to the new one. However, there is also a group renaming ability, so let's click on that here. We can see that we name has some powerful group renaming abilities. I will demonstrate some of these to you, but I recommend that we started this helpful more closely. You're interested in changing many variable names in Stater, so let's close this on Let's rename the variable Rep 78. Let's call it repair to Do Not we simply call upon the Rename Command, the old variable rep 78. And then the new variable name repair and we can now see in our variable is that the valuable rep 78 has changed its name to repair. So that's pretty easy to understand. And when you want to change a single variable name, there isn't much more to it other than to remember that state of differentiates between upper case and lower case letters. As a side note, we can actually change the variable names to Upper and Lower Case and Mass by using the upper and lower options with the name Like So in this case, renamed Star will be named every single variable now data set to up the case with the upper option and to bring it back to Lower case renamed style comma lower so that could be useful to know. Now it's you. We want to change all valuables in one go. Let's say we want to make it absolutely clear that this entire day to sit and for all its variables, refer to the period of mike in septic. Well, we can do this by adding a suffix off 1978. Tow away. The variable mix rename allows you to do this in the following way. Rename Star. Which in the Kids All variables. Two Star or Babel's. But now at the numeric Characters. 1978 1978 To all the very bonnets, we can see that our valuable list has now changed. Holder names now end with 1978. If we wanted to remove the Suffolk's 1978 we can reverse this process and talk to following rename four variables that end in 1978 to Naples. And now we've removed the Suffolk's 1978 From all variables. There are many more possibilities that group we name offers, but you get the slave of how easy it is to be named. Many variables in state of next. We might want to change the variable label of repair. That's this little bit over here next to the valuable name. We can do this by calling on the label command specifically to change the variable label. We were type label and then some common variable, the valuable repair and here in quotation marks, the Newtown and we've now changed the variable able to repair categories. Next, let's take a closer look at repair. Let's tabulate it. Hey, wouldn't see five categories. However. None of the categories have labels. What would we do if you want to label each of these values? Well, we can do this by creating a new label and storing this in status label memory. After we've done that, we can attach it to the variable. So note now, this is a two step process by default. Labels are not attached to any variables, and we can get an overview off the current labels in our label storage by typing label directory. And there we are. We see that we currently only have one label in our data set, and that's called Origin. So let's go ahead and create a new value label, and we can do this by first finding a new label, and we do that by calling the label command with the sub command define we then give the label a name. In this case, we're calling it repair, but we can call it whatever we want. It doesn't have to be the same as the variable name, but we're going to attach it to and after that we're giving each value a specific label. In this case, one will be no repairs to some repairs, etcetera, etcetera, etcetera. Let's execute that. Okay, so nothing much happened. Only thing that happened here is that we've now stored our value label in status memory. The next step is to take that label and attach it to the values of repair on the way we do . That is by calling upon the label command the sub come on values. We're going to attach the value label repair to the variable repair. So let's do that. And we can now tabulate the variable repair. And here we see the value labels attached to the underlying values. So not so good. And that's how easy it is in Stater to change variable names and create new value labels 16. Manipulating 4 - Advanced Variable Creation: in this session will look at the extended generate command, which is a more powerful version off the generate command state is generating replace Commands are, without doubt some of the most used commands in stater. But what happens if you want to create complex variables and both commands can't help us or required a lot of work? For example, say we wanted to create their cells from a continuous variable. Using generate and replace might mean we would have to type 10 lines of code, and we wouldn't use the summarize command to figure out where the cup points are. Well, this is what Egion comes here. Egion stands for extensions to generate and is designed to simplify more complex data manipulations. So let's head off the stater and let me show you. Here we are in ST over, the auto victor already open. Let's take a look at regions helpful. Help Egypt. The syntax for e jealous, identical to generate except that whole region commands require a special function to run. A list of functions is given below, and you advised to take a careful look at some of these. They might seem complex at first, but they will make your data processing much faster. In this session, I will demonstrate three functions to the standard deviation function The cut function on the road mean function. The standard deviation function is pretty easy and allows you to compute standard deviations. The cut function looks a little bit more complex specifically will use it with the group option. This allows us to cut variables into equal frequency groups. Very useful. If you want to create quarters Quintiles, best self, etcetera. I will also demonstrate you one of the role functions that region has specifically row mean that allows you to form mean calculations across a row instead of a column. So let's close this and let me show you some examples. The first example is an example with agent s D function. It's pretty simple. Egion, A new variable in this case will call it SD. Price equals two S D off price, and this will now have generated a new variable which has the standard deviation off price as its values. Here we can see that the standard deviation off price has been moved into that variable and every single observation in that variable now takes the value 2949. Next, let me show you the cat function. Let's say we want to create a new variable of price that v. Coates prize from its continuous state into 10 equally sized vessels. In other words, I want to create a categorical variable with 10 categories that each have equal frequencies . This would take quite a long time to do with Generate. But with Egion within, simply type the following Egion we'll call the new variable Stessel equals to the cut of price. And importantly, here is that we're invoking the group option on. We're gonna asks data to cut it into 10 groups. Let's go ahead and do that. And now we can tabulate a new variable DSL, and we now see that we have approximately 10 equally sized categories of price. Finally, as a last example, imagine that we want to perform a means computation by row. In other words, for each observation out, data are your car. We want to compute the average value off, say price mpg and wait. The standard generate command will own compute averages by columns. So to compute the row mean we can use agents. Romy in command. So, for example, Egion Romanes equals to the function row mean on in this case, we're gonna ask it to compute the means across the variables price, mpg and wait. And now let's browse the data so you can really see what happened. What we've done is we've taken the average off this, this at this value, and we've put it into a new variable. Cool row means So that concludes this very fast overview of Egypt. It's a relatively advanced data command, but one that is very useful to know about in case you get stuck, which and that concludes this quick overview off Egion. It's a relatively advanced data command, but one that is useful to know about in case you get stuck with, generate or replace 17. Manipulating 5 - Creating Indicator Variables: Let's examine how to create indicator variables in stater. These can also be referred to as qualitative variables such available, so used for many purposes and can often be found in regressions, where they identify some kind of category of data. For example, upward down high or low country A or B etcetera. Because there you saw Austin State or offer several ways to create indicator variables. So let's go to Stater and take a look. Hey, we are stated with the water data open and let's say we want to generate a Binoo indicated from the variable price. Say that we want the value of one to reflect any price over $6000 and we used to value zero for anything else. We can achieve this by using the generate and replace commands over two separate lines of code and specifying the if condition in both of them. So, for example, we can start with generate a new variable. Let's call it high price equals to one, and then if the price is greater than 6000 let's execute up and then, as our second command were used for a place to replace the valuable high price with the value zero. If the value of price is equal or less than 6000 execute that. I know we can tabulate our new indicator variable called High Price. But there's another way to do this, and that is with the recode command. Recode has a generator option, which can be used to generate a new viable instead of changing the existing variable. So a little trick in this case is to use recode the variable price and then, as the first condition, go from the Milliman value all the way to 6000 and make that equal to zero. And then it's the second condition. Everything else will go to the Mali one, and we're using the generator option to put all of that into a new viable called high price . To execute that now we can tabulate high price to, and we see it's exactly the same as the previously generated, viable high price notice that I'm using a specific rule here, which is called the Els Room. This makes things very easy because everything that is not in the first rule will go to something we specify in this case. One. Next. Let's say we wanted to generate many by new variables from a categorical variable. Rep. 78 is a categorical variable with five categories. Imagine now for a moment that we wanted to split this variable out into five separate finally variables so we could use the above code five times and achieve what we want. But we could also use to generate option in tabulate. All we need to do is specify a new variable name and state. It will automatically create as many dummy variables as there are categories, but that's useful to know about Let's top rep 78 then specify the generate option and specify a new variable, cold reb dummy. And we now see that we have five new vinyl variables in our data set called Reptile Me One and going all the way to Rep Tommy five. Let's have a summary of these just to see how they look. And there we are. We see that we've got five variables that have minimum and maximum values between zero and one different means. Do be careful with this command that you don't accidentally use this coat on a continuous variable because otherwise you might in love with hundreds. If not thousands of dummy variables in your data. Next, let's have a look at to advance state of functions, the altar coat and the endless function. Both require bit of care, but can make your coding and data manipulation much faster. The auto code function allows you to generate evenly space categories across a number inch , for example, to create 10 categories there that evenly spaced across the values. Let's say 3000 to 16,000 for the variable price weaken type. The following generate a new variable, all the price category, and then invoked the auto code function where we first specify the variable that we want to look up. Then the number of categories and then a value range going from A to B. In this case, 3000 to 16,000 will cover the entire range off price. That's execute that. And then let's tabulate what we just created. And we've now generated a new categorical variable that has 10 categories. Each category is equally spaced between 3000 and 16,000. I know that this is not the same as creating their cells or categories with equal frequencies. Well, look at that in another session and Finally, there is the in list function. The Indus function can be a useful short cut that can save you from coding or typing out too many all conditions. For example, to generate a buying available from the repair variable where categories one and two are coded to zero on categories 34 and five accorded toe. One. We could type the following lengthy coat. Generate a new variable called Repair. If rep 78 equals equals toe one or rep. 78 equals equals to two. And then here, the bottom we're replacing. The variable repair equals toe one. If Rep 78 is one of these categories 23 or four. But notice that each time we need to spell out the valuable name and insert in all conditions well, we can do all of this much faster. By invoking the endless function. Enlist allows us to specify a variable and then a sequence of values that will be evaluated . If the evaluation is positive, think something will happen. If the evaluation is negative, then nothing. So, for example, in this case will generate prepare. He calls to zero If in list. Rep. 78 is equal to the values one and two, and then we'll replace. Repair equals to one. If Enlace viable rep. 78 is 345 on That will generate a new, finally valuable off repair. So this could be a much faster way than typing out too many or conditions Great. So as you can see states and makes it easy for you to quickly generate indicator variables . 18. Manipulating 6 - Dropping and Keeping Data: keeping and dropping data is the key activity of any data manipulation process and state and makes this procedure relatively simple. In this session will explore how weakened Rappe or keep certain observations on door variables. No, there is no back button in Stater, and executing these commands will need to changes to your data. Still, make sure you work with. We're going to introduce the keep and drop command. Both of them either keep or drop a set of observations, all variables. So that's another two stater and explore this further here. We aren't stated. With the auto data already loaded, let's say we wanted to remove the variable price from our data. Well, we can do that by invoking the drop command. Relatively simple, we specify drop and then the variable name price. The variable price has now been dropped from our data set. That's pretty simple. We can, of course, specified multiple variables after the drop command. Alternatively, we could use to keep command and specify what variables we want to retain. Know that we can, of course, enter multiple variables. But we can also use wild cards, for example, by typing keep Star East. Are we will keep all valuables that contain the letter e. And there we are with now kept every single variable that contains the letter e we can also drop will keep observations by specifying the if condition. For example, let's take the variable foreign. Let's drop all foreign car observations from our data. We can do that by typing job and then if the variable foreign equals equals one. Let's tabulate foreign to see and we can see that all data said now only contains domestic cars. In this case, 22 foreign cars have been removed from our data set. And finally, we can also drop or keep a list of variables by specifying a list like So, if we wanted to drop all variables that range from rep 72 weight in our variable list, we can simply type drop. Rep. 78 Dash Wait on this now drops four variables that are ranked in our variable list between rep 78 wait. This could be useful if you want to drop a whole bunch of Abel's that are ranked next to each other and avoids us from typing out every single variable name drop and keep on useful commands there, used often in stater. But remember to take care with them. They allow you to make your data, said smaller. Remove excess variables, safe space and increased computational speed, but there is no backbone. 19. Manipulating 7 - Saving Data: once we're done with data manipulation will probably want to save our data. So let's explore how to do that. However, before we dive into stater, it is very important to recognize that saving data can be dangerous. Stater will overwrite your previous data set. If you specifying the relevant option, which generally you will, it will not ask you. Are you sure? And there is no back buttons data. Once your previous data is replaced, that's it. The original source is gone, so I always work with a copy. I cannot emphasize this enough, and they should always be part of your good data practice. The following commands matter for saving data in Stater the C D Command, which is short for current directory and allows you to change your directory and the safe command, which will save your data. There's also an export command that allows you to save your data into different for four months. So let's head over to ST and explore these commands. And here we are in ST a with the auto data. Imagine we've just finished a data manipulation exercise and transformed our water data said to our liking. We want to save a copy of it before we move on to the data analysis part of our project. So before we do that and invoked safe command, it may be useful to know where the file will be written to. If we don't specify, a past name stated will write everything to our default directory, which would have been set up during the initial installation process. And we can view this directory by typing CD. So let's check out current directory, and that's our current directory. If you want to change out Lethal directory weaken, simply give state or a new directory after the CD command, like so CD and then the new directory path. And make sure you put the director park in quotation marks. And we've now changed out default directory we can now use to save command to save our data into this directory. The safe command is pretty simple to use, so we want explores helpful. We simply type save, give the data, set the name, and in this case I'm also calling on the option replace, which will override automatically any data sets that have the same name in that path. And there we are. We've now saved our data, said called Auto to to the Relevant Directory. If we want to save our file somewhere else and we don't want to change our default directory, we can insert the relevant path straight into the safe command, like so safe, then the relevant path. And then we're going to call it auto, too, at the option replace. And there we are. We've now saved our data said to that specific directory and pathway. Finally, if we want to save our data to another file format, we can use the export option. Let's take a quick look at it's helpful. Help export here is exports helpful. We see another requires various sub commanders toe work these up command. Specify which far former you want to store your data in. We can specify an X r former 80 Limited Former A open data base connectivity for Mt. Out far SAS for months and finally a D base for months. So let's go ahead with Excel. For now, let's close This so we need to do is Juicy Excel sub command and type, Export Excel Development Path name and again, I'm calling on the option. Replace. So there we are We've just saved our data to next fall. As you can see, saving data and state is very simple. But remember to always take care, especially with the replace option. 20. Manipulating 8 - Converting String Data: in this video, we're going to explore string variables and learn how to convert them to numeric variables . For state, our A string is any data character that is not a number or missing value denoted by a doctor. String data is common, and examples might include geographic information, general information or date information. State A has a wide range of options to help you manipulate string information. The state of commands that will look at in this session are the D String and to strengthen Months, which converts string variables to numeric variables on back. And they're particularly useful for variables that contain lots of numbers and code and decode also convert string variables to numeric on back. But I'm more useful when the variables contained little numeric information. So let's head over to ST ER and explore string data. So here we are in state, and I've already loaded the first string data set. Let's type list and examine what we have first here in the state of said, we only have 10 observations and some valuables. All of them have numbers, so that seems good at first glance. But if we do a summary of this data they're not going to get a lot of information. So obviously there's some kind of problem here. A summarized on our data reveals no information. So the reason for this can be discerned with the describe command that tells us that all our valuables are likely to be coded as string variables, and here we see that all of them have to storage type str for string. So that's a little bit off, since our data clearly only contains numbers. But actually this is a common occurrence that can happen in data conversion or data import scenarios. So even though Alvare was only have numbers, for some reason, they're coded a string data in Stater. So obviously, we need to convert this to numeric later, before we convert our variables to Numeric Ravel's Let's first Learn how we can condition on the street to condition on our string. We need to put the condition elements in quotation marks. So, for example, if we wanted to list all the idea observations that have values of 111 we would type list if I d equals, equals and then in quotation marks, put 100 and 11. Let's execute that we now see only values of I d not Take the string 111 So that's how we can condition on string later. Next, let me insert a non numeric character into one of our data cells. This will cause an error later on which wanted to fix. We can do not find replacing income to ABC in Observation. Seven. Okay, we'll come back to this later. Now let's continue and convert our variables to numeric rebels. To do this, we're going to use that these drink amount. It's pretty easy, but we need to specify if we want to generate new variables or replace existing variables in the options. So let's use the replace option and simply replace the existing variables. To do that, we will type D string star to denote all variables and then use the option. Replace, execute not. And then it's just summary off all our data. And there we are. We now see that we've successfully converted most of our variables, do numeric variables, and state is able to compute statistics. All these, however, one variable income has a problem. The non numeric character be inserted earlier lets her problem with the string and the string did not complete its operation. The string will automatically aboard the district process if it encounters non numeric characters. So to overcome this, we can force D String to ignore this and continue anyway with the force option. So let's go ahead and do that. The string available income. We're going to replace it. But now we're using the force option so that it ignores non numeric characters. Execute that. That's to another summary of all our data. And here we see that we now have statistics on the valuable income, but note that we only have nine observations. That is because they're non numeric. Character was converted to a missing value. If we want to reverse this process, we can call upon the two strength amount It works in exactly the same way, So to string or are variables what continues to replace option execute out, and now everything has been converted back to string variables. Next, let me demonstrate the encode and decode commands, which are more useful If your data cells contain a lot of non numeric characters, we'll need to open another data sets about, so let's go ahead and do that open the data. And now let's explore the state of set would have described command. And here we see that the variable sex is coded as a string viable. Let's take a closer look at that viable with tabulate and used a no label option so that we see the data on the lease it. And what we see is that this variable sex is coded purely in strings. Some observations have the strengths and tax female, but other observations have the strings in tax mail. These are clearly non numeric characters. So what do we do? The string isn't going to help us very much here. It's simply going to generate a new variable, that sort of missing observations. In this case, we can use the encode Come on. Encode has no replace option, but the only to specify a new variable with the generate option. So to deal with this particular variable, we can write, encode the variable sex and then generate a new variable which will call gender and within our tabulate gender and see that we've turned it into a numeric variable that has the values one and two, so that's good. Likewise, the command decode reverses this process. So to return gender back into a string variable, weaken, type deco gender, we're going to generate a new variable. This time we'll call the test, execute that tab first, and we see with now turn it back into a string valuable where the cells contain the string values, female and male. So those are the two main ways we can convert string data to new marriage later in Stater, the strings, useful for variables with lots of numbers and then code, is useful for variables with lots of non numeric data. 21. Manipulating 9 - Combining Datasets: in this session, we're going to look at how we can learn multiple data sets together. Combining data set is a common future of data analysis. Maybe you have a data set with individual level information and wish to merge this with local geographic information from another data set. Or imagine that you have to financial data sets that contain the same valuables but related different industry groups. You want to combine these to create a bigger data set. Combining data is actually very simple to do in state of data. Can even much extremely large data sets in the gigabyte range without any problems. However, it is important to fully understand the two main types of combining processes. One is a pending and the other is merging data. A pending adds observations to existing the labels. So if you're two or more data sets contained the same variables but have different observations, then joining them is called A Pending merging data sets is different as it adds variables to usually already existing observations because merging adds a new variables you'll always need emerge key one murdering data. The merge key must be the same in both data sets, so that states and knows which observation goes where. Without merger keys, you can't perform it much. Emergency is also often called a identify or simply I d variable. So let's go to state and explore this further. Here we aren't state, and this session includes some custom data to demonstrate key principles. So let's start with depend. Let's first examine two different data sets that we want to upend the data sets called Upend one and a pen to. So let's clear our data. Use a pen one listed, and I will do the same for a pin to. So here are two data sets. They're very small and contained only five observations each. Each day to set has three variables where one is an I D. Variable in. The other two are just data available to upend the status. It's We can use the A pen command. It's very simple to use so we won't explore. It's helpful. We simply time a pen using a pen. One. Remember, we currently have the data set prepared to open, so in this case, we're going to appended with the 1st 1 The penned one. Let's execute that now. Let's look at the underlying data. And there we are. State has now upended both data sets together. Note that a pen does exactly what it says. It simply appends the request the data to the bottom of the currently used data. In this case, this might leave things a little bit on ordered, and we might have to reorder our data set ourselves if that's what we want to do. Now let's head over to the merge examples I'm going to use to data sets called auto size and Auto Expense, which is sub samples of our frequently used auto training data. Let's have a look at both of them. Use auto size list and use also expense list. And here are both out data sets. You'll see that we have one common variable in both deficits that variables called make. The other two variables are different in both data sets also note that one day to set is larger than the other one contained six observations and the other contains five observations. To merge both data sets together, we need to use the murder command. The murder command actually requires a sub command to work, so let's have a look and it's helpful. Hope much here is merges helpful. We could see that there are four main types of mergers, 1 to 1, many to 11 too many or many to many. There's also a special type of merge, but it's a rarely used. I won't go through every sub command here, but what we need is the Oneto one merge as we're merging unique observations onto unique observations. Next, we need to specify a variable or a series of Abel's that we're going to merge on. These variables will act as how identify variable. In our case, we're only going to specify one variable make. However, you can specify more than one and stated will treat multiple specified variables as one big identification variable. Finally, we need to specify a data set of much. So in this case, we're going to use the auto size eight. So let's close this and let me show you an example of the merge command. Merge Oneto one we're going to use make as our I D variable and we're using the auto size data. Let's execute up. So how much is now complete and we get the merger port. The murder report tells us which observations were merged using what data set in this case . One of our observations wasn't much given that this is a small data Said that's one A list command to see what's actually happened. And here in the list command, we see that we now have one big data set that contains all the valuables from the previous two data sets. All of them were merged under valuable make. We also see here that Observation six was not successfully merged across both deficits. It has a missing values for price in mass per gallon, which are the variables that come from one of the merging data fits not that the information on which observations were successfully merged or not is also stored in a new variable cook on the school merge. This variable allows us to pinpoint which observation didn't make the merge, and it's useful in big data sets. We Contam Piil eight that variable my typing tab underscore much and here again we see that five off. Our observations were successfully merged across both data sets and one was not so There we are. We've successfully much two data sets and created a new data sets that contains or four original variables. Both examples highlight how easy it is to merge and append data sets and stater on disability scales incredibly well for launch of data sets. 22. Manipulating 10 - Macro's and Loop's: Let's explore the concept of macros and loops. Macros are often used for advanced coding purposes. They're useful, since they can make your code much shorter and more efficient. Basically, macro czar abbreviations for a string off characters or numbers. For example, a commonly used macron stater is to specify a variable list once and then recorded again and again. Macros are also linked to loop functions that the low to repeat a series of commands. For example, imagine you wanted to create 100 new variables that are all closely related. Do you really need to type out 100 generate commands? This is where macros can help you. Staters macros come in many flavours. The most important macros are the local macro on the global macro. Local macros are temporary macros that will disappear after you've executed them and will only work in one do far global macro czar persistent through stater and in all due falls until you close your status session, after which they also disappear both the useful. But we often prefer to use local micro's. This is because if you forget about a global macro that you specified, you may accident the executed in another data set or in another part of your code, and you may get very undesired. I think before we head into Stater, let me give you a quick overview of some specific syntax you'll need to know about macros. First, you need to specify what macro you want to create. In most cases, this will often be a local or global. Next, you'll need to give it a name, and then you'll need to give them macro either a string or an expression to execute strings of characters such as A B, C D. And then expression is something like one plus one. If you want to enter a string, you'll need a quotation marks around your string, and if you want an expression, you'll need an equal sign. Once you've created the macro and you want to execute it, you'll need to type a particular set of characters for a local. You will need to call up in the local macro name and insert not between a back and a single court. State is very specific about this. To execute a global, you'll need to type the global micro name with a dollar sign in front of it. So let's move to ST Anna and explore macros using the auto, Dichter said. Okay, here we are in states. First of all, if you want to read more about macros, you can type the command. Help macro toe open the main macro help. Thought so. Let's do that. Help, macro! And here is the main macro helpful. This is not an easy read, but here you can get an overview. Of all the macro commands that's data has available. Local and global are just too. There are others that create temporary variables or files. For example, the macro command with the sub commands there and drop allows it to manage any act of micros that you currently have. But I'll leave this to you for further reading. Let's close this for now and let me show you a common example of a local macro. Let's define a new local macro called violist. We're gonna type local violist and then in quotation marks inside the variables Price Moscow Gallon and Rep. 78. Our next line of code will then summarize that local violist, and finally, we're going to run a regression off the dependent variable length against that relevant violence. Now it is important to remember because is a local macro. It will disappear immediately after with executed this line of code. So the way to retain it in all of these lines of code is to highlight everything and then executed. And there we are. We have now obtained summary statistics for Div. Abel's Price, MPG and Rep 78 Onda. We've obtained a regression off the variable length against price mpg and rep 78. So all that happened here is that we stored a string syntax in a local called violist and then recall that local in our summary in our regress command. However, note if I execute the regress command alone without including the local micro command, the code will fail. That is because locals, a temporary and I lost after the relevant highlighted code is wrong. So here's an example off just running regressive lengths against the local macro Vargas and we can see that the relevant variables were not included because the local has already been dropped. If we want something more permanent, we could repeat the same exercise with a global command to define the global. We simply call upon the global command, we give it a name. In this case, we could give it the same name because this is a different type of macro and again, the same expression we're going to insert available surprise mpg rep 78 in quotation marks , and we can now execute us. This is now permanently stored in Stater until we close stater so we can repeat our exercise and ask for a summary statistic off the global macro. Violist Andi Ask for regression off the dependent variable length against the global macro Wallace. So this can be particularly handy when you're working on large date. Sit with lots of different code sections or different do fast. Finally, let me show you how we can create a simple Luke and Stater. The main look commanding Stater is called for each for Let's take a quick look at it's helpful. Help forage here. We see that for each actually takes a variety of sin taxes. But the essence of all these little sub commands is the same for all times of fridge commands. We specify a local macro name first, and then we need to specify some kind of list and then in curly brackets we specify some cold. Therefore, each loop will execute the code in curly brackets. For every item on our list, that is a loop. So let's close this and let me show you. Imagine that we want to transform a specific set of variables to logs and then run a summary on each of those variables that might involve quite a bit of typing. However, we can use to for each command to create a looping code over that. The basic version for each creates a local from a list that will keep cycling through the list until it finishes to list. So all we need to do is to find the name of the local, give it the list and then some exercise to perform. So in this particular case, we're going to call upon for each. We're going to call the local I, and we're going to give the list. In this case, the list is price mass per gallon rep 78 and then in curly brackets is our main coat for each item in the list. We're going to generate a new variable that is log I. So we're going to start with log price and that variable is stressed. The lock off I after we generated a new variable, we're going to summarize that particular available. So what's going to happen is that price is going to be inserted here, here and here, and then we're going to move over to MPG, which is going to be altered it here, here and here, etcetera, etcetera until we finish the list. So let's execute this and see what happens. We've now generated three new variables. Look, Price log, mpg and log rep 78 summarize them. I hope you can see how useful this could be with large data sets or with lots of variables . We could have done this for tens or hundreds of variables instead of only three. Finally, let me show you a quick example with a sibling commander Forage four values. This command allows us to quickly cycle through numeric values rather than long lists that we need to type out every time. So, for example, if we wanted to create 20 random variables very quickly, I could type the following four values we're going to call the local I and then we're going to ask the list to count from 1 to 20 on that. Except we're going to generate a new variable that's called X I that is randomly normally distributed. So let's do that. Execute not. And now do. Some were off X star and we see that we very quickly generated 20 new variables in tile data set. So this could be very useful to know local and global macros and loops are very useful commands to know about and often form the basis of more complicated programming code. Hopefully, you could see in that having just a basic on the standing of these command can save you a significant amount of work. 23. Manipulating 11 - Accessing Stored Information: Let's examine how to access memory stored results from status estimation commands. All Stata commands store their results first and then access some of these results to generate an output table for users to see. This output table, however, won't include all results that are available. And sometimes you may want to see results that you don't actually see. Learning how to access such information will allow to perform your own custom computations. O introduce further elements automation into your code. For example, imagine we want to multiply the mean of a variable with something. But we're still working on our final data sample. In other words, the mean of this variable may continue to change until we finish working. And rather than doing the calculation again and again by hand, we could introduce some code that takes to store mean and then multiplies it with whatever value. To explore what is stored off each command, we're going to use the return command. The return command comes in various flavors of which the two most important ones of the return commands for our class, e class type results. What are our class and e class results? Well, most data descriptive commands are OK class commands. What's commands I perform estimation procedures such as regression analysis are often II class commands by typing return list for our class commands and eax return lists for e class commands. We can explore what kind of information is stored in each of these commands. That's we have that information. We can do various things with it. In this session, I'll show you how to perform simple computations with a display command. Which status version of the hand calculator. I'll also show you how to explore Store matrices, matrix list command. And I'll show you a shortcut to access stored coefficients and standard errors by calling on status system variables. These are also called underscore variables. So let's go over this data and see how this works. Here we aren't data and I've already preloaded the auto dataset. Let's go ahead and run a summary on price and see what it stores. Summary isn't our class command. So I'm going to use the return list command to access to store results, summarized price, and the term list. And then we are, we now see that summarized those various scalars or a single numbers and their names are all in the form of round brackets. We can now take the stored scalars and put them into another state of command. For example, we can use status display command, the eye to compute what the mean times two of that variable would be. So let's type that display Amin times two. Execute that. And in this case, multiplying the mean of price times two gives us around $12,300. Next, let's go ahead and look at the stored results from my regress. Come out. First, we have to estimate the regression. So let's go ahead and do that. Let's regress price you can smiles per gallon and length. As our regression and regress isn't II class commands. So we can examine its stored content by typing e written lists. Let's go ahead and do that. And here we can see what requests stores, regress those scalars, macros, matrices, and a function. All of these can be used in further Code. Scalars are single numbers. What's macros consists of texts. Matrices contain various matrices with multiple numbers and functions are normally to stay single function that allows users to condition something they do later on the estimation sample. Let's go ahead and focus on the matrices that are stored from the request command. Regress those two types of matrices. It stores both in E B matrix and B matrix. The B matrix slows the estimated coefficient values once the EV matrix, suppose the variance covariance values. You can examine this toward matrices by using the matrix list command like so, matrix list AB and matrix lists EV. And here we've now shown the estimated coefficients and the variances and covariances for each of our variables. We could use other metrics commands to perform matrix computations or to extract parts of these matrices. Thankfully, if we're only interested in extracting coefficients and standard errors, said it makes this process simpler by storing these system variables. System variables all begin with an underscore. And for variable coefficients can be accessed by typing display on the school B, and then in square brackets, the variable name. So to multiply the coefficient of miles per gallon by two, we would type display on the school be square brackets MPG times two. And then we are, we've now automatically multiplied the estimated coefficient of MPG sounds to. Likewise, we can do the same for standard errors by typing display on the scope standard error, e square brackets MPG times two. And we've now multiplied the estimated standard error of MPG times to the most common use for recovering stored information from a state of command is for automating you do for the above techniques allow you to construct code that is independent of how your data looks. Oh, what numbers are produced. They allow it to be more dynamic and flexible when exploring and recoding new datasets. And there's something that advanced uses of states should consider learning and doing. 24. Manipulating 12 - Multiple Loops: Let's explore multiple loops. State a loops can consist of multiple loops. In other words, we can have a loop inside a loop, inside a loop and so on. The application of multiple loops is often highly contextual. But when used properly, it can save a lot of time. Often you'll need to have a specific dataset up and a specific manipulation or exploration problem to use double loops effectively. Can't say of use double loops a lot in my career, but use them, they've made a big difference. It's almost impossible to describe a standard situation for multiple loops. So you have to think carefully about when to use them. Here's an example of the code. We'll explore later in states or I have color coded this code so identified the different pots of what's happening here. Notice that there are three distinct elements to this code. The green code executes what we want states to perform many times over two different locals. Locals X and locals are the two locals are defined in two different loop come up. The first is to blue for each loop that defines a local x to be the entire variable list of the currently open theta. The next local is defined in the four values loop. That counts a range of numbers from ten to 50 in increments of ten. In other words, it will count 1020304050. So ultimately what will happen with this code is that for each variable in our data set, a new variable will be created. That is the variable multiplied by the number. I, will see you in a moment how using this code on the also data will create a new variable list almost instantly. So let's go ahead since theta and take a further look at all of this using the OData. Hey, we aren't states and the author data is already loaded. Let's go ahead and take another look at this double loop we have here. Let's start in the middle, will come to generate a new variable that is equal to x and I. And that itself is equal to computation, that is x times i. Now remember X and I hear all locals. So let's have a look to see where these locals are defined. Is defined up here. And this is simply count. In this case, we're counting from ten to 100 in steps of ten. So 10203040, et cetera, all the way to 100. So we can already see there are ten different versions of IE that are going to be inserted here. And here. The x local begins up here. And this is simply the variable is because we're using a star here. Every single variable in our data set is going to be inserted into that local. So we're gonna go through make prize MPG all the way down to font. So what's going to happen with this double loop then, is that the variable make is going to be multiplied by ten, 20-30, 40-50, et cetera. And then we're gonna move to price 10203040, etcetera. And then to mpg, et cetera, et cetera, et cetera. Let's see what happens. Okay. That's run at scribe. Come out. And we see that we immediately generated a gigantic dataset that contains all the original variables multiplied by the values 10203040, etcetera. So this is an example of where we using a double loop to form a very quick data manipulation exercise. Had we done this by hand and typed out every generate command separately, you can imagine how long this would have taken, doing it with two loops, took seconds. Hopefully, this example exemplifies the power of double loops to you, but you will need to think carefully about how to use them. However, if you're stuck with a lot of repetitive data manipulation, oh, exploration, this might be a way to save you a significant amount of time. 25. Manipulating 13 - Date Variables: Let's explore how to deal with date information in state. Date information often comes in string format and needs to be converted to a numeric format. However, string date format can also take many different shapes. For example, in some countries, the month is written before the day, followed by the year. Another parts of the world, the sequences day, month, year. Alternatively, sometimes state information is already in a numeric format, but it could be spread across multiple variables. For example, one variable measures the year, another variable measures the month, and another variable measures that day. And that case will need to piece all of this together. And finally, you should be aware that I stated date is simply a number which counts the number of days since the first of January, 1960. In other words, theta will interpret the number one as being the second of January 1960. Here's an example of a small dataset with date information. The first variable contains a date in string information. It has numbers, letters, and numbers. The second variable has some important data that we want to analyze. And the third variable shows where we need to get to with the first variable. This data is in the correct date format for stator. This variable is a numeric variable that counts the number of days since the first of January, 1960. To create such a status specific date variable will need to use a variety of functions. The date function is a highly flexible and customizable function that can read many different string format and convert these to status numeric date format. The key to using it is to specify a variable that contains the string date information. And then to specify a permutation of MDI that translates this string information accordingly. For example, if days can be four months before years, we would enter the MY into this code. Alternatively, if the string is in years, months, days, we would use y, m, d. One aspect to note is that sometimes years are given as two digits. In that case state and might have problems knowing what century we are referring to. So we can specify the century before y. Rho stated date numeric variables can be hard to read for users since they only contain numbers. It is that a common format in numeric date variable to make it easier for users to read. There are also four other useful functions to know about. The year function allows us to extract a year from a state numeric date variable. And the same goes for the month and day functions. Finally, the MD y function allows us to create a full date variable from three separate numeric variables. So let's go ahead and see how all of these quarks here we announced data with the associated training dataset. Already loaded. This custom dataset called date as part of this session. Let's explore it by using the describe and list command. Here we can see that this dataset contains two variable. One is a string variable called Date string one, and the other is a variable containing data. We can also see that R 31 observations in this data, one corresponding to each state in January 2018. Now, let's go ahead and convert that date string variable to a numeric variable. And we can do that by using the date function like so. Generate a new variable, let's call it T2, that is equal to the date function. We then specify the string variable, that is state string1 and enter the permutation of MD y, which in this case is day, then months. And then here, we're also going to add the 20 before why? To tell states that we're talking about the 21st century. Let's go execute this. And then let's list out data. And here we can see that we've created a new variable called a2, which is full of numbers. And that's a good thing. However, these numbers don't have a lot of meaning right now to us, since they simply count the number of days from the original base state. Therefore, we often format such data using the Format command like so, format. They too. And then we're going to use the TD format, which is time and days. That's realistic data. And now we can see that the information makes more sense. We can now visually interpret the variable date to. Once we have a date variable setup, we can then use the functions year, month, and day to split out separate year, month and day information. So in here's an example of that. Generate the year equals two year. They too. And the same from month and day. And then let's list. And now we have three new numeric variables with the year, month, and day in numbers. And finally, we can use the MV1 function to merge this information back together. So we can generate a not a new variable. Let's call it tape three, and then specify m, d, y, and then insert the variables month, day, and year. Execute knot, and then released our data. And here we can see that we've reconstituted the date information pack into one variable, date three. So how can we use the state information in our statistical analysis? Well, let's assume that we want to condition some sort of summary statistics. Only the first week of data in January 2018. We could do this in various different ways. But here are three possibilities using the previous functions we've loved about. We can summarize a variable if the T2 is simply less than a specific number, in this case, 21,191. And that corresponds to the first seven days of Camry 2018. We could also specify the three separate variables that we've created. So if the days less than seven, the month is one and the years 20-80. And that gives us the same results. Or finally, we can use the MD y function to condition that they too variable on a specific date. So summarize variable if they t2 is equal or less than and then m d y, we simply add the first month, the seventh day and 2018. And here we are with the same set of results. The last option is probably the most flexible way of doing it. There's a lot more to be learned about how to handle dates and sign in states. But hopefully this session would have given you a short initial primer. And that concludes this introduction to take information in. 26. Manipulating 14 - Subscripting over Groups: Let's take a look at status subscripting capability and how we can use up to help us out data over groups. If you're using complex data, that often comes a point when you need to generate new variables over group characteristics. And other words, you might be trying to figure out what the maximum income of each family member is in a household survey. Or what's the highest temperature in each city was. Perform any kind of group recoding data. We need to know about three things. We need to understand what the underscore variables and how they work. We need to learn about subscripting capability. And we need to learn about status by prefix. So let's have a quick look at each of these three in turn. Has a few system variables up its sleeve that do very specific things when you call them. These are also called underscore variables because they all have an underscore in their code. Most commonly used on the school variables are on the school small n and capital N. Small n returns the number of the current observation. What's underscore, capital N times the number of total observations. The dataset both often used to create observation counts or even identify variables. Next, let's take a look at status subscript and capability. Subscripting and status code allows individual observation on variables to be referred to. For example, we can ask data to evaluate the second value of a variable might be a very specific number that we're interested in. And we want to run further computations over that specific value. So do not explicit subscripts or specified by following a variable name with square brackets that then contains an expression. That expression is often just a number specifying the observation number we want. But it can also include more complex code. For example, with the ifs and even on the skull variables. For example, to select the last observation in the dataset, we could specify on the school capital n in squared brackets. We'll see how this might be useful in just a moment. Finally, we need to know about the pie command. This command allows us to repeat state the commands over subsets of our data. This command comes in two flavors, one without a salt option and one with a soft option. The salt option is more commonly used. This command is a prefix command and therefore works with many other Stata commands. To use it, we simply specify the buying salt prefix at one or more variables. We want to repeat our analysis over then. And the prefix with a colon, anything we specify after the colon, will then be repeated over the variable list we specified. Again. I'll show you how this can be useful in just a moment. So now that we have a basic understanding of these three concepts, let's cut this data. Let me show you how we can use this in practice. Here we are unstated with the training data already loaded. Let's start with the list command to look at some real data. I'll only specify the first 20 observations of a few variables to keep the information manageable. Okay, so here we have some raw data for our cars. And let's concentrate on the variables, price and repair status. We have various prices and Deepak categories in this data. Currently out data seems to be ordered alphabetically on the name of the car. Let's go ahead and use the to underscore variables and see what happens. The easiest way to use them is to generate two new variables. One that is equal to underscore small n, and one that is equal to underscore capital N. And small n is often used to generate an ID count for observations. Whilst underscore capital M is often used to construct a total count. Let's go ahead and run degenerate code and see what happens. Okay, now that's done and let's list out data again and see what we have. We see that we generated two new variables. One that simply counts the OB number from one all the way up to 74. And another that counts the total number of observations in the data. That is 74. Ok, good. That may or may not be useful at this point. But let's push on and put it all together later. Next, let me show you subscripting. Subscripting allows us to choose specific values of variables. For example, to generate a new variable with the first value of price, we would use the following code. Generate variable name that's called mu one equals to price. And then in square brackets, we would insert the value one for the first observation i, the first value. The expression in square brackets, indicates that we want the first value of price. Remember, if we sort our data, this value might change, but not choosing the maximum or minimum value. Just the first of whatever no current sought schemas. However, we can introduce more complex expressions into our subscript. For example, we can tell state that you choose the relevant row value by using the underscore small n. And then subtract one row format by using the formula minus one. And that replaces the current price value with the previous price value. So let's go ahead and run both sets of code and explore what happens. Now let's list out data and these new variables. And look at that. The new price variable contains the first value of price for each observation. In this case, 4,099. That new price to variable contains the previous price observation. It's missing for the first value of our data because there is no price value for the first value. The second value now has data from the first. The third value now is data from the second, et cetera, et cetera. Again, so far, this is probably not too useful. However. Now, let's integrate both types of code with the by sort command, which allows us to repeat code of a subset of our data. For example, in easy method to generate a count of observations in a group is to use the underscore variables with a bicycle prefix, like so by sort. And then we name a variable, in this case rip 78. And then we're gonna generate a new variable that is equal to the school M, O on the school capitals in the first line of code will generate the observation number for each group and then restart the numbering when it jumps to a new group of web 78. The second line of code generates the maximum number of observations for each group. So let's go ahead and execute both and see what happens. Will also list out data. And look at that. We've now sorted out data based on the variable rep 78. There are two observations with category one, and these are now coded as 12 and the variable new ID, the total count of observations in this group is two. And that value is given to us by the variable total. For the next category of rep 78, there are a total of eight observations coded onto it. Perfect. Now let's take it one step further. Let's pretend we want to retrieve certain price characteristics of repair groups. A common price characteristic might be the maximum or minimum price of a car within a specific group. And we can achieve that by modifying our previous code to the following. We first add a second variable in round brackets to the PI sold prefix. This tells data first sought the first variable, rep 78, and then sort the second variable within the first. So we're going to get a data sorting at first sorts on repair record. And then within each group. So talk price. Lowest price will be first and the highest price will be asked. Therefore, to then obtain the maximum price, we can simply specify a new generate command that chooses the last value for much more price, all the first value for minimum price. So let's go ahead and execute this code and see what happens. And there we go. We've now computed the maximum and minimum car price within each repair group. For the first group, the maximum price is 4,934 and the minimum price is 4,195. Nice job. Of course, this is a pretty easy example. But now that you understand the three concepts of underscore variables, subscripting and sorting, you can generate very complex new variables yourself. Lastly, just a small tip on maximum and minimum values within a group. You can achieve the same effect with the Egypt command using the max min options, like so. And there we go. We've just repeated our analysis. Using the Egypt command is slightly safer as it deals with missing values for you. And the previous code, I neglected that. Often each and we'll have a faster and safer option for you. But it's important to have the knowledge to do it by hand yourself. If you must. 27. Visual 1 - Introduction to Graph command: visually exploring and presenting data is an important part of Data analytics. It helps itself, and others better understand what the data is doing. Often that picture's worth 1000 words and what is not always true. Pictorial data representations can have enormous impact. Thankfully, state has a wide range off graphical capabilities. These videos will take a look at how to create some of the most commonly use graphs, and this video will explore some basic craft concept on the grass command before moving on to more specific graph functions in other videos. Before we go to state, here's a small overview of exemplar grass that state that can produce. As you can see, Stater offers a wide range of professional looking graphs to present data Visually state. The graphs can be produced using status graph, graphical user interface or its command language. Due to the complexities off created, professionally looking graphs often is a combination of both. Let's head over the Stater and take a look Here we are in state of with the auto data already open, let's have a look at the help entry for the graphs. Come on, let's type help graph. This opens up the helpful for the command graph. The grass command is an omnibus command that has a variety of different sub come out. For example, we convict the graph command with irrelevant sub command to draw a graph. However, in many cases, the primary grass command isn't actually needed, for example, that two weights up command also works without specifying the graph command First, the reasons for this on historical which we don't need to go into. However, you will need to specify the main graph command if you want to save load or if you want to combine graphs. The graph command is also used for a variety of other purposes, such as changing the scheme of crafts. Let me show you some examples. Let's go ahead and create a quick graph and let me show you how we can then edit that graph . I'm going to call upon the graph two way command here, which is a special command. It allows us to overlay multiple graphs on top of each other. The key thing about to wait is that we can develop multiple graph commands in round brackets on, then have stater over ladies graphs on top of each other to create one picture. Take a look at this example when I overlay to scatter plots on top of each other. We've now created off first graph in state of the graph itself is open in the graph editor , and this also contains a many function to manually save the graph. This could be quite useful as the menu function allows you to quickly change the file, type off the graph in case you want to export it to a word document or to a webpage. We can close the graph, but it will remain in states this memory. And even though we closed a graph, we can still save it manually by typing graph save and then test as the fall name. And here we're using the option. Replace toe override any current graph that already exists there were with now ST dot graphs. In this case, we didn't specify a far park, so the graph is safe to the default directory. If you want to open the graph again with type graph use and then the found name of graft and there's a graphic in, you should also note that different schemes exists. Aircraft, a list of these schemes can be accessed by typing graph query with the option schemes. Like so, these are the defaults games that come with data, but more convene download. It's on the Internet. Let's change our scheme to the economists cume that makes graph looked like the ones used in the publication. The Economist. We can do that by typing set, scheme and then use economists there. Really? Now let's recreate our previous graph. That looks quite different, doesn't it? In addition, there are some really useful free out on packages that can be downloaded to further allow you to modify your graphs and how they look. Now let's go ahead into some manual aided star graph. We can edit grass by right clicking on the graph and then starting the graph editor. State of graphs consist of multiple objects that are constructed together to create one graph. You can edit these objects directly via the right hand man you simply by clicking on them and then double clicking on them to add it. You can also double click anywhere in the graph and added with that particular bit example like So finally, if you're using a Windows PC, there will be tools on the left. And if you're using a Macintosh PC, there are tools here at the bottom that allow you to add extra text or Marcus to the graph . Like so, status graphs are incredibly customizable on offer. A lot of flexibility. Mastering them, however, takes time. 28. Visual 2 - Bar Graphs and Dot Charts: bar graphs and DOT charts are commonly used graphic or absent ations of data bars or dot and their relative differences are used to visually display statistical differences, such as frequencies between categories. Alternatively, values another variable can be compared for each category. For example, the average price of cars across different groups, bar graphs and DOT charts offer an effective way toe, advertise simple statistical properties between different categories and variables, and offer a good counterpoint to big. Summary tables In this session will explore how to create bar graphs and dot trance in state. So let's head over to ST ER and create some grass here. We aren't Stater, and I've already loaded the auto data set, which were used to explore these ground. However, before we continue. This is one of the few occasions where I recommend using the graphical user interface first , as opposed to the command syntax. That is because these commands need a lot of options to work properly. Let me show you by looking at the bar chart helpful. Let's type help graphs bar and execute that. Here's the bar chart helpful. We can see that the command actually consists of to come up the graph dog come on that produces a vertical path on the graph h bark amount that produces the same vast but the horizontally. One of the reasons why this command is a bit complicated is because it comes in four different flavors, and each different flavor needs a different type of code. I can be really hard to figure out the first time, so let's switch to the graphical menu. But keep command syntax of visible in our do far. Let's start with a simple bar graph and let's try to display the frequency count of the categorical variable rep 78 graphically. To do this, we'll click on the graphics menu and select Don't Grow. This opens the bark off menu while we can select what we want stated to do. Let's the usual menu options, such as the if and in conditions, waits, etcetera. But there's also a few options we should know in more detail before we proceed. In particular bar graphs and state of common for different flavors, we can impose summary statistics off other variables on the group or multiple groups of variables. This choice is the default option. We can also display frequencies of categories. We can also graph percentages or frequencies of categories, and finally, we can graph actual data. As is the last option is only useful. If you have a data set with a unique observation per category, let's select the second option to create a frequency craft. Somewhat confusingly, the menu now shows us no further options. We need to click on the next tab, which is the categories. Tub here. Let's select Group One and enter Rep 78 as a category. Now that's submitted, and there we go. Here's a frequency graph off the repair record. We see that the most common categories or three, followed by four. Great. Did you notice that we were able to choose up to three categories for his ground? Let's add another variable to this graph to see what happened. Let's out of arable farm. We've now split the frequency of the repair categories over the variable form. This is a really good way to compare frequencies across many different categories quickly. If we wanted these witnesses and percent instead, we can select the third option in the main menu. Thanks except me. Here is the previous graph, but instead of reporting frequencies, we now report percent. Now let's modify this graph. And rather than showing us the frequency or percent for each category, let's brought the average rally off price and also the standard deviation of price. We can do this by selecting the first option in the main menu and selecting the relevant statistics and variables that we want to display. Like so I was starting to display a lot of information on our graph. We can see that domestic cars with repair records of three have the highest car price and also the highest nine. A deviation. If we wanted to have a lot of summary statistics on such a graph, then a talk shop is another useful option. Rather than creating separate bars, which statistic it plots, the requested suddenly starts as a dot on a singular line for each category. To create the DOT chart, let's go to the graphics menu and click on Dr Let's Enter the same information, and there we are. This is a dot charm. It contains the same information as the previous park ground, but by having each statistic on one line, it creates a more compact visual graph. If you have a lot of summary statistics than this can be a good way to present data, however, note that not chance are not for frequency tabulations. They're only for summary statistic. Hopefully, you can see that creating graphs and stater is quite easy and with a bit of practice, your quickly move away from the grafico. Many option to write in these graphs in code has shown in the do far. 29. Visual 3 - Distribution Plots: examining how variables are distributed is an important part of understanding data. This helps avoid mistakes later on, for example, by exploring the distribution of a variable we can identify without data is evenly spread out or whether it is concentrated around particular values. Likewise, we may want to know whether we have a long data tales around our average data, perhaps to the right or the left. Cool even both ways. In practice, one should always explore the distribution of key variables before we do anything with them in this session will explore how state I can graft distributions. We won't have time to explore all the possible ways, but we'll have a look at the most calm. Your use meth. Let's go over the stater and have a look here. We aren't Stater using the auto. We're interested in the variable price. Specifically, we want to visually see what kind of distribution the variable price has. Let's begin with a history Graham. Let's have a look at its helpful. First, let's execute help History Trump, the command hissed. A gram is a relatively simple command, but it has a few options we should know about hissed a gram Comtech. Only one variable on will create a history Graham off continues and categorical data. Generally, however, it is more useful for continuous variables. X plotting fine re variables doesn't really reveal all that much hissed. A gram is a type of nonpayment oclock, and one of the important options used is the number of pins or the been with these options can radically alter the look of the history. Graham. Let's plop a hist a gram of price. To do that will type hissed a gram price. Here is a traditional history graham with a pins. Each bin represents how much data isn't a particular data range. The graph tells us that the variable price is not normally distributed. It has a long tail to the right. In other words, off there are a few observations with very high values. We can change the number of pins by using the option been Go ahead and do not Let's execute the same hissed a gram, but this time with 20 bins, this now gives us a bit more detail, but it also leaves some gaps in our instagram. You don't have to decide for yourself what is more useful to you. Here's another example where we changed have been with to 5000 and this results in a different kind of pictorial representation. Off the history, the command hissed. A gram has a few other options, and I recommended you explore these in your own time. But now let's move on to another type of distribution plot. A kernel density plots. A kernel density pot is very similar to history, except that it uses some fancy mathematics, which connects the bins via a line. What you get is a smoother, line based representation off the distribution of a variable. Let's have a look at its helpful. First, we could do that. My executing the command help K density, my kissed a gram Colonel Density plots only take one variable. Also, like Instagram's kernel density plots have a band with option that allow you to choose the size of demand. With that, you want to use smaller band with lead to more detailed plot and high abound, with only two less detail plot. Another option is the colonel option that allows us to change how the underlying line is actually computed. I don't really recommend changing it unless you have very specific reasons. Now let's go ahead and execute their colonel than support with the variable price to do now . Well, time, Colonel Density Price. And there we are. This looks similar to the previous history, Graham. Just a little bit smoother. We can clearly see that price is not normally distributed, and it has some extreme values. On the right hand side. Colonel density clubs like hissed a gram plots can be wrapped in the to wake among allowing you to overlay multiple plots on top of each other will explode up more in another video. However, this can be useful if you want to compare price distributions for particular groups, for example. And lastly, let's explore kernel density plot that has a different band with, in this case, a band with a 100. There we are. We can see by this results in a much more detailed plot. Again, Whether this is more useful than the previous plot is something that you need to decide. The next plot is a quantum. A quantum float orders. The values are very variable against the Qantas of a uniform distribution. We're not going to examine its helpful because it doesn't have a lot of options. We call a quantum flock by simply typing quantum price. There we go. 45 degree line is a reference line, and if prices were distributed uniformly then they would be plotted along this line. However, because all the points are below the reference line, we know that the price distribution is skewed to the right. If the points were above the reference line, the variable would be skewed to the left. We also know that this variable is not uniformly distributed because the data point are not on the reference line. Next, let's look at a box plots. Next. Let's look at a box plot that's examine. It's helpful for us. We can do that by typing help Graph books. There are actually two versions of this command. Vertical box plots are called with the graph box command must horizontal box plots. I called with the graph H box. Come on. Both command can take multiple variables. The most useful option here. It's the over option that allows us to examine box brought over different groups. Let me show you an example. Here we are. The box plots gives us an idea of the distribution off, each bearable by showing us the medium. The 25th and 75th percentiles the upper and lower whiskers off the distribution and any outlines. Box plots are particularly useful if you want to compare distributions across groups on, you can specify up to three over groups with this command. He's a quick example, with foreign being used as the over group. Pretty neat right? Foreign cars appear to be pricier than the mystic arts, but also somewhat like Finally, I'd like to show you a command called blather doesn't have a lot of options, and it takes only one variable so we won't bother looking at. It's helpful. Blather transforms a base variable with different power transformations. This can be a good way to examine what kind of distributions result from different transformations. Here's an example with price blather. Price. The actual variable now data set is represented at the top right of this graph. The other eight graphs are different data manipulations off the original variable, and we can see how different transformations lead to different distributions. As you can see that it's relatively easy to examine distributional properties off variables graphically in state of 30. Visual 4 - Pie Charts: Let's explore pie chansons Data pie charts off a powerful visual effect by summarising mainly categorical data into a circular picture. They're very intuitive to interpret and understand on often used in the business world. They're relatively easy to create in Stater, although like for many graph commands, the underlying code can be somewhat complex. Went first encounter. That's why we'll focus on creating a pie chart using the graphical user interface first, but will make sure to capture the resulting code structures in our do far. Let's go ahead and explore how to create pie charts using the auto training Decter Here we aren't Stater and I've already loaded the auto training data. Set point charts are often used for categorical variables. Rep. 78 is a county cork over. It has five categories, and its values have distributed across these categories. Let's tabulate Tab Rep 78. Tabulating this variable gives us a one way table that shows us to distribution off these values across the repair categories. He wanted to turn this into a pie chart, weaken do so using the command graph pie. However, this command me it's quite a few options to function well, so it's better to use the graphical user interface for the first time. But before we do that, let's take a quick look at the helpful small graphs pie we can do not by typing help ground . Hi. The help file reveals that there are actually three modes to graph pie. Well, only focus on the third mode, which plots frequencies. Categorical variable. It's the most commonly used version. It doesn't require variables after them in command, but it does require the over option to be involved with the relevant categorical variable. There's also a lot of options on how to customize my charts. We won't look at these in detail, but I will highlight some in the graphical user interface. Let's close to us and go to the graphics menu. Ankle Compile. There are various options here, but the default option is too graphic. Variable by categories, which is what we want. Let's select Rep. 78 Master category, variable and click submit. We've now created our first point job. We chose to one extent the repair record of cars from 1978 is distributed across various repair categories. Also note that states actually send the command to the results. This is useful toe half as we can copy and paste this coat across entire do felt to regenerate the same graph later on like so, let's go back to the graphics menu. This particular point child doesn't look very nice. What we can change The legend titled on Other Formatting Options. Why the graph editor? Where. Some things we can't do with Graff editor. So let's have a look at some of the Pitre options. We can specify if and in conditions and waits using the relevant menus here and here under the Options menu Tap. We can also include missing values as a category. Let's click on this and see what happens. We've now included a new category that plots the frequencies off the missing values. For this variable. That's closest again. We'll leave this slice ordering a placement alone for now, and let's click on the next hub, the slices town. Let's customize the slice properties. Let's click on slice properties or and then click on the explode slice. Take that and finally, we can also add percentages to the labels by clicking on label properties or and then selecting a cent. Ask the labels. Let's submit this graph and see what that's looking a lot better and more informed, too. We can obviously continue to edit this craft by the graph editor and also at titles, subtitles and legends. Another need future of Stater is that we draw several pie charts by another variable. In this case, imagine if we wanted to compare their repair status off foreign versus domestic cars. What we can do that by specifying the by option in the pie chart menu. Let me show you variable corn and there we go. We can now compare the pie chart for domestic cars. Verses, borrowing costs. Pie chart has many more options that you can explore further by yourself. But as you can see, creating simple point chance and stater is not very difficult. And once you created a few, writing them in code will come easy and fast. 31. Visual 5 - Scatterplot and Lines of Best Fit: scatter plots and lines of best, if it's are a great way to visualize relationships between two variables are why, next, positively or negatively, perhaps the relationship is nonlinear. I personally. Often you scatter plots and lines of best fit to explore the functional form between two variables. I always do this before I implement regression modelling. However, scatter plots don't work too well with very large data sets, a data said, with a 1,000,000 observation or more will often result in a giant blob off data. That's where lines of best fit coming. They work with any sample size. But remember, lines of best fit are subject to the constraint that you said. Linear fits may not always be appropriate forms on, and this session will explore how to construct scatter plots on and a few commonly used line of best fit. We're also going to look at the two wake amount more closely. This command allows us overlay graphs top of each other. The most common use of this is to overlay a line of best fit on top of a scatter plot toe. Help visualize the relationship between two variables, So let's head over to ST ER and explore how we can create skeptic plots. Andi Lines invested it. Here we are in a state, er, with the auto training data already open and this session, we're interested in the relationship that the very was price and mass per gallon might have with each other. The command to create a scatter plot is to command scatter. Let's have a quick look at it's helpful and execute help. Captain. There we are. The scatter command is relatively easy to use. It requires a Y viable on an X variable, which were then plotted against each other. One thing to notice that scatter can actually take more than two variables. However, it's often not very useful, since only two axes can be displayed. McGrath. I personally recommend only ever using two variables with Scott. If you wish to create a three dimensional girl than commands to create, these can be downloaded from the Internet. Scatter, like many other graphs, has a lot of visualization option. We won't explore these now. Many can be manually edited once the graph has been created. Let's close this, and that's Rana Scott. We create a scatter plot between price and mass per gallon my typing Scatter price mpg. Next. From there we go. This creates a scatter plot between price and MPG. I would see from this particular plot that there appears to be a negative relationship between Christ and mass per gallon, although this relationship maybe more curve rather than a straight line one. Explore that further in just a moment. Scatter plot is part of the two way plot family, which means we can use that to a command to overlay multiple scatter plots on top of each other. Let's go have a look at two ways. Help entry and let's execute help. Two way, two way. It's a great command that works with many different kinds of state. A graph at the top. Here you can see the scatter come on to use two way. We call it as the first command and then envelope any grass we want to lay on top of each other. With the round brackets, we can overlay as many grass as we want. Further down on this list are prediction plots that allows to fit linear and quadratic lines off best fit. There's also a local polynomial smoother plot that is a type of non parametric clock. These are great for letting the data speak for itself. Let's close this and let's begin simple, say, with two different schools. Imagine we want to plot the relationship between price and mass began in by whether a car is foreign or not on the one single ground. Well, we can do that as follows, my first invoking the two way Come on two way and then the first graph scatter price against mpg. If foreign equals 20 and then we overlay the Satin Scout oclock. And then we overlay the second scatter plot, which says scattered price accounts mpg. If foreign equals to one. Let's execute that and see what happens. We now see a scatter plot that has two colors blue for the first ground and red. Put a second craft, unfortunately stated, does not label if condition into its election, so you have to identify what color relates to what graph by the order of your scatter plot in the two way come on, the next step is to overlay a fitted plot instead of another scatter plot. There are various fitted plots available on stater, but the most common ones are linear fit a quadratic fit and pollen moments moving, Let's look at the center, its focus back on the relationship between price and mosque per gallon on. Now they linear fit tohave graph. So to weigh, scatter pricing and smashed garland and then overlay a linear fit love price again, smiles forgotten. And they were. The graph now superimposes a linear fit onto the scatter plot confirms that negative relationship twin price on most Begala, however, a lean if it might not be the best fit, so let's go ahead and try a quadratic fit. We can do that by executing two way scatter pricing and smallest McAllen and then over later with a quadratic fit North price against Moscow gun that looks like a slightly better fit, but we can do even better and apply a polynomial smoother. This is a type of non Parametric estimation techniques that allows data to tell us what it thinks is the best relationship between the price and mass per gallon, and we can implement such using the L Polly come out two way scattered price against mpg and then helpfully against price and lost per gallon never go. This plot tells us that There appears to be a quasi linear fit between price the mosque per gallon without first a negative relationship, which don't changes in turn, non relationship around 20 miles from coming. Remember, this kind of plot is very much a function off the band. With you specify my default state will try to fit the best band with when you can override that with a band with options. Here is an example of that two way scattered price against mass per gallon and then overlay a local pollen on a smoother off price against mpg with a specified bandwidth. What's your quick? Fine? And there were at this lower bound with the grass looks much more jagged. E. Whether this is useful to you are not something only you can decide. Finally, each of the above pitted plots has a twin come on that ends with a C I, which stands for confidence in this allows us toe overlay confidence intervals off the violent fit. Here's an example for the AL public amount two way scattered, prosecutes mpg and then overlay a l polish it confidence intervals price against Mossberg. And there were the graph that plots the confidence into for around and Lima's on. This could be used to compare statistical differences of across regions off the ground, as you can see status flexible with its relationship. Cops on both the scatter on line of best fit commands are something you'll find yourself using a lot as you explore data, which data? 32. Visual 6 - Drawing Custom Functions: state and allows the drawing of custom function. For example, we can plot any kind of equation or statistical function over a range of values, going from minus infinity to postive infinite. This can be useful if we're trying to demonstrate particular. Statistical properties can also be used to visualize regression estimates. The command to create custom functions is that to wait function command. The Function Command allows us to specify in equation in the form Why equals to a function of X over a range of values that range between A and B. Let me show you what I mean by heading over the state. Here we are in states that with an empty basis it before we continue. It's worth understanding what kind of functions stato has available. We can access status functions by typing help functions. Let's go take a look, and here we see that state has a wide running to your functions from trigger the metric functions to statistical functions. Here you'll find standard stuff like square roots or log functions on advanced functions like loaded densities. I definitely recommend exploring these entries in your own time. Let's close this for now and let me show an example of how to draw a custom function in state of Let's go ahead and generate their custom function that simply plots a sine wave. To do that, we would write the following two way and then the sub come on function. Why equals two? The scientifics execute up and here's our first girl now because we didn't specify a range in this graph, This graph depicts the sign relationship between Why necks between the values zero and one on the exact So let's go ahead and extend the X axis using a range option. And we can do that like so two way function. Why it was to the sign of X, and now we used a range option comma range. Now we're gonna go between zero and 10. Execute that, and there we are. We've now visualized a sine wave relationship between wine X over a much larger range of values for X. We can also create other custom equations. For example, we can plot a quadratic equation that might look something like this to wait function why equals to one plus two x plus three x rays of the power to plotted over range off eggs between minus five on five. Execute that and then we are was now plotted a quadratic equation. We can use the Function Command to do more than simply plot equations. We can also plot statistical distributions to prop a statistical distribution off the logistic cumulative density function. We can use the logistic function like so two way function. Why equals two? In this case, we're using the logistic function over X and again we're using the range between minus wife and five Forex And here we are. We could see the classic logistic of that has found that between zero and one I recommend you try experimenting with other functions that are available its data. Finally, we can also use the function command to pluck estimated regression coefficients. I should add that using the margins command is probably a better way to do this and stater But this is another more mathematical way of doing it. Let me show an example. Using the water data set, I'm going to load the altar date set and run a quick progression off price against mass per gallon and mpg squared. The important thing to note here is that our results take the form of a constant plus X plus X squared. So that looks very similar to a quadratic equation, and we can now plot this relationship using the Function Command. To make things easier. We can automate this process by calling the numbers right out of the store estimation matrix. We can do this by using the under school. Be prefix and wrapping the variable names in square brackets. This is called a system variable, and you can find more help on this by typing help system variables. Let me show you what I mean. Two way function y equals two and then underscore be in square brackets. The first variable, which is constant and then under school be mpg times X something under school B plus the interacted term master got in terms Mosque Alan Times X to the power to over the range off whilst the government execute that. And there we are. We've now planted their aggression relationship between price and mass. McAllen has a custom function. The function command is very flexible, and there are many more things you can do with it. 33. Visual 7 - Contour Plots: Contour plots allows to visualize three dimensional data in two dimensions by using contours to represent the third dimension. They're often used maps where geography is displayed in two dimensions and height is added in as 1/3 I mentioned via contours. This is the illusion of adding depth two months. But contour plots can also be used for statistical purposes, for example, to display by very distributions or to examine complicated interaction effects from continues. Variables, however, note that contour plots don't work well for categorical variables. So let's explore how to build and modify contour plots. In Stater. I've already loaded a training day percent called sandstone. This data set contains depth information about sub sea elevation off sandstone in an area off Ohio. So let's have a quick look at the data by doing a browse. Okay, we can see that, the data said contains depth information for a variety off Northern and East ING geographic coordinates, nor thing refers to northward measure distances for why co ordinates whilst feasting refers to east with measure distances for X coordinates, there are quite a few observations in the state set, and that's a good thing. Contour plots are type of non Parametric plot, and you'll find that if you have too few observations, your contour plot will look very poor. So let's close this and call a comfortable in state of the command to call a contour plot is violet to wait. Graph command with Contour as the sub command. So let's go ahead and look at its helpful. First, we can do that by typing help. Two way contour. The basics of the command are relatively easy to understand. You'll need to specify a Y viable on an X variable, which together make up the two dimensional axes of the graph. And you also need to specify a zed valuable, which will define the contours. And are they third I mentioned to your graph. There are various options available, including interpellation options, color options and level options. They're pretty self explanatory, but we'll explore the level option and that label option to get more detail in our grand. So let's close this and call a concert float with the data at hand, and we can do that by typing to weigh contour as it will have the depth and then Ryan X unmoving and easing execute up and there we go. We now have a contour plot that shows us depth measures for a particular geographic region . We could see that there are some areas of low depth shaded in blue and some areas of High Dept shaded in red. We can change the granularity off our plot by the finding how many contours we want to use by default state that uses five. Well, we could change this to 20 for example, to get more definition in tar graph. So let's go ahead and do that, and we do not by adding the levels option two way contact depth Northern Eastern comma and then levels in square brackets. 20 on this map has now produced a much more detailed and potentially more useful representation, although the quality of this map will be a function off How Maney observations are in our data sets, however, note that the said legend now also has 20 different cut points that might be, visually a little bit annoying, and we can reduce the number of presented labels via Bizet label option. Like so two way contra depth more than eastern comma. We retain our 20 levels, but we are stated to only show five labels. It is quite important that you add the hash symbol before the five here, so let's execute up. And this graph looks significantly better than before. We retain our fine detail of 20 contours, but only used five different label values, which makes this graph look more professional. Good job. One particularly useful statistical trick for Contour plots is to use them to visualize continuous by continues interaction effects. Former aggression. Such effects can be notoriously hard to interpret from regression coefficients alone. So let me demonstrate how we can use a contour plot with the margins Command in Stater. To visualize such a complicated interaction effect, let's first load the auto training. Except remember, this is a training data set that has data about car prices and car characteristics. So let's go ahead and run the complex regression where they continues by continuous interaction effect. So let's regress. Price against mass per gallon and length, and its interaction off mpg with length will also control for the weight. Valuable has an extra fourth variable. Let's execute this progression, and here we can see our regression results. We see that mass per gallon and length are not statistically significant, but we're going to ignore that. For now, mileage has a positive effect on car prices. Whilst len