Transcripts
1. Data Collection and Cleaning: Hi, I'm Dr Michael McDonald. Today, I'm gonna talk to you about data collection and cleaning the first step in business intelligence. Let's start by talking about the different stages in this course and what you're gonna learn today in Module one will talk about assessing different types of data bases and picking the one that's right for you in module to. We'll talk about the basics in gathering your own data and building your own databases for your use within a company module. Three. We'll talk about merging together different, distinct data sets and some of the pitfalls and perils you may face in doing so in module four. We'll talk about cleaning the day that you've gathered and making sure that the data that you're looking at to make decisions is exactly correct. Module five. We'll talk about a few more pitfalls you may encounter, and some things you can do to combat these. Lets get started. Shall we
2. Assessing Databases: module one accessing databases. Now, when a lot of people think about databases, the first thing they do is think they should go out and look for commercial database software. And that's great because commercial database software is very powerful and does make gathering and recording data easy bought. It's not a panacea for the problems your organization may face, and too many people fail to recognize that in particular, commercial databases can often create a black box that users rarely look beyond. This could be a big problem, for your organization, in particular creates opportunities for faulty data. And that's especially true. His companies start using mobile APS to gather data remotely. There are a number of different types of commercial database software out there. They can kind of be broken down by a few different key characteristics. One of these is whether you're gonna use manual or automatic data entry. There is potential for problems with both manual and automatic data entry, in particular with manual data entry. You can have opportunities for incorrect entries in your data. Imagine, for instance, that you have data entry individual, and they're putting transactions into your database and Emmanuel fashion. It's easy for them to transpose figures or things like that and create errors and potential problems for you to look at. Similarly, they may miss log entries inadvertently creating entire records that are simply incorrect. Bunt. The problem with automatic data entries is that it makes it easy for users to take data and apply it where it shouldn't be applied. For instance, in a simple example, using international sales data in the U. S. With automatic data entry, you may get better data accuracy, but greater problems in actually applying that data correctly. There's a few points that you'll want to consider when evaluating databases. First, you want to think about the ease of reviewing data. Different types of databases may make this easier or harder, in particular. If you have data stored in multiple different sheets across a database, rather than merge together into a single sheet, may be very difficult to view and browse that data and assess it either per sanity checks or even basic analysis. Second, you'll want to consider the ease of editing your data. Isn't simple to make additions to the data if you have an existing database but want to update this on a regular basis. Is it easy to go through an add things to it? Similarly, perhaps you have an existing database, but you want to add a new variable. For instance, perhaps you have a list of of sales that your company has done, and you want to go through and add some sort of a characteristic about the customer. This may or may not be easy, given the database that you're using. They're pros and cons to different databases. And, well, it's true that ease of aiding data. Excuse me to say that again. They're pros and cons of different databases and the ease of editing data. Feature highlights this in particular. While it may make it simpler to make additions to the data and thus allow more powerful data analysis, it also can lead to data governance issues where you may have data incorrectly entered, or you may have problems that are inadvertently created by adding too much data. In addition, when thinking about points to consider with regard to a particular database, you'll want to look for built in tools that allow you to test for data reliability. Different software systems may or may not have the tools that you may need further. You want to think about how this software integrates with other programs. Is this database gonna make it easy to interact with Excel, Order Net. Interact with some sort of analysis software you want to use. Finally, you want to think about whether or not this database software offers any ability to handle analysis on its own. Perhaps you have a system that will let you go from start to finish. That certainly makes it convenient and avoid so logistical issues of moving data between different types of software. But it also makes it easier for people to use faulty analysis without realizing it, in that it makes theme software more of a black box to begin with. Now there's some alternatives to traditional database software. In particular, the first option is to build your own data sets. Excel is the easiest solution to take up. This route excels very familiar for the vast majority of users out there, and it's something that's easy to edit, and virtually all businesses have access to it. However, the problem is that some versions of Excel can only handle 65,536 rows of data. Even if you have a version that handles Mawr, though, Excel has numerous transposition errors. If you try to sort more than 20 or 30,000 rows of data and very many columns of data, the answer here or the solution is that you may want to use Excel for some initial data entry and then transfer the data from their two more sophisticated data analysis program. It's really up to you at the end of the day, though.
3. Gathering Data: module to gathering data. Now the first question we face when building our own databases to start with this, where am I going to get the data that I'd like to use? Well, there's a few different options here. To begin with, your firm could buy data. This is particularly useful for some types of data where maybe it's simply not easy to get it on your own. Names and addresses in mailing lists, for instance, is a classic example. It's very easy to get names and addresses from providers off this data, and it's generally gonna be much more accurate than going out and gathering the data yourself, especially given that people move so frequently. Another example of data that you might want to buy is financial data on publicly traded firms. While you can certainly go out and gather individual pieces of information from sources like Yahoo, our CNBC about these firms gathering data in large quantities and gathering deep amounts of data on the financials behind companies is often difficult if you don't buy it. Similarly, natural restore stada is often very difficult to get unless you buy it. Second option in terms of getting data is to build it. The data on your customers is often the most valuable two year firm, and it's unlikely that you can buy that data from anywhere else. Instead, you're most likely to have to build that database on your own. That's what most of the rest of this module is gonna be focused on. Third and finally, you can gather your data for free. The federal government has reams of data available for free on macroeconomic conditions across the country. Surveys of the U. S. Consumer. Basically any data that you can one on a macroeconomic level. The Fed probably has something for you. Now, if you have specific data needs, what type of data should you look for? Well, the data needs for your company are gonna be driven by your particular project needs. You want to start by thinking about what you're trying to model? Financial economists always start by building a model, then getting the data. Once you're done with that, you want to go through and figure out what the driving factors are that will influence the outcome you care about whenever I'm doing data analysis projects. As a financial economist, I always start by figuring out, ah, basic hypothetical model, then go through and find the the data that I'm looking for that will support this particular project. That's much more effective than gathering the data and trying to build the model. If I gather the data first and try to build the model, it might turn out that I'm missing some critical pieces that I need to go about. My analysis. For instance, Sales are driven by internal marketing, new product innovations, etcetera. But they're also gonna be driven by external factors macroeconomic conditions, competitive behavior, expectations about the future of the marketplace, etcetera. We can build a model that takes into account all of these different factors, but it's very important that we've done. We built this model in advance so that we know what dated to gather. Now, when it comes to gathering data, we want to probably start with the easiest stuff. The macroeconomic data that I mentioned previously from the Fed, for instance, is very simple to get. We can get that from the Federal Reserve's economic database, the website. For that, that resource is here. Alternatively, you can actually gather this through a simple excel Adam that excel Annan, once you've installed it is shown here. After you have installed the ad and go to the Fred Tab in your Excel model, you can see a variety of different types of economic data here. Everything from in this case, Miss particular tab. Real gross domestic product to say, federal outlays, federal receipts and the federal surplus and deficit. We also have data not just on the U. S. But international data as well. Similarly, if we're looking for data on, say, production or business activity, we could find data from the Fed on this related to industrial production capacity utilization housing starts, building permits, essentially any macroeconomic data that we need for a particular industry we can get from the Fed. Once we've gone through and found the data that we need, we're gonna use a pneumonic code to gather it. Let me back up a second. For instance, if we were interested in vehicle sales, we go to browse popular US data than production and business activity and then click on vehicle auto sales and like trucks. When we do that, we'll get the all sales pneumonic next. We're gonna go through and click on that when we dio the pot the data will populate on its own. In this particular case, the data is on a monthly basis. It starts in 1976 and it's available up to March of 2016. The data is gonna tell us about lightweight vehicle sales for autos and light trucks, and it's from the U. S. Bureau of Economic Analysis. The important point here is that even though we're gathering data through the feds excel, add in. It's not actually data from the Federal Reserve. That's the power of the Fred Tool for Excel lets us harness a lot of different data sources like in this case, the B E. T. A is data through a simple and in this makes it much easier to gather data. I would urge you if you have an interest in this, to check it out. The ad Innisfree. It costs you nothing, and there's a lot of neat stuff in there. Let's move on, though. In addition to Fred, you can also try to get data from the U. S. Census Bureau. This is particularly useful for identifying characteristics of target customers based on census blocks. Google trends is also great for survey data. If you're trying to figure out as an example what's going on in particular industries or in social media or things like that, Google trends will give you data on what's being searched for over time. Gathering data on customer comments and social media data online is another really hot topic. It's something that I'm asked about is a financial economists all the time in the context of different financial needs for companies bought gathering, this type of data requires textual analysis That will be the subject for future course. Now, if you have a specific data set that you need, for instance, company financial information for a broad set of large publicly traded companies, buying that data might be the only option. Some data, as I mentioned, is available through Yahoo, financed through CNBC, etcetera. But collecting that data that way requires writing a python script instead. Buying data is usually the most realistic option, especially if you're interested in getting that data updated on a regular basis. Finally, building data sets from your own data is generally the most crucial skill for most companies. You can certainly tap your customer databases to do this for instance, but you can also go about developing your own in house methods of collecting data. This is usually great because on a daily basis most companies generate reams of new data that could be useful in analysing and making future business decisions. Surveys of customers can often be a great option to, but again, it's something that's really only available if you go about making the effort to survey your customers. For instance, I recently worked on investment banking survey to help a small boutique investment banking firm predict the characteristics that helped them win deals. We went through and we looked at both their customers and past customers that they hadn't won the deal with. So we looked at both the customers where they did win and the clients or pretend prospective clients. I should say where they would have liked to get a deal. We looked at both sets and then were able to use business intelligence methods to go through and figure out which type of deals this boutique investment banking firm should target in the future. That's just one example of where the financial industry and in particular investment banking, can benefit from business intelligence, but I'm confident. If you think about it a little bit, you can find many examples in your own firm, where such data analysis can be helpful as well. Now, in terms of getting those surveys, there's a lot of different tools out there that you can your use. For instance, Surveymonkey critical mix and many others will help you once you've generated your survey to go about getting responses. This gives you the ability to get information on not only your own customers but other people's customers. The customers that you missed out on potentially like in the case, the investment banking work that I did, or potentially customers that have never heard of you, but that you might like to target in the future. Next, you want to think about data biases in your survey. If I'm trying to figure out how to sell, the other customers are my existing customers representative of the rest of the world. For instance, if I'm looking to sell abroad in Germany, doing a survey of U. S. Consumers that buy my product may or may not tell me anything useful. The point here is that it's important to make sure that any survey data I'm gathering is actually representative of the problem that I'm trying to solve again. This is where a good mate model of your data can help you if you thought about it in advance of actually gathering the data.
4. Merging Datasets: module, three merging data sets. After you've gathered the data you need to pull together, it's important to start taking these myriad array of different sets of data and putting them together into one cohesive whole that can be useful for your analysis. That sounds easy, but in reality it's not. For instance, there's a variety of different types of problems that you may run into. We looked earlier at light vehicle sales. That data was on a monthly basis. If we were to look instead at GDP, that date is on a quarterly basis. Economic data like this can often have different frequencies. That means that if we were to try to merge GDP data with light vehicle sales data, we run into a problem. Because one is reported monthly, one is reported quarterly. We need to find a way to reconcile that issue. Another example. Oil prices, air daily, housing sales or monthly and GDP is quarterly. So if we're trying to look at all three of these different sets of data, we need to decide how we wanted to deal with that. Are we going to use that GDP data on a daily basis and simply change it once 1/4. Are we going to look at oil prices? Onley quarterly? Because that's how often GDP is reported. We need to figure out the relationships that we care about and then decide what type of database we want to build here. Should a database that contains GDP also have our customer information in it? What do we want that database toe look like? Once you figured out what you want that database to look like, based on the issue that you're trying to solve, we should attempt to build one large spreadsheets. There's a few reasons for this first, and perhaps most importantly, it's easy to review and easy to analyze one large spreadsheets. But this will also help us figure out what our unit of observation is. If we care about sales are unit of observation. Might be days of the company is operating how many sales we got Monday through Friday, every day that the company was open, or it might be that our unit of analysis is customers. If we're trying to predict whether or not a future customers going to come back or the amount that a particular customer will order, we might Instead of caring about individual days of sales, we might care about particular customers and their characteristics. Here's two examples In the top. We have time dependent data you see here. Fictional data about the number of sales on any given day related to save state unemployment rate are whether or not we're running, marketing the number of sales people we have in our competitors. Prices in the bottom. We have static time data. We've customer A, B, C, D E, etcetera. The sales to that customer, whether or not we've offered the customer discounted price, the average monthly orders for particular customer and the last price that something was sold to that customer at both of these data sets could be very useful. But they let us predict completely different things up top. We're more likely be trying to predict something like the number of sales that will have next week next year, next quarter. Whatever below were much more likely. Be trying to figure out something like what type of demand we can expect from a given customer if we change the price were charging that customer. Both of these issues are important, but the type of database and the type of data we need to answer them is very different. If we spend time before we actually gather our data, thinking about what we want our model, the look like. And then what? We want our database toe look like it'll save us a lot of time and effort and, frankly, frustration later on down the road. Once you've decided on the U turn off analysis, you need to go through in merger data, Siri's. To do this, I need to find a common variable or feature to merge on in time dependent data we want to go through emerge on date, for instance. The idea is that date will be common among different variables, and hence we can merge these variables into a single large, unified data set as a result, for time, independent data time static data, that is, we could merge on something like Zip code, for instance, there's no single Univ analysis or common variable that we're going to want to merge on. Instead, the merger is going to depend on the specific circumstances that we care about and what we're trying to analyze bought when we're reviewing our data set, we need to make sure that the merging variable is unique. This can create a big problem that a lot of people don't necessarily think about. For example, in some of the financial projects I've worked on, clients will often say, Well, let's merge on, say, stock tickers. Every company has their own stock ticker, and that's true. But what many people fail to realize is that stock tickers are repeated over time. For instance, ABC company today might refer to a specific company company one. Let's say where 10 years ago, it might have referred to a different company company, too. Company to might have gone bankrupt or been merged into another firm altogether. Been acquired, that is, and that ticker symbol. ABC became available again until it was used by Company one today. As a result, stock tickers are not a unique variable to use when merging our data. If we're looking at a time series of data, they could be repeated over time for different companies that we wouldn't want a lump together. Instead, we need to use something called Q sips when we're looking at financial investment data accuse sip simply like a Social Security number. It's specific to a given company, and it exists for all time whether or not the company goes out of business, etcetera. It is never reassigned as tickers are. If we're using a software program like Sasse or Stada will want to merge, our data using code in Excel will want him heard using the V look up function. If you're going to merge using the V look up function, we should always check her out comes after merging. They could be faulty in particular. Always use the range, look up value in the V, look up function and specify an exact match rather than approximate match. If you only specify an approximate match, you will get numerous state of problems. You can also use H look up functions, but it's better for a for analysis to have variables running across the top and then the observations running vertically rather than vice versa.
5. Cleaning Datasets: module for cleaning databases. When we're going through in cleaning data, it's important to understand that nearly all large data sets have some issues. These potential issues can include things like fraudulent data in the extreme data errors that were entered at some point, usually inadvertently genuine data, simply not representative of typical circumstances data trans positions. Now, if you want to go through and test our data set for errors their specific procedures that we can use to do that. So to test for data errors, we want to start by dropping, replacing any values that don't make sense. For instance, if we're looking at daily sales or company assets, there should never be any negative values. It's usually best to drop questionable values unless we have a small data set, in which case replacement of those values is going to be needed now. Generally, we're going to think about a small data set is less than 500 observations. If we have less than 500 observations at a minimum, we should go through and try and make our best guess as to what the correct value is and replace that data. Ideally, though, for only 500 observations. If possible, we'd like to go back and confirm that the values were putting in our correct that is, go through incorrect our data set with larger data sets of, say, 10 20 150,000 observations. That's simply not going to be practical in most cases. And if we have 100,000 data observations, as long as most of our data is just fine, dropping a few values won't make a big difference if we're dropping more than, say, 20% of our values or 10% of our values. That, of course, creates a problem. But frankly, if we have any more than three or 4% of our values having errors in them, we probably have a faulty data gathering process in the first place. So we need to go back and look at the policies and procedures that we have in place that are letting us gather that data. There's likely gonna be some issues there that will want to correct next. To test our data points, we want to go through and find the mean median and standard deviation value for each variable. These statistical measures are going to be crucial toe letting us go through and do the type of hypothesis testing that I mentioned earlier related to the correction of any possible data errors. In particular, we wanna go through and run a check to flag all of our data points that are more than three standard deviations from the mean. The idea is that in a normal distribution, most data points should fall within that three standard deviations in particular. If we're thinking about this, is a two tailed hypothesis test. Less than 1% of our data should fall more than three standard deviations from the mean, because that's only gonna be a very small portion of our data. It's a good idea to go through and flag those data points don't delete them. They are useful data, and there's not necessarily an indication that they're wrong. But we do want to flag them if we find that much more than, say, 1% of our data is more than three standard deviations from the mean that suggests that our data is in some sense is unusual where there might be a problem with if our mean and median are dramatically different. For instance, that's going to tell us that our data is skewed. We need to decide if this is a problem based on the issue that were examining now this same procedure that is looking at the number of standard deviations from the mean for any given data point that procedure can be used to test for unusual values in variables may not accurately represent reality. Another one of the issues in data analysis that I mentioned earlier. Similarly, it's gonna be useful. Flag any observations in the top 1% of our data and the bottom 1% of our data. This is called wins. Arising again, these observations don't need to be dropped. But we should go through and run analysis with him without these data points to make sure they're not driving our results. One critical mistake that we might make, for instance, is to go through and think that our sales can be dramatically higher if we follow X Y Z procedures when in reality, that's only true for a small sub sample of our data. Say the top 1% of our customers were the bottom 1% of our customers flagging this data and then running our analysis with and without those particular data points. Let's this test to make sure that the data really is similar for those winds, arised points or those flag points versus the bulk data set overall. And it also lets us make sure that our results aren't being driven by a sub sample of our overall data. This leads to another very important issue. Ben Friends Law. Now, one of the trickiest things to deal with in data analysis is the potential for fake data. One of the best rules of thumb, though, for testing large data sets for fake data is Ben Friends Law. Ben Foods Loss says that in real genuine data, the number one should be the most common. The number two should be the next most common, followed by the number three, the number four etcetera. To illustrate why this is the case, think about the stock market. It took a lot longer for the Dow Jones industrial average to go from 1000 to 2000 than from 17,000. It's simply a matter of growth within the markets. Going from 1000 to 1100 is a 10% move in the markets in theory that 10% move on to take about the same amount of time as going from, say, 16,800. Also, a 10% move yet going from 1000 to 1100 Onley moves us a fraction of the way between the data points of 3000 versus going from 16,800 moves us the vast majority of the distance to the 9000 points on the Dow Jones industrial average. Thus, as we go higher and higher, the movement this smaller and smaller on a percentage basis, Ben Foods law simply captures this in an elegant form. When going through and looking at data, the number one should be the most common number, followed by the number two etcetera. Look for that pattern in our data and we can tell whether or not the data Israel or fake the chart below is gonna show us the frequency off each number in genuine data. Now, bear in mind, there's gonna be some variation from this in any given data sample. But on average, about 30.1% of all numbers all digits in genuine data should be the number one 17.6% should be the number two 12.5% should be the number three 9.7% should be the number four 7.9% street. The number five 6.7% should be the number six 5.8% should be the number seven. 5.1% should be the number eight, and 4.6% should be the number nine. If you go through and you look a data set and you find that it differs dramatically from this, it doesn't necessarily guarantee the trip fraudulent data. But it does mean that it's probably prudent to go through and check the source of that data and decide on your own how trustworthy that data is. You wouldn't want to make any big decisions without going through and being pretty confident that data was accurate. Ben Foods law may seem simple, but in reality it's an extremely powerful tool. For instance, in a famous research study, economists showed that Enron's data and runs financial data didn't follow Ben for God's law . If auditors had been looking at Ben Fritz Law when evaluating Enron's books, well, let's just say that the outcome of that story might have been very different
6. Pitfalls in Data Collection: module. Five Pitfalls in Building Data Sets There's a few major issues in data that you may come across that are useful to understand how to deal with. In particular. The first is what to do about missing too much data. The second is Souness in data Ah, third is un observable variables and indoctrinating. Ah, fourth is when we have a small sub sample that might be driving our results. All of these problems can be very challenging to deal with, but we'll talk about some strategies for each of these as we go through this module in particular, let's start by thinking about data sets that are missing too many variables. If we have a data set that's missing too much data, this can lead us to faulty conclusions. It's not clear why the data is missing in the first place and without knowing that we don't know whether or not it's a problem for our analysis. So you have to be very careful in these kinds of situations. For instance, if we're studying financial data from overseas firms, only data from the largest companies is gonna tend to be available in most cases outside the U. S. reporting requirements simply aren't as thoroughly followed, and they're not a stringent as they are in the U. S. As a result, Onley larger companies tend to accurately and consistently report on their financials. Smaller firms don't tend to do that. The small data from the smaller firms is frequently missing. Thus, if we're trying to go through and run a simple analysis of, say, asset size on firms outside the U. S, this is going to produce a distorted picture. In that particular case, we'll find that we see, we believe, based on our analysis, that most foreign firms are much larger than they actually are. And that's what we see in practice. So this could buy us, for instance, any decisions that we might be making about whether or not to enter a foreign market. Perhaps we're going to believe that the firms will be facing are much larger than they actually are. In reality, this same type of problem can happen in other types of analysis. As a general rule of thumb. If you're missing more than about 25% of values, any given variable in a data set, it's time to take a closer look at that data. You may or may not be able to correct this issue, but if you can't, you need to decide whether or not the conclusions that you're going to draw from that data are truly gonna be valid. Next up, let's talk about skew nous and data skew nous and data could be an issue, depending on the data that's being examined. Ah, classic example of this is income levels. If we're looking at the average or mean income for our customers, for instance, that's going to produce a distorted view. No one has an income level less than $0 while a few people have an income over a $1,000,000 . That's que nous could distort some sort of analysis about optimal pricing used in a price discrimination effort. For instance, in a recent project I was involved in, ah, company had gone through and we're looking at their customers trying to predict what kind of an optimal price they could charge by using the mean data. They actually had a few very wealthy customers, and they believed their customers were much less price sensitive than they actually were. In the company's case that led the company to raise price too much, hurting their sales. While price differentiation is very useful in this type of study, we need to make sure that we're using the appropriate metric put. The average income of our customers actually is the median is a far better indication of that in the mean. If our data is skewed, it may or may not be a problem. Either way, data can't be unscrewed, so instead, we need use certain statistical tools when doing our financial and economic analysis. These tools aren't necessarily all that complicated, but you may or may not be familiar with, um, and so it's important to comb through and do a little bit of research before you get to that point. 1/3 problem we might face is UN observable variables. Sometimes results for a business decision are simply driven by a variable can't be observed . For instance, if we're trying to predict which job candidates would be the best employees, that can be a fruitless and frustrating task. It could be that the best employees are the most intelligent ones. But we can't measure intelligence directly, at least unless we're going to start paying for I Q tests right. We can deal with this problem through un observable variables that should be correlated. For instance, with intelligence, we might go through and look at S a T scores college G p A. To proxy for intelligence. It's not perfect. Of course, neither college g p A nor S a T scores directly predict intelligence, but they are related. It would be very unusual for someone who is not very intelligent to score very highly on their S, a ts or of a very high college G p a. Again, it's not perfect, and we need to be aware of that. But it may or may not be the best choice that we have. We have to decide if we can find a good proxy variable for our un observable factor. If we can't, we're gonna need to use special statistical techniques in our analysis. Another problem we might have is the possibility of a sub sample driving our results. Sometimes a sub sample of our analysis conspicuous you our conclusions. For instance, the majority of the stock returns in any given year occur in the week of Federal Reserve meetings. The Fed meets periodically throughout the year and the majority of stock returns, studies have found, occur in the one week period before and after the Fed meets. It's a tiny portion of the overall number of trading days in the market, but it's the most important sample in the year. Looking at most other days during the year is gonna lead to fewer meaningful conclusions about overall returns to avoid problems with sub samples. Driving our results. It's always going to be best to run our analysis in different time periods. For instance, we could check the factors that we believe predict stock returns and see whether or not they have predictive power in every month of the year, rather than only the months for the Fed meets. Let's go through and talk about what we've learned to begin with. When we're assessing databases, we need to be aware of differences in expensive commercial databases and whether or not there right for us. The alternative is to use generic data collection methods. These have their own set of problems, though, and in particular they may require more effort from your staff. Next, we talked about gathering data. It's important to be able to look through and combined data that's been built, bought and gathered from a disparate array of sources going through and get an taking that data that we've gathered and putting it all together into one useful data set is what we refer to as merging our data to merge our data. We need to decide what are Univ analysis is and then merge the data accordingly. Remember that our unit of analysis needs to be unique, such that we can merge our data properly. Next, we talked about cleaning our data. Next, we talked about cleaning our data to clean our data. We need to go through and test for a variety of potential problems. For instance, things like missing data, skewed data, potentially fraudulent data, etcetera. To test for these issues, there's a variety of different types of statistical techniques that we can use. These range from things like wins arising and looking at means and medians to rules like Ben Foods Law. Finally, we talked about pitfalls in data. It's always important to check your data for potential problems and if you find something unusual, have a technique to deal with the issue. I've tried to go through an outline many of the techniques that you'll need in doing that kind of analysis and that kind of checking in this presentation. I hope you've enjoyed this talk. I've certainly enjoyed this opportunity to talk with you. Thank you for watching. Look for future hands on courses in business intelligence techniques coming soon. See you next time.