Transcripts
1. S7L1 Expectations: All right. Congratulations and welcome to the final section of the course case studies. And this is where we're going to really start to have some fun. This section is all about taking the tools that you've learned up to this point and applying them to different data sets and analytical challenges. So until now, we've been focusing exclusively on our IMDB data. But now we're gonna dive into a pivot table Case studies file, which contains a bunch of new data sets that will dig into, ranging from stock market and social media data to shark attack records and even burrito rankings. But before we dive in, let's set a few expectations. Number one. The goal of these case studies is to demonstrate how the concepts covered can be applied to a range of different data sets. These are not intended to be comprehensive analyses, but simply starting points to help you practice using those tools in new and different contexts. Number two some data may have been altered or fabricated to facilitate the demonstrations, so most of the data that you see is 100% real. But keep in mind that some modifications have been made in certain cases specifically for the purposes of this course. So, for example, certain fields may have been cleaned up or standardized, or certain records may have been added or removed. Toe help demonstrate certain concepts and then, finally, homework exercises will be provided after each case study. So the exercises will follow each of my case, study introductions and build upon the specific concepts that I cover. I will provide answers wherever possible, but if you do need additional support, feel free to reach out either in the course discussion board or by messaging me directly. And with that, let's go ahead and dive in.
2. U.S. Voters: all right. He said the first case study that we're gonna dive into is for us voters in 2012 now, quick data sets summary. This is 2012 population and voter registration data from the U. S. Census Bureau broken down by state and by age group. So you can see the screenshot of the data set here on the right side of the slide. And what I've done here is actually format the column headers to indicate what we're working with in terms of dimensions or measures. So again, dimensions are your categorical fields. Measures are your numerical or quantitative fields. So the N equals 2 55 just indicates that we're looking at 255 records or rows of data in this case, so pretty small data set to kick things off. As faras dimensions are concerned. We've got to state and age and columns A and B and then for measures. We've got total population, citizen population registered voters and confirmed voters. And the concepts that will cover in this particular case study are calculated fields and value settings, specifically percent of column and percent of parent. So with that, let's jump to excel and dive in. All right, So go ahead and open up the pivot table. Case studies file and jump into the first tab. The US voters 2012 tab. And from here, we can control A to select Oliver Data. I'm gonna insert a new pivot table on a new worksheet and before we dive in, let's just name that sheet voter pivot. And I actually want to highlight this tab just a light shade of green. That way, as we continue to work through the case studies in this workbook, it will become clear which tabs are raw data and which ones are pivots. So now, as an analyst, given the data that we have on hand, some of the questions that I'd like to explore here are things like How does voter turnout compare across states are certain age range is more or less likely to vote in general. And do these trends hold true nationwide, or are there state specific insights that we can uncover here? So, as you know, these are all questions that pivot tables do an excellent job answering. So let's just see what we can come up with now, right off the bat. Let's cut our data by state as rose and pulling some of the metrics that we care about. For now, let's use citizen population. Since only citizens can vote and confirmed voters start there. I'll go ahead and format both those as numbers, with thousands separator and no decimal points. There we go. And finally, I'll sort my states descending by the citizen population. So, as expected, we see big states like California, Texas, Florida and New York at the top of our list. And then down at the bottom, obviously are less densely populated states like the Dakotas in Alaska, Vermont and Wyoming. So no surprises there From here. I want to add a calculated field because what I'm interested in looking at isn't just the number of citizens and the number of confirmed voters, which is interesting but not entirely surprising. What I want to do is actually calculate a ratio so that it can see out of the citizen population. What percentage of those actually became confirmed voters. So it's going to tools at a new calculated field and name this voter population percentage and again, that's just gonna equal the ratio of confirmed voters to the citizen population press OK, we'll format this as a percentage with one decimal point, so we can already see a completely new stories start to emerge here. And we're not only seeing a state by state comparison in terms of the total number of voters, but also has a ratio of the eligible populations. So why don't we go ahead and sort by our new metric voter population percentage to get a sense of where we see the largest share of voters? So definitely some interesting trends here. We see District of Columbia or D. C. At the top of the list, with a voter population percentage of 76.1%. That makes sense. You know, it's the nation's capital, the political heart of the country. But then you also have states like Mississippi in Wisconsin, which may come as a surprise to some now scrolling down on the other end of the spectrum, we see West Virginia, Hawaii, Oklahoma and Arkansas. Now this isn't necessarily the time or place to dig much deeper into why these trends are the way they are. So what actually causes no the West Virginia population to be less likely to convert into voters compared to Massachusetts or Colorado, for instance. But what might be a really interesting next step would be to pull in some additional data from the U. S. Census that you might be able to tie in here. So maybe things like income per capita or education level or additional demographics might help to provide some additional context and give you tools to better understand the why or the driving forces behind these voter rates that we're seeing here. But one additional piece of information that we do have on hand is age. So let's go ahead and grab age out of our field list and added your view as secondary role labels. And now, as we look through this view, we can see some additional trends start to emerge. So, for example, in our top states, we tend to see really, really strong voter population rates among older populations, especially the 65 plus age bucket and then generally, the younger age buckets tend to show smaller rates but still relatively strong. So 62% in D. C within the 18 to 24 year old range just under 64% 18 to 24 in Mississippi and then scrolling all the way down to the opposite under the spectrum. One thing that's kind of interesting to see is that you still get relatively strong voter rates out of the older populations. Not quite a strong is the top states, but still in the fifties and sixties. But when you start looking at the younger age demographics, the 18 to 20 fours, that 25 to 30 fours you see much, much lower rates down here in states like West Virginia, Hawaii in Oklahoma. So, for instance, of those aged 18 to 24 in West Virginia, you know there are 100 62,000 citizens that fell into this age bucket, and only 37,000 became confirmed voters, which is a rate of 22.8%. Similar case in Hawaii, 26.4 27.2 in Oklahoma, even lower in Arkansas. So still pretty directional at this point, but a really interesting insight nonetheless. Now, the next thing that I'd like to explore a bit more here to kick things off is the composition of the voter population, and to do that, I want to use some of our show values as calculations. So why don't we start with just aged group to begin with full state out into the filters? That I'm just looking at my data organized by these five age group buckets out from here? If we're interested in looking at the distribution of the voter population specifically, let's just pull in a second instance of confirmed voters right here in column D. And now, instead of just showing these with no calculation, let's show them as a percent of the column total. Now, what this tells me is that among all confirmed voters in this data sets own the year 2012. I know that the largest share of those voters fall into the 45 to 64 range, followed by the 65 plus range at 22%. And on the other end, you've got your smallest proportion, or share of voters, which fall into 18 to 24. So the youngest voters make up only 8.54% of the total number of confirmed voters nationwide. Now it's drag state back in as their primary role labels, and see how these trends look at a more granular level. So in this case now, rather than formatting them as a percentage of the column as a whole, which is kind of tough to read, since the percentages get really, really small here, I'm gonna change that calculation 2% of parent and set state as my parent. And now we have an entirely new set of insights at our fingertips so we can drill in state by state and see how these proportions by age ranges differ from place to place. So D. C, for instance, is kind of an interesting story. Just that the largest share of confirmed voters within D. C actually come from the 25 to 34 bucket, whereas most other states, like Mississippi, Wisconsin, Minnesota, show the largest share of voters coming from the 45 to 64 bucket. So for whatever reason, the voter population skews a little bit younger in D. C than it does in many of these other states and then scrolling down. Another interesting insight to see here is that for these low voter population states like Oklahoma, Hawaii and West Virginia, because small, these proportions are particularly for the youngest age group 18 to 24. So less than 5% of voters in Hawaii fell into the 18 to 24 bucket, and only 6% in Oklahoma fell into that range. So for whatever reason, the younger populations in these states really don't seem too inclined to get out there and vote. So definitely another interesting story, really just starting to scratch the surface into some of the interesting angles that we can take with this data set. So that's our kickoff for the U. S. Voter case study. Go ahead and give the homework exercises in the next lecturer shot and see what else you can discover.
3. San Francisco Salaries: except we've got our San Francisco salary case Study Quick summary of the data set. We're looking at salary information from San Francisco government employees between the years 2011 and 2013. So just over 24,000 rows or records of data here, including a number of different dimensions and measures on the dimension side, we've got employees name a year, their employee I d and their job title. And for measures we have based pay, overtime pay and other pay. So the concepts that will cover in this particular case study are sorting and filtering, grouping calculated fields and table layouts. So let's jump in. All right, So in your pivot table case study workbook, go ahead and jump to the San Francisco salaries tab. We're gonna control A to grab all that data. We're gonna insert new pivot table onto a new worksheet. Let's go ahead and just rename that sheet salary pivot and sign a tab color of light green again just to kind of differentiate between the raw data and the pivot taps. So first things first, let's pull an employee name just so that we're looking at data at the employee level to start and I can check the boxes to bring in all of our measures based pay, overtime pay and other pay, and I'll just go in and format all three of those fields has currency, and I don't really need decimal points here, So here we go. All right, so now we're formatted. I also want to insert a new calculated fields that will just call total pay because I want to see the sum of all three of those measures the base pay, plus any over time pay that anyone earned plus any other pay as well. So there you go Total pay, Eagles base plus overtime, plus other. And now I'm going to approach this in kind of an unbiased way and just start to explore the date a little bit and see what I find. So as a first pass, you know, maybe we just sort our employees names descending by that new total pay field that we just created. We want to see who were the top earners during this entire three year period. So Judy Melnick earned 553 K in total pay during this period. Mike Dreiling earned 460 and then and now, instead of looking at the overall sample, I want to see the top earners for just a given year. So bring year into my filters box, say, just for 2013 who were in the most. So Gary Altenburg, for instance, had a great year in 2013 here in a total pay of just under 363,000. But what's kind of interesting about Gary is that he earned 100 29,000 base pay and then another 221,000 from over time, which is huge amount of overtime pay, especially compared to some of these other employees who didn't earn anything in overtime like Amy Heart, for instance, or Sharon or James Dudley. So that's kind of an interesting trend. And that's something that, as an analyst, I'm interested in diving a little bit deeper into So what we can do is actually grab job title and pull that in just to get a sense of what Gary and Amy actually do. And so when we do that, we see that Gary turns out he's a lieutenant. Looks like the Fire Department, whereas Amy is some sort of department head. You know when I suppose that makes sense if firefighters tend to earn more overtime pay just because of the nature of their work, you know, add by that so kind of an interesting insight there. Now, if we pull job title out, we can continue to explore that overtime pay trend using another calculated field. So what I want to do now is instead of just look at the volume of overtime in terms of the number of dollars, I want to look at the rate so over time percentage, which could be calculated just as overtime pay out of the total. And this will give me a different kind of lens to look at my data through. And let's just make it a percentage with the decimal point and now what this new field allows me to do. It's sort my employees descending by that overtime percentage and see kind of who is earning the biggest chunk of their paycheck from over time versus base pay or other pay. And in this case, Janey Jennings is at the top of the list. 82.2% of her pay was categorized as overtime, so 33,000 out of $40,358. But there's kind of a lot of noise here in this list. I've got some employees that are only earning a few 100 or a few $1000 meaning that there probably part time or contract employees so we can drill into our sorting and filtering options. And what I would do in this case is apply it value filter greater than and say OK, total pay needs to be greater than 50,000 Press. OK, that will eliminate a little bit of noise and give me a more apples to apples comparison of the employees that I'm looking at now. So once we've done that now, Kimberly King Stitt is at the top of the list. She earned 63% of her pay from over time, and there's a good friend, Gary right there at number two on the list. So obviously, at this point, you know, the number of different paths that we could take with this analysis are pretty much infinite. Even just the number of ways we can filter and sort this employee list are essentially endless. So, for instance, if we wanted to Onley. Look at employees whose name starts with Steve. We can use a label filter to do that so label filter begins with Steve. Now we're only looking at the Steve's in our data set, but you might notice that now my $50,000 value filter has been over written. So I got some people in here who earned 14,000. You know, our 5000 and again, that's like we talked about earlier in the course, because by default, pivot table options will only allow you to have one set of filters applied to call him. I can change that in my tools, going to options totals and filters. Allow multiple filters per field press OK, and now we still have our Steve label filter. So let's go ahead and reapply the greater than value filter and say total pays greater than 50,000 again. Okay, and now we've got both of our filters applied. We're looking at Onley total pay greater than 50 and on Lee names that begin with Steve. Both these check marks indicate that both values have been applied. So at this point, I really don't care about employees only named Steve, so I'm gonna clear all these filters from the employee name. And let's look at the data a little bit of a higher level. So pull employee name out. Let's aggregate things by job title. So one thing that's interesting here is that when you look at a metric like based pay, you know, it's tempting to look at this and say, Oh, wow! Account clerks make a lot more money than accountants, for instance. But before you do that, you really need to think about what you're looking at here. And in this case, this Some of base pay column is a column that summarizing or adding together the base pays of any employees in the data set that falls into each of these buckets. So my hypothesis would be that there are many more employees that are getting labeled as account clerks than accountants, which is why we see a base pay of $974,000 for account clerks and only 65,000 for accountants. And we can check that just by dragging employee name as secondary role labels right here into our view. And that's exactly what's going on here. So we've got looks like 23 employees with a job, title of account clerk. And we've got Onley one accountant, Carlito. So this 973 figure is adding all of the individual salaries of each of these 23 employees, so pull employees out. And if we do want to get a better sense of apples to apples earnings by job title, which I think is an interesting thing, Teoh, look at here. What I can do is change the summer ization mode from some toe average. So rather than edit this one, I'm gonna pull in a second instance of base pace that we can compare. And this is the one that will change summer ization to an average and formatted as currency . So now it's taking the average of all 23 of those employees salaries and equating that to 42,000 which in fact is significantly lower than the accountant average salary of 65,392. So kind of a subtle but really, really important clarification to make you know, as you're interpreting numbers and a pivot table from there. Now that we have our average field in place, we can sort our titles descending by that average of base pay. And now we get kind of a cleaner, more accurate list to make those comparisons. So we see titles like Department Head, medical examiner deputy chief up at the top of the list. And then as he scrolled down, you start to see more titles like Secretary general, laborer, inventory clerk, cashier on DSO on it so forth. So already, some interesting kind of findings and insights bubbling up next up. What I want to do is actually group some of these titles together, since there are quite a few areas of overlap. So I'll sort alphabetically kind of scroll through to see kind of what these job titles air coming through us. And you'll probably notice that there are a bunch of variations of similar types of job titles, so looks like there are at least 10 or so airport jobs with slightly different titles. So what I want to do in this case is create groupings that roll up the job titles to a slightly higher level. Call it job category as opposed to job title. So to give you an example, why don't we start with some of these at the top of the list can go through accountant related titles, right click group those together and instead of Group one, call it accounting. If you recall by default. This pivot is set up as a compact view, meaning that both my title field and the new group field that I created are all kind of nested in the same column. I don't want that to be the case. I want to be able to deal with both of those fields separately. So I'm going to my tools design options report layout outlined form. And that just breaks out those two fields. Now it can independently edit my original job, title field and my new group field. So instead of job Title two, let's call this job category. Now we can kind of just continue the process like it did with the accounting titles. You know, maybe I want thes admin ones together and right click group those instead of Group two. Call it admin, and then let's do one more with these airport jobs. I've got something like 12 different airport related titles in here. Let's group those together and call it Airport. There we go and now since I changed the outline mode. If I want to roll up by category level now, I could just pull job title out and organized the data by that new field that I just created. So add men's as a whole, which include those believe four rolls or job titles that fall within it, you know, generated a total based pay of 2.273 million and an average of 71,000. Now, this manual approach to grouping works just fine. But as a side note for those who are interested, a more elegant and efficient solution to do this would actually be to create a separate look up table that matches all of those job titles to specific categories. And then, from there, using either it look up functions or data modelling tools toe actually integrate those job categories as a new field in the raw data itself. And that would save me the trouble of manually scrolling through and grouping values together in the pivot. But that's a topic for another day and another course. So there you have it. That's our starting point for exploring San Francisco salary data. Go ahead and play around with it. Do some exploring, Give the homework of shot and give me a shout if you have any questions.
4. Shark Attack Records: all right. Our next case study is a ton of fun. It's my all time favorite for this one. We're actually gonna look at shark attack records recorded between 1,902,016. So you've got just under 5300 records or observations in our data set. And what's a little bit unique about this data set is that it actually doesn't contain any measures. So there are no new miracle variables or quantitative fields. It's all dimensions. And this is relatively common when it comes to things like record keeping, because what's most relevant is the information contained within those records as opposed to any sort of quantitative field. So we've got a ton of really fascinating dimensions toe work with, starting with the case number and the date that the attack took place. We can cut the data by the type of attack, whether it was provoked or unprovoked, where it took place in terms of the country area and specific location. What activity the victim was engaged in when they were attacked, their name, their gender, their age, what type of injury was sustained, whether it was fatal, yes or no. Um, even like this species of the shark and the investigator and where the source came from, All of this is included in these shark attack records that we have on hand. So really, really rich data set to use for all sorts of different types of analyses. In this case, we're going to do quick kick off and some of the concepts that will cover our date grouping value settings and pivot charts. So let's jump in. All right, go ahead and head to the shark Attack Records tab in our pivot table case study workbook. And he was gonna control a to grab all that data and insert a pivot on a new worksheet. Let's go ahead and name this shark pivot and format that tab with a nice green Phil And my gosh, where do we even begin with a data set like this? There are so many cool angles that we can take. But at the end of the day, broadly, what we're trying to understand is the number of records or count of attacks broken down in all sorts of interesting ways, you know, maybe by date, by location, by gender activity, really a lot of options available to us here. And since my raw data is at the case level meaning one row per recorded attack, this should be relatively simple to do. I'm so it's drag case number here into our values box. As you can see, it defaults to count because really isn't a numerical field or a measure. And in this case, the count is exactly what I want. Right now, it tells me there 5292 records arose in my data set, which I jumped back to my actual data sheet. I can confirm aligns with the number of rows, so that's exactly what I want. And now it's a function of breaking down this count and slicing and dicing it in different ways. So why don't we start with trending take a date, field dragon and two rows? As you can see, it's auto grouping, which is okay, I'm gonna pull quarters and date out just so that I'm looking at the attacks by year. So because the data is a little bit funky early on in the beginning of the century, I'd like to actually limit this view to more modern cut of the data. So what I'll do here is just select the last 12 years or so. So 53 4016 and then just right click filter keep only selected items. And there you go. I've got this 12 year running count of shark attacks by year. So 102 attacks in 2503 in 2006. Now what I want to do is actually just pull in couple more instances of this count of case number field so we can show these values in different ways. A second column, you know, maybe want to show this as the percent The column total, which now tells me that you know the 139 attacks in 2015 made up the largest share within this 12 year sample, accounting for just over 10% of the recorded tax. You know this third example, I could show that as a running total with years as my base field, so I can see how the attack total has grown starting in 2005 adding all the way up to 1387 total attacks by the year 2016 so just different ways to look at this data. I understand how the trending is looked over time, so that's interesting enough. What's pull years out now. But preserve that 12 year view, and now we want to cut the data in a different way, So I think I'll start with gender. I want to see if there are notable differences female versus male, and you'll see that my third column, which is showing values as a running total, is no longer valid, since this base field of years no longer exists in my view, so I can go ahead. Just pull that out. And this view is interesting. What it tells me is that it's bad news to be a guy as far shark attacks are concerned. I wouldn't say that they're being attacked by sharks because their mail, it's more likely because of the activities that they're engaged in, but pretty fascinating. In this 12 year window, 83% of the attacks were on men and just under 17 on women, so kind of a cool nugget there, and we could go even deeper, full gender out and actually look at the attacks by age now scrolling through here. One thing I noticed at the end is that there's an unknown bucket with a ton of observations , thes air, just attacks or cases where an age wasn't provided or wasn't recorded. So I don't want that field. I can go ahead and just de select unknown so that I'm only looking at known ages. And this is a great candidate for visualization, particularly for a hist a gram style chart, which can really show the frequency of attacks by age and allow us to understand which ages tend to be more frequently involved in shark attacks. So I'll simplify this view a little bit and pull out the percent of column total calculation that I'm just looking at the count of case number, bucket and by age. Now, with any of these cells selected going to my tools pivot chart, and this clustered column will give us a really nice hissed a gram effect can go into analyze and get rid of those field buttons here. But take a look at this really, really clear and powerful visualization that shows me exactly where the highest frequencies of attacks take place in terms of age buckets. So in this case, pretty crystal clear that starting around the age 15 or so, frequency of attacks are much higher and all the way out into about 22 or so. And then it starts to come down and then trail off quite a bit once people reach their fifties and sixties. So again, this doesn't necessarily address the causation. You know, the why is this the case? But some hypotheses that we might start throwing out there are that younger people are more likely to be engaged in activities where there simply more likely to be attacked by a shark , like surfing or body, boarding or swimming or whatever it may be. But that's a really interesting view here. And if we want to keep this, one option would be to just copy this pivot and create a new version elsewhere in the sheet or on a different sheet. In this case, I'm actually just gonna drop this chart and continue to explore this data. So let's delete that. And now, rather than age, I'm kind of interested in this activity field. I want to see what's what people were doing when they were attacked, kind of what activities led to the most attacks in the sample. I'm gonna pull activity is my role labels. I'm gonna sort these sending by the count case number. So nothing really surprising here. It's kind of what you'd expect. A lot of attacks happened while people were surfing or swimming or spearfishing. If we wanted to drill in one layer deeper, we could also pull in this fatal yes, no field, which is a binary filter. It takes a value of know if the attack was not fatal. Yes, if it was so, we can see if there any differences between, you know, fatal attacks typically occurring when people are swimming or surfing versus non fatal kind of the same story here. Nothing really to groundbreaking there, but just a different ways to slice and dice this data, using the dimensions at hand. Now, one last thing to call out here is that if we unfiltered the year column, you'll notice that this column a gets stretched way, way out. It becomes really annoying to work with. I'm so to fix that you can right click change the column with back to 20 or so so that it's back in view, and the reason it's doing this is because there are certain activities here that have very , very long strings and excels tryingto auto fit to call them with so that none of those strings are being cut off. But the problem is that any time you make any adjustment to your pivot, it's gonna auto fit again and again again. And you have to keep changing the column with over and over. So to fix that, I'll just happen to pivot table tools, options and just diesel ECT. Auto FIT COLUMN Wits on update press. OK, and now when you make any adjustments to your pivot, this column a will remain the same width as it currently is and not stretch out every time you make a change so that I'd highly recommend that you explore this data a little bit deeper on your own. There are some pretty incredible records here. I've really just begun to scratch the surface of this data set. Um, my personal favorite so far, if I pull activity out and drop in name is ah, man by the name of Cosimo Pechiney, who is kind of an interesting case because judging by the activity recorded, he actually attacked a shark with his fists. But don't worry. He somehow escaped with just a lacerated arm. Um, on that note, that's your interest to our shark attack case study.
5. Stock Market Data: all right. Our next case study is using stock market data, and we have a pretty simple data set to work with here. It's a three month sample of stock market data for about 500 different publicly traded companies and looking at the screen shot to the right here. We've got 29,440 observations with just a few simple dimensions and measures. So the only dimensions that will work with here are date and symbol and then, for measures, pretty standard financial stock market measures. We've got the open price high and low prices for the day. The close price and the trading volume and the concepts that will cover in this kickoff includes some basic sorting filtering really focus on conditional formatting here to really use visualizations to tell a story. Will practice highlights. Cell rules will do some really cool demos with data bars, and they will wrap up with some value settings like percent difference from which is a great tool to show day over day gains and losses. In a really clear way. So pretty straightforward, pretty quick one. Let's go ahead and jump in already. So with your pivot table case study Workbook open. Go ahead and navigate to the stock market data tab. And before you create a pivot, just one side note is that this is a pretty standard data extract. And it's by day, meaning that this is a good candidate for analysis that I might revisit time and time again and maybe stack more data on as a collect more data in the future. So since that's the case, it might not make sense to select the entire data set on Lee down to Road 29 4 41 because that means if I add or stack new data beneath, this would have to go into my pivot table tools, change the data source, extend the row, reference out to whatever road that I've added data to. So, like, we talked about there two ways to deal with data that potentially could grow over time, one of which is to select the entire column headers. So now this pivot table reference has no row references. It extends down beneath the last row of data. That way is we add new data. All we need to do is simply refresh the pivot and the second approaches to convert this range into a table because one nice feature of tables is that they can absorb new data as it's added beneath your existing data range. So in this case, let's go ahead and take the latter approach. All I need to do select any field or any cell within my range going to insert table. You can see it selected the entire contiguous range containing data through road 29 4 41 And yes, my data does have headers in row ones. That box should be checked. Press okay. And there you go. Now, if you want to format your table, you can do that right here in the table styles. But this is my raw data. It really don't care much about how it looks. But now if I do add more data later on, it will be much, much easier to accommodate. So with that, let's go back in insert drop in a pivot table on a new worksheet, and we can name this one stock pivot. There we go. And let's just format that tab with a light green shade. And now my goal here in general, is to analyze trending so The first thing I can do is just go ahead and grab date, pull it into my role labels. It's auto grouped it. So with the date field selected, I really don't want month here. I just want to preserve the daily level trending so I can go into my analyzed tools and just press on group, and that will revert back to the original format of that field, which in this case, is what I want. Let's go ahead and drag symbol into our filters box, since eventually will want to dive into performance for individual companies. And then we'll drag all of our fields into our values box. Open, high, low, close and volume. So typically, when you're kicking off a brand new pivot, you'll need to go through a round of formatting. I'm so let's go ahead and get that done right now. Uniform at all of these 1st 45 columns as currency eso open price, high price, low price and close. These will all be formatted, the same as currency with no decimal points. So there we go and then volume is not currency volumes. Just the the amount of trading taking place format this is a number with thousands separator and no decimals. Okay, so this is a bit more clear. Let's also practice our custom column headers. So if you want to get rid of these, some of labels can change that right here in the formula bar. And instead of just open, which is already a feel name, I'm going to use my trailing space trick and format that way. So same thing with high lo close and volume. This just makes my table a bit more readable. And also to help. I can grab all those columns going to home and just center the alignment there. So again, just a little readability trick here and keep in mind we're still looking at the aggregation or the son of prices and volume across all of the symbols in our data set, which in this case doesn't really help much. So instead, why do we drill down into one symbol specifically and you can pick anyone? But I'm going to start with a pl or apple and now this kind of makes a bit more sense. We're looking at the actual Apple Stock price day over day as well as the trading volume So now I have all of the raw information that I need to understand Apple stock performance over this three year sample. But there's definitely some work that weaken due toa add some visualization and really help the insights shine through. So why don't we start by really emphasizing the day over day gains or losses? And to do that, I'm gonna really focus on this close column, and what I want to do is pull in a second instance of clothes. But for this one, instead of showing them no calculation, ID like to show this column as the percent difference from the previous dates on my base fielded, state based item is previous and press OK, so this tells me how that closed price has trended day over day. In this case. On August 24th it was down just under hundreds of a percentage point compared to the 21st and the reason they're missing dates because the market is closed over weekends. So the 21st was a Friday. 24th was a Monday when the market opened back up, so this field is definitely helpful, but the problem is that it's still just a lot of numbers, and it's kind of hard to pinpoint trends or what's really going on and where my I should focus here. So toe help with that conditional formatting will be a great tool to use. So before we do, let's just go into design and get rid of our grand totals Row, since we don't need that and you'll notice how any time we make a change at the columns air kind of automatically refitting based on the data. I don't really like that. So I can change that option here in pivot table tool options and just uncheck that column with auto fit option there press. Okay, now I can kind of just customized the width of my columns, and these will no longer change as I make updates to my pivot. So let's call this instead of some of clothes, let's call it daily changed to make it a bit more clear consent. Er that And now to apply conditional formatting, I'm gonna control, shift down and grab this entire column of data goingto home conditional formatting. There are few options that could use here. I could do color scales like red to green. This case and when use highlight cell rules because I only want these cells to take two formats either red or green, based on a gain or a lost day over day. So start with greater than highlight cell rule and say, any time this cell values greater than zero. I mean, it's a percentage lift compared to the previous day. That's a good thing. So when format it with green, fill with dark green text and press OK and without changing the selection, I'm gonna drill right back into my highlight cell rules and had a second rule for less than zero. And that's a bad thing. So format any cells that are less than zero or negative toe light red, fill with dark red texts and press OK, and now it's very, very clear. Which days showed a lift for Apple, in which day showed a decline. It shows me, you know, periods of time where they're consecutive gains or consecutive losses, so those stories come through so much more clearly with something as simple as a highlight cell rule applied in my pivot. So that was definitely helpful. Next, let's draw attention to this volume field here, since that's another really important element of the data that we can work with. So to show volume trending day over day, obviously a bunch of different ways to do this, I could use a pivot chart with the line charter column chart. But in this case, I want to add my visuals right here into the table itself. So to do that, just like we did with close, I'm gonna pull in a second instance of volume. You call this volume trend, expand the width of this column quite a bit. And now, same deal here. Gonna control shift down to grab that entire column data conditional formatting. And now I'm going to use something called data bars, which essentially put in a bar chart directly within the cells of the pivot. So it's a really nice tool toe. Add visualizations without creating secondary charts. So we'll do a blue data bar here and now it's very, very clear where those peaks and valleys occur in terms of trading volume. So in this case, for Apple, the highest volume day was the 21st followed by the 20th. There's clearly a lot of activity on those days, for whatever reason, and then as a final tweak. Notice how the numbers air kind of redundant here. Since we have volume already in column G and they overlap with the bars, it just kind of looks ugly. So little protest number, format custom, three consecutive semi colons press okay. And, Walla, your numbers become invisible. And I have this nice, clean data bar column on its own. So it's great about this. Is that you know, now that invested a few minutes and kind of building this template applying these conditional formatting rules, I don't have to do that over and over again as I filter or sort or slice and dice this data . So, for instance, instead of Apple, maybe we want to look at Amazon next. You can press, OK. Our daily change formatting our volume trend data bars all update automatically very, very easy. And now here I see a very clear trend where something happened on the 23rd that caused Amazon stock price to jump almost 27% and trading volume to pretty much explode compared to any other day in the sample. Now I happen to know that earnings were released on the 23rd which were really, really positive for Amazon, which drove this game that we see here. So overall, that was a really quick, really easy way to integrate some visualizations and some dashboard style tools directly within our pivot toe. Help us tell these stories and help us understand exactly what's going on with the raw data itself. So in your homework exercises, you'll go through a similar process. But instead of analyzing day over date differences, you'll dig a little bit more into price spreads per day. But the same logic will apply, and with that, that's her quick kick off to the stock market case study.
6. Baseball Team Stats: all right. For all you sports fans out there, this next case study is gonna be fun. One. We're gonna look at baseball team stats, so the data sets summary. We're looking at Major League Baseball team statistics by season for 21 seasons from 1995 through 2015. So not a ton of rows of data. It's only 624 observations, but we have a huge number of dimensions and measures that we have to work with. So dimensions include the season start date, the year, the league, the division team name both abbreviated and full. And then some binary. Yes, no flags to indicate whether the team won the division, the wild card, the league or the world Siri's. So those binary flag fields will be really helpful tools that we can use to filter and sort in terms of measures. You've got all of your classic baseball stats here. Games played, wins and losses if hitting stats like runs scored at bats, hits, doubles, triples, home runs, walks, strikeouts, stolen bases, caught stealing and then pitching or defensive stats like runs allowed, earned runs, earned run average, complete games, shut outs and saves. So we're actually only gonna be using a small handful of these fields, at least in the case Study, kickoff. But they are available to you for you to explore however you choose. So concepts covered in this kickoff we're gonna talk about sorting and filtering pretty basic stuff we're gonna use, um, pivot charts and slicers and then practice of you calculated fields. So let's go ahead and dive in already with your pivot table case study workbook open. Go ahead and select the baseball team stats tab. And we just kind of have a fixed range of data here so we can press control A to grab all of that data. Actually, we don't want these extra rows, so just select one of the fields in the range, press control a again, and that will just get you to row 6 25 which is what we want. The compress insert pivot table on a new worksheet and let's go ahead and name this baseball, give it and format it with a nice light green shade. And here we go. So all of these fields to work with again, we're not going to dig into everything just going to kind of start to scratch the surface here. But since we're looking at team level data by year, let's go ahead and take Year Dragon into our role labels. Now you can see that 21 season time frame that we got to work with. I'm gonna take the full team name and make it a filter up here, so I don't care about a lot of these components stats for now. I just want to start by analysing win loss records. So I'm gonna pull in W for wins. L for losses. Do a little formatting here. So number format with thousands separator, no decimals, same exact format for losses. There we go. And let's do custom header names here wins and losses. And you might be wondering why these two fields are equal right now. It's a great question, but when you think about it, we're looking at data for the entire season across all teams and for every given game, one team has to win, another team has to lose, so every game equates to one win and one loss since there are no ties in baseball, so this doesn't really make a whole lot of sense when we're looking at it across all teams . But when we drill into a particular team like the Red Sox, for instance, now we're looking at that. Particular teams win loss record season by season, so I want a better way to interpret these win loss records. So what I want to do is add it calculated field that I want to call win percentage and that will be defined simply as Winds or W divided by G, which is games played. Press OK, right click and format that as a percentage with one decimal point. There we go. Now we have a single column that kind of captures the information in the winds and loss columns. This is when percentage 1995 the Red Sox win percentage was 59.7%. Now I can use this field to really start digging into the data and analyzing it a level deeper. So, for instance, for the Red Sox, I can sort these seasons descending by win percentage. To say OK 2004 was the Red Sox best year in this sample in terms of win loss at 60.5% 2012 was the worst year with a win percentage of 42.6. You know, and same goes with any other team. I could change the selection from the Red Sox to the Cubs. Now I can see that the Cubs strongest year was 2008 at 60.2, followed by 2015 you know, and so on and so forth. It can also take team drag team into my role labels and pull year out and apply the same sort of sorting logic, descending by win percentage to get a sense of which teams have the strongest win loss record over the course of the entire sample. So we see that in general, the Yankees were pretty much the most dominant team over this 21 year period, with an average win percentage of 58.7. On the other end of the spectrum, you've got teams like the Marlins and raise at 43 40% respectively. And now we're looking at just kind of all the teams laid out in a single list. But we can also show the hierarchies here since we do have the division and league dimensions toe work with. So, for instance, we could pull in division head of team in the row labels and split things out by the Central Division teams. The East Division teams and the West Division teams also pulled league in. Add one additional layer. It's now the American League Central, the National League Central and so on, so forth. And that kind of just gives a different context to our sorting options. And our win percentages here because now we're looking at these teams ranked by win percentage within each particular division. And what's cool about this view is that now it can drill down to any particular year or season 2015 for instance, and this just visualize is the actual final standings at the end of the season. So the Royals won the American League Central, Blue Jays won the American League East, Cardinals won the National League Central and so on and so forth. So kind of a cool way to organize this view using a pivot table. But let's go ahead and drag league and division back out. And why don't we play with some of those binary flags that we talked about for division win league when wildcard win and World Series win, which will see here in these four fields. So I'm gonna drag each of these four fields into my filters. There we go. And now what I can do is de select my year filters. I'm looking at the entire sample again and tweak these filters. You could see that they're all either yes or no to say, OK, show me a list of teams that won their league. So L g win is a league win. So this is a list of teams that won their league in some season within this sample. So I've got the Braves, White Sox, Angels, Indians, etcetera. And the same thing goes with any of these. So World Series win. Show me all the teams in the sample that won the World Series in one of these seasons. So pretty cool. Now I know which teams won the World Series at least one time in these 21 seasons, but we can also drag in year like so, to see which teams one more than once and which particular years they won So braves only 11 time in 95 Angels in 2002 Yankees we see 15 years they won in 98. 2009 1999. If I want to sort these, what's going it? Tools design report layout outline form, which will split out those roll labels. Make it a little bit more readable as well. Now a consort year smallest to largest instead of ranking by win percentage and now it's a little bit easier to read. So the Yankees won five times in 96 98 99 2000 and 2009. Red Sox won the World Series three times. Marlins twice. Giants three times, Cardinals Twice. Pretty cool stuff that we can extract just really with some very simple pivot table sorting and filtering tools. But now let's return back to our original win percentage trending view. So I'm gonna clear out team and D filter or unfiltered, that world serious. Winfield. And instead of showing both wins and losses along with the win percentage, really only need win percentage at this point. So when I pull wins and losses out and like we talked about, the win percentage will always be 50%. When you're looking at all teams rolled up, I'm So why don't we go ahead and select a specific team when we search for the Giants here ? Filter just on the San Francisco Giants? And now what I'd like to do is add a pivot chart to help visualize this trend on and also potentially analyze some interesting correlations as well. So since I have my simple view set up here, I can go ahead insert a pivot chart right from my pivot table tools option and column chart will do just fine. So insert that column, chart my analyzed tab. I don't want these field buttons that can just kind of spread this out of it. So this does a nice job showing the Giants win percentage. How it's trended year over year so we can see where they've had strong seasons, where they've had weak seasons where maybe they're rebuilding the team, but a nice kind of simple but clear visualization here. But I want to go a level deeper and I want it understand what might be correlated with win percentage. So one thing that comes to mind and you might take a different approach, but in this case, why don't we look at the team batting average because batting average is a pretty good indicator of hitting ability, which should, in theory, translate to run, scored and therefore winds. So I'm gonna select a field of my pivot table, go into my tools, fields, items and sets and add a new calculated field for a V, G or average, and that will be defined as hits or H divided by at bats or a B. So that's batting average. When I press okay, you'll see that that new field has been added to both my table and my pivot chart. And to visualize this a little bit more effectively, what I'm gonna do is actually change this from a clustered column, which it defaulted to different type of chart. I'm gonna go all the way down here to Combo, and I want my win percentage to be shown as columns as is, but I want average to be shown as a line chart on the secondary axis. I'm choosing secondary access so that I can edit each of these Siri's and each of these axes independently. So press okay. We're now showing our two metrics or two measures at once, and already we can see a pretty nice correlation between batting average and win percentage . Now we know that correlation doesn't necessarily prove causation, but in this case we can be pretty confident. At least directionally, that hitting performance is generally a solid driver of win percentage. So now let's make just a few formatting tweaks here in a right click format. The data Siri's this case instead of orange. I kind of wanna nice dark blue line there, and I prefer the smooth line so I can check that box down at the bottom. And I've got my legend here showing me which Siri's I'm visualizing as columns in which Siri's I'm visualizing as lines. So my last step, you know, just to add a little bit more interactivity to it is to actually insert a slicer as well for thine and that way. Right now, I'm looking at just the Giants, which you can see if we scroll down is the one item selected. But that way you can select any team and get a sense for how the correlation looks and how the win percentage and batting average have trended for any individual team. You may have noticed that these markers showed up formatted in orange. To be honest, I have no idea why that happened. But if you're curious or if you don't like him, I'll show you how to change those. You can go back into format data. Siri's go into your fill in line. There's this marker tab you can click on Drill into your marker options, and it just kind of automatically added these weird orange circles. So I'm gonna say, Let's do built in markers with these little diamonds, size six and what? Fill them with white and give them a solid border in this dark blue. And I would close this. There we go. That looks a little bit better than the random orange, and you conform at everything else, kind of as you choose. You can put the label with legend down the bottom, but obviously the formatting options are essentially endless here. So there you go, obviously a ton of directions that we could have gone with this data set, but hopefully this approach will give you some ideas and some inspiration to get started exploring this data on your own
7. San Diego Burrito Ratings: all right, This next case studies another one of my favorites, San Diego Burrito Ratings. So Data said that we've got here includes burrito ratings and Yelp reviews from 65 different San Diego restaurants in 2016. So 237 observations. Relatively small sample, but some really interesting kind of unusual dimensions and measures that we have to work with. In this case, dimensions include location, the burrito name, the date recommendation, who the reviewer was and any additional notes. And then for measures. We've got the Yelp rating, the cost of the burrito, and then a whole bunch of 0 to 5 ratings along different elements of the burrito itself, so that tortilla temperature that fillings that meet volume, uniformity, salsa synergy, whatever that means on wrap quality. So we're gonna cover a few different concept here. One of the most important ones is calculating using Count columns, one of the trigger concepts that's really important to keep practicing. We'll practice summarized values by show values as rank and then do some really nice, conditional formatting tricks to really bring this data to life. So let's go ahead to excel and jump right in all right. So if you've got your pivot table case study workbook open, we're gonna be working with the San Diego burrito ratings tab. And let's just control A to grab everything. Insert a new pivot table dropping into a new worksheet that we can name burrito pivot, right click change the tab color to light green. And we're good to go eso looking through the fields and the data set that we have to work with here. I'm like we talked about you might notice it's a little bit different than some of the other data sets that we've worked with up to this point in this case, all of our measures, with the exception of cost our ratings on a 0 to 5 scale. So this will give us a really good chance to practice some techniques, like weighted averages and using count columns in are calculated fields. So let's kick off this case study with an analysis of scores by burrito so I can take burrito, drag it right here into my role labels, and then any scores that we care about, we can pull into values. So maybe we don't need all eight. Maybe we only care about a few of these like tortilla temperatures. Important. Of course, you need good feelings and then let's go. Synergy sounds like a good rating. So we've got four different ratings tortilla, temp fillings and synergy. But now you can see they're evaluating as sums, which doesn't really make a lot of sense here. You know, if we had to ratings for a given burrito for one field like Tortilla and one rating was a 41 was a five. We want to evaluate that as a 4.5 average, not the sum of four and five, which be a nine in this case. So it's changed the summarized values options from some toe average for each of these summarize values by average, and then we'll also do. It's changed the number format to make it a little bit more readable here, so I'm just gonna do a number with two decimal points instead of a general format. There we go. Now we've got our four average scores across thes for individual elements or components of the burrito, and now I want to create some sort of a single aggregated average score based on these four component ratings. So Why don't we play with this a little bit? Go into are calculated fields and call it average Total score. Now you think might make sense here in terms of the formula, could be to open up a parenthesis and say All right, our average total score should be our tortilla score, plus our temp score, plus our fillings score and finally our synergy score. Close the parenthesis divide by four to take the average and press OK, now, when you do this and it drops in this average total score, you'll see some make sense. Like for 619 burrito original. I just select the four component scores down here. I can see the averages 375 which is correct here in column F for addiction, the average three, which is correct. But then I have ones like two and one where the average should be 363 and I'm getting 10.88 and ABA daba which should be 355 I'm getting 24.88 now. I want you to take a minute and pause this video and just think through exactly why this is taking place. So why am I getting scores that are greater than five when really, they should all average out the average of thes four fields. So take a minute, think about it, and then press play when you think you have an answer, all right, so we're able to figure it out. This gets back to the calculated field concept, where any measures or metrics included in a calculated column have to be evaluated as a son . So right now, because they're calculated, Field is taking these four scores tortilla temp fillings and synergy. It's taking the some of those four scores and then dividing by four. So in cases where there's only one rating for a given burrito, the sum will be the same as the average, so a calculated field will give us exactly what we need. The problem lies in burritos that have more than one review, and we can actually see what's going on there by pulling in another instance of burrito, which is a dimension and evaluating it as account in our values field. And now, since that's counting the number of times this burrito showed up in the data set and because their data set is at the review level. We know that this indicates the number of reviews for each particular role label. So the cases that did evaluate correctly are all cases where there's only one review. In other words, there's no difference between evaluating a some versus an average where we do see the incorrect average total scores are for any burrito that has more than one review, like two and one ABA daba l passed or California and so on and so forth. And one thing that's kind of interesting to note here is that if we take, I'm just gonna clear pivot table field list for a second to get some more room. If we take the average total score from are calculated field and divide it by this count column, the number of reviews, then all a sudden we're getting the correct numbers. So two and one, for instance, the average to be 363 363 For California, the average should be 365 365 So by dividing the calculated field that we had created, which is built on sums and dividing that average total score by the count of reviews, we end up with the correct number that we're looking for. So if you recall, we can't add an operator called Count of Burrito into are calculated field because again we need sums we can't use counts. So as a work around, what we need is a field or column in our data whose son gives us these same values. And since Count of burrito is simply counting Rose in her data set, we can just create a new column and assign each row with a value of one so that we can then some them up, according to these real labels, to give us the same answer. So I know that sounds a little bit more complicated than it really is, but let me show you what I mean. Going to her burrito ratings raw data. Gonna insert a new column here, and we call it number of reviews because their data is at the review level and just type of one. Apply it down. So every row equals one. In this case, back in our burrito pivot, it can change your data source, and since it's looking at only be through Q, it could just change that B to an A or it could manually drag the selection and press. OK, now, if I open up my field this again you'll see this number of review fields is now available to me. So when I dragged that in right next to count of burrito, the some of the number of reviews is now exactly equal to the count of the burrito dimension. And again because both of these fields or just counting Rose ones just doing it using a count summer ization. The other one which we just created, is doing it through a some summer ization. And this is the one that I can now use as part of my calculated field. So let's pull counted burrito out. Since we don't need that anymore and keep an eye on this average total score we're gonna go into are calculated field, we're gonna edit this definition and basically what we're gonna do is take this exact definition as we've calculated it here and divide that whole thing by the number of reviews which will give us the right answer. So first things first surround all of this with another set of parentheses and then divide that by the number of reviews, press OK and see how that updates to give us the exact number that we're looking for. So now two and one, 363363 California should be 365365 And there you go. Now we have the correct version of average total score, which we achieved by using accounting column in our raw data to allow us to calculate that field properly. So let's go ahead and drag that average total score right up front after a number of reviews. And now that we have kind of that average total score as our one main KP I or key metric here, I really want to draw attention to this column as my point of comparison across row labels . So one way to do that is to pull in a second instance of average total score. And instead of showing this with no calculation, let's show these values as a rank from large to small, where my base field is the burrito name. Now I can see the average total score in the form of a rank out of this entire list. So two and one is ranked 17th in this list in terms of the average total score. Arizona is ranked fourth with the score for 13 California only cheese is ranked third and so on and so forth. And now you know, As always, I have these fields available for sorting options so I can sort descending by that average total score. And now my number one ranked burritos are listed at the top Colima Spirito veg out both with scores of five. And then on the other end of the spectrum, my lowest rated burritos like Camp A On and Chicken Asada, which got scores of 25 and 238 are listed down here at the bottom of the list. So this rank column is definitely helpful. We can actually name it rank just to make it a bit more clear. But now let's really try to bring this data toe life using some conditional formatting. So I want to move the rank field up in front of my total score. I'm gonna control shift to select this Total Score column, and I don't want the grand total. So going to design turn grand totals off just shift clicks, and I've selected the entire column for average total score I'm gonna go into my home menu , conditional formatting and at a color scale here from green to red, cause I want my high scores formatted green and my low scores formatted red. And as you scroll down, you can see kind of these shades of orange and red as scores get worse, goes up to yellows and greens as the scores get better. And now, to take this visualization one step further, gonna select one of my component score columns and apply a highlight cell rule. And in this case, all I'm trying to do. It's kind of add a little bit more insight into which component scores are driving the average total score for a given burrito. So instead of doing color scales for all of these columns as well, which might get a little bit busy, I'm just going to do some highlights. Cell rules. So in cases where the score is greater than four, let's highlight these green. Since that's a really strong score, and then, with the same selection in place, let's do a less than highlight rule for any scores that are less than three, which is a pretty bad score with a red fill and dark red texts. Everything else will just stay un formatted. So now for that column it's a lot easier to see kind of unusually high or unusually low scores and a little pro tip here. Instead of going through that same process three more times, I can take this whole column and use a little tip in the home menu called Format Painter. If I double click that tool now, I've copied this exact format, even though it's kind of a dynamic, conditional format. And I can just click the first cell in each of these columns and apply the same conditional formatting logic to each of these. And then once I'm done, just press that button again and there you have it. So now I've got a really nice visual showing me the overall ranked score as well as individual component scores, you know, highlighting unusually high or low values. So some interesting trends here, you know, our top scores tend to do well across the board. We have some cases where a burrito like Deborah Special did really well in terms of temperature, but not so well in terms of filling and vice versa. So there's tons of different stories that you can now start to extract just simply by looking at the visualizations here now, final adjustment that I want to make here that you may have noticed is that a lot of these burritos only have a single review, which isn't necessarily fair when we're comparing it against other burritos that might have three or six or 13 or even 60 reviews. So what I'd recommend doing to make this a little bit more of an apples to apples view would be to go ahead and apply a value filter to the Burritos column and choose greater than so that we can say, Let's Onley look at data or burritos, where the number of reviews is greater than one press. Okay, now we trimmed down our view to more apples to apples comparison among all burritos that have at least two ratings. And then, of course, you can kind of apply any other formatting tools you want. You could center the used to make it a little bit more readable. You could change the column headers, whatever you choose. But what's great about pivot tables is that now that I've put in all that work to create this really cool view and template. I can then just swap out roll labels, aggregate or cut up or data in different ways without having to recreate any of these formatting rules. So, for instance, this is what you'll do in your homework. I can pull burrito out, and instead of rating by burrito, I can rate by location. And the only thing that needs to change here is the Rank column. Since that uses a parent of burrito. Now my base field is location, and there we have it so I can go through the same process that consort my locations, descending by the average total score. And there you have it. So just like we had analyzed burritos in the beginning, now we very quickly applied the same type of analysis to locations. So really, really powerful tools here for extracting tons of insight out of this data, and that wraps up her case Study. Kickoff. Now it's your turn to get your hands dirty.
8. Daily Weather Conditions: all right. Our next case studies daily weather conditions, and the data set there have to work with. Here is daily weather data from Boston, Massachusetts, from January through December 2016. So we actually have just under a full year of data 363 rows to work with. Got dimensions like date and the type of conditions. Clear rain, snow and then some measures to capture the max temperature for the day. The mean temperature, minimum temperature, maximum wind speed and any precipitation in terms of inches. So the concepts that will cover in this kickoff include sorting and filtering pivot charts , date grouping and calculating with count. So let's jump right in. So in your pivot table case study workbook, go ahead and click on the Daily Weather Conditions tab and with any of these values selected, just press control. A. To grab the whole data range. Insert pivot on a new worksheet. We'll call this one weather pivot and format that tab color with the light green, and there we go. So we've got our starting point, and now to kick things off, I'd like to start by looking at average mean temperatures to see how things trend out day over day and over the course of months and seasons. So start by pulling in date. You'll note that it auto groups them up two months. That's fine. I'm going to start by pulling months into the filters field so that I'm just looking by day to start. And then, in terms of values, let's grab mean temperature dropping into values. I will notice it defaults to the sun, which because their data is at the daily level. That would actually gives us the correct answer, but we don't want to. Some these temperatures as we look at the data by month or by year or at any aggregated level or view so we can do is change summer ization to average. As you can see, it's exactly the same as some at the daily level, but now will be taking the average when we roll this up two months or years or anything else. So with mean temperature here to help me kind of visualize how things are moving, I can select this entire column, actually get rid of the grand totals and just select the data itself, and from here, what I want to do is insert a color scale, so I'll go into home conditional formats color scales and I like this red to blue because blue kind of feels cold and Redfield's hot. I'm gonna choose this color scale specifically and already as you scroll through, you can see kind of how it shifts from blue to red and then back to blue as we head into the spring, summer, fall and winter. So that's pretty cool on its own, just as a visualization tip. Obviously, I can also apply sort options here, so OK, descending by mean temp will tell me the hottest day of the year, which was August 12th. With an average temperature of 86 degrees. I could either scroll all the way to the bottom or just change my sort options to ascending to see the coldest state of the year, which was Valentine's Day 2016 really cold one. The average temp was zero degrees, which is actually 11 degrees colder. Then the second coldest day, which was the 15th. And as you scrolled through, it's kind of cool to see these color scales gradually shifting as the values change. So so there you go. That's a nice way to kind of explore temperatures at the daily level. But now let's bring date out, pull month into the picture. So this tells us the same exact story here, just at a higher level of granularity. So now we can see the higher level trend by month, where January and February are very, very cold starts to warm up in April, May, June, and then you have July and August, which are the hottest and then starts to chill down again in the fall heading into December . So this color scale doesn't really nice job visualizing the monthly changes in temperature . But I'd like to also add a line chart to help visualize this trend or show it even more clearly. So we're going to pivot table tools, pivot chart and let's select line here and press OK, I can't get rid of some of these features. Don't need the field buttons. I don't really need a legend here, since we're only looking at one Siri's and for the title, we could just call it average temp by months. Make that title bold, spread it out a bit and there you go and you know just in terms of formatting is kind of nit picky. But in a case like this, what I'd like to do is add data labels and then format them a bit. I want to show them at the top so above the data points, and I want to change the number format a bit, make it a number with maybe one decimal point, and then we can close that out. So that just helps kind of see the trend a little bit more and also format the data Siri's going to fill and line. I like to use a smooth line. Here is well again. That's just kind of my preference. You don't have to do any of that if you don't like to, but there you go. Now we see a very clear kind of bump up into June, July, August, and then a study drop back down October, November, December. So from here, we could also go ahead and add a timeline. Since we do have a date field and drop a timeline in as well, essentially, what this will do is it will give us kind of a tool that we can use to trim this view So right now we're looking at all 12 months. But if we only care about the first half of the year, for instance, we can use the timeline to kind of just this view, you know, maybe only care about the second half of the year. There you go. And you could see both the table and the chart updating accordingly. So nice little visualization tool that we can look toe, analyze temperatures by month. Obviously, we could do the same exact thing with Max temp min temp, wind speed precipitation. So I would recommend exploring this and maybe testing out similar visualizations with different fields. So the next thing I want to demonstrate just in this case study kickoff is that now that I have a good sense of average temperatures by month now, I'm curious about digging a little bit deeper into the Max and Min Temps, and specifically, I want to look at the temperature spread by day. So I want to see which days saw the largest differences between the Max temp and the mint temp. So instead of manipulating this pivot and ruining my chart, actually, just gonna select that entire pivot table, copy it and then paste it down here. So preserved the work that we've done. Just have a second instance of the pivot table to work with down here. So let's go ahead and pull months out. It's go back to daily level now, drag in the minimum temperature and the maximum temperature again. Since I'm daily level, it really doesn't matter what summer ization Modi use, but because I know I will be rolling these up to monthly. At some point, I'm gonna change these from sums toe averages. And I want to create a calculated field to give me that spread that I was talking about. Gonna go into pivot table tools, fields, items and sets. Let's call it tempts spread. And really, the only thing that makes sense here is to take Max minus. The men press OK, and this seems to be evaluating properly, at least on the daily level. So for any given day or date, I can see the spread. In February 14th the low was negative. Nine. The high was 12 which is spread of 21 15th spread was from 4 to 35 which was a temp 31 and I can use this temp spread field now as a sorting option. So what sort descending by temp spread to see which day in our sample showed the largest difference between high and low temps. Turns out that was March 9th, where it had a low temp of 40 and a high of 77 which equates to the spread of 37 degrees. So this is cool, and this is helpful. But now, let's say I want to see this average temps spread by month instead of by day. You know, to see if there's a larger spread for certain seasons or if there's some trend that might be interesting to see at the monthly level. So let's swap date with months, and here we see that our temps spread all of sudden is totally crazy. So we're seeing attempt spread of 433 degrees in January, and by this point you should start to get a sense of why this type of error occurs, and it's because temps spread is a calculated field. Calculated fields need to use sums for any input fields that go into the formula, and in this case, the input fields are minimum temperature and maximum temperature. So what I realistically want is a formula that says, Subtract the average minimum temperature from the average maximum temperature and give me that difference. So in this case, really, what I want is 39.484 minus 25.516 So this really should be evaluating to 13.97 instead of 4 33 But like we talked about, we can't go into are calculated field going to attempt spread. And we can't say Give me the average max temp minus the average min tent. That just doesn't work. So what we need is a helper column or new column and her actual data set that will take the count and evaluated as a son that we can use in our formula. So let's go ahead into daily weather conditions. Just add a new column after a we'll call it a number of days. Since our data's at the daily level, it's just gonna take a value of one. See it defaults to a date format, so let's just change that to a number. Apply it down. So now we've got a value of one in every row for this new number of days column back in my pivot, you can go ahead and refresh that pivot. There's number of days, and now what we need to do is convert that formula that we had written from some of max temp and mint temp, which is the default. The only option. We have to change this to the average max in the average men. So to evaluate the average max temperature, can I take the max temper the sum of Max temp and divided by the number of days? That's the same thing is taking the average max tent. I'm gonna do the same thing here with minimum temp opening parenthesis divided by the number of days and close it off. So now this translates to the average max temperature minus the average men temperature. The press. OK, there you go. So 13.967 I could just add that manual calculation in a spot check again, drag it down and it's right on point. So that new formula that we had written using that counting column in our raw data did exactly what we wanted to do. And from here we can simply form at this you know, a more user friendly way. We can even pull out men and Max if we choose. But now, as always, weaken sort based on that new field, to say okay in general, February at least in 2016 was the year with the largest average temperature spread of just over 18 degrees. Auras December had the smallest average temperature spread of 13.36 So kind of interesting . Not sure there's much of a trend here. Looks like the summer months like June July August 10. To show pretty large temperature spreads here on the colder months tend to show smaller. Hello, February looks like an exception to that rule. Um, anyway, just another interesting angle that we can use to analyze this weather data. So there's our kickoff for the daily weather case study. Give the homework questions a shot and let me know if you have any questions
9. Spartan Race Facebook Posts : all right. Our next case studies a fun one, and it's actually looking at Spartan race Facebook data. So in this case, we've got 393 rows of data that actually shows Spartan race Facebook posts from August through October 2016. Now we have some really interesting dimensions to work with. Here we've got the page name, which is Spartan race, the date of the post copy of the Post, a link to the actual post itself. What type of post it was, whether it was a photo of video status or an event, the hour that the post was posted and the time of day. So afternoon, evening, morning, etcetera. And in terms of metrics, you've got your standard social media engagement metrics like shares, reactions, likes and comments. So we'll cover a few different concepts in this case study. We'll do a quick review of show values, as with percent of column, will go into calculated fields and then do some demos with slightly more advanced calculations using count columns in the raw data. So let's go ahead to excel and dive right in already in the pivot table case study workbook , Go ahead and find the Spartan race Facebook Post tab. And with any of these metrics or cells selected, just press control. A. To grab all of that data, we're gonna insert a pivot table on a new worksheet and let's name it Spartan Pivot and when right, click and just change that tab color to a nice light green. And now, before we dive in, let's just pretend that we are the marketing analyst for Spartan Race, and we've been tasked to better understand Spartan races, Facebook performance and posting strategy. Now, if that's the case metrics that I really care about in terms of KP eyes, here are these engagement metrics like shares, reactions, likes and comments. And since we have some really nice, interesting dimensions to work with, what I'd like to know is, how do those engagement metrics or engagement rates differ by post copier, post type or Wen Spartan races posting certain things so hopefully I can use pivot tables and pivot charts to extract some interesting insights out of its data and use those to help Spartan race optimize its Facebook strategy. So with that, let's start pretty high level and simply use account of summer ization to get a sense of how Spartan races posts tend to be distributed in different ways. So if I'm looking for account, I can pull in any of these dimensions here. Let's use Post copy, for instance, and when I drag it into values, since it's not a numerical measure, it will default, account of which simply counts the rose. So 393 there's the total number of rows in my data set, which is exactly what I want. Keep in mind that I could pull in anything here date of post page, and I'd get the same count of metric because we're just counting Rose here. Very simple. And now what I can do with this count is then pull another field into my road labels to break that count down. So let's start with post type event link photos, status or video. And what this tells me is the number of times each type of posts appeared in our data set so we can sort these descending by the count and see that video posts were the most common . There were 195 video specific posts in this sample, followed by photo at 1 56 and then really not many link event or status posts at all. And one thing that we can do that's very simple. Just to visualize this trend a little bit better at a pivot chart, throw something like a pie or doughnut in here as well. Just to visualize these numbers that we're looking at in our table. And I'm just going to get rid of these field buttons and I'm going to right click Siri's. I want to make the whole a little bit smaller, totally personal preference. So to you, I was change the title of this number of posts. Since remember, the count is just counting. Rows and rows represent individual posts. So essentially, what we're looking at here are the number of posts that fall into each bucket of a row labels. So I can see here that video accounts for just about half of the posts, and photo makes up a large percentage that really aren't many of these other post types in here. And now that I have this donut chart in place, I can just swap in different dimensions without having to rebuild my chart. So, for instance, let's pull post type out and pull time of day in and again sort these descending by the count. When we do, we can see that more than half of the posts were posted in the evening, followed by late night and then afternoon. And really not many posts in the morning at all. You can even drill down deeper than this could take time of day out, actually do our of post and sort those descending by the count. And here you can see that 23 PM kind of early afternoon is a pretty common time to post. And then I know it's a little bit tough to read with. All of these labels actually wouldn't recommend using doughnuts or pies when you have this money cuts of your data. But just Teoh illustrate what I mean. You've got very low volume times at 2347 a. M, which makes perfect sense. So now that we know kind of the distribution of posts, we know when Spartan race tends to post most often, and what types of posts tend to be most common now, as an analyst, the next level of insight that I need to get to is how the engagement rates actually break down according these different dimensions. So I really want to start with post type. So let's go ahead and pull our of post out of here. You can get rid of our donut chart, and we can just pull post type back into our role labels kind of where we started. And now, since we want to start looking at actual engagement, let's pull in some of those fields. Both shares, reactions, likes and comments. You'll notice that shares defaulted to account. That's likely because there is at least one blank row in the column. So as long as I'm aware of its not a big deal, we could just change the summer ization mode to a some and then reactions, likes and comments all evaluated to some. So those are all set now. Let's just go ahead and change the number format to add a thousands separator in there, really, just to make it a bit more readable. You don't actually have to do this, but I kind of like to keep everything consistent and polished as much as I possibly can. So there we go now. This is interesting. I've got kind of these four component metrics, but I really want one metric to give me total engagements. And since likes are a subset of reactions, there really only three components here that I care about shares, the reactions and the comments. So let's go ahead into tools, fields, items and sets create a calculated field called Total engagements. And the formula is a simple as taking the shares, plus the reactions plus the comments. Whoops, plus the comments. And there we go. We're good to go. So there's total engagements. And right off the bat, we can see that photos or second line here tend to generate a ton of total engagements, specifically a lot of reactions and likes Where's videos also generate a lot of total engagements. They drive more shares and comments in general. So some interesting trends starting to emerge here already. But one thing to remember is that this is pure volume that we're looking at looking at sums , and it's aggregated across posts. So it's not really a fair comparison without factoring in the actual number of posts that fall into each category, which were capturing here and column B. So, for example, because link posts only drove 9000 total engagements. It doesn't necessarily mean there lower performers than photos or videos, because there are only 31 posts that fell into that category, compared to 156 for photo on 195 for video. So what would be a more appropriate more apples to apples? Measure of performance or relative engagement is engagements per post. So what that means is that we need accounting column that we can use in our calculated field formula. Since we can't take our total engagement calculation and divide by the count of post copy, that would be exactly what we'd want to do. But unfortunately, we're limited on Lee to using the sum of a given field so we can't divide by the count of anything else. So we've practiced this a few times. Should be kind of becoming second nature by now. We can jump back into our raw data. Just insert a new column here. We'll call it number of posts. Since their date is at the post level, said it equal toe one. Apply it down. Just make sure it went all the way to the bottom. Sweet and back to our pivot and refresh. Now, one thing to call out, you'll see that our total engagement column got all busted up. It's got that name error now. Um, sometimes this happens. Sometimes it doesn't when it does happen, it just means that it's thrown a reference error after you've refreshed a pivot so we can go into a total Engagements formula and see that it lost the reference to those fields, probably because those columns shifted as I added my new number of Post column in the raw data. So not a big deal. We can just redefine this field just like we had before, as shares plus reactions plus comments. So just a heads up. Sometimes that happens. Other times it doesn't. I'm not sure why. It only happens sometimes, but it's an easy enough fix to make. So now that we have our number of posts column in here, we can compare that against Count of Post Copy and confirm that the some does give us the same values as the count. So this some field is the one that we can use in our calculation. It can pull count of post copy out, and now it's head back in the fields, items and sets. We're going to create a new calculated field called engagements per post, and that's simply going to equal that total engagement feel that we had calculated divided by that new column we just created called Number of Posts and Press. OK, it's format this to a number. I don't really need any decimal points here and now we have engagements per posts that factors in the number of posts, and it gives us the average amount of engagement for each of these buckets. So now when we sort by that new column engagements per post now we have a kind of a more accurate, more realistic measure of relative engagement. So for photo posts, the average post generates 945 engagements video on Lee 7 34 and then we actually do see the same kind of low volume posts also underperform in terms of engagements per post. So that's an interesting finding, you know, in an insight that I might keep in mind, which makes a lot of sense. And, you know, it also helps to validate why we're seeing so much attention given to photo and video posts because, as we can see here tend to generate a considerably higher amount of engagement than other post types. It's not. We can kind of continue this analysis by maybe pulling post type out and time of day in and sorting time of day by engagements per post. Now we see that morning Onley had about 16 posts in the sample, our post in the morning. But among those 16 the engagements for post was really, really strong. 909 which is higher than the average for afternoon, late night or evening. If we want to drill even deeper, weaken, take post copy dragged that in his secondary role Able changer design report layout outline And now we can see which actual posts drove. Such strong performance of thes are the 16 posts that fell into that morning bucket. And if we sort these at the post level, descending by engagements per post now we can see the right get rid my field list just to make it a little bit more visible. Now we're essentially ranking our posts by engagements per post, which at this level is exactly the same as total engagements, and we can see that this one post, which is a spartan transformation post, really outperformed against the others captured 2500 total engagements, which really helped toe lift the overall average at the time of day level. So really interesting insight. Once you drill down to the actual post level and then you could even go back into your field list, you could pull time of day outs. Now we're looking at all of the individual posts in our sample, sorted by engagements per post just to see what kind of floats up to the very top of our list. So we've got this post copy that says we're not 100% positive that he's performing full push up blah, blah, blah. If we actually bring in link as well, we can actually copy that link and paste it into a browser to see what Post it was in this case. It was a polar bear doing Burpees. So there we have it. That's our kickoff to the Spartan race Facebook Data case study a lot of really interesting ways to dig into this data, and we really only started to scratch the surface here. So go ahead, explore this data, play with it, try the homework exercises and give me a shot. If you have any questions
10. Wrapping Up & Next Steps: All right. Congratulations. You made it to the end of the course. Nicely done. I hope you learned some new things. Hope you picked up some tips and tricks and tools along the way. I had a ton of fun teaching this. I hope you felt the same way about learning it and just want to cover some quick resource is and next steps to help you decide where to go from here. So Number one, if you're looking to become an absolute Excel rock star, recommend completing the full analytics stack with my other two courses. Got Advanced Excel formulas and functions. I've got data viz with Excel charts and graphs, both a really thorough, really comprehensive. If you can get through all three, then you'll be a certified Excel expert and rock star. So check those out next up. Here's some following resource is if you need some additional support questions. The support dot office dot com website is actually really helpful, especially with the basic stuff like formula syntax. You can also go there to check out Office 3 65 which I highly recommend. Basically, it's a subscription based service about 15 bucks a month. It will give you access to all of the latest office products in real time as they come out . So again, highly recommend Office 3 65 Next up Stack Overflow is a really helpful site, really great form for more advanced questions or project specific questions. And then finally, Power Query and Power bi I. It's a book by Rob Kali. You can buy it on Amazon, and for those of you who don't know Power Query and Power pivot Power Bi I thes air some really, really powerful tools and capabilities that Microsoft has been starting to build into excel , especially in recent years. And they opened up a whole new world of data modeling and data analysis capabilities that you can't do with traditional cell formulas. Pivot tables. So if you want to take your analytics and B I skills to an entirely different level, check out that book. See what it's all about. Highly recommend starting there and then last but not least, you know the drill ratings and reviews are what keeps courses like this alive. So please, please let me know what you think. I would love to hear from you. Last but not least, one final. Huge. Thank you. You guys are the best. Thank you for choosing this course. Thank you for sticking through it all the way to the end. Really? Appreciate you guys taking the time should be proud of what you've accomplished. And I hope to see you guys in the next one.