Transcripts
1. Introduction: Welcome to easy statistics data visualization. Easy statistics is a range of classes designed to provide you with a compact in easy to understand set of videos that focus on basic principles of statistical methodology. In this class, I'm going to focus on basic and advanced data visualization. I'm going to cover a variety of techniques that graph, plot and visualize data, but ultimately helped deliver complex messages and patent from data to audiences. Easy statistics classes are designed to require no prior knowledge. There are no equations and I will not explain complex principles of how certain visualization plots not computed or generated. The aim is to give viewers as much exposure to various data visualization techniques as possible and highlight practical advantages and disadvantages of each different method. Focus of this course is on application and interpretation of data visualization. There are many techniques out there and I want to teach you as many as possible without making things overly complex. So what is data visualization? Data visualization is the art of compiling numbers, a series of numbers and variables, or even entire dataset into some sort of pictorial representation. There are many ways to structure this class, but my approach will be as follows. I'm going to explore how to visualize continuous variables. And this is then followed by an examination of how to visualize discrete variables. I'm going to cover some of the most important graph and visualization options that are commonly used. And these include density plot, histogram, bar graphs and pie charts. But I'm also going to look at some specialist graphs, such as violin plots or quantile plot. This class is designed to expose users to data visualization and learn which techniques are appropriate at 1. Many of my plots are created with this data and I have a different class that focuses on how to use data. For use data, please use your tantric code files to generate or graphs presented in my videos. However, I will not explicitly show you how to generate every single graph and stator. One other software packages. This course is for anyone who's interested in data analytics, data science for data analysis, graphing and plotting data is a vital part of the modern workplace. So anyone looking to expand their knowledge of data visualization is more than welcome to hop in and join this class. There are no barriers to joining this class. So come on, let's learn about data visualization. Together.
2. Continuous Variables: Let's start exploring data visualization by examining the various graphical ways in which we can examine a single variable that is continuous. This is often the core of any initial data exploration. And that is vital for any data scientist to have a good understanding of key variables that they're interested in. It continuous variable is one that is continuously distributed. In other words, its values can take any real number. Real number is any number between negative infinity and positive infinity. And for obvious reasons, isn't a countable set of values. However, you don't need an infinite amount of numbers to consider a variable or a set of numbers that's continuous. Two numbers will theoretically do, as long as the possibility exists that you could add an infinite amount of extra numbers. Continuous variables are often expressed in terms of density functions, specifically probability density functions and layman terms. A probability density function is a mathematical and graphical way to specify the probability of the random variable falling within a particular range of values. Now as opposed to taking any one value. In data science, we often plot these distributions on graphs to visualize where particular numbers will fall on the number line. This allows users to get an idea of where the mean, median, or mode might be and how can data varies around such values of interests, also to skew of the distribution and it's cooked ptosis. The skew highlight whether a bulk of numbers lie off-center. What's the kurtosis tells us how fat tails of the distribution of all of these are important concepts for data scientists. Instead, tell us something about the properties of the variable and how should be treated. In any further analysis. Graphical techniques are vital to understanding distribution. Not all variables in data science have neat mathematical derivation and visualize and continuous data is absolutely critical to understanding it. But further analysis. Scientists have derived many ways to visualize continuous distributions. Histograms, density plots, quantile plot. So let's go ahead and look at some of the most important one.
3. What is a Histogram?: Let's take a look at histograms. A histogram is similar to a bar chart, but a group's numbers into ranges. These ranges are often user-defined. And the height of each bar shows how many values fall into that particular range. By stacking all the bars next to each other. A histogram provides a visual interpretation of a distribution of a series of numbers, often a variable in the dataset. Histograms are commonly used in data science to demonstrate how many numbers of a certain type occur within a specific range. They're often one of the first visual commands that statistical users execute. Either an Excel, SPSS book, other software. A key advantage of histograms is that they are so well known and understood that even those who do not have a statistical background won't have any problems understanding them. It's generally hard to place a foot wrong with a histogram when you have a single continuous variable that you're interested in. So let's have a look at a histogram. Here is a histogram of average city temperatures in January with various cities in the United States. What does this tell us? Well, it tells us a couple of things. On the x-axis, we see temperature measured in Fahrenheit. Higher values are on the right and lower values are on the left. On the y-axis, we see the word density. Histograms are often presented with densities on the y-axis. In layman terms, densities represent the area by the bins. And in total, they all sum to the value one. Density is therefore a measure of relative frequency, which has frequency divided by the sum of frequencies. From this graph, then we can see that the average temperature in January is possibly bimodal. There is a peak at around 30 degrees, but also a smaller peak at around 55 degrees. Overall. Anybody looking at this would get a very good visual impression of how the numeric values of temperatures are distributed. What are the advantages and disadvantages of a histogram? The advantages of a histogram are as follows. One, they're easy to understand. The previous graph was incredibly intuitive. To. Histograms are useful when dealing with larger value ranges. If you have a lot of data and a large range between your values, then histograms are great at summarizing where your data lives. In three. You can modify their visual presentation by altering the binwidth or the number of bins. However, this can also be a disadvantage. Disadvantages of histograms are as follows. One, it can be harder to compare histograms side-by-side to bin widths or the number of bins used can be very subjective and significantly alternative visual presentation, depending on what you choose. Three, by default, histograms displayed densities, which can sometimes be hard to understand for people who don't understand statistical language. And finally, it can also be difficult to extract single numbers since numbers are grouped into ranges. Next, let's look at some of the issues we need to be aware of when creating a presenting histograms. The first at the shape and the look of the histogram is a function of the binwidth. Many software packages will have a default formula for presenting you with a histogram. But there's nothing that says that the default look is good or bad. It is up to the creator to define appropriate bin sizes. And different bin sizes can radically affect our histograms. Look. Here's an example of me changing bin sizes. Notice how the graph composition changes as I decrease the size and values. Larger bin sizes lead to less detail in the histogram. And smaller bin sizes lead to more detail, but too small. And there's too much detail. Finding the right balance is a real art form. And there's no definitive guidance from a statistical point of view. Yeah, For histograms of different bandwidths, each graph has its own validity under different circumstances, which will match your conditions, is something only you can decide. As a general rule of thumb, you are advised to use the default sizes provided by your statistical software and change these default sizes up or down a little bit to see where the more or less detail your statistical narrative. There are also mathematical rules out there, but I personally recommend that you stick to your intuition. It will usually be more correct. Another item you should consider is what to display on the y-axis. Many default histogram, commerce density histograms where the area under all the bar graphs sums up to the value one. The y-axis density high. It can then be multiplied by the relevant bin width to obtain a percentage of observations in any particular lipid. However, that can be somewhat cumbersome for people to figure out. So another option is to reformat the y-axis to either fractions, percentages, or even frequencies. This will make your histogram much more intuitive, especially if you are interested in how many observations are in each bar, as opposed to simply evaluating the shape of all the bars together. Here's an example with four different y-axis. In this particular case, the bin sizes up all the same, and therefore the graph distributions look identical. But notice that the y-axis of different values, you can decide which one you prefer. Commonly used values of frequencies and percentages, which are very intuitive for people to understand. And that concludes this basic overview of Histograms. Next, we'll have a look at more complex histograms with varying width sizes.
4. What is an Unequal Bin Histogram?: In many use cases, Instagram's have similar binwidth. Often this is because the default behavior of a statistical program is to generate such graphs. And in reality, this makes it a lot of sense since histograms with similar bin widths are much easier to visually inspect and interpret. That are advantages to modify the individual binwidth of histograms. It can be a great way to show percentile distribution of your data. For example, you could make each bar represents 10 or 20 percent of the total numeric values. We're variable. Alternatively, it can be a good way to highlight custom data ranges. For example, certain temperature values could be flagged up as extreme values such as high or very cold. And individual bars can represent to what extent data values fall into such specific categories. As so very often, context is everything. And you will be best place to decide whether this is a viable data visualization option for your work. But being aware that the option exists is half the battle. So let me show you some examples of histograms with unequal bin sizes. In the previous session, we use temperature data to define custom bin widths, but all the bins have the same size. Now let's create a new histogram with custom binwidth. But each bin is set exactly so that it contains 20 percent of the available data. In other words, the histogram will contain five bars. Here's how that would look. This isn't equal probability histogram. And equal probability histogram shows a histogram of the distribution of a variable constructed so that each bar represents the same fraction of the data. In this case, we specified five bars. So each bar represents 20 percent of our data. The nice thing about this histogram is that the interpretation of each bar is incredibly clear and relatable. The disadvantage is that we are not able to look into the wider bars to see where that particular data might be distributed in that region. You can play with the number of bins to create histograms with different levels of granularity is an example of a two bar, five bar, ten bar, 20 bar. Equal probability is to grab. All four bars have the advantage of displaying equal amounts of data in each bin. And it's up to you to choose the one that best fits your visualization needs. Another alternative is rather than specifying equal probability bins, is to specify custom bins that highlight specific ranges of your data. These bins might have a particular meaning. For example, because they belong to and important range group in the data. Here's an example with road vehicle data. This histogram shows the number of accidents by age. The histogram contains unequal bin widths to highlight that accidents are often focused on particular age groups. Teenagers are most likely to have accidents. In the last bar belongs to elderly people over the age of 60. It may be that our investigation pertains to these particular people or that there are some other important of that particular group. The unequal bin size draws attention to that group of people who could even colored the bin size at different color to draw even more attention to it. And as you can see, by color-coding the bar, specific ranges can be made even more pronounced and stand out. In the histogram. Histograms are flexible and don't need to have similar bin sizes. It's recommended that you consider what kind of information you wish to display. But you should be aware that for all histograms, you can modify the bin sizes, have different bin sizes on the graph and modified or y-axis to represent different measurements.
5. What is a Density Plot?: Let's look at density plots, also known as kernel density plots or KDE. A density plot to visualize the distribution of data over a continuous interval. This plot is similar to a histogram, but uses kernel smoothing to plot values. This allows for a smoother visualization because noise is generally smooth out. Many would say kernel density plots or graphically more appealing due to the smoothness my compared to a histogram. The peaks of a density plot display where values are concentrated most. While it's troughs highlight areas with few data points or values. Just like a histogram or a density plot is a non-parametric way to visualize the probability density function of a variable. Just like histograms. And there are choices to be made that can severely affect how a density rendered. And like histograms, density plots are used in situations when we want to display the distribution of a range of values of a data variable graphically. As a rule of thumb, in any situation way, histogram is required that histogram can be completely replaced by density plot without any loss of information or statistical precision. Often in the style and visual effect that will determine whether you use one or the other. So let's have a look at kernel density plots. And here it is. This is a kernel density plot of average city temperatures in January for various cities in the United States. Instead of bars, we now see a continuous, uninterrupted line that goes up and down as we move along the x-axis. High values of density represent higher concentrations of data in that particular data region of the x axis. A key advantage of a kernel density plot compared to a standard histogram is the smoothness of the plot. This is less disruptive to the eye and allows casual readers to identify particular areas of the graph much more easily. A common trend these days is also to shade the area of the graph underneath the density plot. And this in turn gives more contrast and a better visual effect. The relevant distribution. Take a look at this. I think compared to the previous plot, this one looks very neat and clearly highlights to anyone looking that there's a concentration of temperature near the high 20s, but also been another smaller concentration, native 50s. Very nice. Now let's look at some advantages and disadvantages of a kernel density plot. Specifically compared to a histogram, since both are so similar. Advantages of a kernel density plot are as follows. Unlike histograms, a kernel density plot is a smooth curve and thus it better exhibits the general details of a distribution. Better at highlighting multimodality. In other words, various peaks in the distribution. And it is generally easier in a kernel density plot to identify multiple peaks compared to a histogram. Also, the buyers of the kernel estimator is less than that of a histogram. Finally, kernel density plots can be usefully extended to multivariate distributions or 3D plots. This is much harder to do with histograms. However, there are some disadvantages to kernel density plots. For example, density plots have a tendency to produce the appearance of data when none might exist. Happens for the tails of the distribution. This can result in leakage where you think you have data, but you actually done. For example, it is common for kernel density plots to suggest small negative numbers for data distributions that actually end at 0. And our non-negative histogram do not suffer from this effect. Like histogram, kernel density plots take their shape from the bandwidth that needs to be carefully selected. However, kernel density plots also take their shape from the kernel that is used. And this means that there are two important variables which have an effect on the shape of the curve. Histograms only have one parameter to vary, making them easier to manipulate. Next, let's have a look at some issues we need to be aware of when creating kernel density plots. You may wonder how density plots compared to histograms and how they're generated. Here's a histogram and a density plot overlaid. Both are very similar and a simplistic layman explanation is often that density plots are basically just a midpoints of histogram peaks connected to each other and some sort of fashion. And this intuition is in the right direction. But as you can see here, it's not actually correct. What actually happens is that a small portion of the data is looked at and this is called the bandwidth. Into this bandwidth, a kernel is placed. And a kernel is just another name for a particular distribution. And kernels can take various shapes. Common kernels include Gaussian kernels and Apache Kafka. Depending on how much data is in this area, the kernel may have a wider variance or not. Then the bandwidth is moved one step to the right and a new Kronos greater. Finally, at the end, all the individual kernel distributions are summed up to create one final kernel density estimation. One, creating kernel density plots, two main things need to be considered before presenting the visual results. The first is the bandwidth. Just like for histograms, where the binwidth plays a crucial role in determining the histogram shape. So does the bandwidth play a crucial role for kernel density estimation? Higher bandwidths lead to flatter shapes with less detail, but lower bandwidths lead to more detailed but potentially quite jagged shapes. Many software packages will have a default formula for computing an appropriate bandwidth. But right or wrong can only be determined by yourself. There's nothing that says that the default look is good or bad. Let me show you. Here's a video of me altering the bandwidth of a kernel density plot. Notice how the overall shape of the distribution changes. The bandwidth changes. Larger bandwidth sizes lead to less detailed in the density plot. And smaller bandwidth sizes lead to potentially too much detail. Does like with histograms, finding the right balance can be a real challenge and you should experiment with different bandwidths. My suggestion is to always address the default bandwidth up or down by a few notches to see what difference that makes. Let me show you this again in a more static fashion. Here are for density plots of different bandwidths. Each graph has its own validity in different circumstances. Which graph matches your conditions and your contexts, ultimately, only something you can decide. Again, there are some mathematical rules out there that suggests an optimum bandwidth, but I personally recommend that you stick with your instinct. Often, it will be correct. Another item you should consider is the kernel. This is of slightly less important than the bandwidth, but it can make a significant difference to the shape of your kernel density plot. The choice of kernel influences how data points further away from the center of bandwidths are weighted. At its most extreme, a uniform or rectangular kernel way, each data point in the bandwidth with the same level of importance. Alternatively, in a triangular kernel, the importance of further away datapoints diminishes rapidly from the center. Here's a graph with commonly available kernels. Notice how different kernel choices result in different distributional visuals. However, just because these are available doesn't mean that you should use all of them. In reality, only the EPA nitric of kernel and the Gaussian kernel are used frequently in data analysis. And my advice is to stick to these two. Unless you really know what you're doing and have a special reason to use another kernel. Here are some examples of what kind of density plots different kernels lead to. As you can see, the choice of kernel does matter. Cosine and rectangular kernels produce very sharp and abrupt features in the distribution. But the pelagic often Gaussian kernels produce significantly smoother distribution. And this is probably the reason that they are most commonly used. That much easier on the eye and provide a better visual interpretation compared to the others. However, if you need detail than changing the kernel is one way to get more detail without changing the bandwidth. And that concludes this basic overview of kernel density plots. Next point, have a look at how to deal with multiple density and histogram plots on one graph.
6. How to Visualize Multiple Densities: In many cases, we do not only want to look at the distribution of a single continuous variable, we often want to examine multiple distributions and compare these to each other. And this might involve multiple histograms or multiple kernel density plots. These graphs by p sub populations of the same data. For example, wages by men and women. Or there may be different data altogether, for example, age and educational distributions. Either way, we want to find some visual way to best compare these distributions. And you'll quickly find that simply plotting single distributions on individual graphs and then comparing these individual graphs side-by-side is not a convenient or even smart way to examine distributional differences. So what are the ways are there to compare multiple distributions? Let's go take a look. Let's start with something that you probably shouldn't do. Creating two distributions on two graphs and putting them side-by-side. Here's an example of two Kernel density plots and two histograms plotted next to each other. One plot visualizers the wage distributions of those with higher education. And one plot visualizes the wage distribution of those with low education. At first glance, it's not immediately obvious which is which, although a longer look, should reveal that the two graphs on the right belong to those with more education. However, it is difficult to judge where exactly the distributions differ. And in addition, the y-axis scale so different, making comparison even more challenging. My biases that you should generally stay away from presenting multiple distributions like this unless there's a very good reason for it. Another poor way to compare distributions is using a stacked histogram. Here's an example using the same data as earlier. It looks neat, but there's a fundamental problem with this graph. The bar heights between both groups cannot be directly compared to each other because the bars all start at different height. And this is very counter-intuitive and requires the addition and subtraction computations by the reader. For example, we can see that the total height of the light blue bar is higher than the dark blue bar at the value 15. But that is because the bars are very different here and it's easy to spot. And a weight value of 10 and much less certain which bar is actually bigger? Light blue, dark blue. And this simply requires too much work for any readers. And that should never be the point of data visualization. Then to compare distributions correctly. Well, one suggestion is to create overlapping plots. To keep overlapping plots is opacity, which is another word for transparency. Here's an example with two histograms, both of which are transparent. And it is now much easier to identify how the two distributions relate and defer to each other. However, this approach also has some problem. It now appears that there are actually three different groups, rather than just two. Overlapping histograms with transparent colors don't necessarily always work well. Because there's semi-transparent bound on, on top of another, tends to look like a bar that is drawn in a different color. However, this approach does work better with kernel density plot is an example of two overlapping density plots with opaque colors. Compared to overlapping histograms. Overlapping density plots don't typically have the problem of columns identification because the density lines are clearer and allow the eye to keep the distributions separate. Another interesting solution is to mirror the distribution. And this can be done either in the vertical or the horizontal axis. Here's an example with a histogram. What granted wages are presented in the top half of the chart, and non graduate wages are presented in that bottom part. For convenience, I removed the x-axis labeling here, but this can be added back in quite simply. The key to this kind of data visualization is a common y and x-axis. Without a common and standardized comparison, it will be almost impossible to compare these two distributions. And here you can see what I mean by that. And this picture I repeat the previous histogram user to kernel density plots. The left-hand plot has dissimilar y-axis. What's the right-hand plot has similar y-axis. Notice how important the scaling of the Y-axis is in changing the visual image of the relevant density plots. Finally, what happens if you need to present and compare multiple distributions? For multiple distributions, histograms tend to become quite confusing. Once density plots continued to work well, as a long as the distributions are somewhat distinct. Here's an example of the previously used wage data. And I'm now presenting the wage distribution across four distinct subgroups. And it's quite easy to see that non graduate nonunion workers appear to experience lower wages. The distribution is significantly left shifted compared to the other three distributions. However, this is where coloring may become a disadvantage again, whilst area shading of density plots is a visual technique for identification, it can be confusing to the eye when multiple colors are overlaid on top of each other. And you may want to consider going back to line density plots, or even consider black and white density plots and differentiating density plots using dashes and dots. Here's an example of a monochrome transformation of the previous density plot. And this monogram plot may look less spectacular compared to the color density plot. But the density estimation so much easier to follow and trace and therefore also to compare. So the lesson here is, don't be afraid to lose color when things get busy. The overall takeaway message from all of this is that you should be careful when presenting multiple distributions. In my experience, kernel density plots often work best for presenting distributions of continuous data. Colors, especially Area colors, are a great way to further enhance your visuals. But at the same time, don't be afraid to remove colors when graphs get crowded. And finally, and whatever comparison you do, make sure you check your y and x-axis. Without the common axis. Any comparison will be for nothing as the, I won't be able to adjust easily for any differences in the relevant axis.
7. What is a Ridgeline Plot?: Before moving on to the next topic, I want to briefly discuss the concept of ridge line plots. Rich line plot, sometimes called Joy plot, or a method to plot many kernel density plots on a single graph, but with no or only partial overlap. And they do this by splitting off each density plot onto a separate x-axis. They're keeping that x-axis all within the same graph. So at its core, a ridge line plot combines elements of mirror density plot and overlaying density plot. And this gives it some advantages to primary advantage is the ability to visualize many different kernel density distributions on a single plot. A disadvantage is that the x-axis are all slightly offset. And this requires users to navigate across the plot to compare different density. However, often the eye doesn't need to travel very far. So this is only a minor disadvantage. Rich line plots to have natural limits. This will depend on the type of data that is being displayed. But generally, more than 10 or 20 density plots, well, still prove problematic. So let's have a look and a ridge line plot. This graph plots political belief data for two groups. In this case, the US voters, over time. Each density plot is slightly offset to create the impression of a cascading series of density plots. Each x axis represents a particular year in the US election history. In this case, the plot visualizers that one group of voters has remained ideological stable. Water, not a group of voters has moved to the right, but also increased its variant of political ideology. And this is a really nice presentation of lots and lots of information. The color shading on this plot add to the visual power of the plot. However, one should note that this plot only work because the data has a clear pattern to it and it is sorted. Without sorting and without patterns. Rich life that can get very messy. But let's have a look at the advantages and disadvantages. But, but retry and plot advantages are that rich line plots work very well when the number of kernel density plot is medium to high, you can introduce a lot of visual information into the original plot. Ridge line cards are also great at comparing general distributional shapes across many groups. However, they are not good at identifying and comparing specific areas of data or presenting Summary Statistics. And disadvantages are that ridge line plot don't work very well when there's no clear pattern to the result. Good. Which line plot often sorted according to the pattern before presentation. Another disadvantage is that groups can still overlap and create messy plot areas. And sometimes you need to invest significant effort to move things around to create a good visual. Finally, original plot are still density plots and require careful thought about bamboos choices and y and x-axis scaling to make sense. So let's have a look at a more rich line plot. Not all which I'm not need to have kernel densities. And the point is to visualize data distributions. And you could also plot a histogram or other books that list dot plot. Here's an example of a dotplot. I'm going to look at dotplot in more detail in another session. In this case, the rich I'm plot of magnitude plot is a plot of temperature deck. Each different region. As all apps data points plotted across the x-axis. In this case, it's possible to compare actual data points across different regions, although we'll probably need a microscope to do that properly, a density plot will probably provide a better visual narrative, in my opinion. So here's another version of the same data. But this time the plot is recast into density. And I think this product is visually much more appealing. Read it gets a much better understanding of the distributional differences in temperature data across all nine regions in the United States. However, notice how towards the bottom of this graph, there's a bunch of data. The distributions are very similar and there's quite a lot of overlap between the kernel density plot. And adding color to this plot is unlike these characters problem, since colors will simply interfere with each other or confuse things even more. So, this graph requires more careful sorting, and this is the point I made earlier. Often ridge line plots require careful adjustment the categories. So here's original plot example that has excellent sort, yeah. And also where color shading at additional value. This is a really nice graph that perfectly displays temperature data across the month of the year. This one density plot for each month. Now we can see that temperatures are higher than the summer and called it in the winter, but the variation is higher towards the winter months. Which line plots are great to visualize many densities in one graph, but they are complex to create and require specialized programs. But a well-made ridge line plot can be worth 1000 words. Do take canned formatting original and plot, carefully. Sorting and experimenting around with the categories is very important in these kind of plots. Ultimately, original plot require a lot of creative attention to function properly.
8. What is a Cumulative Density Plot?: Another way to visualize distribution is via cumulative density plots. These are sometimes known as cumulative density function CDF or empirical cumulative density function e CDF. Cumulative density plots or the percentage of the SortedNumber plotted over the numbers themselves. They are essentially the integral of a normal density plots. And rather than showing the fraction of data points at a given value of x, the cumulative density plots show the sum of datapoints Up to the point of cumulative density plots aren't used as much as normal density plots. That is because often users are interested in how the distribution of a continuous variable looks and not that detailed statistics for continuous variables. It can be a bit more difficult to visualize and understand the distributional properties of a variable for my standard CDF plot. But with a bit of experience, it's not too difficult. However, CDF plots can convey important statistical information that normal density plots cannot convey. So let's have a look at the cumulative density plot. Here it is. This is a cumulative density plot of average city temperatures in January for various students in the United States. I previously used as data when I presented normal density plots. Let's have a closer look and see all this different. First, the y-axis is called cumulative density. So it's still a density, but the cumulative part means that it has to range all the way to one. Why? Well, do you remember that I said that in the density plot, the area under the graph adds up to one. Well, because this is the cumulative plot, I eat the addition of all the little areas under the graph. The final value of x must have a density of one. And this little bit of information helps us understand the rest of the graph. That value 0, no data is present and the density starts at 0. As we move along the x-axis, more data points are covered and the cumulative density slowly increases. At the end. All data points are covered by a density plot and we've reached a density of one. So to repeat, this plot shows the fraction of numbers from a given set of numbers that are less or equal to the value of x. So how should you interpret the slope of this curve? Well, I did say that this is the integral of the density plot. In other words, ensure or the gradient of this curve represents the values of a density plot. Therefore, areas with high gradients represent areas with the data accumulates quickly. And areas that are shallower represent areas where there is little data. Often there's little data at the edges of data distributions. And cumulative density plots often have flat beginnings and flat ends, but steepest slope in the middle. One of the main advantages of the cumulative density plots is that statistics such as the minimum, maximum medium and particular quantiles or percentiles, can be directly read and inferred from the cumulative density plot. For example, to figure out what is the 20th percentile of the distribution. Simply plot the line from the zero-point to density to the graph and then read off the x-axis. In this case, the answer is around 22 Fahrenheit. Any percentile value is easy to determine from this graph, including the median or 50th percentile. And this is a key advantage of cumulative density plots compared to normal density plots. You can only do this with a cumulative plot. In fact, let's have a look at some advantages and disadvantages of cumulative density plots. Specifically compared to normal density plots. Advantages include the following. You can obtain a direct reading of key quantitative summary information. So just cool towel medium maximums and minimums. It can also be easy to detect outliers. They will often show up as long horizontal bars in the plot. Also, cumulative density plots are naturally suited towards comparing multiple variables. And finally, humans extensive blocks are more robust to bandwidth and bin size selection. One compared to normal histograms density plots. However, there are some disadvantages. First now more complex to interpret if you're only interested in where the data is distributed. Because due to the cumulative nature, you need to look at the gradient, which makes it harder to figure out how the data is distributed across a particular numeric range. And secondly, they're not used regularly in standard data visualization. So they're not particular laymen friendly and they made need additional explanation or guidance when they're being presented. Next, let's have a look at some more examples of cumulative density plots. Like normal density plots, you can use cumulative plots to compare distributions. Here's an example of temperatures in the month of January and July. We can see that first of all, there's not a lot of overlap between these two distributions. We can see that temperatures and January's appear to vary more than temperatures in July. The July plot is densa and has a steeper slope indicating a bigger punching of data. In this case around the 80 Fahrenheit Mach. What does it real advantage here is that group medians can be easily compared. Simply draw a line from the 0.5 density across and read off the x-axis for January and July. In this case, it looks like the median January temperature is around 30 to 35, and the medium July temperature is around 75. It's also possible to super impose normotensive blocks onto cumulative plots. I personally don't think it works very well, mainly because of the y-axis differential, but it is possible generally, normal density plots don't reach high density values, whilst the cumulative density plots always reach the value one. And if they are superimposed on top of each other, this often results in one big and one small plot, which can't be hard to visually make sense. So here's an example. And this plot I overlaid to normal kernel density plots onto the two cumulative density plots. And you can see that the cumulative gradients is at its deepest where the normal density plot reaches its maximum value. However, the normal plots are very small in size and it's difficult to make out a lot of detail. But if you really need to do it, you can't. You could also think about putting the normal density plots on a second y-axis and re-scaling that to your liking. If you really need to have both on a graph for comparison and you need access to more detail. Finally, if you have a lot of cumulative density plots that you want to graph, consider losing the color aspect. Here's an example of the temperature data for four different regions in the United States. In this case, I remove the area color shading and I use different dashes and dots to represent different regions. This graph is much more visually intuitive and can be easily examined by most laymen. In this case, we see that South and western regions clearly have higher temperatures in January. March, North Central and North Eastern regions are much colder in January. And that concludes this session on cumulative density plots. You should use them if you want to give an idea of the distributional properties of a continuous variable. And at the same time, you want to convey some basic descriptive information about medium and minimum and maximum values. However, they are harder to interpret. So to think carefully, when you use them.
9. What is a Spikeplot?: Sometimes you need to reveal the fine structure of data. And that case, you may not want to use a kernel density plot because it will smooth out small regions of data. But you might be particularly interested in if that's the case, you're better off running a histogram, but not just any normal histogram. You need a histogram with one hundred, two hundred or possibly 500 bins. And that will really reveal the detail in your data. And that is called a spike plot. It simply histogram with very many bins. Let's have a look and that's my ggplot. First is a normal histogram of wages. Here I'm using the default bin size provided by whatever software package I'm using either point. And it looks okay, there are approximately 30 bins here. And I get a good understanding of how the data is distributed across the range of x. But maybe I have a particular interest in the high wages and around the 30 or 40 mock, there's something going on there. And now I want to explore that further, but I can't quite see it right now. So let's go ahead and produce a spike plot with many, many bins. And here we go. This is a histogram plot with around 200 pins. It's purpose is to really show me what is happening. Very much new parts of X. For example, I can now see that there are two very common high-wage values. In this case, these values are 38.840.8. Both may represent a particular unionized contract or something else that I haven't quite figured out yet. The point is that this data is now revealed and I can go ahead and examine it further. So that is a spike plot. Advantages of spike plots are that they are able to reveal find data structures that could be important to your analysis. Disadvantages of spike plots are that spike not, shouldn't be used when you have p observations. For example, if you're creating to a 100 bins, you want to have at least a few thousand observations in your relevant x-variable. Also Spike proteins have a lot of detail and need careful examination. There are not useful for generalist stick distributional statements or four basic presentations of distributions. Now finally, spike marks are not great at comparing multiple distributions. There's simply too much detail here. And let me show you why you should take care with spikelets and multiple distributions. The increased level of detail does not suit itself well towards plotting multiple distributions on one graph. Here's an example of me overlaying two different smart plots. As you can see, even with very translucent colors, it is a very hard task to figure out what exactly is going on here. It's not something I would recommend. And spikelets should not be produced for anything other than a single variable. Spike plots our histogram with many bins. Depending on the software you're using, you can either use a normal histogram command and simply specify many bins to produce a spike plot. Or if you are limited by our software to a certain amount of bins, you may need to use a specific spike plot command. Either way. Spike plots are very useful for revealing lots of fine and granular data in a distribution. However, not easy for laymen to look at. And they suffer significantly when you want to compare two normal distributions. And that concludes this quick session on spy plots.
10. What is a Rootogram?: Let's take a look at router grams. A root diagram is a graphical way of examining the goodness of fit of a particular data distribution. But various theoretical distributions. They're closely related to frequency histograms and also to spike plots. Comparing the distribution of data with a theoretical distribution using an ordinary histogram can be somewhat difficult because often small frequencies are dominated by larger frequencies. And this makes it harder to examine differences between histograms and any theoretical distribution. Rooted arms make this easier. Five, re-scaling the y-axis to the square root of frequencies. This draws attention to discrepancies in details of the data distribution in area where does often little data. In addition, rule diagrams are often presented as hanging router grams. In this visualization technique, a distributional comparison is made easier by hanging the observed results from this theoretical curve. So any discrepancies are seen by comparison with the horizontal axis at 0 rather than with a sloping curve. And this is much easier on the eye. Diagrams are primarily used to visually compare actual data distributions, but theoretical data distributions. So let's have a look at a basic router graph. And here is a basic router ground. And this example, I'm using labor market data on log hourly wages. Wages is often an empirical variable that is log normally distributed. Therefore, taking the base variable wages and transforming it by taking logs should lead to a normal bell-shaped distribution. The router gram here presents a plot of the square root transformation of frequency counts. For this variable is normally distributed. We should see a basic bell-shaped pattern emerge from our data. And for convenience, many router grams also superimpose any theoretical distributions that are tested. And this particular case, the normal distribution. Comparing the distance between the bars and the theoretical distribution. And allows observers to judge to what extent this particular data variable follows a normal distribution. I think in this particular case it looks okay, but they do appear to be some high level outliers on the right-hand side of this data distribution. Over all, the aim of this visualization technique is to provide a very clear and basic picture of how a variable is distributed relative to a theoretical distribution. Router grams often come in hanging for a high-end router graph compares a theoretical distribution with a empirical one by hanging the histogram from this theoretical distribution. Instead of standing the histogram on the x-axis. And in this way, deviations are shown as deviations from a horizontal line instead of deviations from a curve. And this makes it much easier to spot any patterns in data deviations. Here's an example of a hanging router ground using the previous data. As you can see, the individual bars are hanging from the theoretical distribution. Any data distribution that completely fills the area between the theoretical distribution and the dashed red line is a data series that perfectly matches that particular distribution. In this case, log hourly wages, much as the normal distribution relatively well. But there might be some concern over outlines on the right-hand side. And that also appears to be a little bit too much data. In the middle. Router grams and hanging with diagrams can be used with many different distributions. And we are not only limited to testing the normal distribution. For example, we could test whether the original raw hourly wage variable is log, normally distributed and not whether it's transformation normally distributed. And here's an example of that. As you can see, hourly wages now fairly well lognormally distributed. Although there are some outliers here on the right-hand side at high values, and also add values between five and 10. Here's another example with the variable work experience. In this particular visualization, I'm testing whether the variable work experience as a chi-squared distribution. It seems relatively obvious that this variable is not chi-square distributed. The hanging spy plots do not sit closely on the 0 line. And it's also possible to introduce confidence intervals to actually test the distribution shapes that are visually presented. However, router grams or hanging photographs with many bins do not suit themselves very well to additional overlays. So it does make sense to reduce the amount of bonds that are presented. And here's an example of a hanging with the Gram with fewer bars, but with confidence intervals superimposed. The test here is that all the confidence intervals should overlap with the red dashed line. If any did not, then that is evidence that the data does not conform to the relevant theoretical distribution. In this case, most of the bars on either above or below the 0 line. And is clearly suggests that this variable is not chi-square distributed. In some router grams are a useful way to visually present and test the data distribution compared to a theoretical distribution. I recommend use of hanging router grams as they are easier to examine, although they may need some initial explaining. When being presented to laymen. Photographs are advanced graphs and they use should be carefully considered and also often used in regression analysis, especially count and categorical regression models.
11. What is a Box plot?: Now let's take a look at boxplots. Boxplots or so-called box and whisker plot. Very, very common data visualization technique. Boxplots are at their core all about quartiles. They graphically display the five-number summary of a set of data values. These are the minimum, the first quartile or 25th percentile, the median or 50th percentile, the third quartile or 75th percentile, and the maximum summary statistics. Boxplots also contain whiskers that extend from the box. Whiskers can represent maximum values or specific data ranges. These will vary depending on what kind of command and software package is used to generate the boxplot. But the overall point is to highlight minimum or maximum value ranges. Boxplots are by design, distributional plots. And wherever. Unlike histograms and density plots, they massively reduced the available information and concentrate on only a few select characteristics of any data distribution. And this marks their core advantage and also they're called disadvantage. There's simplicity makes them very easy to understand and a great choice for comparing many different distributions. At the same time, they provide only general distributional information. Is specific detail is needed or a specific range needs to be looked at closely, then boxplots will not be very useful to you. So let's take a look at a basic boxplot. Here is a boxplot of temperature ranges in January. The lowest data point, excluding outliers can be found here at the bottom of the whisker. Then the box moves on to the start of the box with the 25th towel, also called the first-quarter. Next is do a media measurement or the 50th percentile. And the box ends with the 75th percentile or the third quarter. Finally, the blood finishes with a top whisker that marks the largest data point, excluding any outliers. You may have noticed that I used to work, excluding outliers here in some boxplot staircase data beyond the whiskers. And these would be represented as individual datapoints. Most boxplots determined that outliers, but any data points that lie on 1.5 times the inter-quartile range away from the upper or lower quarter. And the inter-quartile range itself is simply the length of the box. So here's an example graph with outliers. In this case, the appear to be maximum and outliers beyond the top whisker. So to repeat any data and that is 1.5 times two blocks away from the top of the box, is going to be determined to be an outlier and will be represented as an individual data point. Note that this is the default and some whiskers can represent other properties, such as real maximum and minimum values or the segment and 98th percentile values. For example. It will depend on whoever created. Next, let's have a look at some advantages and disadvantages for boxplot before we explore its use any further. A key advantage of a boxplot is its compatibility across multiple variables or numeric ranges. Boxplots Excel and allowing you to compare many different data distributions in one single graph. And as you know from our previous sessions, histograms and kernel density plots have problems with that. And our limited how many distributions they can realistically portray on a single graph. Another advantage is that you can get summary information on key statistics. Histograms and density plots do not make it easy to identify quartile values. For example, boxplots make it very easy and are great at getting basic statistical information across from the distribution. Boxplots also show outliers. Now, this can't be subjective, but boxplots attempt to identify two users, where and how many outliers exist in the given data range. And this can be helpful in certain circumstances. And it's not something that histogram, so density plots show by default. A, another advantage is that boxplot compliment histograms and density plots really well. In fact, boxplots work perfectly in sync, but other distributional graphs. And you should consider combining boxplots with other graphs for extra visual power. However, boxplots also have some disadvantages. The main disadvantage is the lack of detail on the distribution. Boxplots only show quartiles and sometimes you need more detailed and out. Perhaps you want to identify peaks in your data or whether it's bimodal, a boxplot will not allow you to do that. Another disadvantage is that by default, boxplots do not display means. And Lehmann often confuse median values with mean values. And that can be dangerous. So you need to make it clear that medians are not means and explain boxplot. Well. In some box plots, you are able to plot them in. And often this is denoted with an x. But then you still have the job of outlining to anybody who's reading the graph, what is the median and what is that mean? Finally, a final disadvantage is that outlier calculation is not standardized. False most software will use 1.5 times the inter-quartile range from the top and the bottom quartiles. This isn't necessarily what everyone uses. There is no absolute standardization on how outliers should be defined and therefore their meaning can change plot, pyplot, and user by user. Next, let's have a look at some practical applications of boxplots. Let's explore the primary advantage of boxplots, its ability to allow for easy comparison of data distributions. Here is a not a good example. And this boxplot, I'm trying to get an idea of how the temperature values in January compared to the temperature values in July. Notice that I flipped a boxplots into the horizontal plane here compared to the earlier vertical plane. And boxplots can be both in horizontal or vertical formats. From this data visualization, we can see that the inter-quartile range from January values is well to the left of the inter-quartile range of July temperature values. So that is good, I guess. But what is not so great here is that the whiskers overlap and this confuses the picture. Well, we could try playing around with capacity and different colors. But most box differentiate different distributions by placing the plot above and below each other. So here is a better example of a boxplot. In this example, it is much easier to differentiate between the two different temperature values. We see there's some overlap between the two distributions, but not very much. Most of digital I temperatures are located near the maximum of the January temperatures. Excellent. But this graph still doesn't visualize the full potential of a boxplot. Boxplots really stand on their own two feet when we use them to compare many different groups. So here's a boxplot with many groups. In this plot I'm presenting for groups over two different continuous variables. This would a histogram will, kernel density plot would be almost impossible. And from this graph, we can easily determine that January temperatures in the North Eastern or Central, or lower than January temperatures in the south and west. July temperatures appear pretty similar except possibly Western patrols where there might be a bit lower. And maybe also South temperatures by there might be a bit higher. And we can also see that the West region has the highest temperature spread across both seasons here. So that's a lot of information and it's a really well visualized it, easy to understand. And this is the true power of a boxplot. Now let me show you a boxplots, but even more categories. And there you've got, in this particular visualization, I'm using nine groups and two continuous variables. So I'm comparing 18 different data distributions with each other on the single graph. Yet it remains easy for me to examine these different data distributions and the term and we're higher or lower values are located. And how the spread of these different data distributions differ by various regional groupings. Fantastic. So this is a really powerful graph that carries a lot of data information. Now let's have a quick look at some disadvantages of boxplots. I'll say boxplots me show summary statistics well and give you a rough idea of any data distribution. Clusters of data and multimodality, however, remained solidly hidden. Here are two examples, while I varied the distribution of a particular data rate in each case to boxplot. Well, not be able to reveal exactly what is happening as I changed the distribution. And the first case I'm going to converge a random number sequence towards an integer random number sequence. In other words, the data points will slowly converge to integer values. Look at what happens to the boxplot and relative to the histogram of the same data, both blocks appear to change. But remember these are random numbers that I'm calling. So there's bound to be a bit of random noise here. The histogram, however, changes significantly and how it is presented. The boxplot also changes. But because of the random noise, it doesn't converge towards anything. It's simply cannot identify the significant changes started happening to the data and compare to the histogram. And the second video, I'm going to split a single distribution into two distributions that start to diverge. In other words, I'm creating a multimodal distribution. Now take a look at what happens now. In this video, we see that the original histogram distribution splits into two and then move away from each other. However, the boxplot only gets longer. The boxplot identifies that the overall data distribution becomes more spread out. But it cannot identify the significant changes that happen in modality inside the data distribution. And the histogram can identify that. So there we are, a hot these two videos, we'll make it clear where the disadvantages of a boxplot lie. And finally it before I conclude, let me just pick up on that last point and turn it around into a advantage. What's the disadvantage of a boxplot seems clear compared to a histogram. Together to actually overcome each other's disadvantages. Histograms are terrible at highlighting basic statistics. Lots, boxplots are bad at highlighting distributional detail. So consider presenting both of them in one single graph. And here's an example of that. In this example I've combined a horizontal box plot and a histogram. The histogram shows the entire data distribution of temperature values in January. Once a boxplot underneath highlights where the inter-quartile ranges and where the median is. Anyone in this particular case, where the mean is, in this case, the whiskers represent the actual maximum Min values and there are no outlying data points. The nice thing about this graph is that it combines two visual techniques and removes the disadvantage from both of them at the same time. Just kind of visualization technique may be too complex for a layman. So consider your audience carefully. And that concludes this session on boxplots. Boxplots are used frequently to highlight distributional properties. They absolutely Excel if you want to graph and compare many distributions against each other. But they also suffer from some disadvantages. Consider combining boxplots with other plots, such as histogram. So kernel density plot for an even more powerful visual effect.
12. What is a Violin plot?: At the end of the last session, I introduced the idea that boxplots can be combined with histograms to create a powerful visual technique that provides a lot of statistical details of a continuous distribution. And modern data science is taking this a step further and created a concept called violin plots. A violin plot is similar to a boxplot, but has the addition of a rotated kernel density plot on each side. In other words, a violin plot is a mirror density plot in which there is a boxplot. And the combination of these graphical elements often create graphs like violence, and hence the name violin plot. In general, a violin plot is more informative. Tennis claim boxplot. A boxplot only shows summary statistics such as mean, median, and interquartile ranges. Once a violin plot, just a full distribution of data in addition to the aforementioned summary statistics. However, violin plots are more complex and they are not frequently used, so it might be hard to explain to a layman. So let's have a look at a violin plot. Here is a violin plot for January temperatures in US cities. Notice how the graph looks a little like a violin, the boxplot in the middle. But we also asked a quarter or values of this data distribution and the median, where the white dots and the whiskers represent minimum and maximum values. Outside the box plot, not to kernel density plots. Realistically, one should be enough. But the advantage of having a mirror density plot is that it helps the I pick out important features of the data. Peaks and valleys of any distribution becomes twice as prominent due to the nature of this plot. In this case, we can see that there are a lot of temperature values clustered and around the mid 20s and another smaller cluster near the 50s. Overall, this is a great visual technique to present the maximum amount of information of any particular distribution. And like box plots and violin plots are a practical way to compare many different data distributions in one graph. Here's an example with 18 different groups. I previously presented these as boxplots, but I've now recast this data into violin plots. And this graph shows two different temperature distributions for a variety of regions in the United States. Once the individual shapes may all look a bit funny at first glance. And this graph is actually a really powerful way to present a lot of complex information in one go. For example, we can see that the January temperatures in the Pacific are quite different from the channel, the temperatures and the South Atlantic. One distribution seems unimodal, multi, other distribution appears multimodal. So this is a really powerful visualization technique. Violin plots combine a huge amount of visual power into a single graph. They are not often used because they can be complex to understand. But if you need to show a lot of distributional information all in a single graph, they are almost unbeatable. So consider using violin plots in complex and advanced visualization applications.
13. What is a Stem-and-Leaf plot?: Another way to visualize the distribution of data is via a stem and leaf plot. Technically, stem-and-leaf plots are not graphs, but simply long tables. However, stem-and-leaf tables have visual elements to them not obey similar to kernel density plots. The advantage of stem-and-leaf plot of that they visualize and present every single data value of the specific data range. Stem-and-leaf plots are a compact way to present considerable information about a batch of data. And that can be handy if you're looking at data values in your distribution at the same time, variable or data mangers that have thousands of values might be very difficult to examine what stem-and-leaf plot, these kind of variables. We'll simply have too much information. And there is a hard limit to how many numbers you can show in a stem-and-leaf plot and get useful information out of it. In general, stem-and-leaf plot of very little use, but it's still worth learning about. So let's take a look at a basic stem and leaf plot. Is a stem-and-leaf plot of car prices. The stem-and-leaf plot provides a way to list the underlying data. The expression to the left of the vertical bar. Let's call this stem. And the individual digits to the right called the leaves. All the stems that begin with the same digit and the corresponding leaves written beside each other, reconstruct an observation of the data. Now, so if we look at the two stems to begin with the digit seven and their corresponding leaves. We see that we have one car in this dataset priced at $7,140 and a another car priced at $7,840. Overall, we see that a lot of prices are bunched into 4 $5 thousand range. There's a few high prices too. And we finished the data series, that $15,906. The stem-and-leaf plot is very similar to a density plot or a histogram, except that the raw data values are tabulated in a table. And we can thus identify individual data point which is not possible in a density histogram plot. And this is probably the one and only advantage of a stem and leaf plot. Unlike histograms or density plots, stem and leaf plots take their shape from how many steps there are and how many least are. These parameters can be adjusted to create different stem-and-leaf plot of the same data. For example, he has a stem-and-leaf plot where each step increments in units of one hundred, ten hundred. And as you can see, This stem-and-leaf plot is much more squashed, although it's still reveals all the data points of that particular data. Finally, here's a more extreme version of the same data. In this case, each stamp measures tens of thousands and all the data is populated on only two stamps. How useful this is, I don't know. But as you can see, stem-and-leaf plot can take different forms depending on how the stems and leaves are constructed. Stem-and-leaf plots are not commonly seen in practice. They are not particularly useful for large datasets and cannot compare different distributions they use in the real-world is generally limited. But if for some reason you need to see all the raw data values stack next to each other, then this plot could potentially be useful. But remember, the shape of the plot will depend on the parameters that you set. In general, I recommend going with the default parameters.
14. What is a Dot plot?: Another way to visualize the distribution of a continuous variable is via a dotplot, sometimes called dot chart or strip clubs. A dotplot is a statistical plot consisting of data points plotted on a simple scale. Dotplot produce a visualization that has elements of a boxplot, a histogram, and a scatter plot. However, they are most closely related to stem and leaf plots, like a boxplot, dotplot. So most useful for comparing distributions of several variables or the distribution of one variable across several groups. And like histograms, a dotplot provides a basic estimate of the density of the data. Each dot in a dotplot represents one observation. Dot plots are incredibly easy to use and very intuitive to understand. And there can be a really good way to display and visualize small and medium-sized data series to other people. So how does a dotplot look like? Here's an example. Here. I'm using the temperature data that I've used previously. I'm plotting all 954 temperature values in this data series on a dot chart. Each individual dot represents a data point. I can identify that temperatures in January you have a peak at around values of 30 years. But there's also a second peak and the values of 50 issue. So overall I get a good understanding of the distribution of this particular data. However, unlike a histogram or a kernel density plot, each data point is visualized and presented to me. And that is the primary advantage of a dotplot. Seeing the actual raw data. And then advantages of a dotplot are as follows. They are easy to understand. Each data point is visualized and users get a good idea of the distribution of the data. And dot plots are very suited for small and medium sized datasets. They're useful for highlighting clusters and gaps as well as outliers. And they're useful for presenting data across different groups. The other advantage is the conservation of numerical information. However, dotplot have disadvantages. For example, they're not good for large samples of data because each data value takes on a single dot. Large data samples can quickly overwhelm applaud, making it very difficult to find an exact. Moreover, not sizes affect the appearance of the plot. And this is similar to choosing a bandwidth for a kernel density estimate. So some care must be taken when choosing an appropriate size or talk with. Next, let's focus on dot sizes or dot width. Do make sure you take care when choosing dot Titus as they will determine how your dotplot looks like. Dots are bundled into groups and then stacked on top of each other. So that means that dot values do not actually represent true data points. Changing the dot size will lead to a different pictorial representation. Here's an example of me producing several graphs of the same data with varying sizes. In these cases, I decided to start the dot chart horizontally and vertically like before. But the important thing here is that you can see that choosing a different dot size results in a different type of graph. And this is exactly the same as a histogram or kernel density plot. I always recommend that you play around with various sizes to examine different graph structures, just like you would do for histograms and density plots. A major advantage of dotplot is their ability to present information across many groups. And just like boxplot, dotplot excel at visually comparing different distributions across various groups. Here's an example with the temperature data across four different regions. This is a really neat plot because it not only highlight the different distributional properties of each group, but every single data point can be identified. For example, we can see a cluster of data points right up here. Put it west regions. And this cluster of outliers may be important, or it may not be important. Either way, a dotplot visualizes the data and we can now identify it and determine whether we want to investigate this further. A histogram box-plot. Well, kernel density plot would not be able to do that. Moreover, dotplot can be combined with other plots, such as box plots, to create an even more powerful visual effect. Here's an example where it generates multiple dotplot across nine different groups and then plot a box plot next to each. Look at that. The dotplot visualize the actual data points, wants to boxplots, visualize the inter-quartile ranges, the median, but also the outliers. As you can see, this is a very powerful visualization technique. Would not have information available to people looking at this graph. At the same time. Complexity of this graph is simplified by the dotplot that will make it easy for any layman to figure out what exactly is going on with this graph. Dot plot or strip plots are relatively simple statistical charts that might be easily dismissed. However, I think they are brilliant at visualizing small to medium data distributions and excel at comparing distributions across groups. They become even more powerful when combined with other graphical techniques, such as boxplot. Don't underestimate the power of a good dotplot. But keep in mind that the shape of a dotplot for major function of the dot size or width.
15. What is a Symmetry plot?: In one of the previous videos, I introduced the concept of comparing actual data distributions to theoretical data distributions. And this is a common requirement in modern data analytics since we know the properties of theoretical distributions. And it is useful to test whether data distributions are similar. Because ultimately, this allows us to infer something about the creation process behind any given data. For example, many regression analyses require that the error terms are normally distributed. So it would be obvious to test a distributional properties of any residuals from our regression analysis. Such tests can be performed using analytical tests, but there can also be done using visualization techniques. One technique, irony highlighted are hanging photographs, but Pluto grams are not that commonly used. Quantile plots are much more common in Applied Data Science when checking whether continuous distribution meet certain theoretical criteria. So in the next few videos, I want to take a look at quanta and symmetry plot. The key concept behind all of these plots is that they transform the data and plot a transformation of the data on a graph, a reference line. If the transformed data points are on or close to the reference line, then that data will have particular distributional properties. First, let's take a look at a cemetery plot. Symmetry plots do exactly what they say they do. They check whether a particular data distribution is symmetric or not? A symmetrical distribution is basically one that can be folded in half. And then all the data points match each other. The full point in this case is the median. And that's why these kind of plots transform the data into distances above and below the median. Let me show you an example. I'm going to play you a video of it changing distribution. The distribution is visualized with a histogram. And at first everything is symmetric. The symmetry plant above the histogram shows to one extent that distribution is symmetric. The first data point on the cemetery plot is a plot of the first value of the data series compared to the last value of the data series. The second is the second value of the data series compared to the second to last value and so forth. However, a cemetery plot doesn't plot the actual data values against each other. It plots how far these data values are away from the median. I eat this centre of the distribution. Points lie along the reference line defined as y equals to x. Then the data is symmetrically distributed. Since the first after data matches the second half of the data. Points above the reference line indicates a skew to the right and points below the reference line in the kidney skew to the left. So let's go see what happens if we change the distribution. Here, we can see that as the data distribution slowly becomes skewed left or right, the data points move above and below the symmetry plot reference line. Notice that in this case, it is mainly the tails of the distribution that change. So to symmetry is mainly broken in values of x that are further away from the median. Looking at the graph, there is symmetry plot stays relatively intact up until around 15 percent away from the median. So and that is dosimetry plot. It tells you how a distribution is skewed. And let's have a look at the advantages and disadvantages of such. A significant advantage of asymmetry plot compared to a normal histogram is that it is better able to highlight where distributions are non-symmetric. It is much easier to examine deviations away from our reference line. Then it is splitting a histogram and a half and mentally folding it and checking whether the twofold match each other. Also seem to plots can check for symmetry, any distribution and not just a normal distribution. There are a great way to highlight why the distributions are rescuing them or not. A disadvantage of symmetry, plops, is that there are more complex to understand. The interpretation of distance below and above the median is not always user-friendly. Until the Km. This, I recommend adding text to swim to plot that verbalize the relevant picture. So here's another example of assymetry plot. In this example I'm testing to see whether hourly wages are symmetric or not. Anybody who does apply data analysis will know that wages always have a long tail to the right. And now seem to reveals that the wage distribution appears relatively symmetric around its median. This means there's a curve or a hump and the data that looks approximately symmetric. But after moving around 15 percent away from the median to the right or the left, the distribution diverges away from symmetry. It quickly becomes right-skewed. Further away, values of wages from the median showed that there's a big difference in the distance to the median. And that this part of the distribution is not symmetric anymore. In this case, I've labeled the graph to clearly indicate the two different regions. And this will help layman understand how this data is not symmetric. 72 cops are specialized visualization techniques and they're not used very often. Often distributions can be visually compared or examined using histograms or density plots. And most of the time, it's pretty easy to determine any skew. However, a real advantage of symmetry plot is their ability to identify small skews that would be hard to notice the naked eye. So use them if you're not sure about the symmetry of a particular data distribution.
16. What is a Quantile-Uniform plot?: In the previous session, we explored symmetry plots. Similarly plots transformed the data points of a continuous distribution to highlight areas where the distribution is not symmetric. Quantile, uniform plots continue the fashion of transforming data points of the distribution. But this time in relation to a uniform distribution. Quanta uniform plot test whether a data series or a variable is uniformly distributed. Remember, a uniform distribution, also called a rectangular distribution, is simply a distribution where the probability density function is horizontal. Each value along the distribution has the same probability of occurring. Like symmetry plots. And advantage of a quanta or uniform plot is that it highlights areas of the distribution where uniformity breaks down. It disadvantages that they can be more complex to understand the data transformation that takes place. So let's have a look at a quanta uniform plot. Here's an example. If the temperature data that we've used a few times so far. In a quantile plot, each value of the variable is plotted against the fraction of data that have values less than that fraction. The diagonal line is a reference line. If temperature was uniformly distributed, all the data would be plotted along the line. In this case, some data points are above the reference line, which suggests that data in the early part of our distribution is skewed to the left. Data points below the reference line in the latter part of the distribution suggests that the distribution is skewed to the right. This dance suggests that the underlying distribution might be more bell-shaped than rectangular shape. And here's a video that highlights this even further. In this case, I'm changing a rectangular distribution into a skewed distributions. And we can see that the quantile uniform plot moves around appropriately to highlight where the distribution deviate away from a uniform distribution. Uniform plops belong to the family of quantile plots. They're excellent for examining or testing specific distribution of property. In this case, with a distribution is uniform. The transformation used to create these plots makes them much better at spotting little differences compared to examining a normal histogram or a density plot. But at the same time, it also makes them more difficult to understand and use and presentations to non-specialist audiences. These plots are often used my data scientists who are checking specific data transformations, often before some kind of multivariate analysis takes place.
17. What is a Quantile-Normal plot?: Let's continue examining visualizations that look at distributional properties. The continent normal plot, sometimes called the probit plot, transforms the data to test whether a distribution is normally distributed. Remember, a normal distribution is a bell-shaped distribution, like previous plot. And advantage of the quantile normal plot is that it highlights areas of the distribution where normality breaks down. This is especially useful for distributions that look roughly normal but may not actually be normal. Such distributions can be hard to see as non-normal using the standard histogram or a density plot. In fact, some experts believe that if a deviation from normality cannot be spotted by I on a quantile normal plot, then it's not worth worrying about. And they also argued that statistical tests cannot identify where non-normality takes place in any given distribution. Once a contour of normal plot can, of course, a disadvantage of a quantile normal plot is that these plots are more complex to understand to 2D data transformation that takes place. So let's have a look at a quanta normal plot. Here is a quantile normal plot of hourly wages. Hourly wage later makes up the y-axis. What's the x-axis presents an inverse normal transformation of each data point. If the data is normally distributed, then the inverse normal transformation should result in each data point is sitting on the diagonal reference line. In this case, we can see that our data does not sit on the diagonal reference line. Points in the middle of the wage distribution or peer normally distributed, but points towards the edges of the wage distribution of far away from being normally distributed. Points that are above the reference line exhibit shallower, bell-shaped curve behavior, while the points below the reference line more compacted. Here's a video that highlights everything hydrous set. In this video, I'm changing a rectangular distribution towards a normal distribution. As the data becomes more normal, data points converge towards the reference line. In this case, because the origin is a uniform distribution, a Muslim, the activity here takes place in the tails of the distribution. Like other quantile plot contact normal pots belong to the family of plot that attempt to highlight how data distributions differ from theoretical distributions. They're excellent for examining or testing specific distributional properties. In this case, whether a distribution is normal. But they can be more challenging to use in a layman environment. However, they are excellent at visualizing deviations away from the normal distribution. And these plots are often used by data scientists for exploratory purposes. And or after data transformations.
18. What is a Quantile-Chi-squared plot?: Finally, let's have a look at a quanta chi-square plot. Like the previous visualization techniques. This quantile transformations tests what they eat. Data series is chi-squared distributed. Each data point is appropriately transformed and values that lie on the reference line. Our chi-squared distributed. If all the transform values of the data law and the chi-square reference line and the data is chi-square distributed. As before, the main advantage of this visualization technique is its ability to highlight specific points of the distribution where data is and is not chi-square distributed. The main disadvantage remains its complexity due to the aforementioned data transformation process. So let's take a look at a quanta chi-square plot. Here is a temperature data that I've been using previously. As you can see, this data is definitely not chi-squared distributed. None of the points up on or even near the reference line. Here's another example with data that is chi-squared distributed. As you can see, in this case, the contour chi-squared transformation. There's also all the individual data points being on the reference line. I also plotted the actual distribution below for further reference. Note that chi-square plots require degrees of freedom to be specified. A one degree of freedom chi-square distribution looks very different to a ten degrees of freedom chi-square distribution. So compared to some of the previous distribution staff we looked at in other videos, this extra parameter needs to be explored and adjusted for this type of plot. Chi-square part aren't great at revealing whether data distributions are chi-squared distributed. And if not, more exactly, the data differs from the theoretical distribution. However, you remember to specify the degrees of freedom parameter correctly or at least varied. Since chi-square distributions can look very different depending on this parameter. Overall, you should use this plot if you want to examine whether a variable or data series is chi-squared distributed.
19. What is a Quantile-Quantile plot?: Now let's take a look at Quantile-Quantile plot. Quintile contour plots, often simply called QQ plot. Our scatter plot of two variables and their quantiles plotted against one another. The idea is that if both sets of quantiles come from the same distribution, the data points will form a line that follows a 45 degree gradient. If both data distributions are linearly related but not identical, then the problem will still show a straight line, but not necessarily on the line y equals to x. Qq plots are great for comparing and continuous distributions with each other. Importantly, the distributions do not need to be compared to the same theoretical reference distribution. A Q-Q plot simply test whether two different variables have a similar or dissimilar distribution. Not what are they follow a particular theoretical distribution. Qq plots are not often used in public data science, but they're often used for exploratory analysis, either pre data manipulation or post data manipulation. So let's go and have a look at an example. Here is a Q-Q plot of the temperature data that we've looked at previously. This QQ-plot plots the quantiles of temperatures in US cities in January. Firstly, temperatures in July. One would expect January temperatures to be different to distributed then to low temperatures. January temperature should on average be much colder than two like temperature. And at the same time, we would expect the temperature variant to be the same. There'll be warmer days in January and called the days in January. And the same should happen into lie. And that means these two distributions may not be exactly the same, but there might be linearly related. And the Q-Q plot here shows evidence of this. The two temperature distributions are not identical. The quantum markers do not lie on the reference line, y equals x. However, there is some evidence that the quantum markers only in the early sloping. And that means two distributions have approximately equal shapes, but they're somewhat shifted. In this case, a January temperatures are left shifted compared to the light temperatures because January temperatures, on average, colder than July temperatures. And that's it. That is the QQ-plot. Advantages of Q-Q plots are the sample sizes do not need to be equal often now, but it's not a requirement. Also, QQ plots are religious line test and it's much easier to look for a straight line. And it is to compare different kernel density plots with each other. Qq-plot also allow for many distributional aspects to be simultaneously tested. For example, shifts and location, shifts in scale and changes in symmetry can all be identified from a Q-Q plot. And disadvantages of Q-Q plots are that Lehmann often confused markers as datapoints. That's not correct. These are quantum markers and simply summarize the two data distribution. And that can be hard to explain and it makes it more difficult to use and public presentation. Also, kernel density plots and histograms can offer a more intuitive understanding of how two distributions either look the same or different. Qc are useful blocks that allow data scientists to quickly evaluate two different continuous distributions that can't be superior to density plots are providing more summary information. And by providing an easy eyeball test, is the line straight or is it not scraped? Is it on the 45-degree slope or not? And these are very easiest things to see and test. However, QQ plots are not easy to understand for laymen, mainly because of the complexity of the data transformation. It probably shouldn't be used very often in public, but it should be used for exploratory analysis when looking at complex data.
20. Discrete Variables: Now let's examine the ways of how we can display data from variables that are discrete. Discrete data is probably much more common than continuous data in real-world data applications. So it is important to be able to understand what visual options are available for discrete data. A discrete variable is a variable that contains countable data. So basically, it has a certain number of values. Of course, many real-world continuous variables not also countable. However, that tends to be a point somewhere around a 2250 unique values mark what data scientists stop seeing data is discrete. Now consider it continues. But of course, this is very case specific. Examples of discrete data and that are often treated as categorical data include the number of people in the room, the number of Likert scales, days in the week. How many ethnicities now, how many children, families have, gender, occupational categories, political parties, et cetera. Discrete data may have an ordering to it, or it may be unordered. For example, age groups are discrete, but I paid a natural ordering to them. Once mode of transport, such as train, bus, or car, does not have a natural order. Finally, because discrete data often have a natural grouping to them, we don't need to consider concepts such as bin sizes or bandwidth. Although we do sometimes much certain groups together. Discrete data from single variables are frequently presented in a tablet form via one-way table. This is often a natural way to present such data since tables are flexible and their design, tables are very customizable and they allow you to insert different statistics and create different groups from discrete data. Most laymen users will have no problems understanding a well formatted table. Here's an example of a basic table that displays discrete. In this case, I'm looking at transportation data and then presenting frequency values, present values, and also cumulative percent values. It's very easy to read this table and to get a good understanding of how this particular data is distributed. However, there are graphical ways in which this table can be represented. But I should say there's a limit to how many graphical ways there are to present single-variable discrete data. Ultimately, all we can do is present the actual data itself, often as percentages, all frequencies. But many of the graphs that I've presented here are the basis of more complex graphs that visualize discrete data across multiple dimensions. Such graphs are often used for relationship lots, but not as for another course. So let's have a look at how to visualise basic discrete data.
21. What is a Bar graph?: Let's have a look at bar graphs, also known as bar charts. These are by far the most common type of data visualization technique. And they have a tremendous amount of flexibility to them. Virtually everyone can recognize them. And many of us are taught the such school. Here I'll try to focus on some of the more interesting and important features of bar graphs. And bar graph represents discrete and categorical data with rectangular bars that have height or length and are proportional to the values that they represent. The boss can be plotted vertically or horizontally. And the vertical bar chart can also be called a column chart. Bar charts are very similar to histograms, but the exception that the bins themselves are now discrete. And this means there's no need to think about binwidth. Since each value of the relevant variable is now built own category. Of course, you can still merge categories together. For these new categories continue to be their own discrete categories. At not point does to use them to think about weapon categories might stop open. And the primary advantage of a bar chart is in its simplicity. Bar charts can often conveyed data in a fraction of a second. What's a table of the same data might take multiple seconds for a reader to comprehend. So let's have a look at that basic bar graph in this video. And in the next few videos, I'll only look at single or multiple discrete variables. And that means we're ultimately only interested in how the data is distributed across one or various groups, not how multiple discrete variables are related to each other. So here's a basic bar graph to variable race from a labor force dataset. The graph highlights how many observations are in each particular category. But this particular dataset, we can quickly and easily observe that the majority of observations in this dataset are white observations. And the next biggest category are black observations. After that comes with not a category which bunch all the other ethnicities together and their category called other. Therefore, this visualization tells us that this particular data as largely consistent of white and black labor force observations. Some users like to put labels at the top of each bar that contains the actual data values. And that would look something like this. And now we can see the exact number of observations in our data. And that may be very useful, although it introduces an additional element of complexity to the graph. And because of that, in my opinion, it distracts from the original objective of a bar graphs. Bar graphs are visualizations of data and are designed to provide an approximate visual overview of the data. If you find that you need to put precise frequency or percentage numbers on top of the bars. You're probably better off with a tablet construction. Anyway. The whole point of this kind of visualization is to generalize and to approximate. So you'll find that I often don't include the actual data values in bar graphs. This bar graph presents frequency statistics, but another common statistic to present his percentage statistics here is a percentage bar graph. Visually, this bar graph repeats the previous narrative. The data larger consists of light observations and some black observations and very few other observations. However, the y-axis range has now been adjusted 2%. And that can make it even faster to identify different data patterns. We can quickly see that approximately 75 percent of the data as white and approximately 25 percent of the data is black. And the other category accounts for around 1%. And this is usually my preferred way of graphing discrete data. Percentages are very easy to interpret and understand, and it avoids audience having to do manual computations from individual frequencies. Before we explore some other examples, Let's have a look at the advantages and disadvantages of bar graph. Advantages of bar graphs are that they are very easy to create an virtually anybody can understand them. Bar graphs also become more powerful the more discrete values. If variable has, if a variable has many categories in the name table, would statistics will likely have many numbers that can be hard to compare quickly. And the paragraph won't have that problem. Also, when ordered, bar graphs are better at highlighting trend than tables or other discrete visualization options. Bar graphs are also very versatile and easily extendable to multiple groups and variables. They can quickly become multidimensional and convey a lot of complex information. And you don't have to worry about binwidth. So complex kernel density functions, as long as the variables of interest as discrete categories, you don't need to worry about the graph changing shape radically because you're making some choices about binwidth. Disadvantages of bar graphs are as follows. Depending on their complexity, they can't require additional explanation. However, this usually applies to complex bar graphs and not simple ones like the ones I'm showing here. Bar graphs can also be easily manipulated to give false impressions. Of course, this can be done with almost any graph, but bar graphs are often the preferred tool of choice. When people want to present data in a biased way. Bar graphs have issues with extreme number ranges. A huge number in one category, many small numbers in the other categories, and lead to a very poor visualization. Now let's have a look at a few more examples of basic bar graphs. A common variation to the standard Barker office to flip the bars into the horizontal plane, creating a horizontal graph, also called a column chart. Here's an example. This graph shows the various occupation categories in the labor force data and what percentage of that data falls into each category. Often the decision to use horizontal or vertical bars, this is a matter of preference. But I have noticed that I am more inclined to horizontal bars if I have many categories. Mainly this is because of labeling issues. As you can see here, belong occupational labels or fit easily into this graph. And this would be much harder to do with vertical bars. Here's an example of me trying to cram all the label information into a vertical bar graph. And as you can see, while, so it kind of still works. I had to use rotated texts to cram in all the information in my eyes still need to adjust at a 45 degree plane. That detects this. Now on I'm not a useful property of March artiste ability to sort bars. While it is not possible to rearrange continuous data. And it is often possible to rearrange discrete data. For example, ordering values from largest to smallest. Book clearly highlighted end user which categories have the highest count? Here's an example of that. It's very easy and simple to see in this graph that sales occupations dominate this labor force data. They are by far the biggest category. And next comes professional occupations. And then laborers. Finally, do be careful not to create a simple bar graph, but a variable that has too many categories. So I highlighted earlier discrete variables or any variables that can be counted. But not all things that can be countered shouldn't necessarily be treated as discrete. Here's an example with hours worked from the labor force data. And as you can see, there are many hours categories around 80. And yes, we can identify a spike at the 40 hour mark, which shows that 40 hours is a common time work per week. But the overall graph absolute title to our understanding. And this is a scenario where a density or a histogram plot that treating underlying data as continuous would be much more useful in summarizing this information for this particular variable. So as a rule of thumb, when the labeling doesn't fit naturally, like here, where I had to use a small font size and alternate spacing to get all my labels. And you should probably look for another visualization technique and not a bar graph. Now finally, please do be responsible when it comes to axis labeling. I previously mentioned that bar graphs can often be used to misrepresent data. And here's a basic example. And this example I'm graphing the number of people in the data who have and who do not have a college degree. Both graphs use exactly the same data. However, on one graph, the y-axis ranges from 20 to 80. What's in the other graph? I used a full range from 0 to 101. Not paying careful attention to the, this labeling would assume that there's a gigantic different when looking at the second graph. And there is indeed a big difference in how many observations have a degree and do not have a degree. But the difference is not as big as the second graph makes you think. So, do take care with that. Finally, let's have a look at some examples where we want to present bar graphs for discrete data across different groups. For example. And it's common to visualize the distribution of a discrete variable by another variable, occupations by union membership, for example. The most common way to display such data is by plotting the bars of union and non-union membership right next to each other across each discrete value for occupation. Here's an example using a horizontal bar chart due to the big number of categories and a long labeling that is in this graph. This is a great example of the power of bar graphs in quickly presenting a lot of discrete data. It's easy to see that most union and non-union members are located in this sales occupation. Also occupations, but very few observations for both the union and non-union members. For changing the x-axis 2%, we can also repurpose the graph to highlight the proportion of union and non-union members in each occupation. This graph suggests that the highest concentration of junior members can be found in household workers and other occupations. Again, it's very easy to interpret this type of visualization, although there is a limit to how many categories can be introduced. On one way to reduce a number of bars, but keep the same data in the graph is to use a bar chart. And this method snacks to values of the discrete union variables on top of each other. But each occupation category resulting in a less cluttered visual look is an example of why I combine both the frequency of percent result earlier. This is in the chart because it is easier to read off. In each occupation category. How many percent are union members and how many percentile non-union members. It's also easy to identify how many total union and non-union observations in each group. Although figuring out how many are union and how many non-union is more complex in the snack bar graph. And this is one of the disadvantages of stacking. It does help to visually make it a lot easier when using percentages, but it makes the plot visually harder for frequency data. Finally, another good way to present data across groups, this by using a mirror plot. Here's an example of a population pyramid that is commonly used in demography. This plot depicts a horizontal bar graph with two categories, males and females. But rather than stacking data or presenting bars next to each other, the bars are flipped and mirrored with each other. The advantage of a merit bar graph should be evident. It's easier to identify rough patterns across many discrete groups. For example, we can see there are more females and the higher age category. At the same time. Getting in exactly two of the numbers is more difficult. It can also be visually harder to identify groups and are of equal size. Since the eye has to travel a lot across the graphs. Bar graphs are very flexible and we'll see some of that power again later. But at their core, bar graphs are just plain, easy. An easy goes a long way in anything to be honest, used and when you can eat as an audience's will silently thank you for not putting up messy table or very complex data visualizations. You should take care when formatting bar graphs. But there's a very good reason why they are the most commonly used graph type. Any data analyst should be used them often. Finally, also keep advanced bar graph visualization techniques in mind when dealing with a lot of discrete data categories. And examples of that include stacking or mirroring.
22. What is a Pie chart?: Next, let's have a look at the pie chart. I'm going to be honest about up front and say this is not a very useful visualization technique. You should probably not use it in most circumstances. Upon charge can do everything a pie chart can, and probably do it much better. But pie chart are well-known. They're often taught in school, and they're pushed on people who use software like Excel. There are visually very catchy. And that seems to be their primary reason of existence. Sometimes known as circle chart. Pie charts are circular visualization graphics which divide data into slices that illustrate numerical proportions. And a pie chart, the arc length of each slide and its area is proportional to the quantity. It represents. Bigger slices equal more data. So that's very simple. Moreover, the circular nature of the visualization method, that's a particular aesthetic appeal. And sorrow, no sharp discontinuities, and the icon load over all the information with relative ease. So let's go have a look and they'd basic pie chart. Here it is. This is a basic two-dimensional pie chart that highlight how the underlying data, in this case, tensors data, is distributed across four regions of the United States. This pie chart shows that the region South has the most observations. What's the region north, east as the fused observation? Because the data is given in a circular visualization. And the arc length law, federal angles determined the size of the pie. All sides interpretation is done between 0 and 360. A pie with an angle of 90 degrees will contain a quarter of the data. And to me, that's one of the main problems of pie chart. Whilst having the value of 360th of a 100 percent will be intuitive to many people. It won't be as intuitive as a value of 100. Anyone wanting to access more detailed statistics come a pie chart, 1.5. Angle measurement and radial computation. Bar chart is a much easier way to get statistics and across. So to me, this makes pie chart quite unattractive. They're only good for showing very generalized proportions of discrete data. Like this example. One categories. One category is small and the other two categories are pretty similar. Some programs can also create fancier 3D versions of pie chart. But in reality, this is mostly a bad idea. Visualization should be kept compact and to the point. And even extra visual dimension doesn't add any extra data information, you should usually not add it. So what are some of the advantages and disadvantages of PyCharm? Advantages are that they are very visual and have an uninterrupted playing a visualization. This mixed up easy on the eye and great at communicating very basic and very generalist information about discrete data. At some level, they're even easier to understand and bar graphs. Pie charts, excel at speaking to laymen and those with little knowledge of statistics or data analysis. Pie charts are often used in media and popular. Unlike other graphs, pie charts are easy to manipulate to present particular aspect of data. For example, it's common to explode slices of the pie to emphasize certain data categories. Disadvantages of pie charts are that they're very poor at revealing the underlying value. Moreover, computing values manually from the chart requires additional mathematics due to the radial nature. And the Pi star don't have a neat nineteen hundred and eighty, two hundred and seventy or 360 degree angle can be somewhat hard to interpret. Pie charts are not very useful for many categories of data. There's a natural limit to how many slices uses can interpret. And I would say anymore than five or six lysis requires an alternative visualization method. Finally, pie charts are not very flexible, can't be extended much beyond simply showing discrete data, both single-variable, I'm like bar charts that can accommodate multiple variables as we'll see later, pie chart can only accommodate one discrete variable. So let's continue to have a look at a few more pie chart and some common transformations. One natural critique of Pi charges that default inability to accurately display the size of each discrete category. And that can be fixed in most circumstances by simply having the percentage level, the actual pies. Here's an example. In this case, I replaced the name of each slice with the percent size of each slice. The kinda legend continues to identify which slice belongs to which discrete category. And we can now see that the north-central category, and it's actually slightly smaller than the West category. Of course, it's also possible to highlight actual frequency. Violet is an example. I replaced a percentage values of each category with the underlying frequency data. Now we can see that this dataset is not very big and it only contains 50 observations. In this case, one observation per day. Of course, it's possible. Permanent differences and frequency is two pie chart. But to a large extent, all of this, it's not very satisfying. If you create a pie chart that needs numeric values in the slices, you're probably better off using a table or a bar chart. One useful aspect of the Petrarch is it the ability to highlight particular discrete categories. Of course, this can be done in any graph you by simply using colors. But high tranche allow a second alternative by exploiting one or many slices output. Here's an example. In this example I am highlighting the sounds category by exploiting that pi out, not a circle. Whatever your opinion of pie charts, this is certainly a very eye-catching technique to draw attention to a particular data category. And I would say this is probably one of the better advantages of PyCharm. This technique is often used in the media or in marketing, making very forceful point about a specific data category. However, it works best when only one slice has exploded. If you explode all the slices, they'll simply end up with a very difficult graph to interpret. And here's an example of that. This pie chart defeats itself by removing the extremely basic visualization setup and introduces visual discontinuities. And there's much more uncomfortable to interpret this pie chart and a regular one. So limit your explosions. Another common, a variation of a pie chart because they do not try to go now transfer move the center of the pie, deeming a smoker visualization that resembles a doughnut. Now here's an example of you that don't know translate even worse visualization techniques and pie charts, because it now becomes impossible to measure the central angle of each category. All of that can be looking at is the arc length, which makes comparison of groups even harder. But don't know charter only useful for very, very basic data representation. There are often filled with other images and icons and text to enhance the message they're trying to give. But as pure data visualization girls, they often don't doubt very much value. Finally, let me show you why pie charts are all visualization choices for many categories of discrete data. Here's an example with labor force data and a variable occupations. And as you can see, one of the primary issues with discharge isn't so much a small angle slicers on some of the occupation. It actually looks like, but the legend run out of colors and has to release cycle. But scholars, we could label to pies with names, but small pie slices will be very hard to live with accurately, and a lot of the texts will probably overlap. We could also use slightly different gradients, colors. But the reality is that there's a limit to how many strongly contrast and colors the human eye can interpret. In this case, 11 occupational categories, simply too much and the graph becomes messy and difficult to analyze. Finally, pie chart don't lend themselves well towards multiple comparisons. Here's an example of a very simple pie chart that has only two categories, but spread over five different groups. These pie charts plot the proportion of foreign and domestic cars by each Reaper category in a car. What's it easy to identify that repair group five has the highest proportion of foreign cars. To identify this, we need to look at each pie chart individually and examined all the angles and surface areas to make that determination. And that has a lot of work for a reader. And it gets exponentially more difficult, the more groups are included in the pie chart. Here's an example of a pie chart over multiple groups with many discrete categories. This multiple pie chart graph contains no labeling. Well, it's clear that comparison across PICOT degrees is going to be quite strenuous. But any casual observers with the eye needs to perform a lot of movement and the brain. It's just a lot of computational task to figure out what's going on here. My recommendation is to stay away from such graphs. I don't think they are much visualization of value to data. Overall, I recommend that data professionals stay away from pie charts unless they have something very specific in mind. Pie charts are a poor way to visualize discrete data, since it's very hard to get statistical feedback from them. However, if your primary motivation is to present the very simple method in a visual way using a single discrete data variable. And consider using a pie chart, wouldn't explode it slice.
23. What is a Dot chart?: Now let's have a look at dot-dot, dot chart or dotplot, almost identical to bar charts and the underlying concept, they simply replaced apart, but a series of dots, often 100 dots in a line, and then plot a further dot or marker on that line of dots to indicate what each discrete value statistic actually is. Older and older are identical to bar chart, but they offer an important visual alternative to bar chart, which can sometimes feel cluttered when too many categories are introduced. Dot charts are better at handling many countries in Excel let us playing many discrete categories over many groups. So here is a classic dot chart. In this case, I'm plotting ethnicity data from a labor market dataset. It's obvious that the most of the data consist of white observations followed by black observations. Other ethnicities are hardly noticeable in this particular dataset. So far, so similar to the Barker. Only noticeable difference here is to focus on dots. Each discrete category as a row of 100 dots. And a larger blue dot identifies how many observations are in each category. At this level of discrete categories, in this case, three. It's mainly a matter of personal visual preferences. Whether you go with a dot chart or a bar graph, it doesn't really matter, although I suspect the bar graph might be more visually appealing here. And that's more defined boxes and less empty gaps in the graph, leaving less whitespace for the eye to get Austin. However, dot charts excel at plotting discrete data with many categories. And this is where too many bars from a bar chart might get in each other's way. So here's an example with the variable occupation from the same labor force dataset. As you can see, this dot chart is very easy to interpret. Most occupations in this data, our sales occupations, followed by professionals and labor occupations for dots that are close to each other, it now also becomes possible to compare small differences. Since we can count the number of dots, each little dot around one percentage point, since they add up to 100. So that can be really useful if you're interested in comparing groups very close together. Here's another example using the variable called school grade. This discrete variable has many categories and it's often treated as a continuous variable. But a dot chart can easily distinguish between this 16 different school grades that are in this particular dataset. We can see that 12 years of schooling is one over 900 observations in this data. And this constitutes by far the largest category. Finally, we can also present multiple discrete variables on a single dot-dot-dot. To visualization options exist here. And you need to decide which one best suits your purposes. Here's the first example. This.tab presents the frequency distribution of the Occupation variable by union status. The largest non-union occupation as sales. What's the largest Union occupation is also sales. However, depending on how the comparison is meant to be made, we can also rearrange this top chart into a different visualization. And here it is. This.tab puts the emphasis on union versus non-union comparison across occupations. What's the previous dot chart emphasized? The reversed. We can easily compare the frequency distributions within occupational groups. For example, we can quickly determine that other occupation, number of neutrons, same number of union and non-union respondent. Dot charts excel at showing many discrete categories. If you're single discrete variable, because a lot of categories consider visualizing them via a dot-dot-dot.
24. What is a Radar plot?: Now let's turn to a radar plot. A radar plot, sometimes called spider plot, are another way to visualize discrete data. They're not as intuitive as bar chart or a pie chart, but can be useful in certain circumstances. Were made up lots consists of a sequence of equirectangular spokes called radii, with each spoke representing one discrete category. The data length of a spoke is proportional to whatever is being measured, usually frequencies or percentages. Finally, a line is drawn between all data values for each spoke. And this gives the plot a star-like or spider-like visual appearance. In important advantage of a radar chart is comparing a large number of discrete categories. However, the key advantage is probably the ability to overlay multiple data groups on top of each other and obtain in group wise comparisons across many discrete categories. And this is often used in sports statistics, for example, where different player skills and statistics are shown on a single radar plot. So let's have a look at a simple radar plot is an example. In this example I'm plotting the previously used ethnicity data from a sample of labor force observations. Each category is highlighted as a spoke on the radar plot. And we can read off the data values by looking at where the blue line intersects each spoke. We can see that there are around one hundred and six hundred wide observations and around 600 black observations. However, this is not actually a good example of a radar plot. The key problem with this graph is that there are too few categories and the eye has to travel far across the picture, figure out what's going on, and read our plots are more useful when there are many categories, such as education for example. Here's an example of that. Here there are 16 school year categories on the graph. And we can identify, but most observations are in category 12, which represents the 12 greater school. Finally, is a radar plot that examines the frequency distribution of education over two different groups, union and non-union members. This is both a good and bad example of a radar plot. So a good example because it highlights where radar plots can be useful, comparing and contrasting many discrete categories over different groups. However, as you can see from this graph, most of the data is located in education category 12. And we can hardly make out a frequency sizes and some of the other education categories. Moreover, the union radar plot is very small and visually squashed by the a non-union plot. Hard to figure out what exactly is going on here, except that there are less union members, non-union members on and all this type of visualization would not work well for this particular data. Dot chart or a bar chart is likely to perform much better. Advantages of radar plots are as follows. There are good at comparing many categories, but Excel that highlighting many categories. All the different groups. For example, they can quickly provide an overview of how certain groups, such as different people, differ in their individual characteristics. Another advantage is that the radial nature lends itself to many categories. Radar plots can contain more categories than pie chart. It's keeping a circular, visual narrative that can be easy on the eye. And disadvantages of Ada plots are that they are complex to interpret. The radial nature mixed and more complex. But edition of spokes requires a lot of the movements of the eye, making comparison harder but untrained individuals. Some users may confuse the area for Vader plot would important. But this is not correct. It's the intersection of the spoke that matters. And the area and shape of polygons can change significantly depending on how the axes are positioned around a circle. It can also be difficult to compare this good catalysts that are not next to each other because I have to travel far around the circle. And finally, arena pots become difficult to read with one category has high values and other categories low values. As I've demonstrated previously, this can lead to very spiky Radar plots and up visually quite demanding to handle. Render plots are also often colored to further highlight how the data is distributed. But I personally recommend doing this very sparingly. Many readers confused area would important because they're so used to it from a pie chart. And that's just not the case for radar charts under length of the spoke matters, but the area of the graph and generally Radar plots should be avoided to single discrete variables. That can be good for plotting many discrete categories. I think there used to be primarily related plotting many discrete categories over other groups of data. And this is where they have a particular strength. However, even when the data setup is correct, the actual numbers may lead to spiky and very asymmetric radar plot that don't visually look good. So the reality is often the appropriateness of a radar plot will very much depend on how the data actually falls within your dataset.
25. What is a scatter plot?: Now let's take a look at scatterplots. Anyone working in data analysis will not be able to avoid the basic scatter plot. Scatter plots are the go-to visualization of choice for anyone seeking to explore how two continuous variables might be related to each other. A scatter plot, also called a scatter graph, uses dots to represent values with two different continuous variables. One variable is cost on the y-axis, and the other variable is placed on the. And I say dots. That is the default for many graphs. But it could also be squares across the US or any other kind of market. Positioning of each dot on the horizontal and vertical axis indicates the value for each individual data point. And what better method than simply plotting the actual data on a graph. This in turn is one of the good advantages of this scatterplot. We are showing the raw data. And because you can actually see the data, anybody who sees the graph can make up their own mind about what is happening. And that graph. In citizen willing or we often call that non parametric and non-parametric as you can get at the same time. This is also one of the biggest disadvantages. There's only so much the eye can see more a graph can display. Once observation count goes into the hundreds or thousands, it can be tough to figure out what exactly is going on in a scatter plot. So let me show you a basic example of a scatter plot. In the scatterplot, I am plotting 74 observations. Each observation represents a specific car type and we have data on their price and their weight of each car. Then a scatter plot suggests there is a general positive relationship between carpenters and their weight and peach car. So far, so good. But note that the positive relationship between car prices and weight has to be inferred from the data. The scatter plot does not aggregate the data into a relationship coefficient. It's simply presents a raw data to the eye and our brains have to work out the rest. And that can be good, uncertain situations, but not all, not everybody thinks the same. And other commentators or viewers of this graph might not clearly see the positive relationship in this graph. So let's have a look at some advantages and disadvantages of Scatterplot. Advantages of scatterplots are as follows. First, they're presented data as it is, and they don't make any judgment about relationships. The reader. So do that. It's got a punch, can also show a lot of data, much more than many other visualization types. Although even here, there are limb. Also, Bob scatter plots are mental continuous variables. They can accommodate some discrete variables. It will depend on the data structure, is flexible and can be extended into different subtypes to accommodate more variables and information. This includes three-dimensional scatter plots, bar charts, or even heap block. Finally, scatter plots are often combined with lines of best fit. And together, these two graphs work really well at creating a very powerful visual narrative. Disadvantages are scatterplots are as follows. If you want everybody to think the same, then they scatter plot may not achieve that. They present a real data and everybody might interpret that slightly differently. Scatterplots are also not very useful for discrete variables. If you have two discrete variables that are binary, for example, a scatterplot will definitely not help you. Also is very common for readers to interpret the chart. There's causal or a change in x leads to a change in y. But there's nothing causal about a scatter plot. Scatter plot is simply a correlation, but people often assume that something causes Y. And finally, a scatterplots do not show statistical summary information. We list to get an idea of what the mean or standard deviation of the either the X or the Y variable might be. But that needs to be inferred and Scatterplot don't always make it easy. Now let's have a look at some variations. First, while scatterplots are meant for continuous variables, remember that sometimes continuously treated variables, not actually discrete variables. And that's not a problem for a scatterplot as long as there are enough discrete categories to make sense of in the graph. Here's an example where one variable is continuous and the other one is discrete. But a scatterplot still works. This scatter plot tells us that there appears to be a negative relationship between car prices and car mileage. But the mileage data is not continuous. It increments in unit of one. And that's why the markers are vertically aligned in this category. But because there are so many discrete categories and the data only has a few observations, that is still significant variation and we can still make sense of this particular scatterplot. Here's an example where both variables are discrete and have only few category. It looks like this scatter plot only have seven observations, but actually there are 74 observations. All the markers are sitting on top of each other. So clearly, scatterplot need to be restricted. Continuous or quasi continuous variables to work properly. Another important feature to be aware of, the ability to change market. Markers on scatter plots are often solid dots, but they come in all shapes and sizes. Kinda, for example, as a common future to just scatter plot. For example, here is the same scatter plot of car price against mileage on previously, but with the data divided into foreign and domestic car. And we can now distinguish between two different categories on our scatter plot and examined whether the data or relationship differ between these two categories. Of course, this can also be represented with different markets. For example, you may want to visually differentiate different data squares and cross it like this. Look at that. This is a very distinctive scatterplot. Finally, you can also add text or markers will replace markers with texts entirely. Here's an example with a subset of the auto data. Each observation has a label next to it alone viewers to clearly identify what each marker represents. So overall, there are many ways to manipulate scatterplots to bring raw data out of the picture and make it a very eye-catching picture. Don't be afraid to manipulate markers or colors to suit your particular needs. Another common problem that scatter plot space is too much data. Even in small datasets, raw data points may overlap. You can't get around this by using transparent or opaque markers or changing the dot size to give readers an idea that data points sit on top of each other. But what if you have a lot of data points that say a 100 thousand? Take a look at this plot. If your scatter plot looks like this, then you may have a problem of too much data. Most people looking at this would think there's a positive relationship between y and x. But we're only introducing that from the outlines of this scatterplot. In other words, we're using the outlines of the data to infer relationship. But inside this big blob of blue, that data may have a completely different distribution and the relationship may actually be negative. Now if that's the case, our final option is to reduce marker sizes. Here's an example of that. In this video, I am slowly reducing the market size until we see more and more data. However, at some point, the markers become too small to be of practical use. So clearly, a garland needs to be filed. Ultimately is a hard limit of on a scatter plot can show. Although as you can see, it is possible to get thousands of data points onto a single graph. Scatter plots are very powerful because they can show lots of data and they showed that data in a roar on formatted form. They often act as a visual data dump and then allow you to throw lots and lots of information that other people in a way that doesn't distort the data much. However, scatterplots are not without problems. They don't summarize the data and they don't provide a lot of detailed statistics. Moreover, different readers can have different interpretations of what they see is gonna plot are not often used in very basic data visualizations, but they are used extremely often when analyzing data relationships. They are the go-to graph went talking about correlations or relationship or how something related to something else. However, I do recommend that you should generally combine scatterplot with lines of best fit as these to create a very powerful visual dynamics.
26. What is a heat plot?: Let's explore some variations off scatterplots. One of the previously highlighted issues that scatterplots count sometimes encounter is not of too much data. And that often results in one big blob of markers where users can't identify anything other than extreme outliers. While it's possible to change marker attributes such as transparency or their size. There is another way to deal with this problem. And that is to recast a scatter plot into a heat map or a heat plot. Heatmaps been proportions of the data into certain shape bins, and then assign a value to now been, depending on how much data is inside that bin. It's very similar to a histogram, except there are now two histograms that interconnect via the y and x-axis. And the result is a plot of squares across a graph. So let me show you. Here is a basic scatterplot of personal weight data versus person height data. This data is from a medical dataset and as you can see, its contents, a lot of observations over 10000. Now let's convert this into a heatmap and see what happens. There we go. The general outline of this graph is the same as the previous scatterplot, but we are now presented with a color gradient that identifies what proportion of data sits inside each square on this graph. We can now see that there's a massing of data in the center of this large blob. And it appears that the relationship of this mass is still positive. So the overall relationship between weight and height is positive. But we confirm this via a heatmap and not via the outliers from a scatterplot. Heatmaps are visually more complex for users to follow and interpret, and you should be careful about how to use them. However, when faced with very large datasets that contain a lot of overlapping data points on a scatter plot. They can be a great way to visualize what is happening inside the data blob. So consider using them for complex scatterplots.
27. What is a hex plot?: In the previous section, we explored heatmaps. Traditionally standard heat maps use squares to plot bivariate densities. Squares are easy to compute and most people understand what a square is and how a square relates to another square. However, on scatterplots, squares have a minor problem. Squares only connect and forth. I mentioned north, east, south, and west. But scatterplot often show positive and negative relationships that require more diagonal articulation of the data. And squares are very bad at diagonal visualization. So an alternative way to present heatmaps is with hex plots. And those are exactly the same as heatmaps been replaced squares with hexokinase. Hexokinase have the advantage that they have six edges and not only 46 edges allows for a more visual connection into the diagonal. And also resemble circles more closely. And this often resulted in a smoother and more fluid and visual representation that is easy on the eye and better to understand this what I mean using the previously used data of weight and height of people. Here's a basic heatmap. This square heatmap suggests a positive relationship of height and weight. And now let's replace the squares with hexes and see what happens. So at first glance, it doesn't look like a lot happened, but if you look closer, it's enough to make a small difference. The visualization of the hex plot is more smooth and pockets of data and are more circular and not a square or rectangular as before, the upwards relationship in this data remains visualized, but somehow seems clear up. Well, this may be subjected. Overall, I feel this is a softer but better graph than the previous heatmap. Hex plots are just heatmaps. We had hexokinase. Hexagons are more complex than squares, but it is unlikely that the focus of the picture will be on each individual hexagon. Overall, I think our textbox provide a better visual narrative. Dan square heatmaps to visual depiction is often smoother and more calming than some of the sharp micro discontinuities in a square plot. So I think you should consider using hex plots instead of heatmaps when considering bivariate densities.
28. What is a sunflower plot?: Now let's have a look at another type of heat blocked, the sunflower plot. A sunflower plot is a type of heap not that uses color and mock symbols to visualize bivariate distribution is type of visualization is set to represent a sunflower and hence the name sunflower plot. Specific leaders, sunflower blood draws a density distribution plot where light and dark hexagons are combined with petals inside the hexagon that together indicate the amount of data in any given region of the scatterplot. Some flower pots are very unique and not often used in applied data visualization, but their uniqueness is the point. Sometimes unique graphs can make a significant impact because it forces the viewer of the graph to focus and consider what is being displayed. So let's have a look at a sunflower plot. Here it is. And I'm using the previously used data that scatters body weight against body like. This is a big dataset. And in a scatter plot, the data points all sit on top of each other, making it hard to figure out what's going on. The sunflower plot that visualizes this relationship by using colors and petal. First, each sunflower is actually a hexagon. The hexagon displays information of all the data points that lie in that particular area. Areas would know hexagon or sunflowers, and a single data point depicted by a blue circle. Next are the light shaded hexagon. We can get an idea of how many observations are underneath by looking at the number of petals. The legend up here indicates how many observations one petal corresponds to. One petal is worth one observation in the lights on flowers and eat and observations in the dark sunflower. So enlight sunflower hexagon with three petals will have three observations in that hexagonal space. Likewise, a dark sun flower with three petals times 3. So 54 observations in that particular area. And we can see that the number of black lines or petals and our data increases rapidly as we head towards the center. So a lot of our data is distributed in that particular region. And this is the basic concept of a sunflower plot. The advantage of sunflower plots over x plots is that the combined colors with markers to visualize how much data there is in any given region, the graph. And this can be useful if you want to avoid excessive color gradients and are used in traditional heat plot, it's also possible to avoid hexagon bins altogether and simply keep marker symbols. And that will look something like this. The advantage this particular visualization is that it's very parsimonious. There are still a little bit of color differentiation, but less than before, and it's clear where the bulk of the data is scattered. Finally, don't forget that like all heap blocks, the actual visualization will depend on the size of the hexokinase, often simply denoted binwidth. If the bins on large, then they will encompass more data and will be visually larger. It's a quick video of me demonstrating how to plot changes depending on the binwidth psi. As you can see, the sunflower plot appears to bloom and expand as the bin size increases. Note that the petal legend changes constantly. There's different bins have different amount of data within them. Each petal represents more or less data. As usual, with these type of plots, I recommend sticking to the default bin sizes provided by whatever computer program is generating these plots, but do consider burying them a little bit. Small bin sizes will lead to too much detail and large businesses will lock it out. And that concludes this introduction to Sunflower plots. They're identical to a heatmap plots, but use the less color and more mock symbology to represent data patent chosen the one will always be a matter of taste. But if you want to avoid over colorizing the data visualization, then this might be a viable alternative blood that displays the same information as a hexagon.
29. What is a line of best fit?: Scatter plots are often combined with lines of best fit to create powerful visual story. So let's have a look at the line of best fit and see how they can be used. Turn hand the relationship story that you're trying to tell. In general, line of best-fit attempt to summarize two-dimensional data relationships. I can do this via parametric or non-parametric means. And we'll discuss what these two are in just a moment. And lines of best fit that often associated with two-dimensional data structures such as scatter plot, where a y variable is correlated with me, X-variable. In general, no, Y and X variables are continuous and not discreet. Although depending on how many discrete cases you have, some lines of best fit can also work with discrete cases. If you have more than two variables, lunch or best-fit turn into planes or hyperplanes of best fit. And we commonly call that multiple regression analysis. Finally, as highlighted already, in many cases, lines of best fit visualizations are great compliment to other visualization techniques, not just scatter plot. Lines of best fit can be used to stand alone plot. But more often than not, they're combined with other plots. So let's have a look at the line of best fit visualization. Unlike connect a plot, lines are best bits do not connect individual data points. And lines of best fit are lines that best expressed a relationship between a set of data points. Here's a basic example of a line of best fit. This picture actually consists of two different visualizations, a scatter plot and the line of best-fit overlaid on top of that scatter plot. The line of best-fit summarizes the data relationships in such a way that the total distance between all the individual data points and the line is minimized. Any other linear line through this data will result in more distance between the data points and the line of best-fit. A key advantage of lines of best fit is that they clearly indicate to a user and which way the data is sloping and what the general relationship between y and x is. In this case, the relationship between car prices and mileage of a car is negative. Slope has a coefficient of minus 238, which means that for every one unit increase in mileage, car prices appear to decrease by $238.1. Thing to note is that loan to the best-fit often don't make much sense on their own. Here's an example of a line of best-fit without an underlying scatterplot. Technically, there's nothing wrong with this picture. The picture highlights the same negative relationship between price and mileage as before. Yet it feels empty and the more difficult to understand because the previous scatterplot is missing. I use are looking at this plot would not get a good idea of the underlying data. They would only get an idea of the general relationship in the underlying vector. And that can be good or bad. It can be bad in cases such as before, where the actual data points added an important visual element to the story and contextualized the line of best fit and a really great way. But that can be good when there are so many data points that you can't see much anyway. For example, here is a plot with a million data points. It's unlikely here that the scatterplot of a million data points and anything to the overall relationship plot that we're trying to visualize. So this is a case where a line of best fit is best on its own. And here we are. That looks a lot better than before. Before looking further at loads of best fit, we need to distinguish between parametric and non-parametric. Parametric lines of best fit our lines and apply some kind of parameter to the data can be one or it could be many parameters. But the question begs itself, what is the parameter? A parameter is a numerical quantity. Once fitting lines, parameters will often be in the form of equation such as y equals to 1 times x. Well, in this case the parameter is one. And advantage of this method is that it works well with multidimensional data. In other words, with lots of variable. Another advantage is that it's transposable. I can tell you that Y equals to 1 x. And you can now imagine what this relationship means and apply it to another dataset if you wish to Gustaf. A disadvantages now that requires strong assumptions about the shape of the data. Non-parametric lines of best fit, let the data speak for itself and apply less restrictive parameter. The advantage is that you need to make less assumptions about the relationships in the data. Disadvantage is that this method is not transposable. It's very hard to tell others. The line goes up and down, left and right quickly. Here's quickly. And that can often not be replicated in other datasets. Also, non-parametric methods work poorly in multidimensional data environment. Let's have a look at parametric versus non-parametric methods. Here's the previous scatterplot with some data. But Rob relationship looks negative. But let's see what are two methods say about that. The non-parametric method, we'll use a bandwidth the compute a local average value in a small dataset. That bandwidth is then tourist across the data and each of the average estimate is stitched together. So here's an example. The non-parametric fit between y and x dots and negatively but ends up being almost horizontal. And different bandwidths sizes will result in different lines of best fit. Next, let's fit a parametric relationship. We need to choose how many parameters we want to fit. Most lines of best fit use 12 or three parameters. One parameter line of best-fit will fit a linear line of best-fit. A2 parameter line of best-fit will fit a quadratic line of best fit. And 24. Let's go for the one parameter line of best fit. And as parametric football plot, a linear line through the data that looked like this chosen negative relationship between both variables. The slope of this line is now one parameter. Now let me show you an example of a line of best fit with two parameters. This now result in a quadratic curve that is now defined by two parameters. Y equals something times x plus something times x squared. And this line of best-fit suggests that the relationship is that first negative, but I'm slopes up again. In other words, the relationship is non-linear. And this can be a bit of a disadvantage of parametric bits. The line of best fit can look completely different depending on how many parameters are applied and what kind of minimization function is used. This is less the case with non parametric lines of best fit, although their shape will still depend on the bandwidth parameters that are chosen. And that is the difference. Parametric and non-parametric lines of best fit. Let's have a look at the advantages and disadvantages of lines of best fit. And how integers are not lines of best fit. Excel at highlighting how many data points are related to each other. And nothing can come close to beating their ability to quickly visualize how two continuous variables are related to each other. You can also often thought many different lines of best fit on a single graph without worrying too much that they clutter the visual representation. Although as always, there is a limit. Also, advanced lines of best fit can include information such as confidence intervals that allow users to get an idea of whether relationships are statistically significant. And lines of best fit is often a powerful compliment to scatterplot. They can significantly increase the potency and visual power of standard scatterplot. But they have some disadvantages. By themselves, lines of best fit may not mean much. They don't show how the data is distributed and leaders are left without proper data contexts. Also, there are many different types of lines of best fit, resulting in many different ways to draw relationships. The most common ones are non parametric lines of best fit, but a default bandwidth or linear or quadratic parametric like the respite. However, these can still result in significantly different relationships. And you need to make the choice about which one of these you want to display. Showing all is likely to confuse your reader. Now let's look at a few more examples of lights of best-fit. First, it's important to understand that axes scaling can significantly affect outlines or best-fit are displayed on a graph, especially when the underlying scatter plot is not shown. Here is an example of a line of best fit. Would the y-axis scaling done so that it results in a visualization that has an almost minus 45 degree of relationship. And here is an example of exactly the same data and the same line of best fit with re-scaled. Why actors that results in a very different relationship that looks almost black. However, the parameter is identical. All that has happened is that the y-axis scaling is different. Of course, all data visualizations are subject to such manipulations, but lines of best fit suffer from this a little bit more. So to take care of that. Next, what happens when you have discrete variables? Here's an example with two discrete variables. The y-axis variable measures car repair categories used in five categories. And the x-axis variable is a binary variable measuring whether a car is foreign. You're using a 0 or one dummy variable. It looks like there are only seven data points, but actually there are 74 data points in this dataset. Many of the data points are sitting on top of each other. Surprisingly, sometimes a best fit still work with this kind of setup. Here's an example of a linear line of best-fit. It's not ideal, but the plot clearly suggests that foreign cars have higher repair records than non foreign cars. In this particular case, the interesting information isn't actually the gradient, but the level difference between the line of best-fit start and end point. However, learned to best fit only work if this ordering and both variables, discrete or continuous. If repair record was nominal or an unordered variable, then this plot would probably not make about us then. Also, non parametric plots don't tend to work at all with discrete data. The general rule of thumb, therefore, is the more discrete values you have, all your variables, the better. Another important consideration is out of bounds prediction. Parametric lines of best fit are able to predict out of date about. In other words, they can predict values of x and y What is no actual data. And that can be useful, but it's also dangerous. Since the further one gets away from the actual data, that less likely it is that the original relationship still holds. Here's an example using a previous chart, but with a line of best-fit extended well beyond the actual data range. In danger here is that the layman reader will assume that the linear relationship exists along the entire spectrum of y and x. But if you look at this graph a little bit more closely, you'll see that the line of best-fit. Doesn't make sense in particular regions of the graph. For example, there are no cars with inactive mileage. So everything in this bit of the chart is nonsensical, it doesn't make sense. Also, a car prices don't tend to drop below 0. It's quite unlikely that anybody will give you money to buy a car. So the line of best-fit in the negative range of the wine numbers also doesn't make sense. So this is the danger and is easily misused phenomenon of lines of best fit. And you should take extra care with any out-of-band prediction. A useful advanced property of line of best fit our confidence intervals. Because lines of best fit are ultimately just estimation procedures. Most software packages not able to produce confidence intervals around 900 best fit. And this gives the reader an idea of how statistically significant certain slopes are. Here's an example with a linear line of best fit using the previous scatter plot. And this example, an area around the line of best fit is shaded that gives us the 95 percent confidence interval. Any overlapping parts are not statistically significant. For example, the predicted value of price was around 9,001. Mileage is at 15, and around 3000 when mileage tricks that 40, the lower confidence interval of the 15 mileage estimate does not overlap with the upper confidence into all of these faulty mileage estimate. And this means the linear trend is statistically significantly downward sloping. This also works really well with non parametric plot. Here's an example. Here. Confidence intervals are really useful in identifying when the relationship between y and x matters more and where it matters less. Wider intervals are less certain wild, tighter tools are much more advanced. Users often use confidence intervals to add more information, but the reader to make up their own mind. And finally, multiple lines of best fit can also be used on what ground? Here's an example. Depending on how close the lines of best fit are together and to what extent they overlap. You may want to use a monochrome scheme. But in this particular instance, color differentiation. Problem with multiple lines of best fit is that without additional explanation, we, there's maybe confused with what they're looking at and which line they should focus on. Which relationship is to one they should be looking at here can be good to give readers all the options and let the witness desired. But often it will make sense to simply choose one line of best fit and present and zap. So that's a quick introduction to the line of best fit. Line to best fit, absolutely excel at highlighting scatter plots relationships. In most cases, if you're not using a line of best-fit with a scatter plot or scatter plot would a line of best fit, you're probably missing a trick. Combined. They create a powerful visual narrative that can be hard to beat by any other visualization method. However, take care lines of best fit our statistical technique. And there are many different ways to summarize data relationships. A lot of complex lines of best fit have a lot of mass and statistical assumptions hidden within them. And you shouldn't have some idea of what line of best-fit most suits your needs. Also be careful about using too many lines of best fit. And out of bound prediction. Lines of best fit can be easily misused. But use them if you're looking at continuous and sometimes discrete relationships.
30. What is a line plot?: Now let's have a look at a connected scatterplot, more commonly known as a line plot, sometimes called a curved. Line plots are identical in composition to scatter plots and with the key difference. Now, markers are connected to each other via a line, not the what connection. And this is quite important here. Line plots are not lines of best fit that tried to summarize this scatterplot data into some kind of trend line, line plot. Simply connect each marker to the next marker. And the general idea is that such lines continue to give some idea of a trend or relationship between two data variables. But the lines are simply a visual guide, which data point comes next? Essentially the lines as a visual guide for the eye to jump from one data point to the next data point. And this is very different to lines of best fit that tried to summarize all the markers into one single line. Whatever the shape of that line might be. A key component of line plot is the ordering of one of the variables. Without some kind of ordering, drawing lines between marker 0.1 make much sense. And that is a big difference between a line plot and a scatter plot. A line plot tend to have a natural ordering to them. And that's why line plots are often used when one, if the variable is time, such as day of the week or hours of the day, time as a natural ordering to it. And line plot and its derivatives not often used in financial data visualization. And whatever lamp dots can be used in any circumstance where there's at least one variable that can be ordered. Finally, not that ordered variables can be continuous or discrete. Time is often treated as a discrete variable, but can also be treated as a continuous variable. The important thing is to order not how the data is measured exactly. Although a line plot with only two time points won't look very attractive. So minimum amount of data is generally required. But this goes for all graphs. Let's have a look at a basic line plot. Here is a graph of US life expectancy over the last 100 years. This dataset consists of 100 observations. There is one life expectancy observation per year, and each marker on the graph is connected directly via a line. We can clearly identify that life expectancy has made significant improvements in the last century. But we can also observe a significant dip around the time of the Spanish flu epidemic in 191819. An important aspect of this visualization is the connected line between each marker. And that gives a much better visual representations overall trend. In this case, the marker varnish themselves are actually hidden. We simply observe the line. If we were to convert this graph into a normal scatter plot, we would get the following visualization. At first glance, this visualization teams pretty similar to the previous line plot. But note that it becomes significantly more difficult for the eye to identify a trend in the early part of the 20th century, life expectancy varies considerably year-by-year, and it becomes difficult to figure out which marker comes next. Also can be easy to miss the significant outline marker in the year 1999. So here's a visualization of markers with connected lines. And this combines the advantages of both a scatterplot and a pure line blood by clearly highlighting actual data point and highlighting which marker comes next. Wider use of a connected life. Depending on what you want to achieve pure line plot, all connected scatterplots are advisable when faced with an audit variable. Finally, note that the order variable, it's often placed on the x-axis. This is because many people naturally read from left to right or right to left. And therefore, they have an easier time with horizontal ordering and will be quite ought to make the y-axis ordered. And it's, this would be, people would have to follow the line in the vertical, which is much more alien, but it can't be done. Now let's have a look at some advantages and disadvantages of life plots. Advantages are that line plots absolutely excel letter giving a quick analysis of the data without having to resort to statistical aggregations via light of best fit. Line plots are great for visualizing clear trend. And they can also be used for predicting outside of trends, but not as well as light or best-fit. Line plot clearly visualize outliers and sudden changes and the trend. Because each data point is connected to the next data point and it's very easy to identify sudden changes in the data. Line. Plots can also give you a good idea of missing data. They're trending and ordered nature allows you to better guess what the missing values might be when they're not there. But line plots also have some disadvantages. To require data ordering. At least one variable needs to be ordered. Otherwise, you'll end up with a spaghetti plot visualization and that point help anyone very much. Also when using line plots without markers, it can be hard to identify the exact data value. And not a disadvantage. You start plotting too many lines that are close to each other can make it look cluttered and confusing to read. Also, line plots are often interpreted as causal by observers, which can be a big fallacy. Most people will associate a change in one unit of X to cause a one unit change in y. But of course, live pot, like scatter plots presented a correlation and not causal relationship. And like many other plot line plots are easier to manipulate through Access scaling. That gives them a particular visual look that I can accentuate or diminish a particular argument, even though the data obviously remains the same. And finally, line plots are not very useful when there's a lot of data variation from market to market. Because this often resulted in a very messy plot with many up and down. And it just looks very spike. Now let's have a look at a few more line plot and learn a little bit more detail about the line plot. Simply connect data markers, how they do this, they're subject to use a definition. Most default packages will draw a straight line from one market to the next marker. And that's called the direct method. But I'm not. A method might be this staircase method where markets are connected by a horizontal and vertical lines only. And this results in a different visualization that can significantly alter our inference of the data, especially if they're only a few data. Here's an example using a subset of two previous life expectancy data. And I'm showing direct markers and stack as marketers. And I'm using direct connections and it's decades connection. Stepwise connections result in a slightly different visual presentation. You should be careful about how democracies are connected depending on how much audit data you have and what you think the connection process from data point to data point might be. Next, let's have a look at what happens when your data is not sorted. Here's an example of a small car dataset that plots mileage against weight. We would expect costs that are heavier to have a lower mileage. But without sorting data, a line plot results and something like this. Spaghetti plot is not very useful for anyone and highlights the importance of sorting data. Without sorted data in a natural ordering. Line plot simply don't work. Here's the same plot, but with the data sorted. And we're going to ask you then negative relationship between mileage and weight. That's the line plot collects all the individual markers points across this data space in an ordered fashion. However, this also reveals eight disadvantage of laptops when there's a lot of market to mock and variation in the data. Line plots can result in quite a messy plot. And this is the case here, where we see many ups and downs in our line plot. And I think a line of best fit plot would probably be a better option here. And finally it, Here's an example of a line plot with multiple categories. In this case, I'm plotting life expectancy over the 20th centuries for males and females separately. It's easier to see that for both groups, life expectancy increased over the last 100 years, but difference between men and women that appears to have increased somewhat. Also in times of crisis, such as during the Spanish flu pandemic, both males and females who experienced similar life expectancy. Like many autographs, when you introduce a lot of data series across different groups, colors start to interfere with each other. Here's an example with four different groups. And while it's still possible to make up what's going on in this graph. This kind of graphs are often better visualized by a different line connections. So here's an example of a monochrome version of the same line plot. But not all the line plots are continuous. Some of them are dashed and dotted to highlight two different line plot. I personally think this is a significantly easier graph to follow. And tonotopy and the previous multicolor eyes lie plot. Line plots are very powerful when you have ordered data. Often use with time as a key component there visualize trends in an easy fashion and are used very frequently in data and analysis. The key strength that comes from the visualization of the ordered data markers via line connections between one marker. And then this creates a powerful visual narrative that one off to make sense. Most people reading the graph line plot tend to be very user-friendly and require very little explanation. But be careful, line plots are often interpreted as cause and effect type plot, which of course they are not. Moreover, how the data points are connected matters. Nonetheless, despite some disadvantages, line graphs, I wanted to use visualization techniques out there.
31. What is an area plot?: In the previous session, we introduced the concept of a line plot. Line plots connect markers in scatterplots to create a more visual trend throughout the data. Important requirement for line plot is that there is an ordering to the data. However, there are other variations to line plots that create a slightly different visual effects. Each of the following plots will look at now will still produce a general trend across the data. But the visualization that they use will differ slightly. We're going to have a look at different area plot and each has advantages and disadvantages. Let's start by repeating the previous line plot up dotted life expectancy across time. The original visualization looks something like this. This graph shows life expectancy in the United States and that it increases over time with a significant dip during the 1990s and Spanish flu pandemic. Now let's examined for new ways to visualize this. Still using the content I'm aligned block DO. And here there are four different visualizations. Here are all area plot. The first one's called an area plot. The second one is called a barplot. The next one is called a spike plot, and the final one is called a drop line plot. Each block emphasizes slightly different visual aspects of the data relationship. However, all points have in common that they highlight the area underneath that data point. And this gives me a better and more intuitive understanding of how far individual data points are away from some sort of reference line. Often 0. But in this particular case, it's 40 area Paltz shaped the entire area underneath the graph and provide a strong visual narrative. Bar plots do something similar, but remove some of the spikiness that results from the direct line connection that area plots use. The horizontal path focus makes the connection points of vertical and horizontal and can produce a smoother visualization than area plot depending on how much the data varies. If you have a lot of variation than a barplot might be preferable. Area plot spike not so similar to bar plots, but reduced the size of the bar width. And this leaves more whitespace in-between the data points and thereby reducing the visual impact of the area visualization and allowing for better comparison between adjacent datapoint. Spike plots are in need alternative if you have a lot of data points and you want to continue to allow readers to examine some data points very closely. Finally, dropped mine blocks are similar to spike parts with the difference that they also put an emphasis on data markers which the other plot suppress. And this gives drop-down plots the advantage of showing actual data points must also providing an overview of how much area is underneath the graph. Importantly, these for visual variations of line plots become more useful when you have a lot of variation in your data. And this is often the case with finance data. Here's an example. Here is a line plot that visualizes stock market changes over time. And changes can be positive or negative. And the line plot reveals significant day-to-day variation in how the stock market evolve. As you can see, this line plot has a very sharp edges and a lot of variation. And it does not create a pleasurable visualization experience for readers. So here's the same data using the four different area plots from before. Each of these plots does the significantly better job at highlighting how the stock market evolved over time. My particular favorite of bar or spike plots, in this case. However, different line trends might make me choose different blood types. So it always make sense to plot all four of them and choose whichever visualization works best for you. Land plots do not need to only consist of connecting line. Area visualizations are also possible and they're multiple alternatives exist. And these can be especially useful for data that has lots of variation, especially around some kind of base level. They excel at highlighting and rapid changes over audit x variable. So keep them in mind, especially when plotting data across the time dimension.
32. What is a range plot?: In the previous session, we introduced the concept of line plots in area light plot. In this session we will extend the concept of airline blocks to range plot. Range plots are identical to area plots when the only difference being that the area in question is not fixed, is not a fixed point on the y-axis anymore, but in another reference in line plot. And that can vary constantly. Range plots excel at highlighting differences between two different line plots and how any such differences evolve across the range of the x-axis. But often used in finance where they denote opening and closing values of the stock market. For example, larger areas denote bigger differences between the two data series. Well, smaller areas denote smaller differences. The key advantage is that the area shading between the two lines makes it significantly easier and faster for any reader to identify differences in the plot. And as before, there are several versions available. So let's start by examining the previous line plot of life expectancy. But this time, let's plot this for men and women. Here is the plot. We can identify that both male and female life expectancy increased over time. But there are also differences that appear to either increase or decrease over certain years. Now let's examine four different ways to visualize this using range plots. Here they are. The four different visualizations. Our area plot, a bar plot, a spike plot, and a cap spike plot. Each block emphasizes slightly different visual aspects of the data relationship. However, Oh, plants have in common that they highlight the area between the two data series. And this gives readers a far better and more intuitive understanding of how far the individual data points are away from each other. Yes, that have large differences between male and female life expectancy have larger shaded areas and years where the difference is smaller. As before. Advantage of range plot is that the shaded area between two data series. And this provides a stronger visual narrative than a non shaded visualization. In terms of actual differences here, area plot simply connected each data marker by a straight line, and this can result in a somewhat jargon and spiky pop visualization. These are often better suited if you have a lot of data points on the x-axis and not so much variation from data point to data point. Bar plots can remove some of the spikiness. The horizontal bar focus makes the connections points of vertical and horizontal and can produce a smoother visualization and area plots. If you have a lot of variation in your data and or not, a lot of horizontal datapoints and bar plots are often preferable to spy plots are similar to barplot, but reduced the size of the bar width. And this leaves more whitespace in-between the data, thereby reducing the visual impact of the area visualization and allowing people to better focus on the comparison between adjacent data points. Finally, caps by process similar to spike dots, but that difference that they also put an emphasis on data markers. In this case, the data markers are represented by horizontal bars which cap the spikes, but different visualizations are possible. Of course, arrangements are often used with financial data. What are used to visualize opening and closing prices? Here's an example. Here are four range box that visualized stock market opening, closing prices. Each of these plots as a great job at highlighting how the stock market evolved in terms of opening and closing prices over a specific time period. As before, I personally favorite bar or spike pops in this visualization. But of course, everybody is free to choose their own favorite visualization. The important thing here is the area component. Range parts are identical to area plot with the difference that they visualize the area between two data series and not a fixed horizontal line. They're great for visualizing the differences between two different data series. So keep them in mind if you want to really bring out the difference in values between two different data series.
33. What is a dropline plot?: Now let's have a look and drop line plot. In the previous sessions, we looked at area and range plot in some detail. One type of area plot is known as a drop line plot. And you drop line plot, draws a marker on a scatter plot, and then drops a line from the marker to the horizontal axis. Most often where y equals to 0, one. There are many markers on a scatterplot. Drop line plot often serve as a not a type of area plot. And we've already explored that in the previous session. But drop-down plots are also useful when there isn't a lot of data. In that case, the visual narrative moves away from area plotting too narrowly focusing on particular markets. So let me show you what I mean. Here's the droplet with lots of data. Next variable could be treated as continuous. In this case, this drop-down blob visualizer, stock market prices changes over time. Each day as a marker associated with it, and this marker as a drop line attached to it, now connects to market to the horizontal axis. Well, y equals to 0. In this instance, the drop line plot simply acts as an area block. Well, lines offer a quick visual guide to what extent the stock market changes not they were either large or small and negative or positive. So, so far, so good. Now let's have a look at a, another visualization of a drop. But in this case, I'm going to plot the life expectancy data against gross national product, but various North American country. Here is the plot. Notice how there are only a few data points in the scatter plot that simply aren't very many countries in the continent of North America. Moreover, some GNP values are very close to each other and some are very far away from each other. A drop line plot here works especially well because the individual datapoints are also labeled. A reader can immediately identify what data point refers to what countries. In this plot, it's easy to quickly skim across the country list, choose a country of interests, and then follow the drop line plot down to identify what the value of lock GNP at that country has for a given life expectancy. So drop my props can also be used in circumstances where area shading may not be of primary interest. Instead, they excel at highlighting particular in discrete data points and then allowing viewers to quickly identify and home in on those data points. You might wonder, why not have two dark lines, one on the horizontal axis and the vertical axis? Well, this is probably not advisable as the plot will very quickly fill up with lions. And even only a few data markers will quickly lead to a full plot that will be hard to interpret. A better choice might be to flip the axes around if you wish to change the narrative of the story that you're trying to tell. Here's an example of the axes of the previous graphs flipped around and also the market text is angled. The data remains the same as before, but the interpretation may now have changed. Instead of more GNP leading to higher life expectancy. Now, more life expectancy may lead to more GNP. And this is often the cause of axes changes. People interpret the x-axis is the variable that causes y. And of course we know that's not true, but that is another matter. Here. We're only interested in how to best visualize the data. So top-line thoughts have more users than simply visualizing area shading. When you only have a few observations or particularly discrete points that you want to emphasize. Consider using drop line plots to augment your scatter plot. Dropped iPods can significantly speed up the process of a reader identifying interesting data points and reading the graph. However, often this will only work with limited data, and it's also suggested and you label your Marcus.
34. What is a rainbow plot?: Now let's have a look and drop line plot. In the previous sessions, we looked at area and range plot in some detail. One type of area plot is known as a drop line plot. And you drop line plot, draws a marker on a scatter plot, and then drops a line from the marker to the horizontal axis. Most often where y equals to 0, one. There are many markers on a scatterplot. Drop line plot often serve as a not a type of area plot. And we've already explored that in the previous session. But drop-down plots are also useful when there isn't a lot of data. In that case, the visual narrative moves away from area plotting too narrowly focusing on particular markets. So let me show you what I mean. Here's the droplet with lots of data. Next variable could be treated as continuous. In this case, this drop-down blob visualizer, stock market prices changes over time. Each day as a marker associated with it, and this marker as a drop line attached to it, now connects to market to the horizontal axis. Well, y equals to 0. In this instance, the drop line plot simply acts as an area block. Well, lines offer a quick visual guide to what extent the stock market changes not they were either large or small and negative or positive. So, so far, so good. Now let's have a look at a, another visualization of a drop. But in this case, I'm going to plot the life expectancy data against gross national product, but various North American country. Here is the plot. Notice how there are only a few data points in the scatter plot that simply aren't very many countries in the continent of North America. Moreover, some GNP values are very close to each other and some are very far away from each other. A drop line plot here works especially well because the individual datapoints are also labeled. A reader can immediately identify what data point refers to what countries. In this plot, it's easy to quickly skim across the country list, choose a country of interests, and then follow the drop line plot down to identify what the value of lock GNP at that country has for a given life expectancy. So drop my props can also be used in circumstances where area shading may not be of primary interest. Instead, they excel at highlighting particular in discrete data points and then allowing viewers to quickly identify and home in on those data points. You might wonder, why not have two dark lines, one on the horizontal axis and the vertical axis? Well, this is probably not advisable as the plot will very quickly fill up with lions. And even only a few data markers will quickly lead to a full plot that will be hard to interpret. A better choice might be to flip the axes around if you wish to change the narrative of the story that you're trying to tell. Here's an example of the axes of the previous graphs flipped around and also the market text is angled. The data remains the same as before, but the interpretation may now have changed. Instead of more GNP leading to higher life expectancy. Now, more life expectancy may lead to more GNP. And this is often the cause of axes changes. People interpret the x-axis is the variable that causes y. And of course we know that's not true, but that is another matter. Here. We're only interested in how to best visualize the data. So top-line thoughts have more users than simply visualizing area shading. When you only have a few observations or particularly discrete points that you want to emphasize. Consider using drop line plots to augment your scatter plot. Dropped iPods can significantly speed up the process of a reader identifying interesting data points and reading the graph. However, often this will only work with limited data, and it's also suggested and you label your Marcus.
35. What is a rainbow plot?: Let's have a look at Rameau plots. Remember, plots are a generic name for all plot types that emphasize color gradation across a range of ordered data. Rainbow colors are often used as these encompass the full spectrum of visible colors that the human eye can determine. However, rainbow plots do not necessarily need to be in rainbow colors. They can be in any color, even very boring Carlos, such as grayscale colors. And I will demonstrate this in a moment. The key thing is that a ramble plot is a plot of all the data, with the added future being a color palette based on an ordering of the data. And this is very similar to a heat plot, where color is used to indicate data density. However, in a rainbow plot, color is used to represent data ordering rather than data density. It can be used in many circumstances with many different plot types. But the stereotypical example is a line plot and has many data series. So let me show you what I mean. Here's a basic example of a rainbow plot. This graph is a series of line plots that show life expectancy in the United States over the 20th century. There are nine different groups in this dataset and online groups are plotted on this graph. And that makes the graph relatively full with information and some kind of mechanism must be used to highlight the different groups. A key feature of these line plot are that the trend for all the groups don't really interact. All of them appear to be relatively parallel to each other. And that makes this perfect for rainbow plot. Each different data series is given a slightly different color from the rainbow colors. And the important aspect being that Carlos grade the eight from violet to red as a rainbow. Red colors represent data series with high average values. I'm blue, violet colors represent data series with low average value. However, the coloring is unrelated to any statistical value. It's simply denotes the ordering of the data. And that makes this plot visually much easier to interpret as the eye is naturally lead across the groups via the color gradation. However, this isn't actually a perfect example of a rainbow plot. And remember clots often useful when there are many more groups, 2050 or even a 100 groups in that case. And the legend falls away entirely and the coloring is there purely to make the visual acquisition and interpretation of data easier. So here's an example of a line plot which circuit 50 different groups. This plot is a line plot of weight gain of individual animals over time. Each line represents one particular animal and help it grows overtime. Colors used in this plot to make it easier to identify the starting position. Each starting position is colored as a function of the rainbow. Red colors denote a lower starting weight. Violet colors denote a higher starting with. And we can see that by Week 8 or 9, that rainbow pattern has been disrupted. So animals with a low starting weight now have a high final weight. And we can determine this mainly because of the coloring in the final few weeks would be impossible to trace all the individual lines through this number of theta series. At the same time, the final Rambo coloring is still relatively intact. So this does suggest that on average, animals with a low starting weight also maintain a lobe final weight. All in all, coloring is used to good effect here. It highlights the evolution of a enlarge number of line plot all the time. There's no reason to use rainbow colors. We can use any colors as long as a gradient can be derived between them. So here's an example of a two-color gradient where I used the colors yellow and green. This I think also works very well and it continues to highlight how the different data series evolve over time. Another derivation of a rainbow blob is a grayscale plot with one time series highlighted. But here's an example of what I mean. Rainbow plots are often used to highlight mass data changes over time. However, this grayscale plot emphasizes one single data series, in this particular case, one particular animal. But it then relates this to all the other data series. The light gray data series give a total context to the data. They highlight all of the animals that gain weight of time. But to dark gray theories highlight one particular animal and how its weight evolves over time in comparison to everyone else. So this is also a form of a rainbow plot, although no rainbow colors are used here. Finally, rainbow plots are not unuseful for line plots and it can be used with many different blood types. Here's an example of a bar plot. The key aspect of this bar plot with rainbow colors is not the colors help I travel across this visualization. The eyes automatically drawn to the starting red color and then shifts across the color spectrum to the right towards the blue colors. And in this particular case, it does it twice. This is an example where the groups are not ordered by default, but the rainbow coloring forces a pseudo ordering to the data. In this case, a simple left to right ordering. And this could be useful if you want to force us to look at all the bar plots together and not only focus on one particular category, for example. Here's another example when a series of boxplots, again, the Rambo coloring and forces a pseudo grouping of the data and nudges the reader to water evaluating the data on this graph top to bottom. So these two examples are very much cases where gradation coloring is used to make lot smoother for the eye and create a b spoke type of ordering. Rainbow blogs are all about color gradation. Colors can be used to enhance blots and highlight data series better. Or they can be used to create artificial grouping effects. In either case, the end result is that the eye is drawn across the graph based on color gradation, often rainbow colors, but they can be any colors. These plants are especially useful if you have a large amount of overlapping data and traditional transparency method stop working. And wherever these kind of plots on bonds, blood types and should be considered carefully before being used. For example, they increase the complexity for laymen viewers. And also accessibility may be reduced for certain individuals. For example, people with colorblindness. However, they absolutely excel at creating a visual ordering in busy plots.
36. What is a jitter plot?: Scatter plots do not tend to work well when both the x and y variables are discrete with only few data categories. If you've ever created a scatterplot of two binary coded variables, then you'll know what I mean. All you get is four dots, one in each corner of the scatter plot, and that is not very useful at all. However, there are ways to make scatterplots work with such kind of data. And the secret is jittering and jitter plot, also called scatter plots with data. Our plots that and randomly displaced or marker values away from the actual values. And that creates more space for markers to be visualized and can reveal data densities. And because jittering is random, the overall graph will continue to look similar to the actual raw data graph from before, even though the marker values will not represent accurately the data points. So let me show you how this works. Here is a nice scatter plot of a small car dataset. This dataset has 74 observation and I'm scattering the values of the categorical variable, current repair, which has five categories against a binary variable called car origin, which has two values, domestic and foreign. And as you can see, there are data points on the scatter plot, but it's not very informative. Most of the data points sit on top of each other. Now let's go ahead and randomly displaced a market using a Guitar option. And there we got the individual markers are now moved away from their true values to a random value around their true value. And this allows us to see how many Marco sit in each category and gives us a better visual representation of how the data is distributed across these two discrete variables. So very neat. Advantages of data plots are as follows. They help visualize discrete variables and how they are related to each other. And the random nature of jittering means that the overall plot still remains valid. Disadvantages are that the markers now do not reflect the true data anymore. And people might get confused if they don't understand the concept of jitter. Also tutoring has limit, which will depend on how much data sits within each category. Here are some more examples of jittered scatter plots. Tutoring can be useful for any variables that are discrete, not only for variables that have very few discrete categories. Here's an example of a scatterplot, weird one categorical variable repair record, which has few categories. And they not a discrete variable, MPG that has many discrete categories also split the data across another binary variable, domestic and foreign. In this case, they're showing you an example of multiple tetra. So here's the same plot. With tutoring. This plot, the individual data points slightly better reveal how many values there actually are. But the pot also shifts and cheetahs to foreign and domestic Marcus slightly off the horizontal axis of each repack category to better identify market points of each of the foreign and domestic categories. So jittering can happen across multiple dimensions. However, be careful, there's a limit to how much you can achieve with jitter plot. Here's an example of two binary variables with deterring that's contained a lot of data. And as you can see, logit plot here is not particularly useful since there are so many data points that we still can't identify exactly how this data is distributed. Of course, we could increase the size of our jittering, but that might result in a plot that looks like this. And I question whether this block has any value other than to confuse readers. Finally, here's an example of a data block with many categories. This blocks counters occupation categories and against industry categories for my labor market data set on employment. Once we can identify that some industry occupation combinations have nobody in them, we cannot see which occupations are prevalent in industry start actually have your patients. So here is the same plot, jitter. Well, we can now see that operatives are prevalent in manufacturing industries and managers and professional occupations in the professional services industry. The scatterplot generally do not work well when data is discrete. If your categories sir are, the worse it gets. However, by tutoring the data, some sense can be made of such scatterplots. Digital products can be really useful in presenting categorical data distributions. Now they're often extent that visualizing data that has fewer observations. However, if the observation count is large, they will struggle. Also, layman readers may not fully understand that the presented data markers are not the true data markers anymore. And of course, that must be made clear when using data plots.
37. What is a table plot?: Now let's have a look at a table plot. A table plot is just what you imagine it might be and is simply eat data table, convert it into a platform using a specifically formatted bar chart. The advantage of a table plot over a jitter plot is that they resemble tables more closely and offer a less cluttered and easier visualization of the data. So let's have a look at the basic problem. Here is a table with data. This data table contains two discrete variables. One variable measures health outcomes and the other looks at age groups. One would expect house to get worse with age. And we look closely at the numbers in this table. We can see some evidence of that. However, there are quite a few categories in this table, and a lot of numbers and cell sizes vary between the tens and the hundreds, almost reaching 1000. So any data comparison or in analysis find readers of this table will take quite some time since we need to visually acquire each number and then relate it to another number. It's great if you want a lot of detail, but not too good if you want to quickly summarize the information from this data. So this is what table plots can help. Here's an example of the previous table converted into a table. This is a basic table plot. It takes each cell of the previous table and convert it into a small bar chart. One bar chart in this plot is set at its maximum size, and all the other bar charts are relative to that part. In this case, the maximum bar chart is this cell, and it can take 809 observations. Visually, this plot allows readers to quickly identify how the discrete data is distributed across these various categories. It appears that is a bit of a gradient in this data. And it looks like older people are more likely to have poor health outcomes. This interpretation can be done in a split second, thanks to declare an easy visual presentation of a table plot and would take significantly longer doing this using a data table. The advantages of a table plot are that they quickly summarize a table pictorially and offload read us a much faster interpretation of a complex data table. It probably wouldn't be very useful for binary values. But when you're faced with many discrete categories, it will make a real difference. In disadvantage of tablecloth to instead to contain less detail, can make it harder for us to compute precise statistics themselves. The trade off is simple, and you can either choose less statistical information and they foster a visual interpretation of the data, or more statistical information that requires complex computation by any reader. Table plots can also be amended to provide more detail. For example, the individual self frequencies can easily be added into the plot like this. And we can now see the individual cell sizes and perform any kind of custom complex computation that we would want to. However, this doesn't have the disadvantage of making the plot more complex than before. And in fact, this plot, there's more complex than the original table into it now and contains the visual elements from a table plot and the numerical elements from the data table. So in this kind of plot is useful if you want complexity paired with a visual narrative. Just like any two-way table, it is also possible to present various alternative representations of the raw data. For example, we can highlight column percentages instead of frequency. Here's an example of that. This block further highlights that difference in health outcomes between older and younger people. Younger people are much more likely to have good health outcomes. While it's older people are more likely to have regular or bad health outcomes. Favor plots are a great visualization technique for two-way table. They can help ease the burden of having too many tables in a presentation or a report. And provide a simple but effective visual narrative for how two sets of discrete data are distributed. However, they do lose some statistical precision in their simplest form.
38. What is a balloon plot?: Let's have a look at a variation of a table called a balloon clot. A table plot the visualizers frequency or other statistic confirmation of two discrete variables in a matrix form. However, the usage of bars in each cell of the matrix may not be too Everybody's liking. Bars tend to have a sharper look about them. And comparing the relative sizes can only be done and the vertical axis. And that may be not what's 12 tough times. So an alternative virtualization option is to replace bars which circles where the size or radius of the circle and represent their relevant statistic. And now it's called a balloon plot. This has the advantage that size comparison between different cells can be done in any compass direction, not just circles and have more data and smaller circles now have less data. Overall, a balloon plot can sometimes be a smoother visualization than a table plot. It disadvantages that minute differences can sometimes be harder to spot. So let's have a look at an example. Here's an example of a balloon plot using the previously used health and age data. The individual categorical combinations each have a circle or balloon that represents the frequency size of that particular cell. Larger circles have more observations than smaller circles. And this part's very similar to a table plot, but the visualization is slightly different. Often it will be a matter of personal preference rather than any serious advantage or disadvantage between these two plots, that makes you choose a particular one. I personally prefer table plot, but I can't see a balloon plot working in certain circumstances, especially when a softer visual approach is warranted. And of course, just like table called a bloom plot can be amended to include the actual data values. This makes a lot more complicated and the feature simplicity of it, but it does allow you to combine a visual narrative of a table or balloon blog whilst keeping the statistical power of a simple two-way table. So here's an example of that. And this plot now includes data frequencies in each circle. A disadvantage of a balloon plot with data values is that some circles are so small that they make reading the actual number very hard. A simple solution, such as moving the market values of center won't always work because some values may overlap with the perimeter of the circle. And table plots are less prone to such problems since or market values can always be placed below the party. Balloon plots are very similar to table plots, but of a slightly softer visual approach. Keep them in mind if you want to display data from two categorical variables and how they're distributed with each other.
39. What is a stacked bar chart?: Now let's take a look at bar charts again, but this time stacked bar charts. But it's time stacked bar charts. Stacked bar charts are a useful visualization, optional when you have two discrete variables. Each bar in a standard bar chart is divided into n number of sub bars which are stacked end-to-end, each one corresponding to a level of this second categorical variable. And advantage of snack bar chart that they show the relative decomposition of each primary bar based on the levels of a second categorical variable. The total length of each snack bar is the same as before, but we can now see how the secondary groups contribute to that total. So let's have a look, an example. And this example, I'm using the previously used Health dataset that contains information on the health outcomes and the age of individual. Each variable contains a number of distinct categories. Remember, a regular table of this data involve a lot of numerical numbers. A stacked bar chart offers an alternative way to visualize data. So here is a stacked bar chart. The categories on the x-axis represent different age groups. And on the y-axis, we can read off the individual frequencies. Young people have higher frequencies, but all people, there are more young people and old people. But in each bond we can observe how many of all those individuals and Navbar have good or bad health outcomes. And it's easy to see that older people appear to have a higher frequency of poor, bad health outcomes. And that is the power of a stacked bar chart reveals data patterns across two discrete variables. Of course, you can also flip a stacked bar chart into the vertical, like so. And this kind of visualization is especially useful if you have a lot of texts in the categorical labels. It's also possible to change the categories. By design. One variable is on the x or y-axis once the other variable is colored into the graph. And of course these can be flipped around for a different visualization. Now it has a different emphasis. So here's an example of me switching the two categorical variables, the categories on the x-axis and now refer to health outcomes. And the age categories are snapped into each box. Notice however, that the frequency size of some of the groups and a number of groups in this chart makes it much less readable than the previous one. It's impossible to what's happening in the very bad group. And even the badhealth group is difficult to interpret. So do consider switching between variables when using snack bar chart to see which visualization and it might suit your narrative best. Finally, it's common to replace frequency statistics with percentage statistics. You can snack bar chart. This leads to a different visualization where each snack bar, which is a 100 percent, and it becomes much easier to compare relative differences across groups. Here's an example. We can now see that the proportion of people with bad health outcomes is significantly higher. What is 75 plus group compared to some of the younger age groups? Snack percentage bar charts are fantastic for allowing readers to get a very fast understanding of relative differences between groups across two categorical variables. Stacked bar charts are a very intuitive visualization of discrete data. They're great for presenting data distributions across two categorical variables with few or a medium number of categories. They're easy to understand and often used in data science. However, it can be difficult to compare groups that are small order not in the middle of this stack. Judging sub bar length can be a challenge. And of course, it can also be open to abuse. However, the general goal of stacked bar chart is to make easy relative document about the second categorical variable. Precise judgments are generally not the main goal. If this is the main purpose, then you'll probably need to consider, are there visualization technique?
40. What is a mosaic plot?: Now let's take a quick look at mosaic plot. A mosaic God, sometimes called Amoco plot, or a spine plot, is a graphical method for visualizing data from two categorical variables. They're almost identical to snack bar graphs. With the addition that a second x-axis is introduced that highlight the relative size of each category of the variable on the x-axis. The idea of mosaic plot is to provide two sets of statistics to the reader in a visual fashion. It provides them with the same information as a stacked bar graph, not a word, the percent of observations of a set of discrete choices in another set of discrete choices. But it also provides the overall percentage of how much data is in each discrete choice for one of the discrete variables. So that all might sound a little bit confusing, though it's best to take a look. Here is a mosaic plot of the health data we've used in the previous session. Remember, the overall pattern of health data shows that younger people tend to have better health outcomes than older people. That is what we can see here. The proportion of people in each age group with very good health outcomes decreases as we jumped through the age categories. The proportion with poor health increases. So far, so stacked bar graph. However, now let's take a look at the top. There is a second x-axis. And this x-axis shows the percentage of observations in each of the, each group categories. So we can now identify that the number of people in the last age group and actually quite small compared to the first age group. Only around 5% of observations are in that final age group category. So essentially, each stack has a different way, the width, and represent how much data is actually in that stack. Um, that is a mosaic plot. Mosaic plots are advanced bomb to visualize two discrete variables and how data are distributed across multiple categories. In essence, they combine elements of two stacked bar graphs. One, a frequency snack bar graph, and to a percentage snack bar graph. They have an advantage over single stacked bar graph by visualizing the size of each stack and allowing users to determine whether a stock is important or not. Single snack bar graphs can't do that. However, this does introduce an additional complexity. So labeling is required with mosaic, but even then, layman readers may still struggle to fully understand what's coupling and this plot. Finally, categories with very small cell sizes may result in extremely thin snacks that don't present well. So manual adjustments may be needed to render a mosaic plot useful.
41. What is a contour plot?: Now let's have a look at contour plots on the blood type, type of 3D plot that attempt to visualize the relationship between three different variables. They do this by visualizing data in two dimensions and then using contours to represent a third dimension. Contours are constant slices of the third zed axis variable that is plotted across the two dimensions of y and x. Contour plots allow you to answer the question of how a variables that changes as a function of y and x counterparts are often used in topological mapping, where the contours are used to represent the height. For example, you will often see that mountains are represented by ever smaller concentric rings. However, contour plots also have a use in data analytics, where they can be used to great effect to highlight parametric relationships between variables. However, they are less useful in raw data exploration as non-parametric analysis in higher dimensions often leads to something called the curse of dimensionality. And this often results in plots that take for ever to create and look terrible. Mostly contour plots are generally not very useful if the Y or X variable or categorical, generally, both y and x needs to be continuous. It was a bit more flexibility in the Z variable, but even here, at least five or more categories are needed. So let's have a look at a contour plot is a classical example of a contour plot. This contour plot visualizes subsea elevation of sandstone in an area in the United States. So basically get 7 map. And that's how most people will have some familiarity with contour plots via maps. In this case, the y and x-axis represent a set of coordinates. And together they form a square and represent an area of land. And the Zed variable measures depth in feet in this case. And legend identifies which color values and contours represent which areas of debt. You can imagine that this legend represents a third I mentioned the zed axis. And we can see in this graph that there are some areas of low depth shaded in yellow and there are some areas of high debt shaded in purple. Overall, we get a good understanding of the geography of this particular area. However, content blocks are subject to various choices, including how many contour line should be computed and how the data should be interpolating. For example, in this diagram, only five contours were chosen. And this means a lot of individual data is aggregated together and averaged. So just like how density plots need bandwidth choices and histograms need binwidth choices. Contour plots need contour choices. You need to specify how many contours you want to use. More contours imply more detail, but too many, and it may not make much sense. Is an example of the same plot, but with 20 contours. This plot offers a lot more detail in comparison to the last plot, we can detect multiple values emerge in the landscape, and we can also detect mock curvature and detail in the contour lines. Of course, the amount of detail that you need will depend on what aspect of the data you're interested in. I'm interested in large-scale relationships or my neutron micro regions of the data. Only you can answer that question. We need to play around with various contour lines, determine which suits your needs. Let's have a quick look at some advantages and disadvantages of contour plots. Contour plots offer an alternative visualization to 3D blocks. Now use color and contour lines to visualize a third dimension, the zed axis. Another advantage is that often contour plots will be innately familiar to people. And it's most people will have some understanding of MapReduce. Contour lines also excel at visualizing complex three way relationships and then a smooth and have clear gradients. However, contour plots also have some disadvantages. Come to plot to acquire the wine x variable to be continuous. Does that variable should have many categories or also be continuous? Any other data setup, one 3D work. Contour plots need a lot of data in area of the graph that has no data needs to be guessed. And this is called interpolation. Generally, contour plots need datasets with thousands of observations and not a 100. Color choices, conta levels and interpolation choices matter. These are complex computational aspects behind contour plots, and they can significantly change how plot looks. Just like density and histogram plots, but different depending on bandwidth choices. Lastly, contour plots are usually not good for exploring random variables that are not clearly related. Contour plots are great for clear relationships that have easy patterns. However, complex variables that have a lot of variation would lead to very messy plots that are likely of little use. So let's have a look at some more examples of contour plots. I previously mentioned that the number of contours matters significantly. However, it's not just a number of contours that matters, but also the spacing and how each contours colored. This is an example of me specifying 20 custom contours on the previously used sandstone data. This looks like a pole plot. The contours are similar to before, but coloring suggests almost uniformity in the landscape. And that is because of zed range is too large. The yellow and green colors simply don't. Into the picture and everything is essentially colored blue and purple. So this is an example of a poor color and gradient choice. A better example of coloring is something like this. This plot uses three instead of the previously used two colors and has 120 contours. The three color range extends the available color spectrum and helps visualize high, medium, and low regions in the data more accurately. In addition, the 100 contours create a much more diffused plop. Let's smooth that out. A lot of the previous granularity. We now get a much smoother and detailed understanding of how depth varies across the geography. At the same time. Note this only works because the data detail is available and to computational performance is not a factor. These kind of plots can take a long time to create and compute. So far, we've looked at geography as an example. Now let's look at something that's a bit more applied to data sites. A person find contour plots invaluable in helping me explain and understand complex regression models is an example of a previously used to labor force dataset. Imagine that I'm interested in the relationship between wages, job tenure, and hours worked. I could build a parametric model to examine this relationship. And I would do this using the regression model that has an interaction effect. And the results might look something like this. These are the results of a regression model. And this regression model says that for each increase in hours worked, we just go up by 0.07. And for each year of tenure increase, we just go up by zero-point 19. However, as ours and tenure increased together, we've just got down. And we can see this by the negative sign on the interaction to actually move this term is statistically insignificant, but we'll go to ignore that for now. We'll come back to that later. One problem that I often have been other people have is that the interpretation of the interaction coefficient is very complicated. I just can't quite get my head around what this negative coefficient means. So one easy way is to simply predict await results from this model across the range of hours and tenure. These predictions are then stored in another dataset from which they can be loaded and been through a contour plot. And the end result might look something like this. This is a contour plot that visualizes the previous regression model. Let me explain it. As job tenure increases in years where you just go up. And we can see that by the coloring of the contour plot. As hours worked increases, wages also go up. So we can clearly see from this contour plot as both the Y and X variables increase, wages increase because we head into the darker, higher areas of wages. Great. At the same time, we also see that there's a slight curve to the contours. And this curve actually represents the negative effect of the interaction term. Is both tenure and hours increased together. The expected wage doesn't actually increase linearly. It decreases slightly less. And that is because the contours bulge out. In other words, higher hours and higher wages leads to a less than a linear increase in wages. And that is exactly what the interaction term signifies in our regression. But nonetheless, the interaction effect is very small. And those with high tenure and high hours still learn the most according to this model. Perfect. We've now understood what the regression model estimated through the use of a visual contour plot. However, I did say that the interaction term who was statistically nonsignificant. And actually, this model is not a very good model. The underlying data relationship is more complex than this. You might wonder, well, why not simply apply the contour plot straight to the data? Why bother with this parametric technique such as regression analysis? Why not just let the data speak for itself? Well, here's the result. If we do that, this looks like a very poor and complicated contour plot. The data is all over the place and there's no distinct pattern in this contour plot. Moreover, it took many minutes to compute this graph and it wasted a significant amount of my analysis time. No general trend can be made out from this block. And the end result is that everyone is left confused. And this exemplifies my contour plots are not suitable everywhere. They're best suited when there's a clear patterning among the x, y, and z variables, which was the case in the regression example. For non-parametric data exploration, they will often end up with messy pods that looked like this. Contour plots are powerful block types that can help visualize complex data matters. There are pseudo 3D plots and use colors and contours to visualize a third variable. There are best used when does a smooth relationship between the x, y, and z variables. And this is often the case in geography, for example, mountains and valleys are very smooth topological features. They tend to be less useful when you have little categorical or complex data relationships. Each of these may lead to poor visual results. However, they can also be used for more than basic data exploration. They can be used to visualize results from complex regression analysis, for example, overall, contour plots are a good plot type to know about modeller they use will remain specialized.