Transcripts
1. Course Introduction: Hello and welcome to the beginner's data analysis Boot camp with sequel course. I'm a non shake. I have been consulting for the Fortune 500 companies for the past 18 years and have successfully implemented multiple data projects. I was fortunate enough to go to one of the best schools in my field. I have taught multiple I T courses to professionals. I'm excited to launch this new comprehensive course on data analysis. You don't need to have a programming experience to enroll in this class. You can be from any walks off life and will highly benefit from this course. I will explain each topic in detail in my BDO lectures and have extensive live exercises with multiple quizzes to make sure you get a complete learning experience. If you're not satisfied with this class, you have a 30 day money back guarantee. We will learn the four key pillars of data analysis that will become your foundation in understanding how to work with data effectively. Having a basic understanding of how databases work is very important. We will be using my sequel as a preferred database system for this class. My sequel is one of the leading databases for the Fortune 500 companies. Data modelling how data is related to each other and understanding the different types of relationships between data becomes critical. As we dig deeper in driving value for our business data Insight, we will learn sequel language in detail way. Have the lab exercises to access the data in our database way will start from the basic operations and then get to the Advance analytics to address multiple use cases. Data visualization. We will learn the skills to tell a compelling story off our data by creating multiple visualizations using tableau, which will be our preferred data visualization tool. My goal is to make sure you learn these data skills in a way that can help you apply them in your professional life. Get the salary raise that you were looking for. Make your resume stand out and if need be, make a shift to the deep analysis field. I'm really excited about teaching this course to you, and I hope you join me in this learning journey
2. Introduction to Databases: What our databases. So rather than getting into
the technical details, let's, let's try to understand how we come across databases
in our day-to-day lives. So if you're searching the web, anything you write, or if you're searching for a
particular keyword, any results set
that you get back. That results set is coming
back from a database. If you go to an ATM machine
and key in your passcode, when you swipe your
ATM debit card, information gets checked
against the database. Anytime you log into a web page with your
username password, the username password gets
checked against a database. Anytime you pull up
a bank statement, that bank statement is created from the data that
is stored in the database. So that's kinda the way we
interact with databases. So what you see on
the slide areas that I haven't Excel worksheets. So a lot of folks are
still saving data. In Excel worksheets. You keep adding to your dataset. You keep adding rows. So it's like a row and columns. And that's great. And for a lot of
smaller companies, maybe Excel might be a solution. And I see people just
saving data in text files. And companies are
producing data 24, 7. And with the age of internet, data has grown astronomically. So what is the problem
that we see here? Like what happens
in your Excel when the data grows from a 100
lines to 2 million records, how would excel behave? What it crash? Or do you have to
split the files and how do you link the
files back together? So definitely there's going
to be a problem there. Insuring their security
and reliability of your data in this setup
will also become a problem. Database software systems were designed to solve a problem. And I've just listed a few of the key issues that the
database systems solve. So let's go for the first one. Volume. I mean, data is increasing, it's compounding
astronomically creasing and terabytes,
petabytes, zetabytes. And we need sophisticated
softwares too, manage that much volume of data. Securities and other key issue that companies are facing
when their data gets hacked. So we need database
software systems, do have those scarcity
encryption layer added so we can
protect our data. We have web applications
that have millions of customers accessing their data. So we need sophisticated
database systems. I can handle these
transactions simultaneously. Reliability becomes
so important for companies where data
is their business. I mean, for a stock exchange
flight scheduling system, we cannot have data. That is not correct. Data had computer Bible. Data is increasing. Securities needed, millions of transactions
happening on the same time, data has to be reliable. So database softwares or database management systems are needed to solve these problems. You just need to understand that and get that in our head. Before we get into
kinda looking at the features of the database
management softwares. In the previous lecture, we looked at database
is what they are, and also go through some
of the rationales as to why database management
systems were created. So in this slide, I wanted to kind of give you the architecture view of how database management
systems are used. This diagram is
going to help you understand the different
components that are used when you access data to a database
management system. The user application layer. Think of this as where you have a client installed
on your laptop. In our case, we installed
MySQL Workbench. There are other
client tools that can be used to access the database
management system as well. I've shown towed as one of those softwares and other
reporting softwares can be used as well. Your client application
needs to have an ODBC connection initiated. Odbc stands for open
database connectivity. So this driver needs to be in place for your
client application to connect to the database
management system or your database server. Once that connection
is established, you can easily write a SQL
script and access your data. I listed for you of the
database management systems. Ibm Db2, Microsoft SQL Server, MySQL, and Oracle
database systems. And there are a lot
of other vendors providing this software as well. They are different
functionalities that are embedded in the
management system. And we have functionality that handles
transaction management. We have areas for
handling security, data manipulation, defining
data, loading data, and the database server
also has an engine that can SAP SQL language to be processed and then
rewrite or SQL script as simple as
select star from a table. At SQL statement gets processed, we add the DBMS. The DBMS needs to interact with the operating system to get to their data storage area or
the data storage layer. The last layer is the
data storage area. Your physical data files
or reside on the disk. I'm showing customer order and product data is stored
on the physical disk. But you would need a database management
system to access this data. And you would need a client like MySQL Workbench to be
used via a SQL interface. Do get that data and
be presented to you. These individual
components will help you understand how that database
architecture is set up. And it's very important
to understand what the user application does
and what the DBMS does, and where the physical
data files are stored. And I think once you
can make a distinction, it will really help you understand the database
management system. Hi everyone. One of the key points I want you to take from this slide is that MySQL is being used by
a lot of industry leaders. So if we look at the
top left off the slide, these are just some
of the companies that are using MySQL and production. So from Facebook to a Walmart, to Uber, these are
all the companies. Either they are using MySQL for their warehouse or
their web applications. And if you look at the
bottom left of the slide, I've given like timeline, kinda the inception
of the software. And it was built in 1995
by a Swedish company. And over the course
of years it has been acquired by different companies and Oracle acquired it via sun. So it has hundreds and millions of installation
throughout the world. And it's built on some
of the key features for a database which
are performance, reliability, ease
of use, and cost. And if you look at
the bar chart I've included, I mean it tells you, compared to some of the
other database solutions, it is so cost effective. So the total cost
of ownership TCO, is really, really less
compared to other software.
3. MYSQL Windows Install: Hi there. Great. So let's kick start this installation. So we're gonna be installing my sequel for Windows. It's a very easy install. Just follow the U. R L that I've given are great. So we're on my secret war page and follow the Ural that I've given. We will be downloading the mice equal installer 8.0 point 12. And the first link here, the one that I'm highlighting co operating system is going to be Microsoft Windows for the S. And then click down little do log in or sign up Just saying no, thanks. And move on and you're gonna get, ah, save as, um options and just save as Creative folder and save the file. The installer is going to start, which is 1.4 version. He's gonna ask you to install an upgrade, so just say yes. We will just get the new version as well. All right, so the first thing you're gonna get is the license agreement you're gonna accept, read through and accept that, and they're different options to install a set up type. So we're gonna go to the custom. We just need a server and workbench. So we're going to select the server option moved to the right as a selection when they were going to go to the work bench. Um, get that selected as well. Also, we're going to do the or DBC connection in a documentation. So we're just doing limited software install rather than everything. I've fast forwarded the video, so it should take you no more than 10 to 15 minutes. It next. What? We're going to keep that replication to default. This do next networking type. Just keep it default next authentication method as well, just to a default next. And then just put up a password. Please make sure you remember this password because very important, I just put it in a piece of paper or note. Pat somewhere, it's just gonna be the this service name. If you want to manually start the service this to execute, it's gonna, uh, run a bunch of configuration files again. I have fast forwarded the video just to finish on launching my secret workbench. It gives your default connection, but we're going to create our own. My secret connection just hit the plus sign and we'll give it a name. We'll just say sequel Underscore Class. Everything is already filled up just to Ah, OK, double click on sequel Underscore Class. We're gonna be launching workbench now. So this is the graphical user interface for my equal, and we will, right. All our scripts there will create tables, load data and run different quarries using this GUI interface.
4. MYSQL Macos Install: Hello, students. So we gonna install my sequel on Mac Os. So I've given the euro. Um, you can use to download and go to the my secret website. We gonna be downloading my sequel Community server and also my sequel. Work Bends. So we need these to sell fares for the installation. All right, so we're gonna first install the my sequel server. Just follow the link. I'm click on the community server. We're gonna be downloading the DMG file for the mice Equal Community server 8.0 point 12. So these are different types or files. We'll just do the DMG again. Just click download, ignore the log in and sign up at this time, So just say no, thanks. And click on the link. I fast forwarded the video so you have that amount of metal disc image on your desktops, and now we're gonna do the my sequel workbench Install, scroll down and you'll find one file, which is also the DMG file or workbench. Eight points or 80.12 and click the download. All right, so we have both of files on our desktop, so we'll start with the server file. Click on that so it's gonna start. Ah, the install. Or you would need a password for your database. So please make sure you remember the password right on a piece of paper or note pad or save it toe. It'll need this every time we connected the database. So just please, please, please remember now we should see the installation status as being successful. You know, we have actually installed server. We can go to Preferences and just see my sequel. Um, in the list off software is being available, so we're good to go there. Now we'll click the workbench DMG file and that it install. All right, so let's open the my secret workbench application. Great. So we have my secret workbench installed. Let's click on the icon for the connection. That's already there. We'll have a default connection, can create it at the install. Will just type in the password. That be Ah, we keyed in earlier. So hopefully you remember that he's going to try to connect to the database. All right, so we're successful in connecting to, but my sequel server through Ah, my secret workbench does the graphical user interface we're gonna be using to write our sequel scripts. And, um so hopefully, during the course of the class, we'll go step by step. And you learned this interface for your programming. Thank you.
5. MYSQL Cloud Install: everyone. So we will also try to access my sequel. We have the cloud instance, I'm gonna be using Amazon Web services and, you know, looking into future, I think cloud implementation is gonna be prevailing across a lot of companies going forward . So I think it's a really good to have and experience using some sort of glad service. So we're gonna have the my sequel instance in the cloud, and we're gonna use our my sequel workbench to connect to the cloud Instance. So you need to have an AWS account set up, and I have given a You're left at the bottom for some steps as to how to set up that account to please follow that. And hopefully the set up would be easy for you to create that account. Once you have your count set up, sign into the council and you'll get into the main portal or feet up to us. You you'll see a list of services. So Amazon has a bunch of services from spinning off servers, different storage options date, obvious option, and so we're not going to go through all these details. What we're interested in is setting up my sequel server. So in the database category, we have already. SRT s is a subcategory where we have all the relational database systems that Amazon provides. So now please go ahead. Uh, click on the button. Get started now. So these are the list of relational databases that Amazon is providing. You have sequel server, Oracle posts, Crests, Maria DB and my sequel and Amazon Aurora. So we're gonna go to my sequel. So we have two versions. One is the community addition and the other one is Amazon's Aurora. It's, ah, different flavor of my sequel, and we're gonna be using the Community server edition. They're gonna go click here, Please go ahead. Used the Devon test instance because it uses the free tier so we don't get charged. But we want to see only showed those options that are eligible for free tears that we'll just click that we'll leave everything else here the same. So we need to specify a db instance. So in my case, I'll just use, uh, AMG Inc. Use the name student. I have to give a password like a step so we can leave all this as a default database name will just say my sequel, Cloud. So we'll just leave all this as if back up. We don't need any backups or this one just 10 to 0 relaunch. So this will take a little bit of time to spin off the server. Um, So what I'll do is I'll just pause the video. All right, so now our service up and running and you can see the status. So now we need a connection string or an endpoint. Ah, that we're gonna use in our my sequel work bench to connect to this instance. So we go to instance actions, you'll get some more details on the serving. You also get endpoint. Okay, so this is our my sequel, workbench. And now we're gonna create a connection, a new connection to connect to the my sequel server on the Cloud. So let's go back to our details off the server. So this endpoint is basically our connection string. So just copy this in my secret workbench would create a new connections to click on the plus sign are there will get a small window open so we'll create a connection name, so I'll call it sequel. Underscore Class underscore Cloud as far as the hosting will put the connection string there. We already have the port, so take the port out out of the connection string. So for using him, I had initially put in student Well, just test the connection. Enter the password that you set up. All right. So we were able to make a successful connection to my sequel instance, which is in the cloud. So it'll create a small Aiken for the connection, So just double click that great. So we're able to connect our my sequel workbench to the my sequel server instance on AWS.
6. MYSQL GUI INTRO: Hi there. All right, so now we have our my secret work. Ben started. So? So the first thing you will do is try to connect to a database. So we have my sequel connections. It just gives you Aikens for your connections that preset. So we're gonna use the 1st 1 that we set it up early. If you double click on that, I've already entered my password before, so it's just it's gonna just let me in. But it might ask you for a password. So just key in your password that you set it up. So now we are in the graphical user interface of my sequel, and this is the place where we're gonna be writing different scripts. So we have the quarry pan in the middle where we write a sequel statements. We have an object browser. We're gonna have different schemers and tables listed for metadata information. On the right hand side, we have sequel statements or snippets that we can use for our sequel. So now we have kind of a common understanding about the interface. So if you look at the object browser area, so we have ah s Y s says schema that's created by default, and it has objects underneath it. So we have tables, abused or precision functions. We for our class will be interested in creating tables, loading data enquiring those tables. So if you look at that, we have one table called six. Underscore conf IQ. Let's say if you want to just access that table so we'll come into our quarry pan and just write a simple, select statement. It's Colon. And if you look at this ah, lightning icon at the top, this is execute the selected portion of the script. So just click that and you'll see a result. Great. Which kind of gives you the results that that you need from the stable six underscore conflict. So when you initially load the stable, sis might not be your default database, so just make sure your right click on sets and just set as default schema so you don't need to prefix it with something like this. Leaving this will also work Great. If you look at the right, this is what we initially we're talking about the snippets that my sick will give you. I mean, these are different categories that they have split into the details of the M Els and etcetera. Click on it and just check what the script would look like with all the options. It's just kind of in a handy tool to have when you're trying to right near scripts and, as you can see it has insert joined Select. It's kind of handy in the output. Here is basically kind of a high level summary as to rebrand the script, how much time the script took and the record second returned account of those records. And if any editor comes into scripts able to be listed here as well, great, I will see in the next lecture.
7. Create Database and Load Data: great. So now we have that My sequel software installed. We have my sequel, Workbench up and running. So now we need to create a database. So this is the fun part. We're gonna download a sample database that my sequel provides. You're gonna follow the link. Are so we're on the my sequel website. Your scroll down, You will see the Secura. You can pronounce it different way. We'll just call it Secura. A database. Just download the zip. We'll just do receive. As so we have a schema. That's where we're gonna have the sequel statements to create our table structures. Then we have, ah, data, which is gonna be a lot off insert statements creating this Secura database. And then we have the data model off how these tables are structured and related. All right, so we're gonna do a file we're gonna open Script open, secure a schemer sequel are So we have our schemo that's equal file loaded in my sequel workbench. So it has a bunch of creates statements creating actor table, address, table. So we're going to be creating all our data structures through this script, and it also gives you additional information on the columns that we're gonna be adding. So we'll just run this script all at once, and it's gonna create all the tables we need for our exercise. Make sure the cursor is on the start off the script line and then click on the execute icon to execute the script. So just hit. Hit the, uh, hit that I can. You can ignore the warnings that were using the output. It's still going to create the tables. So if you go to the object browser on the says I can right click and refresh, you'll see the Secura database. And also, if you open the tables like on, you'll see the list of tables that was created to, say Select Star from the actor table Click Execute to run. If you see we did not get anything back, there's no denying it. So now we need to load data into the state of eso are simple to school. Open, open script. We will select the data dot sequel file and load it. So this is sample data that we're gonna be loading were the sequel insert statements to fill up our sample database. We will execute the scripts by clicking on the icon. So these insert statements will take a few minutes to load for data, and you can see the status in the output pan as to how much time is taking and how many records of being inserted. Let's try to see we have data and into one off the table. So we'll just to run the same select star from actor table and see if we get data. So if you seen the result great, Um, the data is populated in the actor table for the first name, last name and a bunch of rose. So we're good to go here. Same way we can try to run the address table. Select star from address. We should be able to see more data there. I will see in the next lecture.
8. Introduction to Data Modeling Part1: Hi everyone. So
now we're going to start a new section
on data modelling. Understanding how
data is related to each other becomes really, really important as we start our journey into
the Data Analysis field. So we have to make sense of how datasets are related
to each other. So we can derive
value off the data. The data becomes
more meaningful if we are able to relate the
data with each other. The byproduct of
data modeling is a data model that gets created. It's a graphical layout. It becomes a communication
tool that can be used for different groups to understand how the data is
related to each other. We can use MySQL workbench or urban software to
do data modeling. You must have seen a
picture of a data model. And what I'm showing on
the slide right now is just a sample set of tables in some of
their relationships. Connecting lines between the tables showed
that relationship. And we will go into detail
in the later slides. Re-ran the schema file to
create a table structure is already loaded
the data already. And now we're going to be
opening up that data model. So there is a dot m WB file, last file when you unzip. So just double-click
on the file, it should open the data model
in the MySQL workbench. If this window shows, just go ahead and rename. And it'll, it'll go ahead
and load the model. Click on the ER diagram icon. And then the model opens. They have color quarter
different sections based on the type of data there is customers and inventory, and all the related tables and certain subject areas
and color-coded it. So this is the interface
where you can view the data model and
how it's related. Tables are one of the foundational structures
in data modeling peoples. So I've given an example
of a country table. Now we're going to
be talking about the building blocks
for data modeling. First one being table or entity. You can use these
names interchangeably. A table can be something for
which you want to store data for collect data for
examples, can be costumer. If you want to
collect data about a customer, it's behavior. It can be automobile, the type of cars that are going on a particular high of a, it can be products. So anything that you
want to collect data for second column or attribute, these are just
characteristics of the table. So if you want to collect
data for the customer, the customer Being a table, the columns for the customer
table can be FirstName, LastName, phone
number, address, etc. Third is datatypes. So each of those columns can have a certain
type of datatype, datatype, all it is is that
if it's a phone number, we have a number as a
format of the column. If it's a name and it's
more of a textural data. So just a different
flavors of those columns. Forward is cardinality of
a relationship is what defines how the two table or entities are linked
with each other. And just kinda is modal for descriptive way of saying how they are related
with each other. And we're going to go into
detail the different types of cardinalities and relationships
they are present. Fifth is the primary key is kinda what defines
the customer. Is it the social
security number is kinda the primary identifier. And then foreign keys is a concept that will
become apparent as we go down explaining
the different type of relationships and
how tables are linked. So the six building blocks are very important to understand how data is
related to each other. So we'll go and explain each of these building blocks in detail
in the following slides. Tables are one of the
foundational structures in data modeling as the
definition states its collection of
related data, right? So we were talking
about if it's customer, we want to have a
customer table. So it's a structured
format in which databases create these tables. So I've given an example
of a country table. Let's go try to look at
each section of the table. So the stuff that's
highlighted in orange are basically the
attributes or columns. So we have countryID,
you have country. And then we have last updated
column, countryID testing, individual unique E. And then the country is listing
a textural information. And then there's a time
value for last updated. So these are columns
or attributes. If you look at the bottom
section of the slide, created that in green, those are actual values. Data elements. So data elements are, if you'd look at that last row, countryID nine,
so nine, Austria. Each of these individually
are data elements. If someone says, how many columns do
you see in this slide? So I would say three, countryID, country
and last update, data is added as rows or tuples. So if you'd look at
horizontally countryID to Algeria and the
last update value, that role becomes one
record in the database. Another important concept
is of primary key. Each of these rows
that you see are identified by a
unique identifier. So country ID one through nine, He's unique, 123 up to nine. So you don't see
that being repeated. Understanding the wave tables are structured is
very important. Defining the relationships
that we're going to learn. In the following slides. We will be talking
about datatypes, or in other words, you can say is to type of format that a column
has in the database. So whenever you try
to insert a data, would need to know
if it's a date, if it's a string on,
if it's a number. So photo date, column,
date of birth. In this example, the format
is first, why, why, why, why is the year portion dash m, M is the month and dash d.
D is the days that do 1000. Dash 1010 be the format
of a data element. And the second date format is
a date where we have hours, minutes, and seconds as well. So for certain scenarios, we can have the date field defined in a more granular way. So date would be
the first datatype. Second data type is strings. So strings are more or less. In a case of customer table, we can have customer FirstName, LastName, anything textural. And then we have two variations, or varchar variable
character and the character. We don't need to go
too much into detail, but they bolt hold
string values. The third category of data
types come under number. So anytime we want to store
numbers, for instance, your car's mile, like we were talking about
an automobile table. So if you have
mileage for the car, that mileage column can
be stored as a number. So for number, you can have a data type of an
integer or a small int, depending on how
big the number is. Or if you have decimal places we wanted to present,
like for instance, a customer paid amount, like if you paid for
a Starbucks coffee. And the amount was some
dollars and then some sense. So we can represent that
by a datatype off decimal. If you note that datatypes and the category of date
strings and numbers, you can do a majority of the data analysis
knowing these datatypes. We will be talking about the notations that
we will be using in our relationships before we get into the different
types of relationships, I just want to make sure we
understand the language in which we can there be
depicting our relationships. So we're going to be using information
engineering notation. So the process of
designing the tables and its relationships is called entity
relationship modelling. So if you're looking
at the cardinality, That's just the maximum number of connections you can have
between the two tables. So one would be just
one more to go Line, many would be offbeat. Modality is just the least
number of connections you can have a relationship
with 0 or one. When you see your
relationship and action, you will have both
the maximum and the minimum number of relationships those two
entities or tables can have. So that's why if you're looking
at the right-hand side, showing a cardinality
and modality together. So the first relationship
is showing one or many. And that's why you see
this symbol for one as a vertical line and for
many as a crow feet. The one at the
bottom is 0 or many, and you have a circle
and a crow foot. So just getting an
understanding of the symbols were really helped you in the following slides.
9. Introduction to Data Modeling Part2: All right, Now we will talk
about different types of relationships that two
tables or entities can have. First one is one-to-one. As the name suggests, it says, one entity can only have one
relationship with the entity to entity 2k and only have one relationship,
the entity one. So if you look at this example, we have a relationship between a manager table and an office. If you just look at
the Manager table, we have manager ID,
FirstName, lastname. So there are four different
managers in our table. And if you look on
the right-hand side, we have an office table. So we have four
different offices, Chicago and New York,
Seattle, and Columbus. So that's the data
we're dealing with. If you look at the
Manager table, we have manager ID
as the primary key. And in the previous
lecture we defined what a primary key is,
a unique identifier. So manager ID 1 own
relates to Eddie, manager ID for only
relates to Paul. If you look at the office table, we have primary IDs as
Office ID one through four. And then also we have another column manager
ID in the office table. So what it shows is that office one is managed by a manager, one location in Chicago. So Manager ID being
in the office table. This particular column is a foreign key because it is
present in the Manager table. So now there's a
parent relationship. And if you look at the line that connects these two tables, it's a straight line with
two dashes on the side. So what that depicts
is that the manager can only manage one office. If you look at from the
office to the manager side, it says office can be only
managed by one manager. So it's a one-to-one
relationship. It's a mandatory relationship as the modalities also showing one with the vertical bar and the maximum
relationship is also one. And another example can be US citizens and social
security table. So US citizens table, which has a FirstName, LastName, and then another table
as a Social Security. As citizen can only have
one social security. And a Social Security can only
be related to one citizen. So hopefully that kinda clears that up as far as the
relationship settings. So it's a one to one. So in this case, a manager
manages one office. An office can only be
managed by one manager. All right, so the second type of relationship is
one-to-many relationship. This is one of the most common
relationships that you will find between tables. If you look at the example, it's between customer
and order table. So if you just logically
think and a scenario where a customer can have
multiple orders, like if you go to Amazon, you can be can put three orders. So that's a legitimate
business case, right? So you can have multiple orders, but that individual order, if you put three orders
that it order one or the first-order
only relates to you. None of the other Amazon
customers, right? So the order is linked
to one customer, but the customer can
put multiple orders. And if you go with that mindset, if you look at the relationship in the middle between
these two tables, we have a vertical
line on the left. Meanings that the order can
have only one customer. On the right we have
a crow foot depicting the many portion of the relationship with
the vertical bar, depicting at least one. So the customer can at least
have one order or many. And again, we have a primary
key and a foreign key. So the primary key on the customer table
is the customer ID. And the primary key in the
order table is order ID. Great. But the order table
also has a customer ID in it. For this relationship to work. And that customer ID in the
order table is a foreign key. And that customer ID is what that relationship
is built on. Majority of the
tables would fall into this category. Again. So if you look at this example, customers can put multiple
orders and an order can only be placed
by one customer. Third type of relationship is
many-to-many relationship. Let's look at the
example that we have. The first table,
we have a student. So it just gives you a list
of students that we have. So student ID, FirstName,
LastName, great. So we have a student table. Then if you look
at the far right, we have a class table. It just gives you the
different classes that are being offered. So English, math,
chemistry, etc. So as long as student and class tables
by themselves look fine. Student has student information, class has class information. As we'd relate the student
and class data together, there is an apparent
many-to-many relationship between them. Since a student can
take multiple classes and each class can have multiple
students enrolled in it. We have a scenario where this
relationship can capture other information like student
grades, enrollment dates. And for this, it'll be better
to make a junction table called enrollment that will
capture this information. If you look at the way a student
is linked to enrollment, we have a one-to-many
relationship where a student can be enrolled in
one or more classes. And if you're looking at
the relationship from right to left from plaster enrollment, a class can have one or more enrollments
from the students. As you go forward in the
data analysis field, you'll see more and
more relationships would fall into this category. Fourth type of relationship is called recursive
relationship. And if you look at the
example of employees table, so you have Employee ID,
FirstName, LastName. You also have a fourth
column called manager ID. So the managers are also taken from the
list of employees. So there is a
relationship between manager ID and employee ID. So if you look at the
bottom half of the table, we have it a
relationship going from manager ID to employee ID. So a manager can manage
multiple employees. So if you look at the tail
end of the relationship, It's a crow feet saying
many and at least one. An employee can be
managed by one manager. Or in some cases may not
have a manager at all. So that's the other side of the relationship and
that's why you see employee one not
having any manager ID. And if you look at my
ch and j employ 23, they are being managed by Eddy who has an
employee ID of one. And nobody's managing Eddie, he's a top-level employee
and that's why you see Manager ID has
no information. So this is an example of
a recursive relationship where we have a relationship
within a single table. And manager ID is a foreign key to the
employee ID column. So hopefully this was helpful. This type of relationship
is rare to have, but it's good to know
when doing data analysis. So now we have kinda looked at the building blocks
of data modeling. Know what, what a table is, what our columns or attributes, what is a primary
key or foreign key? What are data types
and how tables are related and types of relationships that can
exist between tables. Understanding normalization
becomes important as more and more data
comes into your database. So what normalization
is a process in which we try to break a bigger
table into smaller tables. And what it does is that
it helps you reduce redundant data or
duplicate data. Normalization goes
through a few steps. In each of these steps
are called normal forms. We go through first normal form, then second and then third. And overall normalization. If you look at some
of the benefits is that you save disk storage. It's easy on maintenance for the database because
you're updating less data, IO activity gets improved and also it helps query
and reporting. I will show you an example
of an employee table. And we're going to
take that table from first normal form to second
and then third normal form. And you will see
their transition as to how we go through
this whole process. And hopefully with this example, the normalization process of
a table will become easier. We have been given this
employee table and are tasked to normalize it up to
the third normal form. Let's take a step back and
look at the data closely. We have an employee ID
which uniquely identifies an employee or department code that is unique for a department. We have employee
name department, three columns that have
four numbers in them, employ start date and
employee vested indicator. So that's the data that
we're dealing with. And we have to take
this table through the normalization process of first normal form up to
the third normal form. When you're dealing with
the first normal form, you need to ask two questions. Are there any repeating
groups, meaning, are there two or more columns that are closely
related to each other? If they are, we need to
create a new table for them. Second, make sure that all the attributes are
single-valued attributes. Let's take a look at
the employees tables. If you look at the employee
name, bf, FirstName, LastName in one column, that's our first candidate, as we should not have any
multi-valued attribute. So what we're doing,
the first normal form, we split the employee name
into the firstName and lastName to take care of
the multi-valued attribute. If you look at the
first phone number, all the phone numbers There
are different and it has to be the employee phone
number. Phone number 2. It seems like that
phone number is dependent on the department. If you see finance is finance department
has the same number. So it is the department
phone number, phone number three has all the phone numbers
being the same. That can be looked at as
a company phone number, employee start date, and employee vested indicator
can move as is. We also improved the metadata or the column names for all the
three phone number columns, and add employee,
department and company labeling to them as
it makes more sense. Now our employee table
is in first normal form. Let's get that ball rolling
to the second normal form. The most important
question you want to ask is when you're transforming your table from the first normal
form to the second, is that to make sure that all non-key columns are functionally dependent on
the entire primary key. So what are the non-key columns? Employee last name,
employee first name, department employee
phone number, Department phone number,
company phone number, employee start date and
employee vested indicator. These are all non-key columns. So what the rule says
that they need to be functionally dependent on
the entire primary key. So the primary key here is employee ID and
department code, right? Those are the
unique identifiers. But if you look at the
employee first name, last name, that is dependent
on the employee ID. And employee ID should
be able to give us the employee
FirstName, LastName. If you look at the department, department is dependent
on the department code. So you see we have some columns
in the table depending on one part of the key and the other columns are dependent on the other part of the key. So what we need to do is
we need to split the table where the columns are
dependent on the whole key. So that's why you see
the employee table being split and having the
employee first name, last name, phone number, employee start date and
employ best indicator. They are all dependent
on the employee ID. For department. You see we have Department
Department phone number and the company phone number
that are dependent on the department code. Great. So we have those two tables transformed into the second novel
form right now. We also need to capture
the relationship between the employees and the
department data, right? Employee can work in
different departments and the department can
have multiple employees, as you recall from
the earlier lecture, to resolve a many-to-many
relationship, we had to create
a junction table. So in this case, it would be the employee assignment where you're going to
have an employee ID and a department code as being two columns in
the relationship. So you'd look at the
relationship from employees to employee assignment
is a one-to-many. And if you look at
from department to employ assignment
is also one-to-many. So how are you going to
capture the relationship and the employee and department being into
second normal form. Great. All right. Let's get the ball rolling photo table to be in
third normal form. We should not have a situation
where an attribute depends on another attribute that is not the primary
key of that table. It's also called
transitive dependency, really important role that the non-key columns in the table should not have
dependence between them. Secondly, we should not have
any derived data such as like total columns that are derived from other
columns in the table. So if you look at
the employee table, we have employee
vested indicator that is dependent on the
employees start date. So for a surgeon start date, that indicator either
turns yes or no. So that is definitely
one candidate that we can address and
not third normal form. So if you look at
the employee table into third normal form, we only have FirstName, LastName, phone number,
and start date. All those columns are only
dependent on the primary key. And there's no dependency
between the non-key columns. If you look at the department that is already in the
third normal form, as you go through
different projects and, and as you go through your
data modelling assignments, you'll find that the
third normal form is where majority of the
transformation stops. They are some
higher normal forms like bark called
normal form, fourth, fifth normal form, which do
occur and people do use them. But very rarely. I will try to add
some slides for those normalizations
at a later time. All I will say is
that with practice, data modelling becomes
easier and easier. You need to understand
the business as to how the data is captured and
what the data means. Practicing the normalization
rules and being close to the data will help you in your data
modelling projects.
10. Introduction to SQL: Hi, there. We went through learning about databases and then
data modeling. And now we're going to
be learning about SQL. Sql is a programming
language use to interact with the
relational databases. Sql is an acronym for
structured query language. The syntax of SQL is
very easy to learn. It's pretty descriptive. A lot of keywords
that are used to do the operations,
self-explanatory. Different relational
database systems have some flavor of SQL that
they have rolled out. And each flavor has some functions that might be different from the
other version. But in all the basic SQL
syntax remains the same. We will be using SQL to interact
with our MySQL database. We're going to be going
through different categories in which SQL statements
can be placed. Sql was standardized by American National
Standards Institute and C for short in 1986, and it has gone through
multiple revisions. We can put SQL commands into
different classifications. The first one is DDL data
definition language. So whenever we are trying
to create a table, alter it, drop it. These SQL keywords can
be categorized within DDL and we have a separate
slide wherever you can and go through examples
for creating DTLs. Second is DML. Dml statements pertain
to either inserting, updating or deleting data. And we will go through
some examples of those. Third is DQL Data
Query Language. And this is the select keyword. You must have seen
a select statement in the past as well. So this is basically a generic SQL statement which
can help you retrieve data. Fourth one is DCL,
data control language. The statements in DCL are normally used by
database administrators, like granting privileges
or revoking them. We won't spend time doing data
control language examples, but we will go through examples that will fall
under DDL, DML, and DQL.
11. DDL: All right, So now the fun begins, we're gonna be writing a sequel code, so we're gonna go through D D. L statements. As in the previous slide, I was saying, It's the data definition language. We will be creating a table and then also we can either alter drop it or truncated. So let's let's take the first example in creating a new table. All right, so we'll be creating a new table called students, and we're gonna be using a keyword called Create Table and then the name off the table, which will be students Open bracket and close brackets will list different settle columns. You stink of columns as attributes like, what kind of data do you want to capture for the students? Okay, you have your student i d. His first and last name date of birth phone number at is just a general information about the students. So whenever you want to add a column, you have to have a data type what defines the type of data that will go into the column. So, for instance, student I d. Reuse i nt as indigent first name last name is gonna be very well character war car for short, it's gonna be string values. Anytime you have a string ready, just use a very real character or character you can use either or one is fixed length. The other one is rebelling. And whenever you defining a string, you will basically give the length off the string rallies in this case is 20. Date of birth is gonna be a data type of date. Phone numbers again is gonna be a string address is going to be a string. And if you see address, I have kept it at 100. He might have more information on the address. We need to add a semi colon to indicate the end off the statement. So just highlight this portion and execute. Great. So now we have successfully executed a create statement. So if you go under the tables section off the database Secura, we need to refresh tables and do fresh all. And then you will see the students table. You can open the students table hair and drill down into the columns and you'll see the list of columns that we created. Great. So now you have created a student's table. All right, so now we have created the students table. Let's try to add some columns to the table and drop some, and we can even go ahead mortify a data type. So if you look at the older table syntax so we have the altar table as keyword, we have to use altar table and then the name off the table that we want to alter, right? So it's gonna be altar table students, and we want to add address, line and address to as to new columns. So we're gonna use the ad key word and what's after give data type. In this case, I'm gonna be giving 100 length for each column. It's gonna be comma when we end. Would be then using the drop key word to drop a column. So we'll be dropping the address, Call him. And if you want to modify a call and we just use the word mortify last name and we're gonna be changing Initially, this string data type was off length 20. So we're gonna be using now, 30. So now we can just run this, highlight the whole statement and just run it. Great. So are ultra Statement was executed successfully. We can come down to the object browser and look at our student table. So we have student I d. First name, last name, phone and address. So it's actually showing the older emits to just do it a fresh all. And if you look at the last name, it's a variable length 30. Great. So now we have altered the table to drop her table or to delete David. Suppose we just say drop table and the name off the table. So what this will do is it will drop the student table and then we also have a truncate table statement. If we do truncate table, it basically empties out all the records and the tables. Right now we have our student table is empty, so it won't perform a particular action because of the day with his empty. So we run the drop command and as you see, the student table has been deleted
12. DML: the next category in which we can put the sequel statements are D. M. Els. So data manipulation language is so we can create statements to either insert data update data or delete data from the table. So we'll practice each one of these statements and see how the get a base behaves. All right, so we'll be inserting a record into the country table. So let's see what we have in the country table already. Every great. So we have a bunch of records. We have about 109 records in the country table. So if you want to insert data into the table, do you have to use the keyword insert into, and then the table name open brackets and you after listed columns, Country table has country I. D. Country and last updates. We'll list those three columns and then we close the brackets. So the first set of columns are gonna be the number of columns center in the table, and then we have the values as a keyword. Then we list the data values that we want to insert into the table separated by commas. Go for country I d. We're gonna put 1 10 If you look, we already have the last record in the table as I d being one or nine. So we'll put 1 10 will put as a country name as my land. And for the last update, this is a function. Don't worry about it. Just you can just put it, as is it say's current underscored time open and closed bracket That's just gives you the current time when you insert the record close bracket. So insert into table name list the columns that are into table than the values keyboard open bracket and then insert the data and separated by a comma. All right, so let's execute this. Okay, so we have affected one droll if you look here at the bottom. So let's try to select data from the country table and see if we got our record. And so let's scroll down, says you can see the record with Country 81. 10 was inserted. Great. Now we want to update the same record and said the country named to no man lands. So we used the key word update the table name set country. The column that we want to update equals two. And when you put this in Ah, Coats, No man lands. We use the where keyword country I d equals 1 10 So this is gonna filter the data. This will filter the data to the record, which we inserted That was 1 10 and this will The set keyword is gonna update the value in the country column for the record where the country ideas 1 10 So we run this. See, it has been updated to no man's land. Great. So if you want to delete a record from the table, we used the key word delete from table name and then we use where is when we want to qualify and get to a particular record into the table and filter. So where country idea is equal to 1 10? Let's run this. So now if you do so like star from country, we should not have any record with country I d. One time. So that has been deleted. Great
13. DQL1: we will be looking at the most common sequel statement that is used to get data from the database. It's ah, it's a very simple statement, but it's very powerful. At the same time, I have tried to put this visual for the select statement. I just wanted to make sure that you understand each component off the select statement. A lot of times people probably no select statement partially and they missed the other details. So hopefully this picture will help you understand each component off the select statement that is used to retrieve data from the tables. Select selects the number of columns from basically less the tables. You want to get data from where is basically where you put the filter condition to filter your data order by is where you list the order in which you wanna see the data and limit is where you wanna limit the number of rows you get back. All right, so let's get this kick start. Its the first script we have is select star from actor, so we cannot retrieve data from the actor table. We're using a semi colon in the end to end the statement we're using the select and from keyword, the star or ass trick depicts. You want to bring back all the columns from the table. So if you don't I mean other ways you have to list. Let's say the table has 20 columns. You don't want to list each one of them. I let the script and run, so you'll get the data from the table about 200 records in the table. Now let's see if you just want to bring back first name and last name, we do select. We just list out the first name, and columns will be separated by a comma. First name, comma, last name from keyword and an actor and semi colon. So we just get those two columns. It's pretty straightforward and self explanatory. Let's say if you want to label the first name as employed name, you have to use the keyboard as and then just used the column name as the given label name from the table, and you have employees name and the rest of the data comes in. Great. Unless say, if you want to order their data order by first name, so we'll still do the same. This piece of the select statement. It means the same. We have to select first name last name from customer. We used the key word order by and which call him do you on order by his first name Ascending. So let's run that. Great. So you see the data comes back and is ordered by the first name of sending and let's take an example off a payment table. Let's use the payment table and order by amount descending so amount should go down when the data comes back. So you see, first we have a higher amount that it goes down. Let's see if you want. We can also add a sort on two columns, several first sort by the first team so we can sort the customer data by first name, last name together as descending by just using order by first name Coma, Last name descending. It's run that so great suit is ordered it. Now let's say we want to bring just the 1st 5 records from the customer table. So this is gonna be a traditional select star from table order by customer, I D. And as you can see, what 600 records come back and say We have to use the key word limit and then give the number of records we want and then Colon. So let's run this. We only get five records by. So this is by using the key word limit and then the number of records you want back. You can also pick arrange when using the limit. Keyword. So if you say to comma five, that means that it's gonna bring five records, but it's going to start from the third record off the table.
14. DQL2: This is a list off sequel comparison operators that you can use when you're trying to filter data from the data set. So whenever you use the keyword where you can use a particular column and then if you want the column to have a value equal to something or not equal to something greater than less than so, you can use thes mathematical operators, we also have operators like in open bracket and close that gives you the option to list multiple values. Also bit mean gives you an option to go pick a range. Rieff. It's the range in the date or the amount null. And not now is where you have an absence of data and you just want to pick the value where there's no data in the column like is also very popular operator, which you can use to big data that follows a certain pattern. You will be using the where clause to filter the records coming back from the database. We already know how to select data from my table, right? So let's highlight the portion where it's a select star from customers, so you will get pretty much all the records from the table, right? So let's say if you want only pick Mary, you use the where keyword falling by a column named that you Want to filter on equals Mary has to be in single Coates is a strength string value and then semi colon. So let's run this. So we found one record in the customer table, which satisfied this condition. If you want to pick customer data where the first name is not equal to marry, you just use explanation mark and then equal together and then marry. So if you run this, you will filter out all the records where the first name is Mary or let's take an example. We want to apply to conditions to filter the data. So we using the payment table. We want to pick all the records from the payment table where the payment date is created than 2006 Jan. First, and the paid amount is greater than four. So we have to apply to conditions to filter the records. We'll do select star from payment where payment date will use greater than the set date in single coats and amount greater than four. Let's run this so you'll see all the payment dates I created than 2006 Jan. First, and the amounts are greater than four. Great. Let's try to filter the last name off the customer from a given list of names. We can use them not in open brackets and give the list of those names if you run this So all the last names, a return back and Davis, Miller and Wilson records are filtered out. Let's take an example where we want a filter that payment records based on a date range. We can use the between key Vered and give the starting date the single coat and the ending date. So the records that have returned or a payment date between these two dates, we can also use the order by payment date. So let's run this 2005. July 5th is in the range of 2005. July 1st 2000 Julys 31st. All these records are in the month of July. We can also filter customer data based on a pattern that we find in a string value so we can use the key word like, followed by a single coat and the two characters off the first name in this case, a N, followed by a percent sign and close coat. Let's run this. It's returning all the first names where the 1st 2 characters are a n. Likewise, we can run the same like command, but have the percentage in the start, meaning that the last director would be. Why in this pattern. So let's to run that you look at the first name. The last character is why in all these records, and this is a very powerful operator, great.
15. Inner Join: one of the tasks that data analysts do is bring data together from different data sets, and that is achieved by joining tables together. The first type of joint we're going to look at is called inner join. So if you look at the Venn Diagram Table A and Table B have something in common, and that common data element will be used as a key to join these two tables together, the syntax for the inner join can be in an implicit way on an explicit implicit is basically when you're using the inner join keyword. So let's take an example off the first syntax. Select from table a inner joint table, be on, and then we list the keys from both the table that we want to join on. That's kind of an implicit way off saying it. We can also use instead off on. We can use the using keyword and just list the key that is common between those two tables . Explicit ways where you say from table a calm a table, be where a dot column equals B dot columns. That's just a different flavor in which you can get the same results you can open the data model from the file itself. Click the ER diagram Aiken to open the model when joining tables is really important to know what the data model looks like, Right? So let's say, if you want to join country and city, we want to know which column the relationship between these two tables are built on. So what column In country table In what column? In the city table, the two tables can be joined on, so it's gonna be select Country and City column from city table inner join country table using Country I D. As the column that is common between both those tables where countries equal to United States. So whenever you're trying to join the tables, you need to look at the data model rights of what data model will give us the relationship off, how tables are joined together. So let's switch to the tab where we opened the data model. So we're looking at the country and the city table, and if you can see we have country I d common between those two tables, so that means that we can use the country I D column in our joint condition so we have used the country I D column as the joining column between city and country table and we also using the keyword inner join. And what that means is we want to bring back only those records. We're the country. I d has same data between those two tables, and we want a filter where the country's United States So what the sequel will return is the list of all the cities in the United States. You're given the task to find the list of films. Language is not English, so we're gonna go back to our data model so Phil and Language Table are joined on language I D. So let's go back to our sequel. So select. We're gonna select title and name from Phil Inner join language table using language i d. Where name? Not in English. So we want to bring back all the films where the language in which the film was made is not English, and we don't have any records right now on the database that satisfied this condition, we can modify the script to bring back films that were made in English by taking the not out just having the name in English. Let's say we have a scenario where we want to generate the list of customers which live in Barcelona. So let's go back here. It will be a three table joint customer address and city. Let's look at our sequel Select. So we're selecting a bunch of columns from from these tables from customer inner join address using the address I D. So if you look, this is how the custom an address are joined on address I D and then we're joining address inner join to the city table on city I d here. So they're joining on the city i d column. So this will be our joints for three tables where city equals Barcelona. So you see the column City to get to that, we had to go from customer to address to city to relate all this data together. Now we run this. We should get all the customers living in the city of Barcelona. So we have one customer that lives in Barcelona. So this is a very powerful joint statement, and as you can see, you have joined three tables on these common data elements. So let's say you want to bring back customers that live in Barcelona, and during their rental period they have paid more than $5 in rental cost. So you will be joining the payment table with the customer table. So if you go back here so there's a customer, I d their joints, the customer and the payment table right there. So you're joining so the rest are the same. We already did this joint, so we're just adding another payment table, and then we have an amount greater than five. So let's run this. So if you see we're getting records for the same customer. So where he has two transactions where the amount was greater than five. So it's giving you a transaction history for the customer living in Barcelona, where the payment amounts are greater than five.
16. Left Join: the second type of joinus called left join. So left Shown brings all the data from Table A along with the data that matches between A and B, the syntax four left joinus select from and we used the key word left joint table a left join table B using the common key fields between those two tables. So we want to get the list of staff members and indicate which one are managers. But the list should also give you staff members who are not managers as well. So for that we will be using a left join. We need to insert these two records so we have enough data to make this example work. So if you come back to the model, we have staff and store table linked by the staff. I d. The table store has those staff I ds that are managers. And then, of course, we have the relationship on store idea as well, So we will be joining the staff table and the store table with a left joint. That means that is gonna bring all the records that are in the staff table along with the ones that are in store and we're gonna join on staff. I d equals manager staff I d. So this should give us all the records in the staff table along with those staff members that are managers as well. So if you see first name, last name is coming from the staff table. So we have four staff members right now in our database manager staff ideas coming from the store table. And only Mike and John are managers. So for Johnny and Jimmy, there's no record in the store table. That's why we have no means, no data. So this way you can see that left joints are useful for certain scenarios toe, where we want to bring all the data from one table and link it with the records that are matching between the two tables as well. Art Another variation off left join his left outer join, meaning that I just want to get the records in table A which are not present in table B. So what you can do is that you can do select from table a left, join table B using the key that is common between the table. We're it will be his keys equal no. So if you picked this last option where table bees key, which means this table equals no, you will get all the records in table A. So let's say, if you want to get the list of staff members who are not managers, we'll do a left outer join and the wavy achieved that is we joined staff to store with keyword using left. Join on staff I d. Equal manager Staff I d. We're using the own keyword in our joint condition. So the column names in the two tables are not spelled the same way. So till here it's exactly the same as a left join. But when we add the very clause, the staff i d from the store table is no. Then it only returns those records from staff table and not the ones that are matching between those two tables. So if you compare this with the other left joint, this is only bringing those staff members that are not manager great
17. Right Join: third type of join is a right joint. Basically, it's the divers off the left. Join and syntax would be select from table a right joint table. Be using the common keys. It brings all the data from table B and links the records between Table A and B. We also have right outer join, which is the opposite off left outer join. So select from table a right joint table. Be using the common keys. Where nine. The previous one when we had left outer join We were using table bees G as no. But in this one, we will use table a key as no. So it's just a reverse off left outer. Join for the right joint and the right outer join. I don't have any lab exercises for you. Was that something that you can try to practice on your own? Using the Secura sample database
18. Cross Join: four type of joint is called Cross Join is basically trying to join two tables that don't have an apparent relationship, and this type of joint is also called Cartesian product. One of the things you have to be careful about is that you can do cross shine between two tables if the number of records in both the tables are pretty small. I'm talking about 5 to 10 records and maybe 20 records or so, and they're normally court values with some descriptions. So it's like if you have a Pacific use case where you have to create a some sort of a mapping a code mapping between two tables, then you can use cross join the syntax for this is select the list of columns from Table A cross Joint Table B. So it's going to show one value of table A being repeated for each value and table B. That's why the number of records get multiplied by the number of rows you have in Table B and A So let's look at an example off across showing so would create two tables won t shirt , underscore color. The other T shirt underscore size. So one has number of colors and the other has the number off sizes T shirt can have, so we'll just run. Let's look at what data we have in T shirt color table. So in the T shirt, we have two records. One is black and one is white. Let's look at the T shirt size table, so we have four sizes small, medium, large and excel. So now if you want to find out the maximum number off combinations between the color and the T shirt size, we can join the T shirt color table with their T shirt size table into across join. It's going to give us the maxim possible combinations that apostle between the colors we have and the sizes Great. So with black, you can have four sizes, small medium logic cell, and with white, you can have four sizes as well.
19. Self Join: fifth type of join is a self join. You have to create an alias off the table. So let's say if the table is a you have to create an alias, a one and a two to join the table together. Self joint, normally our rear. But they are some use cases in which you have to use a self joint. For instance, if you have an employee table and there's a column in which you have a manager, I d. That manager ideas pointing back to the employee i D column in the employee table. So the same table has an inherent relationship within the table itself. So let's go to an example and see how this can work in practice. So let's say we want to get a list of managers and their direct reports. So for this example, we have to create a table called employee Manager so Discover had created table. Let's insert ah, number of records here. Great. So we have six employees and their full name and the manager 80. So, for example, employees one Matthew is being managed by grace and Grace is being managed by Alice and is a relationship between the table itself. So let's take a look at this self joined sequel. So let's take Take it piece by piece. So we're creating an alias for the employee in the school manager Table E one. We doing the right join employees in the score manager E, too. So you're doing a right John between E one and E two on employee I d think of the one as the manager list and think of E two as the employee list and we're selecting E two employees full name, which means we want all the employees to be less dead. We just giving a label as employees. So all this statement is doing here is that if no meaning, that if the manager value is now meaning the employee does not have a manager, we label it with no manager. And this is just a label to give the column. So if we run this, you'll see we get all the list off employees in their corresponding managers. If you look at the last row, Alice Cooper does not have a manager. Let's look at just the employee table again. Alice Cooper, the manager. I DS No. So when we run this. It found no and it replaced it with no manager value, which we give here so self joins our rear, but it's good to have the knowledge about them.
20. Intersect: second type of set operator is Intersection. So sad, one intersex said to He's gonna only bring back whatever is common between those two sets. So my sequel doesn't support the Intersect keyword, but we can get the similar results by using exists or in operator. So again, we're gonna have curry. One exists. And then we're going to use the quarry to so inquiry to We're going to be joining the two tables together based on calling one and column, too, and maybe run. This is going to only bring back the records that are common on those two columns. Similarly, we can use the in operator in the in operator. We don't need to physically join the columns. We can just do a look up. Think off in as a look up. So so select Colin. One column to from Table A where so in the brackets, we give what columns we want to look up. Call the one column two in the second quarry, and then we select the columns that we want to look up to. In my experience exists performs better, would data and in his little bit slower from a performance perspective. So we want to get back data that set one and set to have in common. So if you look at the version one off the script So we're seeing Select Star from Sat one where X and Y are the columns of set one in and then we have open brackets and clothes and select star from Set to So this will bring only those records that are common pretenders to sets. So as you remember, set one had four records and set to had to records ever present and sat one. So we should get to records back. Great. So we have Eddie and Charlie that comes back. We can also run the same command using the exists key word with exists, we have to reduce a joint condition. So we'll select star from sat one. I'm giving it an alias. A Where exists? Open brackets select star from set to alias Be where? So this joint conditions on column one and column to needs to be present in the second quarry to link it to to set one. So this physical linkage needs to exist when we using this type of sequel when we're using in. We did not need it. The joint condition. So that's kind of the parents. Syntax difference, but exist is faster when there's a lot of data, so we get the same result great.
21. Minus: the third type of set operation is minus, basically set one minus. That, too, will give you elements that are inset one, but not inside, too. If you look at the example on the slide set, one has won 4 10 data elements and said to has 9 10 11 If you do the minus operation there , you will get one and four as a result, just as a mathematical operation off subtraction, my sequel does not support minus keyword, but we will still achieve the same results by using, not exist or not in keyword. As far as the syntax goes, it is the opposite of what we did earlier for the Intersection lecture. So instead, off exist, we will use, not exist, and instead of in we will use not in for the lab exercise. I've attached the sequel scripts to the lecture notes, which will help you go through the minus like operations within my sequel
22. Union: in sequel. We haven't area where we can actually do set operations. I mean Set is is a branch of discrete mathematics. But when it comes to databases and sequel, you can perform set operations by using the keyword union intersection and minus my sequel supports Union, Onley, intersection and minus can be mimicked by some variation off the sequel itself. So if you look at the union operation set one has won 4 10 data elements and said to has 9 10 and 11. If you do set one union set to, we're gonna get all the data from sent one and set to, and there won't be any duplicates. So if you look at the lab bottom portion of the slide, you have quarry one union quarry to, and one of the things you have to be cognisant about is that column one and column to in both quarries need to have same data type when you're doing the union operation. The columns that you select inquiry one need to match the columns from quarry to when I say match. The two columns need to have similar data types between the two quarries. Also, you have to make sure that the number of columns between the two Coreys are the same, and the order in which the columns are listed is also the same and also have similar data pretending to each other. If you use a slight variation to union and use union all, it will not remove the duplicates, so you will have duplicate values if you just to union. All So it's a simple operation, which brings data from two tables together and a pence data from Cree one to Cory, too. We will create two tables, set one and set to, and we will insert some daytime to those two tables. So let's see what we haven't set one, right? So we have four values Eddie, George, Charlie and David. But see what's inside, too. So we have Eddie and Charlie. So let's say we want to bring data from both the tables Eddie and Charlie isn't set to. So said to is a subset of set one. So when we run the union statement, so the first Cory Select star from Set one and the quarry to is select cell from set to we should get all the records from sub one as their common and said to as well. So let's run this. We should still get the same value for records. Great. But let's say, for use union, all we should get six records. Even two of them are duplicates to see. We get Eddie and Charlie.
23. Introduction to Aggregate Functions: Aggregating and summarizing of data is at the core
of data analysis. To be good at data analysis, you need to understand
how aggregation works. I'll be introducing
a new keyword group by that is normally used with a lot of
aggregate functions. And we will go through a lot of examples where we are
using group BY clause. Groupby can also be used
with a rolled-up keyword. All this will make sense when we go through the
examples in our lab. So we have a list of
aggregate functions. If you look at the first
table on the left, We have average count,
maximum, minimum. And some, these are some of the common aggregate functions that you can use on the dataset. If you look at the
table on the right, these are statistical functions
that you can apply on numerical data in
your dataset to derive standard
deviation and variants. So I've put together this
diagram to just show you how the aggregate functions when they're applied to the
data would look like. So this is just an
imaginary table with customer ID
and payment column. And if you see on
the right-hand side the syntax to actually apply the aggregate functions
on the payment column. So it's select some
open bracket and the column that you want to sum on and then close bracket comma. And likewise you can
put average Min, max comma and then count one. It's going to give you the
total number of records in the table from the given table. So once you run these
aggregate functions, you will get a summary record. And there's somebody
backward is going to add up all your payments, average the payment,
give you a minimum, maximum, and the
count of the records. This is a quick way to get
your summary statistics. So in this example, we are not using
group BY clause yet, and we will use that
at a later example. So if we look at the payment
table in our database, it has an amount field. So these are all
the amounts that customers paid when
renting the movies. So we want to get the
minimum payment, maximum, average total, and then also the number
of transactions we have. So we're going to use
aggregate functions and it should give us one
record. So let's run that. So if you see we get one record back summarizing the data
in the payment table. So the amount field has the
minimum payment as being 0, Maximum payment being
$11 and 99 cents, average total sum
of the payment. And then the total number
of transactions we have, which are 16,049. So this is a quick
way to just get a summary data from a table by using these
aggregate functions.
24. Groupby Part1: If you look at the diagram, what the group by does is that it partitions the table based on the customer i d. So if you look at Partition one therefore transactions. That customer one did with 2030 30 and 40. So that partition is going to get aggregated, and you will see one record for customer i. D. One same thing with Partition 23 and four. So the group by is basically isolating and aggregating each customer's data. You look at the syntax. There's a new addition to our previous sequel, which is a group by customer I. D. And also, we have added that in the select piece of the sequel as well. So we're adding a customer i d comma and then the aggregate functions that are going on to the Payment column. This is a very powerful sequel statement that is used number of time in multiple projects. So if you get to understand this sequel, it will really help in your data analysis activities for this lab will be creating an employee table to just run this create statement and the insert statements, and he should be able to see an employee tables record. So we have i d employee name, department name, salary and age. So this is the data that we loaded right now. So we want to see the total salary per department and also you want to see the number of employees in that department. So to get that data, we need to group by department name, right? So that would be our partitioning field. Whatever group, by calling we use, we have to use it in that select as well. We're going to some on the salary label it as total salary, and then we're gonna do a count, and this count is gonna go per department and that would be the total number employees. So let's run this great. So we see that for each department were able to get the total salary and were able to get the total number of employees. So such a simple group by statement has produced really important information that can be used by the business. So let's take another example, keeping the sequel the same as the previous one will just add. That would roll up key word after the group by department. Me. So what this does is that It adds another record apart from what we already get from the group by and gives us a total summary record along with the group buy records. So if you see the last record, should added up all the salary column and added up the employees column. So the total salary is 44,000 and the total number of employees are 11. So this roll up gives us a summary record that gets added to the group by results so it can come really handy when needed.
25. Groupby Part2: So if you want to filter the group by results, you can use the keyword having and then the group by column, and you can use an operator to filter the records. So if you see the example, we're saying having some off payments greater than 1 20 So we're using the having clause. It's gonna Onley bring back those customers where the some off their payments is greater than 1 20 So if you look at this table to some off the payments for customer for is greater than 1 20 The rest are less than 1 20 so that record is gonna be returned. So having keyword can Hughes to filter group buy records further if I just run that piece what we did before. So we're going to get a job production, service and sales with their total salaries in that apartment and the total number of employees. So now, if you want a filter, this result set to only show those salaries where the sum is greater than 18,000 Onley production records should show, so we'll use the having keyword and then do some salary created than 18,000 so only production record should show Let's run this great so you can see that having key word can be used to filter group buy records further.
26. Introduction to Subqueries: Subqueries I like
nested queries. So you have a main query
and then a subquery. There are two types
of subqueries. One is correlated and the other one is
non-correlated subquery. Correlated subquery is
dependent on the main query in such a way that for every
row in the main query, the subquery gets
executed at least once. Non-correlated subqueries
are independent of the main query. And they get executed once. We have a list of predicates
that can be used between the mean and the subquery
exist, NOT exists. An n are some of the common
predicates that we can use. And we had some earlier
lectures on that. Additional set of predicates
are any ALL or some. And we'll see some
examples of those and some comparison operators
that can be used as well. Hopefully this visual
will help you understand the building blocks of setting
up a subquery routine. We will do some lab exercises
in which the concept off correlated and
non-correlated subqueries will become much more clearer. So let's look at the first
example of the subquery. We want to get the list
of customers name where the rental duration is
more than five days. So they had rented the movies
for more than five days. So for this information, we need the customer table
and the rental table. So rental table will have
information about the movie as to when the movie was rented out and when
it was returned. And the customer table
has the first lastname. So we'll start with
select FirstName, LastName from customer table. So this is pretty
straightforward where we're going to use the exists keyword open bracket, and then we will
start our sub-query. So this is our sub-query. So we're going to
select star from rental and we're gonna give
it an alias are where. And so this is the
joint customer ID is common between rental
and customer table. So this makes this
subquery correlated query. Date dif gives you a
difference between two dates. It's just a function that MySQL has and we'll go into date
functions at a later lecture. But for now, just, just think of this function returning the difference between these two dates and it
will be greater than five. So let's run this. So this gives you the list of customers who rented the movie
for more than five days. Let's take a second example. We want to get the
list of movies that are not available in any stores. So you want to get the list of movies that are not
available in any stores. So we need two tables for this. One is film and the
other words inventory. So where are we going
to use not exist. So we're going to say
select star from film, give me all the films not exist. And then your subquery
select star from inventory. So inventory table has
a column Philip ID, so we can join the film to the inventory table on film ID. So it says not exist. So this will return me all the films that are not
in the inventory table. So you can see
there's a bunch of movies that are
not in any stores. And this is an example of a
correlated query as well. So let's take an example of subqueries are not correlated. So the first one is really
want to get a list of all the payment
transactions that are above the average
payment amount. So if you look at, so you're
looking at this script, select star from payment
where amount we can use, use the operator greater than. And then we can use a subquery
and then that we'd just say select average
amount from payment. So we using the same table in the subquery that returns
the average amount. And our main queries saying is that select star
from payment where amount is greater than the average amount that gets
returned from the subquery. So let's run that. So this list gives you all the amounts that are greater
than the average amount. The second example is
where we want to find out how many stores
are in Woodbridge. So our main query
select star from store, where address in. So we're using the keyword in and we're using
the address ID. And the subqueries select
address ID from reusing the address table and joining it with the city table
in our subquery. So the subquery itself has two table that we're
joining on city ID, where cities would rich. So the subquery returns us All the address IDs that
exist in Wooldridge. And then the main query
is pulling data from the store table where the
address ID is in this list. So let's run this. So we have one store
that isn't what Rich. And you can check
that by profiling the individual tables as well.
27. Date Functions Introduction: date functions. You either love it or hate it. Date functions become really important when it comes to data that's time sensitive. As data comes from different part of the world, date format becomes an issue as you can see from this picture that date formats in different regions of the world is not conforming to one. If you look at the blue colored countries, we start with the day d D dash month MM dash year. So that's kind of a common format. But if you look at the yellow countries, we have the year first. If you look at the countries and red, we have the month first in the date format. The day, month and year portion gets inter changed as we look at different parts of the world, so understanding how to format a date field becomes really important, and I can emphasize more that or the course of years. I come across so many occasions where I had to deal with formatting the date data, so hopefully this picture can give you an idea off the different date formats we can get when we're dealing with data that's more international. Sequel has a lot of powerful date and time functions. I put together this diagram for you to understand the anatomy of the date and time format as it is using the databases. So if you look at the left off the bar, we have the date portion. It's a four digit year, two digit month and two digit day. If you look at the right side of the bar, that's the time portion. So we have two digit, our two digit minute, two digits. Second and then we have milli seconds. That could go up to six digits. If you're able to understand this format inside out, you really have an edge on the other. Analysts. Hopefully, in our lab exercise will go through multiple functions that deal with either the date portion or the time portion off this field.
28. Date Functions Part1: below is the set of functions we're going to go through in our lab. The set of functions in the lighter green color, basically our in built date and time functions that give you whatever the current date is. The second set is basically, if you want, extract a certain component off the date back from the data field. The third set, which is the orange color, basically gives you the time portion that you want extract from your date and time field. And the 4th 1 which is highlighted in yellow, is basically a formatting function lets you format the date in two different flavors. So let's go to our lab exercise to try these out. So let's take a look at script once we're using the in bill, date and time function that my sequel gifts. So now let's run this script So current underscore. Timestamp gives you the date and the time. And remember, we went through the format off the date portion and the time portion off the date and time . So this kind of mimics that now also gives you date and time boat together. Current date just gives you the date, so it's the year portion, then the month and then the day. So it's why, Why, why four wise two m's and two D's current time gives you just the time portion. So these functions are good to remember when you're actually doing your application development as well. So let's look at script to so we're gonna be using functions that can extract either the year portion, the month or the week or the day portion off a date and time. So let's run our second script. So if you see we already used the current underscored date function before, so he's just going to give you the date portion, so we have 2018 10 20. So if you say a year open bracket and then used this current date inside the bracket, what this will do is just extract the year same thing with quarter month, week and day. It's pretty self explanatory, and these functions unnamed pretty well. So let's say if you want to say day and in the bracket you used the current underscored date, it's just gonna pull that. Today's the 20th day. Same thing with week, month, quarter and year, and these functions become really important in reporting when business is asking you to split the date and tell you what the quarter is, what the month is, what the week is for their sales data so really, really important is really good to know these functions that can drive and isolate the year , quarter, month, week and day. Very helpful. Let's forget script three. So now we're using functions that can extract the time portion and dissect the time down to our minute and second. So now we're gonna use the current underscore time function to give us the current time and then these additional functions that will drive the hour, the minute and the second same thing what we did with the current date, we're going to use the hour, minute and second and then in the brackets, we're gonna give the current time. Did you look at the current time? It's 11. 50 to 58 so it takes the hour out the minute out and the seconds and you know these functions will become important. If for some compliance reasons, you want to know the minute off your transaction when it was paid or you want to know the hour or you want to know the second thes granular attributes become very, very important. Now we're gonna look at the current time stamp, which is the date and the time portion together. So let's run this. So if you see the current time, we have 2018 10 20 then the time portion, so date underscore format can be used for date and time values. So let's say you have your date and time value right here. Right now, we're using our current time stamp, but in production systems you will get an actual date and time value. So you need to place that hair first and comma. You see these letters percentage K percentage. I ask d M. And why these are pre defined letters that depict like percentage K depicts. It's an hour I depicts is a minute s second D day M month and why year? So this is something that you just have to remember. Just make sure you remember that date underscore format function takes two parameters. One is the actual need in time, and the other is actually what element you want extract when you look at the results said were able to extract the hour, The minute, the second and the day, month and years. This is an important function as well. They say you get the current date and time and you're asked by the reporting analyst to tell you what week it is. What day is it? Not just numerically, but from a textual perspective is a Saturday. If it's a Monday, how would you go about driving that? So let's take a look at this script. So we're using the date underscore format, and we're gonna use now if you look at these pre defined letters there, so percent W will tell you the week day name percent m is gonna be month name percent. He's gonna be day off a month percent Why is gonna be the year, then the same portion ages, our eyes minute and species am or PM. So you see, we got the date and time here then, So we're seeing it's a Saturday. The month is October 20th is the date 2018 and then the actual time. And then if it's a PM or in AM so as you can see this derived date and time is more descriptive and sometimes for your reporting you would need that. You would want to convert a date and time from this format down to here. So just remember that these are pre defined letters that are used in the sequel function. That's something that you have to remember, and hopefully, as you practice more, these functions will become very, very easy for you to understand.
29. Date Functions Part2: the following set of date functions either help you add or subtract date and or time from the date field. These sets of date functions become important when we have more time line oriented data that's needed. These sets of data and time function helps you move along the time line, either by adding days, months, years or subtracting them to get the desired result. Let's go use them in practice. So let's look at Script one there. A lot of times when we are asked to either add or subtract a certain number of days from the date fields, this activity is really important. As we're trying to report a certain metrics, you might come across some use cases where you might either have to subtract or add days or months to a particular date field, and it might be for some reporting need. So let's look at the script, select current date, and then we're adding five days to the current date and then just giving a label. So let's run this script so you see, we added five days, asserts 2018 10 25 to add the days you need to use the keyword interval and then the number off days you want to add, and then the green off. What you want to add. In this case, it's the day. Do you want to add? Let's look at the next example. We can achieve a similar dissolved by using the ad date function as well, so add it takes two parameters. The first is the actual date field that you want to add to comma, and then we can use internal five days, so this should give us the same result as well. So it's added five days. So if you asked to take days out, we can use the negative sign and see Interval 10 days it will take. It will deduct 10 days from the current date. Soc It's 10 10 like heading days weekend. Either add or subtract months. So let's say, if you want to subtract 10 months from today Associates 12 17 12 20. Likewise, the year portion. We can also take 10 years out of the date. It's giving us 2008 pretty nice handy functions. So now let's apply these date and time function to our database, so we have a payment table. She has a payment date column We wanna list out the month and the total number of transactions for the customers in that month. So let's look at the script so select payment date as paid dates. So that's our field in the table Month name is gonna extract the month from the state field and actually extract. If it's January, February, March and then we're doing account, we won number of transactions from payment. We have to use the group by so you want to group it by Beeman Date, but only the month portion off the payment date. Then we just ordering by the count. So we want. Basically the first record should show the most number of transactions, and it should go descending. So let's run this. So you see from the results that July has the highest amount off transactions, then August, June, May and then February. So looking at this data, the store can actually see that July's are mostly the busiest time for movie rentals, and they can actually add more. Resource is in their stores, in which one are the least busiest months, and this can impact the business directly, so knowing the date functions can really help business decipher the type of data you have in your database. Let's say if you're asked to produce an average number of days, a customer holds a movie for viewing. So to calculate the average number of days a customer holds a movie first we need to get the difference between the return date and the rental dates. How many days? Actually, the customer holds the movie, so the first script just gives you that. You can see that these are the number of days a customer's holding the movie now, and we're using the date day function to calculate the difference. Now to calculate the average, we just used the average function. So the average number of days a customer holds a movie is five. If you want to know which day the customer rants the most off the movies. So we want to find out which day off the week customers are renting the boast movie. So let's look at our script for this select day name from rental date. It's just gonna give you if it's a Monday Tuesday Wednesday, a count the number of movies rented from rental again we have to grow by the rental date. So if you look at this data, Tuesday seems to be the day when we have the most amount of movies being rented in a summary daytime functions player Really important role in your data analysis journey. Time after time again, you will come across situations where you have to use the date and time functions. So my advice would be is to go through these exercises a few times, use different data and try to get used to these date and time functions, I think.
30. Numeric Functions: sequel allows you to from a lot of numeric functions. When we look at our database, we have we might have numeric feels like quantity paid amounts or a particular price. So any time you see in America data, you can apply thes numeric functions. These functions can be characterized in mathematical functions and or rich thematic operators. So if you look at the first staple for mathematical functions, we have functions like floor round sign. It's pretty simple to apply these mathematical functions. I'll show you in the lab as well as the arithmetic operators for division, multiplication and addition. So these operators come really handy. Where we were trying to either add up in America data or whenever we're trying to apply business rules on these numeric fields, these numeric functions become really, really important. Now let's see these functions in practice. Let's take a look at our first script. So we're applying an absolute function short A. B s open bracket, and we're giving the value negative two and we're giving a labour last or how we want to see the column name Mama Seal again Open bracket and we're giving the actual American value floor round and sign, and we're not giving a particular table. So you can also just use thes functions by themselves without giving a table, and my sequel will calculate it. So let's run this now. The absolute negative to brought back to which is correct. Ceiling brought back 31 because it's the upper number. Floor got 30 which is correct. Round got 20 because we're rounding the number from 20.2 to 4 to 34 to 20 and sign is basically giving you what the sign of this value is. So it's pretty simple to apply these mathematical functions, and you can try out your own variations. Put some different America data here and see how the functions behave. The second category was Earth Matic functions. In the second script, we're doing division. We're also doing subtraction and addition. It's pretty simple, and you can run these with your own radiation off numeric data. The result. One is one result to his one. We were using a D. I V instead of the slash for divide as well, so both can be used to give you the same result. Just a negative sign. In a positive. She's pretty simple, so hopefully you can try out some examples on your own and and try out both the mathematical functions and the automatic operators.
31. Introduction to String Functions: Being able to effectively use string functions become really important in your data analysis. String functions really come handy when you're trying
to clean bad data. Having data in a certain
format becomes really, really important for
companies that get audited on a yearly basis. And strength functions
can be used to get your data in a
particular format. We will go through each of
these categories and I'll show you examples of how
each function can be used. These functions might look
overwhelming at first, but I'll try my best to go step-by-step and expanding
how they can be used. We will be looking at the converting category
for string functions. And the list of functions
are ascii, cast and soundex. So let's go to our lab. So let's say if you're
given a task to bring back those customers where
it's the last name is Smith. So we'd run this
simple select star from customers where
last name is Smith. So at the first glance, we can say we don't have any records where the
last name is Smith. But what if the data
quality in your table is not good and somebody is
fat fingered the lastName, meaning that there are
some spelling mistakes. So you will, at first glance say we don't have any customers, but the last name is Smith. But this is where one of
the string functions come handy in use to
sound x function. What this function does is it's going to bring
back that data that sounds like Smith.
And it's pretty neat. So we can run the same select, select star from customer where. So we'll use a select star
from customer statement where sound x is the
name of the function, open bracket and the name of the column that we want to check equals soundex again and what value we are trying
to check against. So it's going to
compare the sound. So both of these data points. And if you run this, you get two records back where the last name sounds like Smith. And when you have a lot of data, and let's say you're trying
to find a particular name. This type of function
will help you to return names that sound like a particular given name that
you have been tasked to find and will help you in profiling the data
in an effective way. The second strip is
where we want to actually use the Cast function, which converts data from
one data type to the other. So we have select
2010, 10 as value. So we're just using
this tax value. And then we're casting this
text value as datetime. Basically the function Cast uses the value that we
are trying to convert, and then we use the keyword As and that data type that
we want to convert it to, just giving it a label
has cast value from dual. Dual is a dummy table in MySQL debt can be used
to test your functions. So as you can see that the
cast function converted the tax value 2000 1010
into a datetime format. So casting can be
used when you are ingesting a lot of data that is not standardized and
where you want to cast that data into
a particular format. So very powerful function
that gets used a lot as well. The third script
we will be using an ascii function and ask you, functions can be used when
you're doing data conversions. So some use cases might ask
you to derive an ascii value. So if we run this script, so basically this is your
last name column and it converts it into an ascii value. I've also given you
the ascii value for an empty
screens, which is 0. Formatting string functions
are used a lot when writing SQL scripts that lets you
manipulate string values. Having a knowledge or formatting string functions will
definitely give you an edge. Let's go to our lab section. So let's take the first example. We are tasked to convert a first name and the email
address to lowercase. So this script, we have
FirstName column on L4. That's the name of the
function open bracket, and then the name of
the column you want to change the case to, and then lowercase
e-mail as walls, we just going to
limit to the number of records being five. So if you see the first name was all uppercase
in the database, so we converted it
to a lowercase. The e-mail also gets kind
of order to a lowercase. So it's a pretty
simple function. The second script is
where are we going to be using a trimming function? Trim function removes spaces. From the value. So let's look at the script. So the first column, we're concatenating
the firstName comma. And so if you see 123, so we have three spaces
after the firstName. And then we're doing the length. What's the length of this? Once we add three spaces
to the firstName. And we're going to use
the trim function, right? Trim firstName. So it's going to trim the spaces in the
end of the column. And then when you're
going to see the length of the final value. So let's run this. Let's look at the first structure
is Mary's length is four plus three, It's like the seven. But when we use the
right trim function, it traps the spaces from the right side
of the characters. So here you can see we have used a trim function that trumps
the spaces from the right. And the third script
is where we can be using a padding function. So let's look at the script. Select FirstName, LastName. Arpad means right
pad, firstName comma. So the length of the total
output needs to be 15. Third parameter is the
actual padding value. And in our case we're
going to be using a dot. And then you're also going
to write pad the last name. So let's run this. So we had the
FirstName, lastname. So FirstName, the length
of the name is four. So there are four
characters, MAR why? It adds 11 dots. So the total length becomes
15, and that's what we have. We are qualifying it,
but the length that we need for this field to be at. And then we give what we need to use for
the padding value. So it'll be to source column, the total length of
the desired value and what the padding value
needs to be in quotes. So this is a function
that can be used to format a lot of string values. String functions can also
be categorized under expression extracting and
manipulating categories. And we're going to be
trying out some of the functions in
our lab exercises for these in a lot of data
projects as being an analyst, you'll come across
different scenarios where you're given
a task to either manipulate the data or either extract some
value out of that data. So these are the
functions that you can use to get your desired results. Let's try some of these
functions in practice. So let's take a look
at the script one. So it's asking us to
find the position of character RY and the first
name of the customer table. And we're going to be using
a string function called int string short IN STR, which gives you the position of the character you're trying
to find in this string value. So let's look at our script. So select FirstName
comma IN STR. That's the name of the function. Open bracket FirstName, that's
the name we want to find, where the character exist. Comma RY. That's the character
we're looking for, closed brackets from customer. I'm using this in the
where clause to bring back only those records where
the firstname does have RY, as this particular function
gives you the length. So if the length
is greater than 0, that means at the firstname
does have our y in it. So let's run it. See if you look at
the results set. The first record marry
for the position from which our y starts
from is three. So M a R. So R is at
the third position. If you look at Cheryl, It starts at the fourth
position, RY right here. So this is a string
function that can give you a position of a particular character in
the data pretty quickly. So let's look at script
to under extracting. So we want to use left, right, and mid functions to split the phone number that we
have in the address table. So let's look at this script. Select phone comma left open
bracket phone comma three. So that means every are
bringing three characters from the left of this value. And we're going to give
it a label as area code. Then we'll do a mid.
Mid takes two numbers. We'll start at the
position four. And up to the third
characters after position four would
be the mid phone. And the right would be, we need the rightmost
four characters. So let's run this. So yeah, you can see from
the results that we got back is that 99 eight came
into the area code. 834 came in the mid phone. Column, one to 75 came
in the right phone. We look at some manipulating string functions
and script three. So let's say if you want to
reverse your first-name, it's pretty simple
and we just use the reverse function and give
it the name of the column. Let's run that. So you
can see can easily reverse a string value just by using the inverse function. We can also use it
Replace function to replace a value in the data. Let's look at this script. Select FirstName comma,
replace the name of the field comma Linda to Lynn. And we'll just give
it to label from customer where
firstName is Linda. So what this does is
that it finds Linda in this column and replace it
with Berlin. Let's run this. You see Linda is
replaced by Lynn now. So it's a quick way to
replaced string values. The last script is
basically you can repeat data multiple times by
using the repeat function. So let's run this. As you can see, the value is Eddie and it's get
repeated twice. These are just some examples in which I'm showing you how
a string function can be used based on a
particular business is you can use a combination of all these string
functions and actually get the desired result.
32. Conditional Statements: conditional logic and sequel can be applied using a case statement. It's similar to an if then else statement. If you've used any other programming language, Gates statement can help you transform data from one settle values to the other, and also it can help you prevent enters. If you're doing some sort of a calculation. Now, let's look at the syntax. We'll use the keyword case to start the case statement. Then give the particular column whose value we want. Evaluate when the value, then whatever value we want to use to transform that we can have multiple conditions by using the Wen and then keywords else. If none of those conditions are met, we can give a particular data value a point and then end. Case else is optional. Let's look at the case statement in practice. Let's use the case statement on the film stable. Let's see what we have in the table first, so we have a title in a rating. So what we want to do is we wanna basically expand on the rating acronym and give a description. So what? We can uses a case statement to actually enhance thes acronyms and let our users know what they mean. So if you look at the first script we're using the case keyword, then we're giving the column. We want to transform, and in this case is gonna be rating when and then the value of this column. So if the value of the column is our, then we're going to change it to restrict it. So these are multiple conditions when and then that could be applied to change these documents to more descriptive information. And then you use else if none of these conditions are met than else, condition is gonna be triggered. And as and we're gonna give it a label for this column as movie rating description and also ordered by Title 31 order it a sending. So let's run this. So looking at the results that if you CPG I had a condition where of reading was PG, then transform it to parental guidance suggested. So you see, this new column is actually giving us to transformed value, So this case statement can become really, really important when you're doing migration from one platform to the other, and there some requirements where you want to change values based on a particular business requirement. So you see, all the actives have been successfully transformed. So there's another flavor in which we can use the case statement. The script that we ran. You were using the column after the case. What we can also do if you look at our second script, is case when the column equals R, then Transformers to Restricted and Harry listing the column in each of these conditions and then assigning a particular transformation ruled to it. Both of these will give us the same results that hopefully you'll find these examples helpful.
33. Introduction to Window Functions: Window functions are one of the coolest features
I think SQL has. Once you note the mechanics of working with window functions, you can use it for
different use cases and calculate the
desired metrics. The windowing process that's
you divide that result set into groups or froze
called partitions. And then frames let you operate on even a subset
of partition data. By breaking partitions into
smaller sequences of rows. There are a lot of functions
that can be applied. We had the windowing concept, some of the use cases that
can use that window function. If you want to calculate the top 10 products this
year by total sales. If you want to calculate moving averages, are running totals. We will go through some
scenarios where you can see how window
functions can be used. You can apply the
window function on a particular column while still maintaining the detail
of the whole table. Like if you remember in
our group by functions, we used to get data
that user condensed the number of records based
on the group by column. With window functions, you can still maintain the detailed
set of records off the table and still apply grouping on the data based on different
aggregate functions. So the windowing
concept becomes really important when
you're trying to see that data in its totality. Next, we'll see how the window function
is constructed and what are the components that make up a window function work? Window functions are one of the coolest features
I think SQL has. Once you know the mechanics of working with window functions, you can use it for
different use cases and calculate that
desired metrics. The windowing process
lets you divide results set into groups or froze
called partitions. And then frames let you operate on even a subset
of partition data. By breaking partitions into
smaller sequences of rows. There are a lot of functions
that can be applied. We have the windowing concept, some of the use cases that
can use that window function. If you want to calculate
the top 10 products this year by total sales. If you want to calculate moving averages, are running totals. We will go through some
scenarios where you can see how window
functions can be used. You can apply the
window function on a particular column while still maintaining the detail
of the whole table. Like if you remember in
our group by functions, we used to get data
that user condensed the number of records based
on the group by column. But when window functions, you can still maintain that detailed set of
records off the table and still apply grouping on the data based on different
aggregate functions. So the windowing
concept becomes really important when
you're trying to see that data in its totality. Next, we'll see how the window function
is constructed and what are the components that
make a window function work?
34. Window Aggregate Function Part1: the following categories give you the list of functions that can be used in the rendering logic so you can use aggregate its ranks functions for statistics, distribution and also positional activities. So these functions become really handy. Once used via the rendering logic, we're going to start with using the AC create functions in our lap. Now we'll start recreating the employee table and inserts some data, and some of you might have already done that in the previous lab. Or so we have employee I D. Employee names the department that they work in their salary in the age. So that's the data we're working with. So let's start. But finding the total salary per department, right? So, you know, from the rarest lecture, we can use a group by department name to get the total salary by summing it up here, you can see the group bike loss, has grouped each department and summed up the total salary. Let's say you want to calculate the average and the total sum off all the salaries in the table, presented in a way that the actual number of records in the table still show when you run the query for that, you're going to use the over clause with initiates the rendering process. So we're gonna do the aggregate function off some passing the column that we want to aggregate on than the over claws. And this give it a label. Same thing with the average. We're gonna give it the column name over clause and the label and just run the script. So, as you can see, we got all our records in the employee table. Employee, I d. Employing name, department name and salary. So we have 11 records showing up. But if you look at the last two columns, the total salary and the average Sally's toward this over Klaus dead was summed up all the salaries of all employees enlisted it at every dro, along with the average salary at every row. Our third strip is asking us to sum up the amounts for Marion Barbara, where each amount they paid is greater than $5. So for that we have to join the customer table and the corresponding payment table. So we're gonna join those two tables on customer I d. We still using the where clause where amount is greater than five and The name is Mary and Barbara. If you look at our select statement, we want to know which month they paid these amounts in your summing up the amount and then using the over claws and then giving a total payment. So let's run this here. You can see the Mary's transactions in June, she had two transactions that were greater than $5. In July, she had three transaction. But if you look at the total payment, the or function is adding up all these amounts and showing it at each row, some off all these amounts together. So this kind of shows you the aggregation being shown with the detailed records are four. Script is a little variation from the third script in scripturally we were not grouping or partitioning the win doing function. But here we're gonna introduce ah, partitioning claws and we're gonna partition by each customer. So we gonna partition by Mary and then we're gonna partition by Barbara. So the script is pretty much the same. The only difference we have is that when we some the amount and use the over clause in the over claws open bracket, we're going to use the partition by Klaus, and we're gonna partition by customer i d. Let's see the results if you see now, so the total amounts are aggregated by the customer. All of Mary's amounts have been totaled up, and all of Barbara's amount have been total up. This partition by Klaus comes really handy when we wanna partition on a particular column using the rendering functions, you can see the total amounts here $41.94 only relate to Mary Smith's transaction When you some these up. Great. I've added a slight to explain these results, further calculating the total payment amount for each customer, ensuring their detailed records alongside it. So if you see we some the amount and use the over claws and then used partitioned by customer i D and labeled as total payments. So if you look at the last column using the partition by Klaus created two partitions in our results that one for Mary and one for Barbara. And by using the partition keyword, you're able to isolate your aggregate functions to be only applied on that partition. Hopefully, this Richard was helpful. In 1/5 script. We want to calculate the total on average. Sally per department. Enlist the total employees per department as well. You had seen the employee table. We had employee named Department and salary and their age going to use the or clause we can also use the partition by. We're gonna partition by department. So? So here. You can see the 1st 3 records relate to the HR department. Salaries are here, So the total salary per department. So there's 12,900. Is this some off these three salaries in the HR department? Because we partitioned by department name. Likewise, average salary per department and the total employees, so we have three employees.
35. Window Aggregate Function Part2: Now let's take a scenario we wanted, classically it. They're running some off the paid amounts per month by customer, where the each transactions are greater than $5 for Marion Barbara, so the script remains the same. The only portion that will change is that we're going to add the order by paid month clause after the partition. By custom righty. So what this does is that it further segments out the data by month after it has already grouped by the customer i D. So let's run this. If you look at Mary's transactions, we have months 67 and eight. So we have three months in which the transactions air split. So for month of June, this winnowing function added up the two transactions for the month of June. Then it came and added the transactions for the month of June, plus the month of July, So 35.95 is adding transactions for all the month of July and June. The last transaction for Mary for the month off August is $41.94 so it adds $5.99 to the $35.95 we calculate in the previous role to $45.94 is the total for all these amounts for all these months as well. It's a running total as it keeps adding the transaction amounts as the current role moves down. If you don't specify a row and range and give a particular boundary, that default is range between unbuttoned proceeding and control. Let's try to understand this through a visual aid. So when we use the renewing function and do the order by payment month, if we don't use any range or don't set any boundaries by default, my sequel will process this as a range claws and then do between unbounded, preceding and current row. You look at the results set if the order by value is the same in this case for the month of July via three transactions range would calculate the aggregation for that block of data together. And that's why you see $35.95 being repeated for each row for the month of July. As the calculation happens on that logical group, hopefully, this ritual will help your differentiate between a row and arrange calculation Now let's update the previous script and add another rendering function where we're actually explicitly seeing rows between unbowed and proceeding, and the current after the order by clause, and leave the other running total payment range as is. Let's run this so the running total payment rose. What it does is that it adds the amounts row by row, like a physical row gets added to. The total amount does not matter if the month is the same. But if you compare the range running total if it sees the month being the same, it as the total amounts and give the same amount on those two rows, then it comes to the month of July. It add all the July and the previous difference between rose and range that the order by clause does not impact the row calculation, and it basically works. It's aggregation on each row as it goes through. On the other hand range, basically logically groups based on the order by and calculates the total sum based on the order by column. Let's look at the roads calculation through a visual aid. So the moment we used the Rose keyword, the framing happens at each row. So as Rose are being added, the frame gets bigger as you can see from the visual on the right side that the frame one through frame six is an increment from the first row to the last row. So as the frame goes bigger, the total sum or the running, some gets added as the control goes down to the last row off Mary's transaction and likewise with Barbara. Now let's say if you want to calculate a moving sound, meaning that a sliding window that goes through our set of records will keep the rest of the script the same will still use the some off the amount over Klaus partitioned by customer I D. And the order of the month remains the same. So this information we're not changing, we gonna add Rose between and give a boundary a one preceding and current crow. Likewise, we can also add the range between one proceeding and country and see how our calculations are different between rose and range. Let's run the script if you look at our rows Columns, um, moving total payment rose bait amount one day, so we have ah, one day prior sliding window. So now we have $5.99. So we don't have any prior RO just gonna represent fighters nine cents as you go to the next column. It's gonna add one previous amount and this amount and put $15.98 as it moved forward is just gonna only add one previous amount because you look at the last roll. The reason why we have $13.98 is because it it added $5.99 to $7.99. So it's a sliding window with one month prior by using the row between one preceding and current row. If you look at the range calculation, it again summed up the two amounts based on the payment month. So for the month of June, we have $15.98 as it went through the next set off month. It added again for the month of July, so it added July payments to the June payments, but it repeats it for the same month. If you look at the last Ruoff Mary and look at daughter moving payment range came in one day, it's $25 in 98 96 cents. The reason it is $25.96 is because it added $5.99 to these three transactions in July. So knowing the difference between rose and range calculation becomes really helpful. As you start doing your analysis, let's look at the sliding effect using the rose clause through a visual aid. So this late is showing you are moving total payment for each customer by month and were able to achieve that by setting up a boundary and by saying Rose between one preceding and current. And if you look at the framing, it's basically has two rows as it moves through, their data sat in school a sliding window. If he would have used between two preceding and current, the window would have had at the most three records as it's moving through the table
36. Window Ranking Function: ranking is another category of rendering functions that gets used a lot. Rank functions can be used to calculate the top number of records in a data set rank functions assigned the same ranks to equal rose versus danced rank that has no gaps. Hence the term dense and tile slices. The data into the given number of subdivisions you give andro number just gives a physical number two each row. In your petition, let's see these functions in practice. I really like rank function. So we're gonna be using the employee table for running our rank examples. I want to update one off the Sally Records for employees five. So we can see the difference between the rank and dance rank. Please run that update statement. Great for script one. We want to calculate the rank and danced rank on the salary off the employees. So we're gonna use the keyword drank open and close bracket over partitioned by department name. So for every department, I want to rank the salary. Same thing I want to do with dense rank. Dense underscore rank open and close bracket over partition, my department and order by salary. So I want the ranking by salary within a department. So let's run this. So if you C H R gets 123 rank and dance rank 123 So far, rank and dance rank is giving us the same pranks. Let's go to production department if you look the first row or production gets both ranks as one second and third gets, too, because the salary is the same. But the fourth record would rank function gets four, and dense strength gets three. So here you can see there's a gap in the rank function. It's skipping three. But the dance rank does not skip, and there no gaps when using the dense rank. And so it systematically say's 12 to 3 and four. On the other hand, ranks say's 1224 and five. So I prefer using dense ranks. For this reason, let's see if you want to calculate the top five employees by salary, what we can do is we can apply the dense rank function, used the over claws in just order by salary descending, so we want the highest salary first and lowest at the bottom, and we're not partitioning by department right now. because we just want the top five salaries regardless of the department. So we're using an in line sequel right here. So if you look at the brackets, this particular select, it's giving your dance. Drank for the high salaries one and then kind of goes down all the way down to eight. But we're looking for the top five employees, so we need to filter this data further. So to filter the sequel inside, we will use another sequel outside this and keep this equal in close brackets given an alias. So the whole script inside will get an alias off acts So we'll say, Select star from the inner sequel. This acts where salary underscored dense rank ISS less than unequal to five, which is filter these records where salary dense rank is less unequal to fight. So if we run that we should get the top five employees by salary. There you go. So it's a handy view of getting an end number off people within a group based on a particular metric. Let's look it script to we want to basically use the entire function to split the payment data off the customers into four groups How are we going to do that? So we're gonna use an end tile function Open bracket four. Basically, this gives you how many slices you need off the data over which initiates the window function order by the amount descending as quartile. And we're gonna join the payment and the customer table on the customer I d. So, basically, we want to see the amounts paid by the customer and then group them into four distinctive groups. Let's run this. If you look at the quartile College assigning these roads Group one, if you scroll down to records, we have some records assigned as to sometimes three and then four. So this is an easy way to categorise the number of records in your table toe a particular given group using the n tal function and giving a particular number of groups you need in this parentheses right here in script three, we're gonna be calculating the top earning employees per department. So for every department I want the highest paid employee, we're gonna use the role underscore number function, partitioned the data by department name and then order the salary descending. We roll numbers basically assigned just a number to each physical row. If we run the script highlighted on the screen, we gonna get a sequential number assigned to each row in our employees table. So for HR department, we're getting 123 and then the ordering is based on salary descending the way we can filter this record and only bring records where the employee ranks equal to one. We need to have an outer sequel, Select Star from Open Brackets and Close Bracket, where we have the in Line sequel given an alias, acts where employees rank equals want. So let's run this. So this gives us the highest paid employees in each department based on the salary. Hopefully, these examples help you understand the ranking functions.
37. Window Positional Function: positional window function helps an advocate data location in the results. That, and helpful in business reporting, lag and lead functions let you either access the data before the control or after the control, depending on the offset you give. First value and last value gives you either the first or the last value in an order to sat . Let's look at these functions in practice. So what script one is doing It is calculating the first and the last value off the employee I d. Partitioned by department name. So it's a pretty simple script. Give you the first employee and the last employees. So if you see we're getting frame first row, we're getting the first employee here and cream last road. We're getting the last employee here, which is Grace Lee. Pretty simple, straightforward function. Let's try to run script to, which is calculating the lag and the lead function or the employee i d. Partitioned by department name and ordered by ideas were ordering it by I D. So let's run that If you look at the HR department, we have three employees. So for the first employee, there's no previous row. If you don't give an offset. What a default value for lag and lead. So it's gonna basically set the offset by one record. So lag would be one record before, and lead would be one record after. So the next row, which is using the lead function, is going to be two and three from the current trolls. It just gives you that for the third employees in this particular partition. There's no next record and it's gonna be no. But if you use the offset for the lead function, like in this case I've used to that means that the function would fetch record from two rows down. So when you're on the first row on employees, one to record Stone would be Grace Lee. So we're getting three here for the lead with setting it off set off to, but they're no records after two records from two. That's why we're getting no, no, we're using the lead and lag offset. You can really navigate the data that you have in the table within the partition. So for every new partition, the window function would start again
38. Introduction to Data Visualization: One of the reasons why I added the data visualization
section to this course is that
as a data analyst, you would be asked to present your data findings in a very concise and impactful
ways so the business can understand the value of your analysis along with the key metrics that
impact their business. Think of data visualization
as a tool for communication. One of the best ways to
explore and try to understand large datasets is through
data visualization. The concept of
storytelling is how we set the stage to
showcase our data. So it can be understood
without going into too much technical details. And with data volumes going, business as leaning
more and more towards leveraging data
visualization, do better. Educate them on either
discovering patterns, spotting trends,
and highlighting information where there are
opportunities to improve. The goal of all
data-centric organizations is to make better
data-driven decisions, which would lead
to better insight and that would ultimately
drive business. Aesthetics of data visualization
becomes very important as we start using visualization to
communicate the message. Aesthetics described
every aspect of a given graphical element. So like position, which describes where the
element is located. So location becomes important. You showcase your visualization. Shape can be used to
show different datasets. Size can show magnitude, color can show intensity, lines with varying width
and type control patterns. So all these small
intricacies can help your data visualization
be more impactful. And we'll see a lot of
examples in the other slides. Once we get our
hands on a dataset, the first step is to understand the type of data values
we want to visualize the two categories in
which data values can be split is categorical
and numerical. Categorical is more
quantitative and is further split into either
nominal or ordinal. Nominal categories don't have a particular order and it's used for more labeling purposes. Examples are either
eye color or gender. So for instance, gender might
have values of m as male, and then F for female. Ordinal values are more ordered. Ordinal data, as
the name suggests, has a certain order to it. So customer survey might have data values as one
being satisfied, do being neutral and
three, being unsatisfied. Numerical category is
more quantitative. It can be further split into either discrete or continuous. Quantitative values
can be measured. Discrete dataset, for instance, can take only a certain value. For example, the number
of people in a room can either hold six or seven
people, but not 5.5. So the examples are
a number of sites. You have four dice and
number of pages in a book. Continuous dataset
is where the data can be split to further slices. And it's more like in finite
continuous data values are for which you can divide them into finer and
finer increments. So for instance, a temperature
or a weighing scale, you can further subdivide the
weight into smaller units, so it's more or less in
finite, it's continuous. The distinction of data into these groups
will become really, really important as we start to use different types
of charts and graphs, which we'll see in
the later slides. Hopefully this visual will help you understand how we need to think of data before we start putting the data into
a more visual form. You might be faced with
a particular situation where you need to present
your data based on a specific requirement that
can either be categorized as comparing the data values or showing composition
of your dataset, or presenting how data is
distributed or it is related. For each of these categories, we have a set of charts
and graphs that can be used to deliver the
desired results. I'm showing a condensed list of chart types within these
categories on this slide. We will go through each of
these chart types in detail. In the later slides. I've attached a chart
selection diagram, lecture notes created by
Dr. Andrew Abella that is pretty elaborate
and will help you pick the right chart
for your datatype. And important question you can ask before selecting
a visualization is, how many variables do you want
to show in a single chart? He's a 123 or more. And how many data
points you want to display for each variable. And will you display
values or period of time? What among items or groups? Answer to these questions
will help you pick the right chart for
your visualization.
39. Tableau Software and Datasets: We will use the tableau software for our data visualization course there many other data visualization tools in the market, but tableau by far is one of the leaders. If you follow the link given in the slide, it will take you to a public version off tableau, which is free of cost. You can download it either for Windows or Mac version. The only downside is that you cannot save your visualization on the desk. You can only save it online on tap those public portal. So this is only good for testing or university base assignments. You don't want to share sensitive data on the public website. We're gonna first install tableau on the Mac OS. So just given an email address, we have successfully installed tableau on Michaela's Let's install tableau public version on Windows. Great. Our software has been installed. We have two data sets we're gonna use during the course. One would be the world indicator, Data said. The other one is going to be the sample superstore data set. You have no dimensions or measures, and pretty much you have a blank canvas at this point. So what you want to do is you want to click on this, connect to data to get those two data sets loaded. So we'll go to excel because we have those data, sets and excel. We need to link to these two data sets. So we'll go first to the World Bank CEO to Excel sheet and load it once you're connected to the Excel Sheet is going to show you the tabs that you have in your Excel worksheet going to click on the CEO to data clean and double click that great. So we have actually successfully connected to the world CEO to Excel Sheet. You have your country court country named Region Year, your CO two emission and your per capita information. So now if you could click on the sheet one, we should have the dimensions in majors loaded. Great. So we have those, too. Now we need to connect to the second data, sets a click on the data source here, click on this Aiken, we're gonna connect to Excel Sample superstore Excel. So be able to look at these taps that are in that excel sheet so we don't double click on the orders. But this is the data set that we gonna be connecting to is the order data. Let's go to sheet one. Now you see, we have the C 02 data connected, and also the superstore data connected as well. So you'll see dimensions and mayors for the superstore data set. And if you click on the world C 02 days that we're seeing the dimensions and measures related to that data set as an exercise, you can connect to the world indicators data set on your own. Great. So we have the data now.
40. BAR Chart Visualization: bar charts Air, one of the most commonly used visualisation that shows numeric values as bars split across clear categories. They're effective in comparing magnitudes and discovering high and low in the data, and that is shown by the height of the bars. Bar chart is sometimes called column charts. You can either show data on a war tickle or a horizontal bar, and let's see if you have more categories or labels at times becomes harder to show that vertically, and we have the option to show them horizontally. X axis showed the dimensions and why axes show the measures. Bar charts. Discrete data is categorical in nature and therefore answers the question. How many? Within each of these categories? Let's take a look at the anatomy of the bar chart. If you look at the diagram, the horizontal axis has the categories or dimensions, and the word ical axis has the measures. The height of the bar is showing the magnitude, and the width of the bar is the same across, and you read from left to right. Each bar shows a particular category. This visualization is very easy to understand. This visualization becomes very impactful when you can see the radiations. In the height of the bar, we have another variation off the traditional bar chart that's called the Stack Bar Chart. Here you see the same bar chart, and the only difference is that each bar is further segmented out into sub categories. This becomes effective in you have, ah hierarchy in the category. So let's say you have a region, and then the region is split further into 2 to 3 different subregions. So this is an effective way to show those sub regions within a bar. A stack bar chart show finer details on the data set as the data can be seen. Split in further categories. Let's go to tableau and create these charts. For this exercise, we will connect to a world indicator data set click on Microsoft Excel, and you need to open the world indicator data set. So it's connecting our power. So what Taepo does is that it lets you see the data once you connect to it in a browser format so you can see what data you have. You can also improve the quality of data if you want to change a particular value once you see it. But for us, we're going to use the data as is Let's go to sheet one. So now you can see that the dimensions and mayors have been filled up piece of the data set . So our first task is to show how the inbound dollar amount is increasing per year in our data set, so double click on the year. So we have data from 2000 to 2012. Double click tourism inbound, which is a measure go to show me and will select the horizontal bars so you can see that the inbound tourism has been increasing worldwide. You can also switch the chart from horizontal to vertical by clicking on this Aiken. Great. So we have our first chart. If you look at this plus sign, we can create a new sheet, so we'll just to hand so we'll do. The same thing will double click on the year we'll double click on the tourism inbound that would click on the region. Now in the show me we need to click on the stock bar chart. Great. So now you can see we're able to create a stack bar chart where each bar shows total inbound dollar amount spent for each segment. And in this case, our segment is the region you can see Europe is leading in each of these years in Tourism in Africa is increasing over the years, but compared to Europe is still for less great. So this is a stack bar chart will create 1/3 chart. You can create a duplicate version, often existing sheet by clicking on the duplicate option in the menu, and this will be a site by side bar chart. I really like this visualization. What it's showing is that each region is shown in a column, and you have the number of years for the data for inbound dollar amounts showing. What we can do is instead of the year being on the color palette, we can replace that by region, so each region has its own color. Let's try to add a dashboard. The dashboard would be this Aiken right hair. You can combine multiple work sheet in a single dashboard. We contract the sheet on the pan and you'll see how the dashboard actually puts the sheet side by side. This really becomes a very powerful communication tool in your presentations you will make a few modifications to the dashboard. First, we will rename the worksheets to have a more descriptive name that goes along with the chart it's representing Second on our sheet one. We will change the color off the bar chart to have a more neutral tone. Third, we will adjust the size off the dashboard appropriately. Let's go make those updates or click on the color. I will just pick a gray. Great. I have renamed the three sheets with the desired names. You can do that on your side as well, by right clicking on the sheet. You want to change the name for and select rename from the menu selection so you have your layout and your dashboard. Then you have the size. It's an automatic size, but you can increase so you can increase the wit or decrease the width. You can decrease the height if you want, so this kind of gives you an easy way to minimize or expand the size. Now you can see these three visualizations in a dashboard, which really gives you a compelling story. My favorite is the total in monitors and per region details, so it's basically gives you each region and the way inbound tourism dollar amount is increasing. If you can see Asia has have the sharpest increase or the years and also Europe is leading by far. The total inbound tourism just gives us a total picture off how every year the dollar amount on the tourism is increasing and the stack march are kind of gives you how eat segment within the year for the region is compared. So this kind of gives you an overall dashboard that can used to show different perspectives off how the inbound tourism in the world can be compared among different regions. Hopefully, this was helpful.
41. Line Charts: line charts are like time. Siri's graph there used to visualize data trends over an interval of time. The data points on the chart are connected through a straight line, sharing numeric values plotted as lines or date related fields. The difference between the bar chart on the line chart is that the line chart needs to include Time dimension in it. We will also be using stacked area chart, which can also show time series data. Its main purpose is to show individual parts and how they make up the whole or the total. Let's see these charts in action. Using Top low, we will be connecting to the world indicator data sets a click on Microsoft Excel and open the data set. Great, we got connected. Let's go to sheet one. All right, so our dimension and measures have been loaded. So what we need to calculate is the total CEO to a mission or the Peter of Time and show which region is showing a spike in the year to mission and where we have a dip. Let's have the Dimension Year the book Click on the Dimension at the Measure CO two emission double click on that as well, and then double click on the region as well. So if you see we have a tabulated form of data here is really hard to see a dip or a peak on the sea or two based on the region. So we really need to get a visualization on this data to make some sense quickly. So you click on the Show me drop down and pick the line chart. So this line chart really becomes an impactful communication tool where you can see Isha by far from 2000 to has a peak, meaning the C 02 emissions have been going up quite significantly from 2000 to and own words for the region of Asia. Compared to the other regions, let's create a new worksheet for are stacked area chart. Click on the plus sign will pick the same dimension and majors double click on year double click on See or to Mission and then double click on the region. Click on the show me drop down and pick the area chart. Great. So the area chart is showing the total summed up see or to mission in the world, and it's doing it by stacking each region on top of each other. This chart becomes helpful to see the bigger picture. Now let's try to put these two charts on our dashboard by clicking the plus Aiken. We can just drag and drop these two sheets on the pen. Great. You can change the size off the dashboard based on your own need. Well, just expand it so it's fills up the screen. Let's try to rename these two worksheets. Do the same for the next sheet. Had a title. Great. Now we have our dashboard ready to be presented.
42. Histogram: history. Graham shows the frequency of distribution off your continues data. It divides the data set into clear groups called bins. The frequency off the Ben data becomes the measure off the history Graham and gets calculated automatically history. Graham uses one dimension that is in America, which gets split into Ben's history. Grams can also be useful to visualize the outliers in your data. Set in our past visualization with bar charts, the columns were showing categorical variable in history. Graham, the columns show quantitative variable. If you look at the anatomy off the history Graham on the X axes, you have fixed size bins where your numerical data gets grouped. There no gaps between the bars. As Baylies continues on the Y axis, we have frequency off the been data, and the height of the bar is showing how maney number of records you have per bid. So it shows the magnitude we will be creating to history Grams one for sales data and the other for profit from our superstore data set. Let's see these in practice. Let's connect to the superstore data set. Bring the orders to the pan. Great. So we have the data. Let's go to sheet one. We have to look at thesis Ailes measure and find the frequency off the distribution for this data set. So what we need to do, we need to create bins off the sales data. We need to figure out what size of the bins do we need for this exercise? I'm gonna pick the bin size as 10. Click on the sales measure, click on the drop down. There's a small icon on the right side, click on the create, and then click on Vince. There's a default. Been size. You can go ahead or write it the moment you created bin for the sales data. The Bend data goes into the dimension. We need to change the type of this data to be continues. You can click on the drop down and click and work to continues. Double click on the sales. Been great. So you got that on the column On the real we need to actually count the number of records we have for the sales data. So we'll just write count bracket Number of records closed bracket just hit. Enter. Great. So we got our history. Graham, what we need to do is we need to filter out some of the sales data. You can do this by dragging the sales to the filter pan, and you'll get this window. Click all values. You can reduce the maximum value to about 600. You can either type in the window or just drag this lighter hit. Apply. Great. So now we can see our history, Graham. And as you can see, that majority of the concentration off the values is between 0 to 50. If you look at the tallest Ben, that has the maximum number of records in it, so what we can do is we can label it. Just do a right click mark label all results so it shows you the number of records we have in that been great. Let's create a new worksheet. Now we'll create the history. Graham, with the prophet measure well again, pick the bin size for profit. I will pick 100 as a been size. Now you need to change their Ben into a continuous very well double click on it. So it shows on the column and on the road you need to do account number of record. Great. So we got our prophet history. Graham. What we need to do is we need to filter the profit between negative 1002 positive 1000. You can do this by moving the prophet to the filter pan and you'll get a window open. Click all values now. You can give a range between negative 1000 to positive 1000 you can type it in the small window it apply. Now we can see the majority of the transactions are between 0 $200 and you can label the bar with the highest number of records. As you move to the right and the left off the highest bar, you'll see that the number of records in those bins go down quite a bit. Let's try to add these to history grams into our dashboard by clicking the plus sign. You can just double, click or just drop those two sheets into the pan. You can rename both the worksheets to a more appropriate label, and whenever you are showing a history, Graham always try to mention the bin size on your diagram. We can rename sheet you as well by right clicking and added title. You can adjust the size, cough your dashboard as well. Great. There you have your dashboard, which are selected history. Grams
43. Scatter Plot: scatter plots show relationship between two measures one plotted on the X axis and the other on the Y Axis Scatter Plot scan showed a large number of data points and is suited to drive actionable insight. Unlike bar graphs, which uses at length as a visual clue, scatter plaice Youth position X and Y coordinate. Scatter plots can help show correlation between variables positive and negative and can depict clusters within your data points. If you look at the anatomy of the scatter, plot is basically made up often X and Y coordinate, and both axes have measures listed on them. This type of realization helps you capture a lot of data points in a very efficient manner . In our lab will be creating a scatter plot between profit and sales data off the superstore data set. Also identifying clusters within this data set by color coding it It will be fun. This light shows two plus one, showing a positive correlation between birthdate and infant mortality, meaning if one waited, what goes up the other goes as well, hence positive. The other shows negative correlation between life expectancy of females to birth rate. As one goes up, the other comes down, hence negative. The correlation Scatter plots will be added at a later time, and you can do this on your own and see if you get the desired results. For now, we'll go to the lab and create a scatter plot for our superstore data set. Let's connect to the data set superstore. Click on the orders Time click on the new worksheet. All right, so we have to get to measures, profit and sales on the pan. Double click on those two. So now you see that we have an aggregated point where all the profit and sales have been aggregated. We need to uncheck this aggregation by going to analysis and un checking the aggregate majors selection. Great. So now we have a spread off all our profit and sales data in a scatter plot. As you can see, we're able to show ah, lot of data points. Our end goal is to find the clusters off customers we have in our data set. We contract the customer name through the details marks card. Now what we need to do is go to the analysis time and you can see in the model you have the cluster option. Distract that and drop it on the pan. Great, you can see tableau based on its came ing algorithm, has clustered our data set into three distinctive clusters. Cluster 12 and three, and they are color coded for distinction. Our cluster one. As you can see, we have a lot of customers there, but the profit is minimal and cluster to you can see we have customers where we're making a lot of profit much better than cluster one. So that's definitely one cluster you want to focus on for your sales cluster threes where we have very few customers. But on each customer we're making a much larger profit margin. We can also move the cluster from the mark's car to the dimension, and what we'll do is it will create a customer group that could be used later on another regionalisation. Let's create a new worksheet. Let's bring in latitude and longitude and the postal code for the customers. This will show us where all our customers in in North America now what we need to do is we need to track the customer group that we made from the clustering into the filter and select which cluster we want to see on our map. Now you can see we can select clusters and the map is gonna get updated based on the selection off the cluster we pick. So you want a target marketing based on a certain cluster, we can see where they are. You contract the customer group to the colors. Marks card. Looking at these clusters on a map really gives a powerful visualization for us to do our targeted marketing and seeing which cluster has a better impact and which location needs to be focused. More. Let's add our scatter plot to the dashboard. Let's rename it.
44. PIE Chart: people just either love pie charts to just hate them. I'm not to phone of them. Pie charts are good to use when you have few categories in your data set. They help ensuring how one category is related to the other in proportion. I would say 3 to 6 categories would be appropriate. Pie charts can either show nominal or orginal data. If you look at the anatomy of the pie chart, each category has a certain percentage that adds up 200%. You can either call each portion or category of the pie as slices or veg is. If you showing more than six categories than the pie chart will become more crowded. It'll be harder to convey the message, so please keep the categories under six. We will be creating a pie chart off the profits from our superstore data set per region. This visualization is showing US sales data per category off the products within each state of USA on a map. You can try to create this visualization on your own and see if you get their desired results. Let's try to create a pie chart in top low. Let's connect to our superstore data set. Select orders data. Click on sheet one. Double click profit from measures and Double Click Region from Dimension. Go to Show Me and click on the pie chart. Let's try to fit this on the screen. Great. So now what we need to do is we need to label each portion or veg bring the region and profit to the label marks pan. Great. We have our labelling done. I was trying to label the sheet as well. Let's increase the size of the phone and make it bold. We'll do the same for the region and the profit. Format it and increase the want. Emboldened as well. Let's add it to the dashboard. Let's adjust the size. We will remove the total profit. Great. We have our pie chart. Regionalization done.