Beginners Data Analysis Bootcamp with SQL | AMG Inc | Skillshare

Playback Speed


  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Beginners Data Analysis Bootcamp with SQL

teacher avatar AMG Inc, Technologist

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

44 Lessons (3h 36m)
    • 1. Course Introduction

      2:41
    • 2. Introduction to Databases

      8:54
    • 3. MYSQL Windows Install

      3:59
    • 4. MYSQL Macos Install

      3:54
    • 5. MYSQL Cloud Install

      5:58
    • 6. MYSQL GUI INTRO

      3:19
    • 7. Create Database and Load Data

      4:21
    • 8. Introduction to Data Modeling Part1

      11:16
    • 9. Introduction to Data Modeling Part2

      16:44
    • 10. Introduction to SQL

      2:22
    • 11. DDL

      5:05
    • 12. DML

      4:23
    • 13. DQL1

      4:57
    • 14. DQL2

      5:12
    • 15. Inner Join

      7:36
    • 16. Left Join

      4:03
    • 17. Right Join

      1:00
    • 18. Cross Join

      2:27
    • 19. Self Join

      3:17
    • 20. Intersect

      3:04
    • 21. Minus

      0:56
    • 22. Union

      3:24
    • 23. Introduction to Aggregate Functions

      3:02
    • 24. Groupby Part1

      3:20
    • 25. Groupby Part2

      1:37
    • 26. Introduction to Subqueries

      6:01
    • 27. Date Functions Introduction

      2:15
    • 28. Date Functions Part1

      6:42
    • 29. Date Functions Part2

      5:54
    • 30. Numeric Functions

      3:02
    • 31. Introduction to String Functions

      12:18
    • 32. Conditional Statements

      3:16
    • 33. Introduction to Window Functions

      3:13
    • 34. Window Aggregate Function Part1

      5:57
    • 35. Window Aggregate Function Part2

      7:41
    • 36. Window Ranking Function

      6:15
    • 37. Window Positional Function

      2:26
    • 38. Introduction to Data Visualization

      5:19
    • 39. Tableau Software and Datasets

      4:02
    • 40. BAR Chart Visualization

      7:47
    • 41. Line Charts

      3:20
    • 42. Histogram

      5:58
    • 43. Scatter Plot

      4:53
    • 44. PIE Chart

      2:32
  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

1,653

Students

--

Projects

About This Class

Learn the four key pillars of Data Analysis:  Databases , Data Modeling , SQL and Data Visualization.

This course will teach you the four building blocks of data analysis in a very structured way.

* We will first learn about databases and database management systems with the focus of data accessibility. Our preferred DBMS will be MYSQL.

* We will then learn the foundation of Data modeling for relational DBMS and gain knowledge about how data is related and how the design of data models help in capturing business use cases.

* We will take a deep dive in learning SQL; from writing simple scripts to advance analytical functions to gain insight.

* The last section will focus on data visualization and we will use Tableau software as our data visualization tool with many examples to showcase our story telling skills.

By taking this course you will gain experience in designing data solutions which will make you more marketable and competitive in the job market.

I will try my best to keep this course up to date by adding more relevant lectures and lab exercises.

Meet Your Teacher

Teacher Profile Image

AMG Inc

Technologist

Teacher

Our company goal is to produce best online courses in Data Science and Cloud Computing technologies. Below is our team of experienced instructors.

Instructor1

Adnan Shaikh has been in the information technology industry for the past 18 years and has worked for the fortune 500 companies in North America. He has held roles from Data Analyst , Systems Analyst, Data Modeler, Data Manager to Data Architect with various technology companies.

He has worked with Oracle , SQL Server, DB2 , MySql and many other relational and non relational databases over the course of his career. He has written multiple data access applications using complex SQL queries and Stored Procedures for critical production systems.

He has a masters degree from Northwestern University of Chica... See full profile

Class Ratings

Expectations Met?
    Exceeded!
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Course Introduction: Hello and welcome to the beginner's data analysis Boot camp with sequel course. I'm a non shake. I have been consulting for the Fortune 500 companies for the past 18 years and have successfully implemented multiple data projects. I was fortunate enough to go to one of the best schools in my field. I have taught multiple I T courses to professionals. I'm excited to launch this new comprehensive course on data analysis. You don't need to have a programming experience to enroll in this class. You can be from any walks off life and will highly benefit from this course. I will explain each topic in detail in my BDO lectures and have extensive live exercises with multiple quizzes to make sure you get a complete learning experience. If you're not satisfied with this class, you have a 30 day money back guarantee. We will learn the four key pillars of data analysis that will become your foundation in understanding how to work with data effectively. Having a basic understanding of how databases work is very important. We will be using my sequel as a preferred database system for this class. My sequel is one of the leading databases for the Fortune 500 companies. Data modelling how data is related to each other and understanding the different types of relationships between data becomes critical. As we dig deeper in driving value for our business data Insight, we will learn sequel language in detail way. Have the lab exercises to access the data in our database way will start from the basic operations and then get to the Advance analytics to address multiple use cases. Data visualization. We will learn the skills to tell a compelling story off our data by creating multiple visualizations using tableau, which will be our preferred data visualization tool. My goal is to make sure you learn these data skills in a way that can help you apply them in your professional life. Get the salary raise that you were looking for. Make your resume stand out and if need be, make a shift to the deep analysis field. I'm really excited about teaching this course to you, and I hope you join me in this learning journey 2. Introduction to Databases: What our databases. So rather than getting into the technical details, let's, let's try to understand how we come across databases in our day-to-day lives. So if you're searching the web, anything you write, or if you're searching for a particular keyword, any results set that you get back. That results set is coming back from a database. If you go to an ATM machine and key in your passcode, when you swipe your ATM debit card, information gets checked against the database. Anytime you log into a web page with your username password, the username password gets checked against a database. Anytime you pull up a bank statement, that bank statement is created from the data that is stored in the database. So that's kinda the way we interact with databases. So what you see on the slide areas that I haven't Excel worksheets. So a lot of folks are still saving data. In Excel worksheets. You keep adding to your dataset. You keep adding rows. So it's like a row and columns. And that's great. And for a lot of smaller companies, maybe Excel might be a solution. And I see people just saving data in text files. And companies are producing data 24, 7. And with the age of internet, data has grown astronomically. So what is the problem that we see here? Like what happens in your Excel when the data grows from a 100 lines to 2 million records, how would excel behave? What it crash? Or do you have to split the files and how do you link the files back together? So definitely there's going to be a problem there. Insuring their security and reliability of your data in this setup will also become a problem. Database software systems were designed to solve a problem. And I've just listed a few of the key issues that the database systems solve. So let's go for the first one. Volume. I mean, data is increasing, it's compounding astronomically creasing and terabytes, petabytes, zetabytes. And we need sophisticated softwares too, manage that much volume of data. Securities and other key issue that companies are facing when their data gets hacked. So we need database software systems, do have those scarcity encryption layer added so we can protect our data. We have web applications that have millions of customers accessing their data. So we need sophisticated database systems. I can handle these transactions simultaneously. Reliability becomes so important for companies where data is their business. I mean, for a stock exchange flight scheduling system, we cannot have data. That is not correct. Data had computer Bible. Data is increasing. Securities needed, millions of transactions happening on the same time, data has to be reliable. So database softwares or database management systems are needed to solve these problems. You just need to understand that and get that in our head. Before we get into kinda looking at the features of the database management softwares. In the previous lecture, we looked at database is what they are, and also go through some of the rationales as to why database management systems were created. So in this slide, I wanted to kind of give you the architecture view of how database management systems are used. This diagram is going to help you understand the different components that are used when you access data to a database management system. The user application layer. Think of this as where you have a client installed on your laptop. In our case, we installed MySQL Workbench. There are other client tools that can be used to access the database management system as well. I've shown towed as one of those softwares and other reporting softwares can be used as well. Your client application needs to have an ODBC connection initiated. Odbc stands for open database connectivity. So this driver needs to be in place for your client application to connect to the database management system or your database server. Once that connection is established, you can easily write a SQL script and access your data. I listed for you of the database management systems. Ibm Db2, Microsoft SQL Server, MySQL, and Oracle database systems. And there are a lot of other vendors providing this software as well. They are different functionalities that are embedded in the management system. And we have functionality that handles transaction management. We have areas for handling security, data manipulation, defining data, loading data, and the database server also has an engine that can SAP SQL language to be processed and then rewrite or SQL script as simple as select star from a table. At SQL statement gets processed, we add the DBMS. The DBMS needs to interact with the operating system to get to their data storage area or the data storage layer. The last layer is the data storage area. Your physical data files or reside on the disk. I'm showing customer order and product data is stored on the physical disk. But you would need a database management system to access this data. And you would need a client like MySQL Workbench to be used via a SQL interface. Do get that data and be presented to you. These individual components will help you understand how that database architecture is set up. And it's very important to understand what the user application does and what the DBMS does, and where the physical data files are stored. And I think once you can make a distinction, it will really help you understand the database management system. Hi everyone. One of the key points I want you to take from this slide is that MySQL is being used by a lot of industry leaders. So if we look at the top left off the slide, these are just some of the companies that are using MySQL and production. So from Facebook to a Walmart, to Uber, these are all the companies. Either they are using MySQL for their warehouse or their web applications. And if you look at the bottom left of the slide, I've given like timeline, kinda the inception of the software. And it was built in 1995 by a Swedish company. And over the course of years it has been acquired by different companies and Oracle acquired it via sun. So it has hundreds and millions of installation throughout the world. And it's built on some of the key features for a database which are performance, reliability, ease of use, and cost. And if you look at the bar chart I've included, I mean it tells you, compared to some of the other database solutions, it is so cost effective. So the total cost of ownership TCO, is really, really less compared to other software. 3. MYSQL Windows Install: Hi there. Great. So let's kick start this installation. So we're gonna be installing my sequel for Windows. It's a very easy install. Just follow the U. R L that I've given are great. So we're on my secret war page and follow the Ural that I've given. We will be downloading the mice equal installer 8.0 point 12. And the first link here, the one that I'm highlighting co operating system is going to be Microsoft Windows for the S. And then click down little do log in or sign up Just saying no, thanks. And move on and you're gonna get, ah, save as, um options and just save as Creative folder and save the file. The installer is going to start, which is 1.4 version. He's gonna ask you to install an upgrade, so just say yes. We will just get the new version as well. All right, so the first thing you're gonna get is the license agreement you're gonna accept, read through and accept that, and they're different options to install a set up type. So we're gonna go to the custom. We just need a server and workbench. So we're going to select the server option moved to the right as a selection when they were going to go to the work bench. Um, get that selected as well. Also, we're going to do the or DBC connection in a documentation. So we're just doing limited software install rather than everything. I've fast forwarded the video, so it should take you no more than 10 to 15 minutes. It next. What? We're going to keep that replication to default. This do next networking type. Just keep it default next authentication method as well, just to a default next. And then just put up a password. Please make sure you remember this password because very important, I just put it in a piece of paper or note. Pat somewhere, it's just gonna be the this service name. If you want to manually start the service this to execute, it's gonna, uh, run a bunch of configuration files again. I have fast forwarded the video just to finish on launching my secret workbench. It gives your default connection, but we're going to create our own. My secret connection just hit the plus sign and we'll give it a name. We'll just say sequel Underscore Class. Everything is already filled up just to Ah, OK, double click on sequel Underscore Class. We're gonna be launching workbench now. So this is the graphical user interface for my equal, and we will, right. All our scripts there will create tables, load data and run different quarries using this GUI interface. 4. MYSQL Macos Install: Hello, students. So we gonna install my sequel on Mac Os. So I've given the euro. Um, you can use to download and go to the my secret website. We gonna be downloading my sequel Community server and also my sequel. Work Bends. So we need these to sell fares for the installation. All right, so we're gonna first install the my sequel server. Just follow the link. I'm click on the community server. We're gonna be downloading the DMG file for the mice Equal Community server 8.0 point 12. So these are different types or files. We'll just do the DMG again. Just click download, ignore the log in and sign up at this time, So just say no, thanks. And click on the link. I fast forwarded the video so you have that amount of metal disc image on your desktops, and now we're gonna do the my sequel workbench Install, scroll down and you'll find one file, which is also the DMG file or workbench. Eight points or 80.12 and click the download. All right, so we have both of files on our desktop, so we'll start with the server file. Click on that so it's gonna start. Ah, the install. Or you would need a password for your database. So please make sure you remember the password right on a piece of paper or note pad or save it toe. It'll need this every time we connected the database. So just please, please, please remember now we should see the installation status as being successful. You know, we have actually installed server. We can go to Preferences and just see my sequel. Um, in the list off software is being available, so we're good to go there. Now we'll click the workbench DMG file and that it install. All right, so let's open the my secret workbench application. Great. So we have my secret workbench installed. Let's click on the icon for the connection. That's already there. We'll have a default connection, can create it at the install. Will just type in the password. That be Ah, we keyed in earlier. So hopefully you remember that he's going to try to connect to the database. All right, so we're successful in connecting to, but my sequel server through Ah, my secret workbench does the graphical user interface we're gonna be using to write our sequel scripts. And, um so hopefully, during the course of the class, we'll go step by step. And you learned this interface for your programming. Thank you. 5. MYSQL Cloud Install: everyone. So we will also try to access my sequel. We have the cloud instance, I'm gonna be using Amazon Web services and, you know, looking into future, I think cloud implementation is gonna be prevailing across a lot of companies going forward . So I think it's a really good to have and experience using some sort of glad service. So we're gonna have the my sequel instance in the cloud, and we're gonna use our my sequel workbench to connect to the cloud Instance. So you need to have an AWS account set up, and I have given a You're left at the bottom for some steps as to how to set up that account to please follow that. And hopefully the set up would be easy for you to create that account. Once you have your count set up, sign into the council and you'll get into the main portal or feet up to us. You you'll see a list of services. So Amazon has a bunch of services from spinning off servers, different storage options date, obvious option, and so we're not going to go through all these details. What we're interested in is setting up my sequel server. So in the database category, we have already. SRT s is a subcategory where we have all the relational database systems that Amazon provides. So now please go ahead. Uh, click on the button. Get started now. So these are the list of relational databases that Amazon is providing. You have sequel server, Oracle posts, Crests, Maria DB and my sequel and Amazon Aurora. So we're gonna go to my sequel. So we have two versions. One is the community addition and the other one is Amazon's Aurora. It's, ah, different flavor of my sequel, and we're gonna be using the Community server edition. They're gonna go click here, Please go ahead. Used the Devon test instance because it uses the free tier so we don't get charged. But we want to see only showed those options that are eligible for free tears that we'll just click that we'll leave everything else here the same. So we need to specify a db instance. So in my case, I'll just use, uh, AMG Inc. Use the name student. I have to give a password like a step so we can leave all this as a default database name will just say my sequel, Cloud. So we'll just leave all this as if back up. We don't need any backups or this one just 10 to 0 relaunch. So this will take a little bit of time to spin off the server. Um, So what I'll do is I'll just pause the video. All right, so now our service up and running and you can see the status. So now we need a connection string or an endpoint. Ah, that we're gonna use in our my sequel work bench to connect to this instance. So we go to instance actions, you'll get some more details on the serving. You also get endpoint. Okay, so this is our my sequel, workbench. And now we're gonna create a connection, a new connection to connect to the my sequel server on the Cloud. So let's go back to our details off the server. So this endpoint is basically our connection string. So just copy this in my secret workbench would create a new connections to click on the plus sign are there will get a small window open so we'll create a connection name, so I'll call it sequel. Underscore Class underscore Cloud as far as the hosting will put the connection string there. We already have the port, so take the port out out of the connection string. So for using him, I had initially put in student Well, just test the connection. Enter the password that you set up. All right. So we were able to make a successful connection to my sequel instance, which is in the cloud. So it'll create a small Aiken for the connection, So just double click that great. So we're able to connect our my sequel workbench to the my sequel server instance on AWS. 6. MYSQL GUI INTRO: Hi there. All right, so now we have our my secret work. Ben started. So? So the first thing you will do is try to connect to a database. So we have my sequel connections. It just gives you Aikens for your connections that preset. So we're gonna use the 1st 1 that we set it up early. If you double click on that, I've already entered my password before, so it's just it's gonna just let me in. But it might ask you for a password. So just key in your password that you set it up. So now we are in the graphical user interface of my sequel, and this is the place where we're gonna be writing different scripts. So we have the quarry pan in the middle where we write a sequel statements. We have an object browser. We're gonna have different schemers and tables listed for metadata information. On the right hand side, we have sequel statements or snippets that we can use for our sequel. So now we have kind of a common understanding about the interface. So if you look at the object browser area, so we have ah s Y s says schema that's created by default, and it has objects underneath it. So we have tables, abused or precision functions. We for our class will be interested in creating tables, loading data enquiring those tables. So if you look at that, we have one table called six. Underscore conf IQ. Let's say if you want to just access that table so we'll come into our quarry pan and just write a simple, select statement. It's Colon. And if you look at this ah, lightning icon at the top, this is execute the selected portion of the script. So just click that and you'll see a result. Great. Which kind of gives you the results that that you need from the stable six underscore conflict. So when you initially load the stable, sis might not be your default database, so just make sure your right click on sets and just set as default schema so you don't need to prefix it with something like this. Leaving this will also work Great. If you look at the right, this is what we initially we're talking about the snippets that my sick will give you. I mean, these are different categories that they have split into the details of the M Els and etcetera. Click on it and just check what the script would look like with all the options. It's just kind of in a handy tool to have when you're trying to right near scripts and, as you can see it has insert joined Select. It's kind of handy in the output. Here is basically kind of a high level summary as to rebrand the script, how much time the script took and the record second returned account of those records. And if any editor comes into scripts able to be listed here as well, great, I will see in the next lecture. 7. Create Database and Load Data: great. So now we have that My sequel software installed. We have my sequel, Workbench up and running. So now we need to create a database. So this is the fun part. We're gonna download a sample database that my sequel provides. You're gonna follow the link. Are so we're on the my sequel website. Your scroll down, You will see the Secura. You can pronounce it different way. We'll just call it Secura. A database. Just download the zip. We'll just do receive. As so we have a schema. That's where we're gonna have the sequel statements to create our table structures. Then we have, ah, data, which is gonna be a lot off insert statements creating this Secura database. And then we have the data model off how these tables are structured and related. All right, so we're gonna do a file we're gonna open Script open, secure a schemer sequel are So we have our schemo that's equal file loaded in my sequel workbench. So it has a bunch of creates statements creating actor table, address, table. So we're going to be creating all our data structures through this script, and it also gives you additional information on the columns that we're gonna be adding. So we'll just run this script all at once, and it's gonna create all the tables we need for our exercise. Make sure the cursor is on the start off the script line and then click on the execute icon to execute the script. So just hit. Hit the, uh, hit that I can. You can ignore the warnings that were using the output. It's still going to create the tables. So if you go to the object browser on the says I can right click and refresh, you'll see the Secura database. And also, if you open the tables like on, you'll see the list of tables that was created to, say Select Star from the actor table Click Execute to run. If you see we did not get anything back, there's no denying it. So now we need to load data into the state of eso are simple to school. Open, open script. We will select the data dot sequel file and load it. So this is sample data that we're gonna be loading were the sequel insert statements to fill up our sample database. We will execute the scripts by clicking on the icon. So these insert statements will take a few minutes to load for data, and you can see the status in the output pan as to how much time is taking and how many records of being inserted. Let's try to see we have data and into one off the table. So we'll just to run the same select star from actor table and see if we get data. So if you seen the result great, Um, the data is populated in the actor table for the first name, last name and a bunch of rose. So we're good to go here. Same way we can try to run the address table. Select star from address. We should be able to see more data there. I will see in the next lecture. 8. Introduction to Data Modeling Part1: Hi everyone. So now we're going to start a new section on data modelling. Understanding how data is related to each other becomes really, really important as we start our journey into the Data Analysis field. So we have to make sense of how datasets are related to each other. So we can derive value off the data. The data becomes more meaningful if we are able to relate the data with each other. The byproduct of data modeling is a data model that gets created. It's a graphical layout. It becomes a communication tool that can be used for different groups to understand how the data is related to each other. We can use MySQL workbench or urban software to do data modeling. You must have seen a picture of a data model. And what I'm showing on the slide right now is just a sample set of tables in some of their relationships. Connecting lines between the tables showed that relationship. And we will go into detail in the later slides. Re-ran the schema file to create a table structure is already loaded the data already. And now we're going to be opening up that data model. So there is a dot m WB file, last file when you unzip. So just double-click on the file, it should open the data model in the MySQL workbench. If this window shows, just go ahead and rename. And it'll, it'll go ahead and load the model. Click on the ER diagram icon. And then the model opens. They have color quarter different sections based on the type of data there is customers and inventory, and all the related tables and certain subject areas and color-coded it. So this is the interface where you can view the data model and how it's related. Tables are one of the foundational structures in data modeling peoples. So I've given an example of a country table. Now we're going to be talking about the building blocks for data modeling. First one being table or entity. You can use these names interchangeably. A table can be something for which you want to store data for collect data for examples, can be costumer. If you want to collect data about a customer, it's behavior. It can be automobile, the type of cars that are going on a particular high of a, it can be products. So anything that you want to collect data for second column or attribute, these are just characteristics of the table. So if you want to collect data for the customer, the customer Being a table, the columns for the customer table can be FirstName, LastName, phone number, address, etc. Third is datatypes. So each of those columns can have a certain type of datatype, datatype, all it is is that if it's a phone number, we have a number as a format of the column. If it's a name and it's more of a textural data. So just a different flavors of those columns. Forward is cardinality of a relationship is what defines how the two table or entities are linked with each other. And just kinda is modal for descriptive way of saying how they are related with each other. And we're going to go into detail the different types of cardinalities and relationships they are present. Fifth is the primary key is kinda what defines the customer. Is it the social security number is kinda the primary identifier. And then foreign keys is a concept that will become apparent as we go down explaining the different type of relationships and how tables are linked. So the six building blocks are very important to understand how data is related to each other. So we'll go and explain each of these building blocks in detail in the following slides. Tables are one of the foundational structures in data modeling as the definition states its collection of related data, right? So we were talking about if it's customer, we want to have a customer table. So it's a structured format in which databases create these tables. So I've given an example of a country table. Let's go try to look at each section of the table. So the stuff that's highlighted in orange are basically the attributes or columns. So we have countryID, you have country. And then we have last updated column, countryID testing, individual unique E. And then the country is listing a textural information. And then there's a time value for last updated. So these are columns or attributes. If you look at the bottom section of the slide, created that in green, those are actual values. Data elements. So data elements are, if you'd look at that last row, countryID nine, so nine, Austria. Each of these individually are data elements. If someone says, how many columns do you see in this slide? So I would say three, countryID, country and last update, data is added as rows or tuples. So if you'd look at horizontally countryID to Algeria and the last update value, that role becomes one record in the database. Another important concept is of primary key. Each of these rows that you see are identified by a unique identifier. So country ID one through nine, He's unique, 123 up to nine. So you don't see that being repeated. Understanding the wave tables are structured is very important. Defining the relationships that we're going to learn. In the following slides. We will be talking about datatypes, or in other words, you can say is to type of format that a column has in the database. So whenever you try to insert a data, would need to know if it's a date, if it's a string on, if it's a number. So photo date, column, date of birth. In this example, the format is first, why, why, why, why is the year portion dash m, M is the month and dash d. D is the days that do 1000. Dash 1010 be the format of a data element. And the second date format is a date where we have hours, minutes, and seconds as well. So for certain scenarios, we can have the date field defined in a more granular way. So date would be the first datatype. Second data type is strings. So strings are more or less. In a case of customer table, we can have customer FirstName, LastName, anything textural. And then we have two variations, or varchar variable character and the character. We don't need to go too much into detail, but they bolt hold string values. The third category of data types come under number. So anytime we want to store numbers, for instance, your car's mile, like we were talking about an automobile table. So if you have mileage for the car, that mileage column can be stored as a number. So for number, you can have a data type of an integer or a small int, depending on how big the number is. Or if you have decimal places we wanted to present, like for instance, a customer paid amount, like if you paid for a Starbucks coffee. And the amount was some dollars and then some sense. So we can represent that by a datatype off decimal. If you note that datatypes and the category of date strings and numbers, you can do a majority of the data analysis knowing these datatypes. We will be talking about the notations that we will be using in our relationships before we get into the different types of relationships, I just want to make sure we understand the language in which we can there be depicting our relationships. So we're going to be using information engineering notation. So the process of designing the tables and its relationships is called entity relationship modelling. So if you're looking at the cardinality, That's just the maximum number of connections you can have between the two tables. So one would be just one more to go Line, many would be offbeat. Modality is just the least number of connections you can have a relationship with 0 or one. When you see your relationship and action, you will have both the maximum and the minimum number of relationships those two entities or tables can have. So that's why if you're looking at the right-hand side, showing a cardinality and modality together. So the first relationship is showing one or many. And that's why you see this symbol for one as a vertical line and for many as a crow feet. The one at the bottom is 0 or many, and you have a circle and a crow foot. So just getting an understanding of the symbols were really helped you in the following slides. 9. Introduction to Data Modeling Part2: All right, Now we will talk about different types of relationships that two tables or entities can have. First one is one-to-one. As the name suggests, it says, one entity can only have one relationship with the entity to entity 2k and only have one relationship, the entity one. So if you look at this example, we have a relationship between a manager table and an office. If you just look at the Manager table, we have manager ID, FirstName, lastname. So there are four different managers in our table. And if you look on the right-hand side, we have an office table. So we have four different offices, Chicago and New York, Seattle, and Columbus. So that's the data we're dealing with. If you look at the Manager table, we have manager ID as the primary key. And in the previous lecture we defined what a primary key is, a unique identifier. So manager ID 1 own relates to Eddie, manager ID for only relates to Paul. If you look at the office table, we have primary IDs as Office ID one through four. And then also we have another column manager ID in the office table. So what it shows is that office one is managed by a manager, one location in Chicago. So Manager ID being in the office table. This particular column is a foreign key because it is present in the Manager table. So now there's a parent relationship. And if you look at the line that connects these two tables, it's a straight line with two dashes on the side. So what that depicts is that the manager can only manage one office. If you look at from the office to the manager side, it says office can be only managed by one manager. So it's a one-to-one relationship. It's a mandatory relationship as the modalities also showing one with the vertical bar and the maximum relationship is also one. And another example can be US citizens and social security table. So US citizens table, which has a FirstName, LastName, and then another table as a Social Security. As citizen can only have one social security. And a Social Security can only be related to one citizen. So hopefully that kinda clears that up as far as the relationship settings. So it's a one to one. So in this case, a manager manages one office. An office can only be managed by one manager. All right, so the second type of relationship is one-to-many relationship. This is one of the most common relationships that you will find between tables. If you look at the example, it's between customer and order table. So if you just logically think and a scenario where a customer can have multiple orders, like if you go to Amazon, you can be can put three orders. So that's a legitimate business case, right? So you can have multiple orders, but that individual order, if you put three orders that it order one or the first-order only relates to you. None of the other Amazon customers, right? So the order is linked to one customer, but the customer can put multiple orders. And if you go with that mindset, if you look at the relationship in the middle between these two tables, we have a vertical line on the left. Meanings that the order can have only one customer. On the right we have a crow foot depicting the many portion of the relationship with the vertical bar, depicting at least one. So the customer can at least have one order or many. And again, we have a primary key and a foreign key. So the primary key on the customer table is the customer ID. And the primary key in the order table is order ID. Great. But the order table also has a customer ID in it. For this relationship to work. And that customer ID in the order table is a foreign key. And that customer ID is what that relationship is built on. Majority of the tables would fall into this category. Again. So if you look at this example, customers can put multiple orders and an order can only be placed by one customer. Third type of relationship is many-to-many relationship. Let's look at the example that we have. The first table, we have a student. So it just gives you a list of students that we have. So student ID, FirstName, LastName, great. So we have a student table. Then if you look at the far right, we have a class table. It just gives you the different classes that are being offered. So English, math, chemistry, etc. So as long as student and class tables by themselves look fine. Student has student information, class has class information. As we'd relate the student and class data together, there is an apparent many-to-many relationship between them. Since a student can take multiple classes and each class can have multiple students enrolled in it. We have a scenario where this relationship can capture other information like student grades, enrollment dates. And for this, it'll be better to make a junction table called enrollment that will capture this information. If you look at the way a student is linked to enrollment, we have a one-to-many relationship where a student can be enrolled in one or more classes. And if you're looking at the relationship from right to left from plaster enrollment, a class can have one or more enrollments from the students. As you go forward in the data analysis field, you'll see more and more relationships would fall into this category. Fourth type of relationship is called recursive relationship. And if you look at the example of employees table, so you have Employee ID, FirstName, LastName. You also have a fourth column called manager ID. So the managers are also taken from the list of employees. So there is a relationship between manager ID and employee ID. So if you look at the bottom half of the table, we have it a relationship going from manager ID to employee ID. So a manager can manage multiple employees. So if you look at the tail end of the relationship, It's a crow feet saying many and at least one. An employee can be managed by one manager. Or in some cases may not have a manager at all. So that's the other side of the relationship and that's why you see employee one not having any manager ID. And if you look at my ch and j employ 23, they are being managed by Eddy who has an employee ID of one. And nobody's managing Eddie, he's a top-level employee and that's why you see Manager ID has no information. So this is an example of a recursive relationship where we have a relationship within a single table. And manager ID is a foreign key to the employee ID column. So hopefully this was helpful. This type of relationship is rare to have, but it's good to know when doing data analysis. So now we have kinda looked at the building blocks of data modeling. Know what, what a table is, what our columns or attributes, what is a primary key or foreign key? What are data types and how tables are related and types of relationships that can exist between tables. Understanding normalization becomes important as more and more data comes into your database. So what normalization is a process in which we try to break a bigger table into smaller tables. And what it does is that it helps you reduce redundant data or duplicate data. Normalization goes through a few steps. In each of these steps are called normal forms. We go through first normal form, then second and then third. And overall normalization. If you look at some of the benefits is that you save disk storage. It's easy on maintenance for the database because you're updating less data, IO activity gets improved and also it helps query and reporting. I will show you an example of an employee table. And we're going to take that table from first normal form to second and then third normal form. And you will see their transition as to how we go through this whole process. And hopefully with this example, the normalization process of a table will become easier. We have been given this employee table and are tasked to normalize it up to the third normal form. Let's take a step back and look at the data closely. We have an employee ID which uniquely identifies an employee or department code that is unique for a department. We have employee name department, three columns that have four numbers in them, employ start date and employee vested indicator. So that's the data that we're dealing with. And we have to take this table through the normalization process of first normal form up to the third normal form. When you're dealing with the first normal form, you need to ask two questions. Are there any repeating groups, meaning, are there two or more columns that are closely related to each other? If they are, we need to create a new table for them. Second, make sure that all the attributes are single-valued attributes. Let's take a look at the employees tables. If you look at the employee name, bf, FirstName, LastName in one column, that's our first candidate, as we should not have any multi-valued attribute. So what we're doing, the first normal form, we split the employee name into the firstName and lastName to take care of the multi-valued attribute. If you look at the first phone number, all the phone numbers There are different and it has to be the employee phone number. Phone number 2. It seems like that phone number is dependent on the department. If you see finance is finance department has the same number. So it is the department phone number, phone number three has all the phone numbers being the same. That can be looked at as a company phone number, employee start date, and employee vested indicator can move as is. We also improved the metadata or the column names for all the three phone number columns, and add employee, department and company labeling to them as it makes more sense. Now our employee table is in first normal form. Let's get that ball rolling to the second normal form. The most important question you want to ask is when you're transforming your table from the first normal form to the second, is that to make sure that all non-key columns are functionally dependent on the entire primary key. So what are the non-key columns? Employee last name, employee first name, department employee phone number, Department phone number, company phone number, employee start date and employee vested indicator. These are all non-key columns. So what the rule says that they need to be functionally dependent on the entire primary key. So the primary key here is employee ID and department code, right? Those are the unique identifiers. But if you look at the employee first name, last name, that is dependent on the employee ID. And employee ID should be able to give us the employee FirstName, LastName. If you look at the department, department is dependent on the department code. So you see we have some columns in the table depending on one part of the key and the other columns are dependent on the other part of the key. So what we need to do is we need to split the table where the columns are dependent on the whole key. So that's why you see the employee table being split and having the employee first name, last name, phone number, employee start date and employ best indicator. They are all dependent on the employee ID. For department. You see we have Department Department phone number and the company phone number that are dependent on the department code. Great. So we have those two tables transformed into the second novel form right now. We also need to capture the relationship between the employees and the department data, right? Employee can work in different departments and the department can have multiple employees, as you recall from the earlier lecture, to resolve a many-to-many relationship, we had to create a junction table. So in this case, it would be the employee assignment where you're going to have an employee ID and a department code as being two columns in the relationship. So you'd look at the relationship from employees to employee assignment is a one-to-many. And if you look at from department to employ assignment is also one-to-many. So how are you going to capture the relationship and the employee and department being into second normal form. Great. All right. Let's get the ball rolling photo table to be in third normal form. We should not have a situation where an attribute depends on another attribute that is not the primary key of that table. It's also called transitive dependency, really important role that the non-key columns in the table should not have dependence between them. Secondly, we should not have any derived data such as like total columns that are derived from other columns in the table. So if you look at the employee table, we have employee vested indicator that is dependent on the employees start date. So for a surgeon start date, that indicator either turns yes or no. So that is definitely one candidate that we can address and not third normal form. So if you look at the employee table into third normal form, we only have FirstName, LastName, phone number, and start date. All those columns are only dependent on the primary key. And there's no dependency between the non-key columns. If you look at the department that is already in the third normal form, as you go through different projects and, and as you go through your data modelling assignments, you'll find that the third normal form is where majority of the transformation stops. They are some higher normal forms like bark called normal form, fourth, fifth normal form, which do occur and people do use them. But very rarely. I will try to add some slides for those normalizations at a later time. All I will say is that with practice, data modelling becomes easier and easier. You need to understand the business as to how the data is captured and what the data means. Practicing the normalization rules and being close to the data will help you in your data modelling projects. 10. Introduction to SQL: Hi, there. We went through learning about databases and then data modeling. And now we're going to be learning about SQL. Sql is a programming language use to interact with the relational databases. Sql is an acronym for structured query language. The syntax of SQL is very easy to learn. It's pretty descriptive. A lot of keywords that are used to do the operations, self-explanatory. Different relational database systems have some flavor of SQL that they have rolled out. And each flavor has some functions that might be different from the other version. But in all the basic SQL syntax remains the same. We will be using SQL to interact with our MySQL database. We're going to be going through different categories in which SQL statements can be placed. Sql was standardized by American National Standards Institute and C for short in 1986, and it has gone through multiple revisions. We can put SQL commands into different classifications. The first one is DDL data definition language. So whenever we are trying to create a table, alter it, drop it. These SQL keywords can be categorized within DDL and we have a separate slide wherever you can and go through examples for creating DTLs. Second is DML. Dml statements pertain to either inserting, updating or deleting data. And we will go through some examples of those. Third is DQL Data Query Language. And this is the select keyword. You must have seen a select statement in the past as well. So this is basically a generic SQL statement which can help you retrieve data. Fourth one is DCL, data control language. The statements in DCL are normally used by database administrators, like granting privileges or revoking them. We won't spend time doing data control language examples, but we will go through examples that will fall under DDL, DML, and DQL. 11. DDL: All right, So now the fun begins, we're gonna be writing a sequel code, so we're gonna go through D D. L statements. As in the previous slide, I was saying, It's the data definition language. We will be creating a table and then also we can either alter drop it or truncated. So let's let's take the first example in creating a new table. All right, so we'll be creating a new table called students, and we're gonna be using a keyword called Create Table and then the name off the table, which will be students Open bracket and close brackets will list different settle columns. You stink of columns as attributes like, what kind of data do you want to capture for the students? Okay, you have your student i d. His first and last name date of birth phone number at is just a general information about the students. So whenever you want to add a column, you have to have a data type what defines the type of data that will go into the column. So, for instance, student I d. Reuse i nt as indigent first name last name is gonna be very well character war car for short, it's gonna be string values. Anytime you have a string ready, just use a very real character or character you can use either or one is fixed length. The other one is rebelling. And whenever you defining a string, you will basically give the length off the string rallies in this case is 20. Date of birth is gonna be a data type of date. Phone numbers again is gonna be a string address is going to be a string. And if you see address, I have kept it at 100. He might have more information on the address. We need to add a semi colon to indicate the end off the statement. So just highlight this portion and execute. Great. So now we have successfully executed a create statement. So if you go under the tables section off the database Secura, we need to refresh tables and do fresh all. And then you will see the students table. You can open the students table hair and drill down into the columns and you'll see the list of columns that we created. Great. So now you have created a student's table. All right, so now we have created the students table. Let's try to add some columns to the table and drop some, and we can even go ahead mortify a data type. So if you look at the older table syntax so we have the altar table as keyword, we have to use altar table and then the name off the table that we want to alter, right? So it's gonna be altar table students, and we want to add address, line and address to as to new columns. So we're gonna use the ad key word and what's after give data type. In this case, I'm gonna be giving 100 length for each column. It's gonna be comma when we end. Would be then using the drop key word to drop a column. So we'll be dropping the address, Call him. And if you want to modify a call and we just use the word mortify last name and we're gonna be changing Initially, this string data type was off length 20. So we're gonna be using now, 30. So now we can just run this, highlight the whole statement and just run it. Great. So are ultra Statement was executed successfully. We can come down to the object browser and look at our student table. So we have student I d. First name, last name, phone and address. So it's actually showing the older emits to just do it a fresh all. And if you look at the last name, it's a variable length 30. Great. So now we have altered the table to drop her table or to delete David. Suppose we just say drop table and the name off the table. So what this will do is it will drop the student table and then we also have a truncate table statement. If we do truncate table, it basically empties out all the records and the tables. Right now we have our student table is empty, so it won't perform a particular action because of the day with his empty. So we run the drop command and as you see, the student table has been deleted 12. DML: the next category in which we can put the sequel statements are D. M. Els. So data manipulation language is so we can create statements to either insert data update data or delete data from the table. So we'll practice each one of these statements and see how the get a base behaves. All right, so we'll be inserting a record into the country table. So let's see what we have in the country table already. Every great. So we have a bunch of records. We have about 109 records in the country table. So if you want to insert data into the table, do you have to use the keyword insert into, and then the table name open brackets and you after listed columns, Country table has country I. D. Country and last updates. We'll list those three columns and then we close the brackets. So the first set of columns are gonna be the number of columns center in the table, and then we have the values as a keyword. Then we list the data values that we want to insert into the table separated by commas. Go for country I d. We're gonna put 1 10 If you look, we already have the last record in the table as I d being one or nine. So we'll put 1 10 will put as a country name as my land. And for the last update, this is a function. Don't worry about it. Just you can just put it, as is it say's current underscored time open and closed bracket That's just gives you the current time when you insert the record close bracket. So insert into table name list the columns that are into table than the values keyboard open bracket and then insert the data and separated by a comma. All right, so let's execute this. Okay, so we have affected one droll if you look here at the bottom. So let's try to select data from the country table and see if we got our record. And so let's scroll down, says you can see the record with Country 81. 10 was inserted. Great. Now we want to update the same record and said the country named to no man lands. So we used the key word update the table name set country. The column that we want to update equals two. And when you put this in Ah, Coats, No man lands. We use the where keyword country I d equals 1 10 So this is gonna filter the data. This will filter the data to the record, which we inserted That was 1 10 and this will The set keyword is gonna update the value in the country column for the record where the country ideas 1 10 So we run this. See, it has been updated to no man's land. Great. So if you want to delete a record from the table, we used the key word delete from table name and then we use where is when we want to qualify and get to a particular record into the table and filter. So where country idea is equal to 1 10? Let's run this. So now if you do so like star from country, we should not have any record with country I d. One time. So that has been deleted. Great 13. DQL1: we will be looking at the most common sequel statement that is used to get data from the database. It's ah, it's a very simple statement, but it's very powerful. At the same time, I have tried to put this visual for the select statement. I just wanted to make sure that you understand each component off the select statement. A lot of times people probably no select statement partially and they missed the other details. So hopefully this picture will help you understand each component off the select statement that is used to retrieve data from the tables. Select selects the number of columns from basically less the tables. You want to get data from where is basically where you put the filter condition to filter your data order by is where you list the order in which you wanna see the data and limit is where you wanna limit the number of rows you get back. All right, so let's get this kick start. Its the first script we have is select star from actor, so we cannot retrieve data from the actor table. We're using a semi colon in the end to end the statement we're using the select and from keyword, the star or ass trick depicts. You want to bring back all the columns from the table. So if you don't I mean other ways you have to list. Let's say the table has 20 columns. You don't want to list each one of them. I let the script and run, so you'll get the data from the table about 200 records in the table. Now let's see if you just want to bring back first name and last name, we do select. We just list out the first name, and columns will be separated by a comma. First name, comma, last name from keyword and an actor and semi colon. So we just get those two columns. It's pretty straightforward and self explanatory. Let's say if you want to label the first name as employed name, you have to use the keyboard as and then just used the column name as the given label name from the table, and you have employees name and the rest of the data comes in. Great. Unless say, if you want to order their data order by first name, so we'll still do the same. This piece of the select statement. It means the same. We have to select first name last name from customer. We used the key word order by and which call him do you on order by his first name Ascending. So let's run that. Great. So you see the data comes back and is ordered by the first name of sending and let's take an example off a payment table. Let's use the payment table and order by amount descending so amount should go down when the data comes back. So you see, first we have a higher amount that it goes down. Let's see if you want. We can also add a sort on two columns, several first sort by the first team so we can sort the customer data by first name, last name together as descending by just using order by first name Coma, Last name descending. It's run that so great suit is ordered it. Now let's say we want to bring just the 1st 5 records from the customer table. So this is gonna be a traditional select star from table order by customer, I D. And as you can see, what 600 records come back and say We have to use the key word limit and then give the number of records we want and then Colon. So let's run this. We only get five records by. So this is by using the key word limit and then the number of records you want back. You can also pick arrange when using the limit. Keyword. So if you say to comma five, that means that it's gonna bring five records, but it's going to start from the third record off the table. 14. DQL2: This is a list off sequel comparison operators that you can use when you're trying to filter data from the data set. So whenever you use the keyword where you can use a particular column and then if you want the column to have a value equal to something or not equal to something greater than less than so, you can use thes mathematical operators, we also have operators like in open bracket and close that gives you the option to list multiple values. Also bit mean gives you an option to go pick a range. Rieff. It's the range in the date or the amount null. And not now is where you have an absence of data and you just want to pick the value where there's no data in the column like is also very popular operator, which you can use to big data that follows a certain pattern. You will be using the where clause to filter the records coming back from the database. We already know how to select data from my table, right? So let's highlight the portion where it's a select star from customers, so you will get pretty much all the records from the table, right? So let's say if you want only pick Mary, you use the where keyword falling by a column named that you Want to filter on equals Mary has to be in single Coates is a strength string value and then semi colon. So let's run this. So we found one record in the customer table, which satisfied this condition. If you want to pick customer data where the first name is not equal to marry, you just use explanation mark and then equal together and then marry. So if you run this, you will filter out all the records where the first name is Mary or let's take an example. We want to apply to conditions to filter the data. So we using the payment table. We want to pick all the records from the payment table where the payment date is created than 2006 Jan. First, and the paid amount is greater than four. So we have to apply to conditions to filter the records. We'll do select star from payment where payment date will use greater than the set date in single coats and amount greater than four. Let's run this so you'll see all the payment dates I created than 2006 Jan. First, and the amounts are greater than four. Great. Let's try to filter the last name off the customer from a given list of names. We can use them not in open brackets and give the list of those names if you run this So all the last names, a return back and Davis, Miller and Wilson records are filtered out. Let's take an example where we want a filter that payment records based on a date range. We can use the between key Vered and give the starting date the single coat and the ending date. So the records that have returned or a payment date between these two dates, we can also use the order by payment date. So let's run this 2005. July 5th is in the range of 2005. July 1st 2000 Julys 31st. All these records are in the month of July. We can also filter customer data based on a pattern that we find in a string value so we can use the key word like, followed by a single coat and the two characters off the first name in this case, a N, followed by a percent sign and close coat. Let's run this. It's returning all the first names where the 1st 2 characters are a n. Likewise, we can run the same like command, but have the percentage in the start, meaning that the last director would be. Why in this pattern. So let's to run that you look at the first name. The last character is why in all these records, and this is a very powerful operator, great. 15. Inner Join: one of the tasks that data analysts do is bring data together from different data sets, and that is achieved by joining tables together. The first type of joint we're going to look at is called inner join. So if you look at the Venn Diagram Table A and Table B have something in common, and that common data element will be used as a key to join these two tables together, the syntax for the inner join can be in an implicit way on an explicit implicit is basically when you're using the inner join keyword. So let's take an example off the first syntax. Select from table a inner joint table, be on, and then we list the keys from both the table that we want to join on. That's kind of an implicit way off saying it. We can also use instead off on. We can use the using keyword and just list the key that is common between those two tables . Explicit ways where you say from table a calm a table, be where a dot column equals B dot columns. That's just a different flavor in which you can get the same results you can open the data model from the file itself. Click the ER diagram Aiken to open the model when joining tables is really important to know what the data model looks like, Right? So let's say, if you want to join country and city, we want to know which column the relationship between these two tables are built on. So what column In country table In what column? In the city table, the two tables can be joined on, so it's gonna be select Country and City column from city table inner join country table using Country I D. As the column that is common between both those tables where countries equal to United States. So whenever you're trying to join the tables, you need to look at the data model rights of what data model will give us the relationship off, how tables are joined together. So let's switch to the tab where we opened the data model. So we're looking at the country and the city table, and if you can see we have country I d common between those two tables, so that means that we can use the country I D column in our joint condition so we have used the country I D column as the joining column between city and country table and we also using the keyword inner join. And what that means is we want to bring back only those records. We're the country. I d has same data between those two tables, and we want a filter where the country's United States So what the sequel will return is the list of all the cities in the United States. You're given the task to find the list of films. Language is not English, so we're gonna go back to our data model so Phil and Language Table are joined on language I D. So let's go back to our sequel. So select. We're gonna select title and name from Phil Inner join language table using language i d. Where name? Not in English. So we want to bring back all the films where the language in which the film was made is not English, and we don't have any records right now on the database that satisfied this condition, we can modify the script to bring back films that were made in English by taking the not out just having the name in English. Let's say we have a scenario where we want to generate the list of customers which live in Barcelona. So let's go back here. It will be a three table joint customer address and city. Let's look at our sequel Select. So we're selecting a bunch of columns from from these tables from customer inner join address using the address I D. So if you look, this is how the custom an address are joined on address I D and then we're joining address inner join to the city table on city I d here. So they're joining on the city i d column. So this will be our joints for three tables where city equals Barcelona. So you see the column City to get to that, we had to go from customer to address to city to relate all this data together. Now we run this. We should get all the customers living in the city of Barcelona. So we have one customer that lives in Barcelona. So this is a very powerful joint statement, and as you can see, you have joined three tables on these common data elements. So let's say you want to bring back customers that live in Barcelona, and during their rental period they have paid more than $5 in rental cost. So you will be joining the payment table with the customer table. So if you go back here so there's a customer, I d their joints, the customer and the payment table right there. So you're joining so the rest are the same. We already did this joint, so we're just adding another payment table, and then we have an amount greater than five. So let's run this. So if you see we're getting records for the same customer. So where he has two transactions where the amount was greater than five. So it's giving you a transaction history for the customer living in Barcelona, where the payment amounts are greater than five. 16. Left Join: the second type of joinus called left join. So left Shown brings all the data from Table A along with the data that matches between A and B, the syntax four left joinus select from and we used the key word left joint table a left join table B using the common key fields between those two tables. So we want to get the list of staff members and indicate which one are managers. But the list should also give you staff members who are not managers as well. So for that we will be using a left join. We need to insert these two records so we have enough data to make this example work. So if you come back to the model, we have staff and store table linked by the staff. I d. The table store has those staff I ds that are managers. And then, of course, we have the relationship on store idea as well, So we will be joining the staff table and the store table with a left joint. That means that is gonna bring all the records that are in the staff table along with the ones that are in store and we're gonna join on staff. I d equals manager staff I d. So this should give us all the records in the staff table along with those staff members that are managers as well. So if you see first name, last name is coming from the staff table. So we have four staff members right now in our database manager staff ideas coming from the store table. And only Mike and John are managers. So for Johnny and Jimmy, there's no record in the store table. That's why we have no means, no data. So this way you can see that left joints are useful for certain scenarios toe, where we want to bring all the data from one table and link it with the records that are matching between the two tables as well. Art Another variation off left join his left outer join, meaning that I just want to get the records in table A which are not present in table B. So what you can do is that you can do select from table a left, join table B using the key that is common between the table. We're it will be his keys equal no. So if you picked this last option where table bees key, which means this table equals no, you will get all the records in table A. So let's say, if you want to get the list of staff members who are not managers, we'll do a left outer join and the wavy achieved that is we joined staff to store with keyword using left. Join on staff I d. Equal manager Staff I d. We're using the own keyword in our joint condition. So the column names in the two tables are not spelled the same way. So till here it's exactly the same as a left join. But when we add the very clause, the staff i d from the store table is no. Then it only returns those records from staff table and not the ones that are matching between those two tables. So if you compare this with the other left joint, this is only bringing those staff members that are not manager great 17. Right Join: third type of join is a right joint. Basically, it's the divers off the left. Join and syntax would be select from table a right joint table. Be using the common keys. It brings all the data from table B and links the records between Table A and B. We also have right outer join, which is the opposite off left outer join. So select from table a right joint table. Be using the common keys. Where nine. The previous one when we had left outer join We were using table bees G as no. But in this one, we will use table a key as no. So it's just a reverse off left outer. Join for the right joint and the right outer join. I don't have any lab exercises for you. Was that something that you can try to practice on your own? Using the Secura sample database 18. Cross Join: four type of joint is called Cross Join is basically trying to join two tables that don't have an apparent relationship, and this type of joint is also called Cartesian product. One of the things you have to be careful about is that you can do cross shine between two tables if the number of records in both the tables are pretty small. I'm talking about 5 to 10 records and maybe 20 records or so, and they're normally court values with some descriptions. So it's like if you have a Pacific use case where you have to create a some sort of a mapping a code mapping between two tables, then you can use cross join the syntax for this is select the list of columns from Table A cross Joint Table B. So it's going to show one value of table A being repeated for each value and table B. That's why the number of records get multiplied by the number of rows you have in Table B and A So let's look at an example off across showing so would create two tables won t shirt , underscore color. The other T shirt underscore size. So one has number of colors and the other has the number off sizes T shirt can have, so we'll just run. Let's look at what data we have in T shirt color table. So in the T shirt, we have two records. One is black and one is white. Let's look at the T shirt size table, so we have four sizes small, medium, large and excel. So now if you want to find out the maximum number off combinations between the color and the T shirt size, we can join the T shirt color table with their T shirt size table into across join. It's going to give us the maxim possible combinations that apostle between the colors we have and the sizes Great. So with black, you can have four sizes, small medium logic cell, and with white, you can have four sizes as well. 19. Self Join: fifth type of join is a self join. You have to create an alias off the table. So let's say if the table is a you have to create an alias, a one and a two to join the table together. Self joint, normally our rear. But they are some use cases in which you have to use a self joint. For instance, if you have an employee table and there's a column in which you have a manager, I d. That manager ideas pointing back to the employee i D column in the employee table. So the same table has an inherent relationship within the table itself. So let's go to an example and see how this can work in practice. So let's say we want to get a list of managers and their direct reports. So for this example, we have to create a table called employee Manager so Discover had created table. Let's insert ah, number of records here. Great. So we have six employees and their full name and the manager 80. So, for example, employees one Matthew is being managed by grace and Grace is being managed by Alice and is a relationship between the table itself. So let's take a look at this self joined sequel. So let's take Take it piece by piece. So we're creating an alias for the employee in the school manager Table E one. We doing the right join employees in the score manager E, too. So you're doing a right John between E one and E two on employee I d think of the one as the manager list and think of E two as the employee list and we're selecting E two employees full name, which means we want all the employees to be less dead. We just giving a label as employees. So all this statement is doing here is that if no meaning, that if the manager value is now meaning the employee does not have a manager, we label it with no manager. And this is just a label to give the column. So if we run this, you'll see we get all the list off employees in their corresponding managers. If you look at the last row, Alice Cooper does not have a manager. Let's look at just the employee table again. Alice Cooper, the manager. I DS No. So when we run this. It found no and it replaced it with no manager value, which we give here so self joins our rear, but it's good to have the knowledge about them. 20. Intersect: second type of set operator is Intersection. So sad, one intersex said to He's gonna only bring back whatever is common between those two sets. So my sequel doesn't support the Intersect keyword, but we can get the similar results by using exists or in operator. So again, we're gonna have curry. One exists. And then we're going to use the quarry to so inquiry to We're going to be joining the two tables together based on calling one and column, too, and maybe run. This is going to only bring back the records that are common on those two columns. Similarly, we can use the in operator in the in operator. We don't need to physically join the columns. We can just do a look up. Think off in as a look up. So so select Colin. One column to from Table A where so in the brackets, we give what columns we want to look up. Call the one column two in the second quarry, and then we select the columns that we want to look up to. In my experience exists performs better, would data and in his little bit slower from a performance perspective. So we want to get back data that set one and set to have in common. So if you look at the version one off the script So we're seeing Select Star from Sat one where X and Y are the columns of set one in and then we have open brackets and clothes and select star from Set to So this will bring only those records that are common pretenders to sets. So as you remember, set one had four records and set to had to records ever present and sat one. So we should get to records back. Great. So we have Eddie and Charlie that comes back. We can also run the same command using the exists key word with exists, we have to reduce a joint condition. So we'll select star from sat one. I'm giving it an alias. A Where exists? Open brackets select star from set to alias Be where? So this joint conditions on column one and column to needs to be present in the second quarry to link it to to set one. So this physical linkage needs to exist when we using this type of sequel when we're using in. We did not need it. The joint condition. So that's kind of the parents. Syntax difference, but exist is faster when there's a lot of data, so we get the same result great. 21. Minus: the third type of set operation is minus, basically set one minus. That, too, will give you elements that are inset one, but not inside, too. If you look at the example on the slide set, one has won 4 10 data elements and said to has 9 10 11 If you do the minus operation there , you will get one and four as a result, just as a mathematical operation off subtraction, my sequel does not support minus keyword, but we will still achieve the same results by using, not exist or not in keyword. As far as the syntax goes, it is the opposite of what we did earlier for the Intersection lecture. So instead, off exist, we will use, not exist, and instead of in we will use not in for the lab exercise. I've attached the sequel scripts to the lecture notes, which will help you go through the minus like operations within my sequel 22. Union: in sequel. We haven't area where we can actually do set operations. I mean Set is is a branch of discrete mathematics. But when it comes to databases and sequel, you can perform set operations by using the keyword union intersection and minus my sequel supports Union, Onley, intersection and minus can be mimicked by some variation off the sequel itself. So if you look at the union operation set one has won 4 10 data elements and said to has 9 10 and 11. If you do set one union set to, we're gonna get all the data from sent one and set to, and there won't be any duplicates. So if you look at the lab bottom portion of the slide, you have quarry one union quarry to, and one of the things you have to be cognisant about is that column one and column to in both quarries need to have same data type when you're doing the union operation. The columns that you select inquiry one need to match the columns from quarry to when I say match. The two columns need to have similar data types between the two quarries. Also, you have to make sure that the number of columns between the two Coreys are the same, and the order in which the columns are listed is also the same and also have similar data pretending to each other. If you use a slight variation to union and use union all, it will not remove the duplicates, so you will have duplicate values if you just to union. All So it's a simple operation, which brings data from two tables together and a pence data from Cree one to Cory, too. We will create two tables, set one and set to, and we will insert some daytime to those two tables. So let's see what we haven't set one, right? So we have four values Eddie, George, Charlie and David. But see what's inside, too. So we have Eddie and Charlie. So let's say we want to bring data from both the tables Eddie and Charlie isn't set to. So said to is a subset of set one. So when we run the union statement, so the first Cory Select star from Set one and the quarry to is select cell from set to we should get all the records from sub one as their common and said to as well. So let's run this. We should still get the same value for records. Great. But let's say, for use union, all we should get six records. Even two of them are duplicates to see. We get Eddie and Charlie. 23. Introduction to Aggregate Functions: Aggregating and summarizing of data is at the core of data analysis. To be good at data analysis, you need to understand how aggregation works. I'll be introducing a new keyword group by that is normally used with a lot of aggregate functions. And we will go through a lot of examples where we are using group BY clause. Groupby can also be used with a rolled-up keyword. All this will make sense when we go through the examples in our lab. So we have a list of aggregate functions. If you look at the first table on the left, We have average count, maximum, minimum. And some, these are some of the common aggregate functions that you can use on the dataset. If you look at the table on the right, these are statistical functions that you can apply on numerical data in your dataset to derive standard deviation and variants. So I've put together this diagram to just show you how the aggregate functions when they're applied to the data would look like. So this is just an imaginary table with customer ID and payment column. And if you see on the right-hand side the syntax to actually apply the aggregate functions on the payment column. So it's select some open bracket and the column that you want to sum on and then close bracket comma. And likewise you can put average Min, max comma and then count one. It's going to give you the total number of records in the table from the given table. So once you run these aggregate functions, you will get a summary record. And there's somebody backward is going to add up all your payments, average the payment, give you a minimum, maximum, and the count of the records. This is a quick way to get your summary statistics. So in this example, we are not using group BY clause yet, and we will use that at a later example. So if we look at the payment table in our database, it has an amount field. So these are all the amounts that customers paid when renting the movies. So we want to get the minimum payment, maximum, average total, and then also the number of transactions we have. So we're going to use aggregate functions and it should give us one record. So let's run that. So if you see we get one record back summarizing the data in the payment table. So the amount field has the minimum payment as being 0, Maximum payment being $11 and 99 cents, average total sum of the payment. And then the total number of transactions we have, which are 16,049. So this is a quick way to just get a summary data from a table by using these aggregate functions. 24. Groupby Part1: If you look at the diagram, what the group by does is that it partitions the table based on the customer i d. So if you look at Partition one therefore transactions. That customer one did with 2030 30 and 40. So that partition is going to get aggregated, and you will see one record for customer i. D. One same thing with Partition 23 and four. So the group by is basically isolating and aggregating each customer's data. You look at the syntax. There's a new addition to our previous sequel, which is a group by customer I. D. And also, we have added that in the select piece of the sequel as well. So we're adding a customer i d comma and then the aggregate functions that are going on to the Payment column. This is a very powerful sequel statement that is used number of time in multiple projects. So if you get to understand this sequel, it will really help in your data analysis activities for this lab will be creating an employee table to just run this create statement and the insert statements, and he should be able to see an employee tables record. So we have i d employee name, department name, salary and age. So this is the data that we loaded right now. So we want to see the total salary per department and also you want to see the number of employees in that department. So to get that data, we need to group by department name, right? So that would be our partitioning field. Whatever group, by calling we use, we have to use it in that select as well. We're going to some on the salary label it as total salary, and then we're gonna do a count, and this count is gonna go per department and that would be the total number employees. So let's run this great. So we see that for each department were able to get the total salary and were able to get the total number of employees. So such a simple group by statement has produced really important information that can be used by the business. So let's take another example, keeping the sequel the same as the previous one will just add. That would roll up key word after the group by department. Me. So what this does is that It adds another record apart from what we already get from the group by and gives us a total summary record along with the group buy records. So if you see the last record, should added up all the salary column and added up the employees column. So the total salary is 44,000 and the total number of employees are 11. So this roll up gives us a summary record that gets added to the group by results so it can come really handy when needed. 25. Groupby Part2: So if you want to filter the group by results, you can use the keyword having and then the group by column, and you can use an operator to filter the records. So if you see the example, we're saying having some off payments greater than 1 20 So we're using the having clause. It's gonna Onley bring back those customers where the some off their payments is greater than 1 20 So if you look at this table to some off the payments for customer for is greater than 1 20 The rest are less than 1 20 so that record is gonna be returned. So having keyword can Hughes to filter group buy records further if I just run that piece what we did before. So we're going to get a job production, service and sales with their total salaries in that apartment and the total number of employees. So now, if you want a filter, this result set to only show those salaries where the sum is greater than 18,000 Onley production records should show, so we'll use the having keyword and then do some salary created than 18,000 so only production record should show Let's run this great so you can see that having key word can be used to filter group buy records further. 26. Introduction to Subqueries: Subqueries I like nested queries. So you have a main query and then a subquery. There are two types of subqueries. One is correlated and the other one is non-correlated subquery. Correlated subquery is dependent on the main query in such a way that for every row in the main query, the subquery gets executed at least once. Non-correlated subqueries are independent of the main query. And they get executed once. We have a list of predicates that can be used between the mean and the subquery exist, NOT exists. An n are some of the common predicates that we can use. And we had some earlier lectures on that. Additional set of predicates are any ALL or some. And we'll see some examples of those and some comparison operators that can be used as well. Hopefully this visual will help you understand the building blocks of setting up a subquery routine. We will do some lab exercises in which the concept off correlated and non-correlated subqueries will become much more clearer. So let's look at the first example of the subquery. We want to get the list of customers name where the rental duration is more than five days. So they had rented the movies for more than five days. So for this information, we need the customer table and the rental table. So rental table will have information about the movie as to when the movie was rented out and when it was returned. And the customer table has the first lastname. So we'll start with select FirstName, LastName from customer table. So this is pretty straightforward where we're going to use the exists keyword open bracket, and then we will start our sub-query. So this is our sub-query. So we're going to select star from rental and we're gonna give it an alias are where. And so this is the joint customer ID is common between rental and customer table. So this makes this subquery correlated query. Date dif gives you a difference between two dates. It's just a function that MySQL has and we'll go into date functions at a later lecture. But for now, just, just think of this function returning the difference between these two dates and it will be greater than five. So let's run this. So this gives you the list of customers who rented the movie for more than five days. Let's take a second example. We want to get the list of movies that are not available in any stores. So you want to get the list of movies that are not available in any stores. So we need two tables for this. One is film and the other words inventory. So where are we going to use not exist. So we're going to say select star from film, give me all the films not exist. And then your subquery select star from inventory. So inventory table has a column Philip ID, so we can join the film to the inventory table on film ID. So it says not exist. So this will return me all the films that are not in the inventory table. So you can see there's a bunch of movies that are not in any stores. And this is an example of a correlated query as well. So let's take an example of subqueries are not correlated. So the first one is really want to get a list of all the payment transactions that are above the average payment amount. So if you look at, so you're looking at this script, select star from payment where amount we can use, use the operator greater than. And then we can use a subquery and then that we'd just say select average amount from payment. So we using the same table in the subquery that returns the average amount. And our main queries saying is that select star from payment where amount is greater than the average amount that gets returned from the subquery. So let's run that. So this list gives you all the amounts that are greater than the average amount. The second example is where we want to find out how many stores are in Woodbridge. So our main query select star from store, where address in. So we're using the keyword in and we're using the address ID. And the subqueries select address ID from reusing the address table and joining it with the city table in our subquery. So the subquery itself has two table that we're joining on city ID, where cities would rich. So the subquery returns us All the address IDs that exist in Wooldridge. And then the main query is pulling data from the store table where the address ID is in this list. So let's run this. So we have one store that isn't what Rich. And you can check that by profiling the individual tables as well. 27. Date Functions Introduction: date functions. You either love it or hate it. Date functions become really important when it comes to data that's time sensitive. As data comes from different part of the world, date format becomes an issue as you can see from this picture that date formats in different regions of the world is not conforming to one. If you look at the blue colored countries, we start with the day d D dash month MM dash year. So that's kind of a common format. But if you look at the yellow countries, we have the year first. If you look at the countries and red, we have the month first in the date format. The day, month and year portion gets inter changed as we look at different parts of the world, so understanding how to format a date field becomes really important, and I can emphasize more that or the course of years. I come across so many occasions where I had to deal with formatting the date data, so hopefully this picture can give you an idea off the different date formats we can get when we're dealing with data that's more international. Sequel has a lot of powerful date and time functions. I put together this diagram for you to understand the anatomy of the date and time format as it is using the databases. So if you look at the left off the bar, we have the date portion. It's a four digit year, two digit month and two digit day. If you look at the right side of the bar, that's the time portion. So we have two digit, our two digit minute, two digits. Second and then we have milli seconds. That could go up to six digits. If you're able to understand this format inside out, you really have an edge on the other. Analysts. Hopefully, in our lab exercise will go through multiple functions that deal with either the date portion or the time portion off this field. 28. Date Functions Part1: below is the set of functions we're going to go through in our lab. The set of functions in the lighter green color, basically our in built date and time functions that give you whatever the current date is. The second set is basically, if you want, extract a certain component off the date back from the data field. The third set, which is the orange color, basically gives you the time portion that you want extract from your date and time field. And the 4th 1 which is highlighted in yellow, is basically a formatting function lets you format the date in two different flavors. So let's go to our lab exercise to try these out. So let's take a look at script once we're using the in bill, date and time function that my sequel gifts. So now let's run this script So current underscore. Timestamp gives you the date and the time. And remember, we went through the format off the date portion and the time portion off the date and time . So this kind of mimics that now also gives you date and time boat together. Current date just gives you the date, so it's the year portion, then the month and then the day. So it's why, Why, why four wise two m's and two D's current time gives you just the time portion. So these functions are good to remember when you're actually doing your application development as well. So let's look at script to so we're gonna be using functions that can extract either the year portion, the month or the week or the day portion off a date and time. So let's run our second script. So if you see we already used the current underscored date function before, so he's just going to give you the date portion, so we have 2018 10 20. So if you say a year open bracket and then used this current date inside the bracket, what this will do is just extract the year same thing with quarter month, week and day. It's pretty self explanatory, and these functions unnamed pretty well. So let's say if you want to say day and in the bracket you used the current underscored date, it's just gonna pull that. Today's the 20th day. Same thing with week, month, quarter and year, and these functions become really important in reporting when business is asking you to split the date and tell you what the quarter is, what the month is, what the week is for their sales data so really, really important is really good to know these functions that can drive and isolate the year , quarter, month, week and day. Very helpful. Let's forget script three. So now we're using functions that can extract the time portion and dissect the time down to our minute and second. So now we're gonna use the current underscore time function to give us the current time and then these additional functions that will drive the hour, the minute and the second same thing what we did with the current date, we're going to use the hour, minute and second and then in the brackets, we're gonna give the current time. Did you look at the current time? It's 11. 50 to 58 so it takes the hour out the minute out and the seconds and you know these functions will become important. If for some compliance reasons, you want to know the minute off your transaction when it was paid or you want to know the hour or you want to know the second thes granular attributes become very, very important. Now we're gonna look at the current time stamp, which is the date and the time portion together. So let's run this. So if you see the current time, we have 2018 10 20 then the time portion, so date underscore format can be used for date and time values. So let's say you have your date and time value right here. Right now, we're using our current time stamp, but in production systems you will get an actual date and time value. So you need to place that hair first and comma. You see these letters percentage K percentage. I ask d M. And why these are pre defined letters that depict like percentage K depicts. It's an hour I depicts is a minute s second D day M month and why year? So this is something that you just have to remember. Just make sure you remember that date underscore format function takes two parameters. One is the actual need in time, and the other is actually what element you want extract when you look at the results said were able to extract the hour, The minute, the second and the day, month and years. This is an important function as well. They say you get the current date and time and you're asked by the reporting analyst to tell you what week it is. What day is it? Not just numerically, but from a textual perspective is a Saturday. If it's a Monday, how would you go about driving that? So let's take a look at this script. So we're using the date underscore format, and we're gonna use now if you look at these pre defined letters there, so percent W will tell you the week day name percent m is gonna be month name percent. He's gonna be day off a month percent Why is gonna be the year, then the same portion ages, our eyes minute and species am or PM. So you see, we got the date and time here then, So we're seeing it's a Saturday. The month is October 20th is the date 2018 and then the actual time. And then if it's a PM or in AM so as you can see this derived date and time is more descriptive and sometimes for your reporting you would need that. You would want to convert a date and time from this format down to here. So just remember that these are pre defined letters that are used in the sequel function. That's something that you have to remember, and hopefully, as you practice more, these functions will become very, very easy for you to understand. 29. Date Functions Part2: the following set of date functions either help you add or subtract date and or time from the date field. These sets of date functions become important when we have more time line oriented data that's needed. These sets of data and time function helps you move along the time line, either by adding days, months, years or subtracting them to get the desired result. Let's go use them in practice. So let's look at Script one there. A lot of times when we are asked to either add or subtract a certain number of days from the date fields, this activity is really important. As we're trying to report a certain metrics, you might come across some use cases where you might either have to subtract or add days or months to a particular date field, and it might be for some reporting need. So let's look at the script, select current date, and then we're adding five days to the current date and then just giving a label. So let's run this script so you see, we added five days, asserts 2018 10 25 to add the days you need to use the keyword interval and then the number off days you want to add, and then the green off. What you want to add. In this case, it's the day. Do you want to add? Let's look at the next example. We can achieve a similar dissolved by using the ad date function as well, so add it takes two parameters. The first is the actual date field that you want to add to comma, and then we can use internal five days, so this should give us the same result as well. So it's added five days. So if you asked to take days out, we can use the negative sign and see Interval 10 days it will take. It will deduct 10 days from the current date. Soc It's 10 10 like heading days weekend. Either add or subtract months. So let's say, if you want to subtract 10 months from today Associates 12 17 12 20. Likewise, the year portion. We can also take 10 years out of the date. It's giving us 2008 pretty nice handy functions. So now let's apply these date and time function to our database, so we have a payment table. She has a payment date column We wanna list out the month and the total number of transactions for the customers in that month. So let's look at the script so select payment date as paid dates. So that's our field in the table Month name is gonna extract the month from the state field and actually extract. If it's January, February, March and then we're doing account, we won number of transactions from payment. We have to use the group by so you want to group it by Beeman Date, but only the month portion off the payment date. Then we just ordering by the count. So we want. Basically the first record should show the most number of transactions, and it should go descending. So let's run this. So you see from the results that July has the highest amount off transactions, then August, June, May and then February. So looking at this data, the store can actually see that July's are mostly the busiest time for movie rentals, and they can actually add more. Resource is in their stores, in which one are the least busiest months, and this can impact the business directly, so knowing the date functions can really help business decipher the type of data you have in your database. Let's say if you're asked to produce an average number of days, a customer holds a movie for viewing. So to calculate the average number of days a customer holds a movie first we need to get the difference between the return date and the rental dates. How many days? Actually, the customer holds the movie, so the first script just gives you that. You can see that these are the number of days a customer's holding the movie now, and we're using the date day function to calculate the difference. Now to calculate the average, we just used the average function. So the average number of days a customer holds a movie is five. If you want to know which day the customer rants the most off the movies. So we want to find out which day off the week customers are renting the boast movie. So let's look at our script for this select day name from rental date. It's just gonna give you if it's a Monday Tuesday Wednesday, a count the number of movies rented from rental again we have to grow by the rental date. So if you look at this data, Tuesday seems to be the day when we have the most amount of movies being rented in a summary daytime functions player Really important role in your data analysis journey. Time after time again, you will come across situations where you have to use the date and time functions. So my advice would be is to go through these exercises a few times, use different data and try to get used to these date and time functions, I think. 30. Numeric Functions: sequel allows you to from a lot of numeric functions. When we look at our database, we have we might have numeric feels like quantity paid amounts or a particular price. So any time you see in America data, you can apply thes numeric functions. These functions can be characterized in mathematical functions and or rich thematic operators. So if you look at the first staple for mathematical functions, we have functions like floor round sign. It's pretty simple to apply these mathematical functions. I'll show you in the lab as well as the arithmetic operators for division, multiplication and addition. So these operators come really handy. Where we were trying to either add up in America data or whenever we're trying to apply business rules on these numeric fields, these numeric functions become really, really important. Now let's see these functions in practice. Let's take a look at our first script. So we're applying an absolute function short A. B s open bracket, and we're giving the value negative two and we're giving a labour last or how we want to see the column name Mama Seal again Open bracket and we're giving the actual American value floor round and sign, and we're not giving a particular table. So you can also just use thes functions by themselves without giving a table, and my sequel will calculate it. So let's run this now. The absolute negative to brought back to which is correct. Ceiling brought back 31 because it's the upper number. Floor got 30 which is correct. Round got 20 because we're rounding the number from 20.2 to 4 to 34 to 20 and sign is basically giving you what the sign of this value is. So it's pretty simple to apply these mathematical functions, and you can try out your own variations. Put some different America data here and see how the functions behave. The second category was Earth Matic functions. In the second script, we're doing division. We're also doing subtraction and addition. It's pretty simple, and you can run these with your own radiation off numeric data. The result. One is one result to his one. We were using a D. I V instead of the slash for divide as well, so both can be used to give you the same result. Just a negative sign. In a positive. She's pretty simple, so hopefully you can try out some examples on your own and and try out both the mathematical functions and the automatic operators. 31. Introduction to String Functions: Being able to effectively use string functions become really important in your data analysis. String functions really come handy when you're trying to clean bad data. Having data in a certain format becomes really, really important for companies that get audited on a yearly basis. And strength functions can be used to get your data in a particular format. We will go through each of these categories and I'll show you examples of how each function can be used. These functions might look overwhelming at first, but I'll try my best to go step-by-step and expanding how they can be used. We will be looking at the converting category for string functions. And the list of functions are ascii, cast and soundex. So let's go to our lab. So let's say if you're given a task to bring back those customers where it's the last name is Smith. So we'd run this simple select star from customers where last name is Smith. So at the first glance, we can say we don't have any records where the last name is Smith. But what if the data quality in your table is not good and somebody is fat fingered the lastName, meaning that there are some spelling mistakes. So you will, at first glance say we don't have any customers, but the last name is Smith. But this is where one of the string functions come handy in use to sound x function. What this function does is it's going to bring back that data that sounds like Smith. And it's pretty neat. So we can run the same select, select star from customer where. So we'll use a select star from customer statement where sound x is the name of the function, open bracket and the name of the column that we want to check equals soundex again and what value we are trying to check against. So it's going to compare the sound. So both of these data points. And if you run this, you get two records back where the last name sounds like Smith. And when you have a lot of data, and let's say you're trying to find a particular name. This type of function will help you to return names that sound like a particular given name that you have been tasked to find and will help you in profiling the data in an effective way. The second strip is where we want to actually use the Cast function, which converts data from one data type to the other. So we have select 2010, 10 as value. So we're just using this tax value. And then we're casting this text value as datetime. Basically the function Cast uses the value that we are trying to convert, and then we use the keyword As and that data type that we want to convert it to, just giving it a label has cast value from dual. Dual is a dummy table in MySQL debt can be used to test your functions. So as you can see that the cast function converted the tax value 2000 1010 into a datetime format. So casting can be used when you are ingesting a lot of data that is not standardized and where you want to cast that data into a particular format. So very powerful function that gets used a lot as well. The third script we will be using an ascii function and ask you, functions can be used when you're doing data conversions. So some use cases might ask you to derive an ascii value. So if we run this script, so basically this is your last name column and it converts it into an ascii value. I've also given you the ascii value for an empty screens, which is 0. Formatting string functions are used a lot when writing SQL scripts that lets you manipulate string values. Having a knowledge or formatting string functions will definitely give you an edge. Let's go to our lab section. So let's take the first example. We are tasked to convert a first name and the email address to lowercase. So this script, we have FirstName column on L4. That's the name of the function open bracket, and then the name of the column you want to change the case to, and then lowercase e-mail as walls, we just going to limit to the number of records being five. So if you see the first name was all uppercase in the database, so we converted it to a lowercase. The e-mail also gets kind of order to a lowercase. So it's a pretty simple function. The second script is where are we going to be using a trimming function? Trim function removes spaces. From the value. So let's look at the script. So the first column, we're concatenating the firstName comma. And so if you see 123, so we have three spaces after the firstName. And then we're doing the length. What's the length of this? Once we add three spaces to the firstName. And we're going to use the trim function, right? Trim firstName. So it's going to trim the spaces in the end of the column. And then when you're going to see the length of the final value. So let's run this. Let's look at the first structure is Mary's length is four plus three, It's like the seven. But when we use the right trim function, it traps the spaces from the right side of the characters. So here you can see we have used a trim function that trumps the spaces from the right. And the third script is where we can be using a padding function. So let's look at the script. Select FirstName, LastName. Arpad means right pad, firstName comma. So the length of the total output needs to be 15. Third parameter is the actual padding value. And in our case we're going to be using a dot. And then you're also going to write pad the last name. So let's run this. So we had the FirstName, lastname. So FirstName, the length of the name is four. So there are four characters, MAR why? It adds 11 dots. So the total length becomes 15, and that's what we have. We are qualifying it, but the length that we need for this field to be at. And then we give what we need to use for the padding value. So it'll be to source column, the total length of the desired value and what the padding value needs to be in quotes. So this is a function that can be used to format a lot of string values. String functions can also be categorized under expression extracting and manipulating categories. And we're going to be trying out some of the functions in our lab exercises for these in a lot of data projects as being an analyst, you'll come across different scenarios where you're given a task to either manipulate the data or either extract some value out of that data. So these are the functions that you can use to get your desired results. Let's try some of these functions in practice. So let's take a look at the script one. So it's asking us to find the position of character RY and the first name of the customer table. And we're going to be using a string function called int string short IN STR, which gives you the position of the character you're trying to find in this string value. So let's look at our script. So select FirstName comma IN STR. That's the name of the function. Open bracket FirstName, that's the name we want to find, where the character exist. Comma RY. That's the character we're looking for, closed brackets from customer. I'm using this in the where clause to bring back only those records where the firstname does have RY, as this particular function gives you the length. So if the length is greater than 0, that means at the firstname does have our y in it. So let's run it. See if you look at the results set. The first record marry for the position from which our y starts from is three. So M a R. So R is at the third position. If you look at Cheryl, It starts at the fourth position, RY right here. So this is a string function that can give you a position of a particular character in the data pretty quickly. So let's look at script to under extracting. So we want to use left, right, and mid functions to split the phone number that we have in the address table. So let's look at this script. Select phone comma left open bracket phone comma three. So that means every are bringing three characters from the left of this value. And we're going to give it a label as area code. Then we'll do a mid. Mid takes two numbers. We'll start at the position four. And up to the third characters after position four would be the mid phone. And the right would be, we need the rightmost four characters. So let's run this. So yeah, you can see from the results that we got back is that 99 eight came into the area code. 834 came in the mid phone. Column, one to 75 came in the right phone. We look at some manipulating string functions and script three. So let's say if you want to reverse your first-name, it's pretty simple and we just use the reverse function and give it the name of the column. Let's run that. So you can see can easily reverse a string value just by using the inverse function. We can also use it Replace function to replace a value in the data. Let's look at this script. Select FirstName comma, replace the name of the field comma Linda to Lynn. And we'll just give it to label from customer where firstName is Linda. So what this does is that it finds Linda in this column and replace it with Berlin. Let's run this. You see Linda is replaced by Lynn now. So it's a quick way to replaced string values. The last script is basically you can repeat data multiple times by using the repeat function. So let's run this. As you can see, the value is Eddie and it's get repeated twice. These are just some examples in which I'm showing you how a string function can be used based on a particular business is you can use a combination of all these string functions and actually get the desired result. 32. Conditional Statements: conditional logic and sequel can be applied using a case statement. It's similar to an if then else statement. If you've used any other programming language, Gates statement can help you transform data from one settle values to the other, and also it can help you prevent enters. If you're doing some sort of a calculation. Now, let's look at the syntax. We'll use the keyword case to start the case statement. Then give the particular column whose value we want. Evaluate when the value, then whatever value we want to use to transform that we can have multiple conditions by using the Wen and then keywords else. If none of those conditions are met, we can give a particular data value a point and then end. Case else is optional. Let's look at the case statement in practice. Let's use the case statement on the film stable. Let's see what we have in the table first, so we have a title in a rating. So what we want to do is we wanna basically expand on the rating acronym and give a description. So what? We can uses a case statement to actually enhance thes acronyms and let our users know what they mean. So if you look at the first script we're using the case keyword, then we're giving the column. We want to transform, and in this case is gonna be rating when and then the value of this column. So if the value of the column is our, then we're going to change it to restrict it. So these are multiple conditions when and then that could be applied to change these documents to more descriptive information. And then you use else if none of these conditions are met than else, condition is gonna be triggered. And as and we're gonna give it a label for this column as movie rating description and also ordered by Title 31 order it a sending. So let's run this. So looking at the results that if you CPG I had a condition where of reading was PG, then transform it to parental guidance suggested. So you see, this new column is actually giving us to transformed value, So this case statement can become really, really important when you're doing migration from one platform to the other, and there some requirements where you want to change values based on a particular business requirement. So you see, all the actives have been successfully transformed. So there's another flavor in which we can use the case statement. The script that we ran. You were using the column after the case. What we can also do if you look at our second script, is case when the column equals R, then Transformers to Restricted and Harry listing the column in each of these conditions and then assigning a particular transformation ruled to it. Both of these will give us the same results that hopefully you'll find these examples helpful. 33. Introduction to Window Functions: Window functions are one of the coolest features I think SQL has. Once you note the mechanics of working with window functions, you can use it for different use cases and calculate the desired metrics. The windowing process that's you divide that result set into groups or froze called partitions. And then frames let you operate on even a subset of partition data. By breaking partitions into smaller sequences of rows. There are a lot of functions that can be applied. We had the windowing concept, some of the use cases that can use that window function. If you want to calculate the top 10 products this year by total sales. If you want to calculate moving averages, are running totals. We will go through some scenarios where you can see how window functions can be used. You can apply the window function on a particular column while still maintaining the detail of the whole table. Like if you remember in our group by functions, we used to get data that user condensed the number of records based on the group by column. With window functions, you can still maintain the detailed set of records off the table and still apply grouping on the data based on different aggregate functions. So the windowing concept becomes really important when you're trying to see that data in its totality. Next, we'll see how the window function is constructed and what are the components that make up a window function work? Window functions are one of the coolest features I think SQL has. Once you know the mechanics of working with window functions, you can use it for different use cases and calculate that desired metrics. The windowing process lets you divide results set into groups or froze called partitions. And then frames let you operate on even a subset of partition data. By breaking partitions into smaller sequences of rows. There are a lot of functions that can be applied. We have the windowing concept, some of the use cases that can use that window function. If you want to calculate the top 10 products this year by total sales. If you want to calculate moving averages, are running totals. We will go through some scenarios where you can see how window functions can be used. You can apply the window function on a particular column while still maintaining the detail of the whole table. Like if you remember in our group by functions, we used to get data that user condensed the number of records based on the group by column. But when window functions, you can still maintain that detailed set of records off the table and still apply grouping on the data based on different aggregate functions. So the windowing concept becomes really important when you're trying to see that data in its totality. Next, we'll see how the window function is constructed and what are the components that make a window function work? 34. Window Aggregate Function Part1: the following categories give you the list of functions that can be used in the rendering logic so you can use aggregate its ranks functions for statistics, distribution and also positional activities. So these functions become really handy. Once used via the rendering logic, we're going to start with using the AC create functions in our lap. Now we'll start recreating the employee table and inserts some data, and some of you might have already done that in the previous lab. Or so we have employee I D. Employee names the department that they work in their salary in the age. So that's the data we're working with. So let's start. But finding the total salary per department, right? So, you know, from the rarest lecture, we can use a group by department name to get the total salary by summing it up here, you can see the group bike loss, has grouped each department and summed up the total salary. Let's say you want to calculate the average and the total sum off all the salaries in the table, presented in a way that the actual number of records in the table still show when you run the query for that, you're going to use the over clause with initiates the rendering process. So we're gonna do the aggregate function off some passing the column that we want to aggregate on than the over claws. And this give it a label. Same thing with the average. We're gonna give it the column name over clause and the label and just run the script. So, as you can see, we got all our records in the employee table. Employee, I d. Employing name, department name and salary. So we have 11 records showing up. But if you look at the last two columns, the total salary and the average Sally's toward this over Klaus dead was summed up all the salaries of all employees enlisted it at every dro, along with the average salary at every row. Our third strip is asking us to sum up the amounts for Marion Barbara, where each amount they paid is greater than $5. So for that we have to join the customer table and the corresponding payment table. So we're gonna join those two tables on customer I d. We still using the where clause where amount is greater than five and The name is Mary and Barbara. If you look at our select statement, we want to know which month they paid these amounts in your summing up the amount and then using the over claws and then giving a total payment. So let's run this here. You can see the Mary's transactions in June, she had two transactions that were greater than $5. In July, she had three transaction. But if you look at the total payment, the or function is adding up all these amounts and showing it at each row, some off all these amounts together. So this kind of shows you the aggregation being shown with the detailed records are four. Script is a little variation from the third script in scripturally we were not grouping or partitioning the win doing function. But here we're gonna introduce ah, partitioning claws and we're gonna partition by each customer. So we gonna partition by Mary and then we're gonna partition by Barbara. So the script is pretty much the same. The only difference we have is that when we some the amount and use the over clause in the over claws open bracket, we're going to use the partition by Klaus, and we're gonna partition by customer i d. Let's see the results if you see now, so the total amounts are aggregated by the customer. All of Mary's amounts have been totaled up, and all of Barbara's amount have been total up. This partition by Klaus comes really handy when we wanna partition on a particular column using the rendering functions, you can see the total amounts here $41.94 only relate to Mary Smith's transaction When you some these up. Great. I've added a slight to explain these results, further calculating the total payment amount for each customer, ensuring their detailed records alongside it. So if you see we some the amount and use the over claws and then used partitioned by customer i D and labeled as total payments. So if you look at the last column using the partition by Klaus created two partitions in our results that one for Mary and one for Barbara. And by using the partition keyword, you're able to isolate your aggregate functions to be only applied on that partition. Hopefully, this Richard was helpful. In 1/5 script. We want to calculate the total on average. Sally per department. Enlist the total employees per department as well. You had seen the employee table. We had employee named Department and salary and their age going to use the or clause we can also use the partition by. We're gonna partition by department. So? So here. You can see the 1st 3 records relate to the HR department. Salaries are here, So the total salary per department. So there's 12,900. Is this some off these three salaries in the HR department? Because we partitioned by department name. Likewise, average salary per department and the total employees, so we have three employees. 35. Window Aggregate Function Part2: Now let's take a scenario we wanted, classically it. They're running some off the paid amounts per month by customer, where the each transactions are greater than $5 for Marion Barbara, so the script remains the same. The only portion that will change is that we're going to add the order by paid month clause after the partition. By custom righty. So what this does is that it further segments out the data by month after it has already grouped by the customer i D. So let's run this. If you look at Mary's transactions, we have months 67 and eight. So we have three months in which the transactions air split. So for month of June, this winnowing function added up the two transactions for the month of June. Then it came and added the transactions for the month of June, plus the month of July, So 35.95 is adding transactions for all the month of July and June. The last transaction for Mary for the month off August is $41.94 so it adds $5.99 to the $35.95 we calculate in the previous role to $45.94 is the total for all these amounts for all these months as well. It's a running total as it keeps adding the transaction amounts as the current role moves down. If you don't specify a row and range and give a particular boundary, that default is range between unbuttoned proceeding and control. Let's try to understand this through a visual aid. So when we use the renewing function and do the order by payment month, if we don't use any range or don't set any boundaries by default, my sequel will process this as a range claws and then do between unbounded, preceding and current row. You look at the results set if the order by value is the same in this case for the month of July via three transactions range would calculate the aggregation for that block of data together. And that's why you see $35.95 being repeated for each row for the month of July. As the calculation happens on that logical group, hopefully, this ritual will help your differentiate between a row and arrange calculation Now let's update the previous script and add another rendering function where we're actually explicitly seeing rows between unbowed and proceeding, and the current after the order by clause, and leave the other running total payment range as is. Let's run this so the running total payment rose. What it does is that it adds the amounts row by row, like a physical row gets added to. The total amount does not matter if the month is the same. But if you compare the range running total if it sees the month being the same, it as the total amounts and give the same amount on those two rows, then it comes to the month of July. It add all the July and the previous difference between rose and range that the order by clause does not impact the row calculation, and it basically works. It's aggregation on each row as it goes through. On the other hand range, basically logically groups based on the order by and calculates the total sum based on the order by column. Let's look at the roads calculation through a visual aid. So the moment we used the Rose keyword, the framing happens at each row. So as Rose are being added, the frame gets bigger as you can see from the visual on the right side that the frame one through frame six is an increment from the first row to the last row. So as the frame goes bigger, the total sum or the running, some gets added as the control goes down to the last row off Mary's transaction and likewise with Barbara. Now let's say if you want to calculate a moving sound, meaning that a sliding window that goes through our set of records will keep the rest of the script the same will still use the some off the amount over Klaus partitioned by customer I D. And the order of the month remains the same. So this information we're not changing, we gonna add Rose between and give a boundary a one preceding and current crow. Likewise, we can also add the range between one proceeding and country and see how our calculations are different between rose and range. Let's run the script if you look at our rows Columns, um, moving total payment rose bait amount one day, so we have ah, one day prior sliding window. So now we have $5.99. So we don't have any prior RO just gonna represent fighters nine cents as you go to the next column. It's gonna add one previous amount and this amount and put $15.98 as it moved forward is just gonna only add one previous amount because you look at the last roll. The reason why we have $13.98 is because it it added $5.99 to $7.99. So it's a sliding window with one month prior by using the row between one preceding and current row. If you look at the range calculation, it again summed up the two amounts based on the payment month. So for the month of June, we have $15.98 as it went through the next set off month. It added again for the month of July, so it added July payments to the June payments, but it repeats it for the same month. If you look at the last Ruoff Mary and look at daughter moving payment range came in one day, it's $25 in 98 96 cents. The reason it is $25.96 is because it added $5.99 to these three transactions in July. So knowing the difference between rose and range calculation becomes really helpful. As you start doing your analysis, let's look at the sliding effect using the rose clause through a visual aid. So this late is showing you are moving total payment for each customer by month and were able to achieve that by setting up a boundary and by saying Rose between one preceding and current. And if you look at the framing, it's basically has two rows as it moves through, their data sat in school a sliding window. If he would have used between two preceding and current, the window would have had at the most three records as it's moving through the table 36. Window Ranking Function: ranking is another category of rendering functions that gets used a lot. Rank functions can be used to calculate the top number of records in a data set rank functions assigned the same ranks to equal rose versus danced rank that has no gaps. Hence the term dense and tile slices. The data into the given number of subdivisions you give andro number just gives a physical number two each row. In your petition, let's see these functions in practice. I really like rank function. So we're gonna be using the employee table for running our rank examples. I want to update one off the Sally Records for employees five. So we can see the difference between the rank and dance rank. Please run that update statement. Great for script one. We want to calculate the rank and danced rank on the salary off the employees. So we're gonna use the keyword drank open and close bracket over partitioned by department name. So for every department, I want to rank the salary. Same thing I want to do with dense rank. Dense underscore rank open and close bracket over partition, my department and order by salary. So I want the ranking by salary within a department. So let's run this. So if you C H R gets 123 rank and dance rank 123 So far, rank and dance rank is giving us the same pranks. Let's go to production department if you look the first row or production gets both ranks as one second and third gets, too, because the salary is the same. But the fourth record would rank function gets four, and dense strength gets three. So here you can see there's a gap in the rank function. It's skipping three. But the dance rank does not skip, and there no gaps when using the dense rank. And so it systematically say's 12 to 3 and four. On the other hand, ranks say's 1224 and five. So I prefer using dense ranks. For this reason, let's see if you want to calculate the top five employees by salary, what we can do is we can apply the dense rank function, used the over claws in just order by salary descending, so we want the highest salary first and lowest at the bottom, and we're not partitioning by department right now. because we just want the top five salaries regardless of the department. So we're using an in line sequel right here. So if you look at the brackets, this particular select, it's giving your dance. Drank for the high salaries one and then kind of goes down all the way down to eight. But we're looking for the top five employees, so we need to filter this data further. So to filter the sequel inside, we will use another sequel outside this and keep this equal in close brackets given an alias. So the whole script inside will get an alias off acts So we'll say, Select star from the inner sequel. This acts where salary underscored dense rank ISS less than unequal to five, which is filter these records where salary dense rank is less unequal to fight. So if we run that we should get the top five employees by salary. There you go. So it's a handy view of getting an end number off people within a group based on a particular metric. Let's look it script to we want to basically use the entire function to split the payment data off the customers into four groups How are we going to do that? So we're gonna use an end tile function Open bracket four. Basically, this gives you how many slices you need off the data over which initiates the window function order by the amount descending as quartile. And we're gonna join the payment and the customer table on the customer I d. So, basically, we want to see the amounts paid by the customer and then group them into four distinctive groups. Let's run this. If you look at the quartile College assigning these roads Group one, if you scroll down to records, we have some records assigned as to sometimes three and then four. So this is an easy way to categorise the number of records in your table toe a particular given group using the n tal function and giving a particular number of groups you need in this parentheses right here in script three, we're gonna be calculating the top earning employees per department. So for every department I want the highest paid employee, we're gonna use the role underscore number function, partitioned the data by department name and then order the salary descending. We roll numbers basically assigned just a number to each physical row. If we run the script highlighted on the screen, we gonna get a sequential number assigned to each row in our employees table. So for HR department, we're getting 123 and then the ordering is based on salary descending the way we can filter this record and only bring records where the employee ranks equal to one. We need to have an outer sequel, Select Star from Open Brackets and Close Bracket, where we have the in Line sequel given an alias, acts where employees rank equals want. So let's run this. So this gives us the highest paid employees in each department based on the salary. Hopefully, these examples help you understand the ranking functions. 37. Window Positional Function: positional window function helps an advocate data location in the results. That, and helpful in business reporting, lag and lead functions let you either access the data before the control or after the control, depending on the offset you give. First value and last value gives you either the first or the last value in an order to sat . Let's look at these functions in practice. So what script one is doing It is calculating the first and the last value off the employee I d. Partitioned by department name. So it's a pretty simple script. Give you the first employee and the last employees. So if you see we're getting frame first row, we're getting the first employee here and cream last road. We're getting the last employee here, which is Grace Lee. Pretty simple, straightforward function. Let's try to run script to, which is calculating the lag and the lead function or the employee i d. Partitioned by department name and ordered by ideas were ordering it by I D. So let's run that If you look at the HR department, we have three employees. So for the first employee, there's no previous row. If you don't give an offset. What a default value for lag and lead. So it's gonna basically set the offset by one record. So lag would be one record before, and lead would be one record after. So the next row, which is using the lead function, is going to be two and three from the current trolls. It just gives you that for the third employees in this particular partition. There's no next record and it's gonna be no. But if you use the offset for the lead function, like in this case I've used to that means that the function would fetch record from two rows down. So when you're on the first row on employees, one to record Stone would be Grace Lee. So we're getting three here for the lead with setting it off set off to, but they're no records after two records from two. That's why we're getting no, no, we're using the lead and lag offset. You can really navigate the data that you have in the table within the partition. So for every new partition, the window function would start again 38. Introduction to Data Visualization: One of the reasons why I added the data visualization section to this course is that as a data analyst, you would be asked to present your data findings in a very concise and impactful ways so the business can understand the value of your analysis along with the key metrics that impact their business. Think of data visualization as a tool for communication. One of the best ways to explore and try to understand large datasets is through data visualization. The concept of storytelling is how we set the stage to showcase our data. So it can be understood without going into too much technical details. And with data volumes going, business as leaning more and more towards leveraging data visualization, do better. Educate them on either discovering patterns, spotting trends, and highlighting information where there are opportunities to improve. The goal of all data-centric organizations is to make better data-driven decisions, which would lead to better insight and that would ultimately drive business. Aesthetics of data visualization becomes very important as we start using visualization to communicate the message. Aesthetics described every aspect of a given graphical element. So like position, which describes where the element is located. So location becomes important. You showcase your visualization. Shape can be used to show different datasets. Size can show magnitude, color can show intensity, lines with varying width and type control patterns. So all these small intricacies can help your data visualization be more impactful. And we'll see a lot of examples in the other slides. Once we get our hands on a dataset, the first step is to understand the type of data values we want to visualize the two categories in which data values can be split is categorical and numerical. Categorical is more quantitative and is further split into either nominal or ordinal. Nominal categories don't have a particular order and it's used for more labeling purposes. Examples are either eye color or gender. So for instance, gender might have values of m as male, and then F for female. Ordinal values are more ordered. Ordinal data, as the name suggests, has a certain order to it. So customer survey might have data values as one being satisfied, do being neutral and three, being unsatisfied. Numerical category is more quantitative. It can be further split into either discrete or continuous. Quantitative values can be measured. Discrete dataset, for instance, can take only a certain value. For example, the number of people in a room can either hold six or seven people, but not 5.5. So the examples are a number of sites. You have four dice and number of pages in a book. Continuous dataset is where the data can be split to further slices. And it's more like in finite continuous data values are for which you can divide them into finer and finer increments. So for instance, a temperature or a weighing scale, you can further subdivide the weight into smaller units, so it's more or less in finite, it's continuous. The distinction of data into these groups will become really, really important as we start to use different types of charts and graphs, which we'll see in the later slides. Hopefully this visual will help you understand how we need to think of data before we start putting the data into a more visual form. You might be faced with a particular situation where you need to present your data based on a specific requirement that can either be categorized as comparing the data values or showing composition of your dataset, or presenting how data is distributed or it is related. For each of these categories, we have a set of charts and graphs that can be used to deliver the desired results. I'm showing a condensed list of chart types within these categories on this slide. We will go through each of these chart types in detail. In the later slides. I've attached a chart selection diagram, lecture notes created by Dr. Andrew Abella that is pretty elaborate and will help you pick the right chart for your datatype. And important question you can ask before selecting a visualization is, how many variables do you want to show in a single chart? He's a 123 or more. And how many data points you want to display for each variable. And will you display values or period of time? What among items or groups? Answer to these questions will help you pick the right chart for your visualization. 39. Tableau Software and Datasets: We will use the tableau software for our data visualization course there many other data visualization tools in the market, but tableau by far is one of the leaders. If you follow the link given in the slide, it will take you to a public version off tableau, which is free of cost. You can download it either for Windows or Mac version. The only downside is that you cannot save your visualization on the desk. You can only save it online on tap those public portal. So this is only good for testing or university base assignments. You don't want to share sensitive data on the public website. We're gonna first install tableau on the Mac OS. So just given an email address, we have successfully installed tableau on Michaela's Let's install tableau public version on Windows. Great. Our software has been installed. We have two data sets we're gonna use during the course. One would be the world indicator, Data said. The other one is going to be the sample superstore data set. You have no dimensions or measures, and pretty much you have a blank canvas at this point. So what you want to do is you want to click on this, connect to data to get those two data sets loaded. So we'll go to excel because we have those data, sets and excel. We need to link to these two data sets. So we'll go first to the World Bank CEO to Excel sheet and load it once you're connected to the Excel Sheet is going to show you the tabs that you have in your Excel worksheet going to click on the CEO to data clean and double click that great. So we have actually successfully connected to the world CEO to Excel Sheet. You have your country court country named Region Year, your CO two emission and your per capita information. So now if you could click on the sheet one, we should have the dimensions in majors loaded. Great. So we have those, too. Now we need to connect to the second data, sets a click on the data source here, click on this Aiken, we're gonna connect to Excel Sample superstore Excel. So be able to look at these taps that are in that excel sheet so we don't double click on the orders. But this is the data set that we gonna be connecting to is the order data. Let's go to sheet one. Now you see, we have the C 02 data connected, and also the superstore data connected as well. So you'll see dimensions and mayors for the superstore data set. And if you click on the world C 02 days that we're seeing the dimensions and measures related to that data set as an exercise, you can connect to the world indicators data set on your own. Great. So we have the data now. 40. BAR Chart Visualization: bar charts Air, one of the most commonly used visualisation that shows numeric values as bars split across clear categories. They're effective in comparing magnitudes and discovering high and low in the data, and that is shown by the height of the bars. Bar chart is sometimes called column charts. You can either show data on a war tickle or a horizontal bar, and let's see if you have more categories or labels at times becomes harder to show that vertically, and we have the option to show them horizontally. X axis showed the dimensions and why axes show the measures. Bar charts. Discrete data is categorical in nature and therefore answers the question. How many? Within each of these categories? Let's take a look at the anatomy of the bar chart. If you look at the diagram, the horizontal axis has the categories or dimensions, and the word ical axis has the measures. The height of the bar is showing the magnitude, and the width of the bar is the same across, and you read from left to right. Each bar shows a particular category. This visualization is very easy to understand. This visualization becomes very impactful when you can see the radiations. In the height of the bar, we have another variation off the traditional bar chart that's called the Stack Bar Chart. Here you see the same bar chart, and the only difference is that each bar is further segmented out into sub categories. This becomes effective in you have, ah hierarchy in the category. So let's say you have a region, and then the region is split further into 2 to 3 different subregions. So this is an effective way to show those sub regions within a bar. A stack bar chart show finer details on the data set as the data can be seen. Split in further categories. Let's go to tableau and create these charts. For this exercise, we will connect to a world indicator data set click on Microsoft Excel, and you need to open the world indicator data set. So it's connecting our power. So what Taepo does is that it lets you see the data once you connect to it in a browser format so you can see what data you have. You can also improve the quality of data if you want to change a particular value once you see it. But for us, we're going to use the data as is Let's go to sheet one. So now you can see that the dimensions and mayors have been filled up piece of the data set . So our first task is to show how the inbound dollar amount is increasing per year in our data set, so double click on the year. So we have data from 2000 to 2012. Double click tourism inbound, which is a measure go to show me and will select the horizontal bars so you can see that the inbound tourism has been increasing worldwide. You can also switch the chart from horizontal to vertical by clicking on this Aiken. Great. So we have our first chart. If you look at this plus sign, we can create a new sheet, so we'll just to hand so we'll do. The same thing will double click on the year we'll double click on the tourism inbound that would click on the region. Now in the show me we need to click on the stock bar chart. Great. So now you can see we're able to create a stack bar chart where each bar shows total inbound dollar amount spent for each segment. And in this case, our segment is the region you can see Europe is leading in each of these years in Tourism in Africa is increasing over the years, but compared to Europe is still for less great. So this is a stack bar chart will create 1/3 chart. You can create a duplicate version, often existing sheet by clicking on the duplicate option in the menu, and this will be a site by side bar chart. I really like this visualization. What it's showing is that each region is shown in a column, and you have the number of years for the data for inbound dollar amounts showing. What we can do is instead of the year being on the color palette, we can replace that by region, so each region has its own color. Let's try to add a dashboard. The dashboard would be this Aiken right hair. You can combine multiple work sheet in a single dashboard. We contract the sheet on the pan and you'll see how the dashboard actually puts the sheet side by side. This really becomes a very powerful communication tool in your presentations you will make a few modifications to the dashboard. First, we will rename the worksheets to have a more descriptive name that goes along with the chart it's representing Second on our sheet one. We will change the color off the bar chart to have a more neutral tone. Third, we will adjust the size off the dashboard appropriately. Let's go make those updates or click on the color. I will just pick a gray. Great. I have renamed the three sheets with the desired names. You can do that on your side as well, by right clicking on the sheet. You want to change the name for and select rename from the menu selection so you have your layout and your dashboard. Then you have the size. It's an automatic size, but you can increase so you can increase the wit or decrease the width. You can decrease the height if you want, so this kind of gives you an easy way to minimize or expand the size. Now you can see these three visualizations in a dashboard, which really gives you a compelling story. My favorite is the total in monitors and per region details, so it's basically gives you each region and the way inbound tourism dollar amount is increasing. If you can see Asia has have the sharpest increase or the years and also Europe is leading by far. The total inbound tourism just gives us a total picture off how every year the dollar amount on the tourism is increasing and the stack march are kind of gives you how eat segment within the year for the region is compared. So this kind of gives you an overall dashboard that can used to show different perspectives off how the inbound tourism in the world can be compared among different regions. Hopefully, this was helpful. 41. Line Charts: line charts are like time. Siri's graph there used to visualize data trends over an interval of time. The data points on the chart are connected through a straight line, sharing numeric values plotted as lines or date related fields. The difference between the bar chart on the line chart is that the line chart needs to include Time dimension in it. We will also be using stacked area chart, which can also show time series data. Its main purpose is to show individual parts and how they make up the whole or the total. Let's see these charts in action. Using Top low, we will be connecting to the world indicator data sets a click on Microsoft Excel and open the data set. Great, we got connected. Let's go to sheet one. All right, so our dimension and measures have been loaded. So what we need to calculate is the total CEO to a mission or the Peter of Time and show which region is showing a spike in the year to mission and where we have a dip. Let's have the Dimension Year the book Click on the Dimension at the Measure CO two emission double click on that as well, and then double click on the region as well. So if you see we have a tabulated form of data here is really hard to see a dip or a peak on the sea or two based on the region. So we really need to get a visualization on this data to make some sense quickly. So you click on the Show me drop down and pick the line chart. So this line chart really becomes an impactful communication tool where you can see Isha by far from 2000 to has a peak, meaning the C 02 emissions have been going up quite significantly from 2000 to and own words for the region of Asia. Compared to the other regions, let's create a new worksheet for are stacked area chart. Click on the plus sign will pick the same dimension and majors double click on year double click on See or to Mission and then double click on the region. Click on the show me drop down and pick the area chart. Great. So the area chart is showing the total summed up see or to mission in the world, and it's doing it by stacking each region on top of each other. This chart becomes helpful to see the bigger picture. Now let's try to put these two charts on our dashboard by clicking the plus Aiken. We can just drag and drop these two sheets on the pen. Great. You can change the size off the dashboard based on your own need. Well, just expand it so it's fills up the screen. Let's try to rename these two worksheets. Do the same for the next sheet. Had a title. Great. Now we have our dashboard ready to be presented. 42. Histogram: history. Graham shows the frequency of distribution off your continues data. It divides the data set into clear groups called bins. The frequency off the Ben data becomes the measure off the history Graham and gets calculated automatically history. Graham uses one dimension that is in America, which gets split into Ben's history. Grams can also be useful to visualize the outliers in your data. Set in our past visualization with bar charts, the columns were showing categorical variable in history. Graham, the columns show quantitative variable. If you look at the anatomy off the history Graham on the X axes, you have fixed size bins where your numerical data gets grouped. There no gaps between the bars. As Baylies continues on the Y axis, we have frequency off the been data, and the height of the bar is showing how maney number of records you have per bid. So it shows the magnitude we will be creating to history Grams one for sales data and the other for profit from our superstore data set. Let's see these in practice. Let's connect to the superstore data set. Bring the orders to the pan. Great. So we have the data. Let's go to sheet one. We have to look at thesis Ailes measure and find the frequency off the distribution for this data set. So what we need to do, we need to create bins off the sales data. We need to figure out what size of the bins do we need for this exercise? I'm gonna pick the bin size as 10. Click on the sales measure, click on the drop down. There's a small icon on the right side, click on the create, and then click on Vince. There's a default. Been size. You can go ahead or write it the moment you created bin for the sales data. The Bend data goes into the dimension. We need to change the type of this data to be continues. You can click on the drop down and click and work to continues. Double click on the sales. Been great. So you got that on the column On the real we need to actually count the number of records we have for the sales data. So we'll just write count bracket Number of records closed bracket just hit. Enter. Great. So we got our history. Graham, what we need to do is we need to filter out some of the sales data. You can do this by dragging the sales to the filter pan, and you'll get this window. Click all values. You can reduce the maximum value to about 600. You can either type in the window or just drag this lighter hit. Apply. Great. So now we can see our history, Graham. And as you can see, that majority of the concentration off the values is between 0 to 50. If you look at the tallest Ben, that has the maximum number of records in it, so what we can do is we can label it. Just do a right click mark label all results so it shows you the number of records we have in that been great. Let's create a new worksheet. Now we'll create the history. Graham, with the prophet measure well again, pick the bin size for profit. I will pick 100 as a been size. Now you need to change their Ben into a continuous very well double click on it. So it shows on the column and on the road you need to do account number of record. Great. So we got our prophet history. Graham. What we need to do is we need to filter the profit between negative 1002 positive 1000. You can do this by moving the prophet to the filter pan and you'll get a window open. Click all values now. You can give a range between negative 1000 to positive 1000 you can type it in the small window it apply. Now we can see the majority of the transactions are between 0 $200 and you can label the bar with the highest number of records. As you move to the right and the left off the highest bar, you'll see that the number of records in those bins go down quite a bit. Let's try to add these to history grams into our dashboard by clicking the plus sign. You can just double, click or just drop those two sheets into the pan. You can rename both the worksheets to a more appropriate label, and whenever you are showing a history, Graham always try to mention the bin size on your diagram. We can rename sheet you as well by right clicking and added title. You can adjust the size, cough your dashboard as well. Great. There you have your dashboard, which are selected history. Grams 43. Scatter Plot: scatter plots show relationship between two measures one plotted on the X axis and the other on the Y Axis Scatter Plot scan showed a large number of data points and is suited to drive actionable insight. Unlike bar graphs, which uses at length as a visual clue, scatter plaice Youth position X and Y coordinate. Scatter plots can help show correlation between variables positive and negative and can depict clusters within your data points. If you look at the anatomy of the scatter, plot is basically made up often X and Y coordinate, and both axes have measures listed on them. This type of realization helps you capture a lot of data points in a very efficient manner . In our lab will be creating a scatter plot between profit and sales data off the superstore data set. Also identifying clusters within this data set by color coding it It will be fun. This light shows two plus one, showing a positive correlation between birthdate and infant mortality, meaning if one waited, what goes up the other goes as well, hence positive. The other shows negative correlation between life expectancy of females to birth rate. As one goes up, the other comes down, hence negative. The correlation Scatter plots will be added at a later time, and you can do this on your own and see if you get the desired results. For now, we'll go to the lab and create a scatter plot for our superstore data set. Let's connect to the data set superstore. Click on the orders Time click on the new worksheet. All right, so we have to get to measures, profit and sales on the pan. Double click on those two. So now you see that we have an aggregated point where all the profit and sales have been aggregated. We need to uncheck this aggregation by going to analysis and un checking the aggregate majors selection. Great. So now we have a spread off all our profit and sales data in a scatter plot. As you can see, we're able to show ah, lot of data points. Our end goal is to find the clusters off customers we have in our data set. We contract the customer name through the details marks card. Now what we need to do is go to the analysis time and you can see in the model you have the cluster option. Distract that and drop it on the pan. Great, you can see tableau based on its came ing algorithm, has clustered our data set into three distinctive clusters. Cluster 12 and three, and they are color coded for distinction. Our cluster one. As you can see, we have a lot of customers there, but the profit is minimal and cluster to you can see we have customers where we're making a lot of profit much better than cluster one. So that's definitely one cluster you want to focus on for your sales cluster threes where we have very few customers. But on each customer we're making a much larger profit margin. We can also move the cluster from the mark's car to the dimension, and what we'll do is it will create a customer group that could be used later on another regionalisation. Let's create a new worksheet. Let's bring in latitude and longitude and the postal code for the customers. This will show us where all our customers in in North America now what we need to do is we need to track the customer group that we made from the clustering into the filter and select which cluster we want to see on our map. Now you can see we can select clusters and the map is gonna get updated based on the selection off the cluster we pick. So you want a target marketing based on a certain cluster, we can see where they are. You contract the customer group to the colors. Marks card. Looking at these clusters on a map really gives a powerful visualization for us to do our targeted marketing and seeing which cluster has a better impact and which location needs to be focused. More. Let's add our scatter plot to the dashboard. Let's rename it. 44. PIE Chart: people just either love pie charts to just hate them. I'm not to phone of them. Pie charts are good to use when you have few categories in your data set. They help ensuring how one category is related to the other in proportion. I would say 3 to 6 categories would be appropriate. Pie charts can either show nominal or orginal data. If you look at the anatomy of the pie chart, each category has a certain percentage that adds up 200%. You can either call each portion or category of the pie as slices or veg is. If you showing more than six categories than the pie chart will become more crowded. It'll be harder to convey the message, so please keep the categories under six. We will be creating a pie chart off the profits from our superstore data set per region. This visualization is showing US sales data per category off the products within each state of USA on a map. You can try to create this visualization on your own and see if you get their desired results. Let's try to create a pie chart in top low. Let's connect to our superstore data set. Select orders data. Click on sheet one. Double click profit from measures and Double Click Region from Dimension. Go to Show Me and click on the pie chart. Let's try to fit this on the screen. Great. So now what we need to do is we need to label each portion or veg bring the region and profit to the label marks pan. Great. We have our labelling done. I was trying to label the sheet as well. Let's increase the size of the phone and make it bold. We'll do the same for the region and the profit. Format it and increase the want. Emboldened as well. Let's add it to the dashboard. Let's adjust the size. We will remove the total profit. Great. We have our pie chart. Regionalization done.