Data Engineering : How to Structure, Load and Present Information | Henk Van Zyl | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Data Engineering : How to Structure, Load and Present Information

teacher avatar Henk Van Zyl

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

12 Lessons (1h 4m)
    • 1. Intro

    • 2. Resource Requirements

    • 3. Taking a Look at Our Data

    • 4. Creating Information Packages

    • 5. Creating a Star Schema

    • 6. Creating the Database Structure

    • 7. Loading Our Temp Tables

    • 8. Populating the Dimension Tables

    • 9. Populating the FACT Table

    • 10. Creating a View of Our Data

    • 11. Loading Data to Power BI

    • 12. Creating a Dashboard (Course Project)

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

This course will teach you data engineering from start to finish. This course covers all you need to know from the physical design of our data structures to implementation where we the course will provide you with step by step instructions to follow along.

We will be using unformatted source data and convert this data into usable and valuable information. This process entails everything you need to know, from data design, ETL, information access, data analysis, and lastly information presentation in the form of business intelligence.

This course will include data files that you can use to follow along in the step by step tutorials, to ensure that you have all the information that you need at your disposal to make this course a breeze.

We will be looking at the processes that I have found to work the best in a real live environment. These are methods that can be employed in any business and the teachings of this course will provide you with the required information to start your journey in data engineering.

We will be looking at the following main topics in this course:

  • Data Design

  • Data Structuring

  • Dimensional Modeling

  • UML (Unified Modeling Language)

  • ERDs (Entity Relational Models)

  • SQL (Structured Query Language)

  • Power BI

  • etc.

I look forward to seeing you in the course and sharing this learning experience with you!

Meet Your Teacher

Teacher Profile Image

Henk Van Zyl


My name is Henk van Zyl.

I am an experienced Business Analyst with work experience in the E-commerce, Insurance and Banking sectors.

My working career has always center on data analysis/science where I have completed numerous projects to provide organizations with business intelligence.

The data science field is fascinating with an abundance of opportunities and I am excited to share my knowledge with you!

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Intro: Welcome to the course on Data Engineering from design to implementation. I'm excited to teach you all you need to know from starting off with unstructured information and getting this data into an optimal state that we can utilize in our database. The idea of this course is to provide you with all the information you need to implement a data design from start to finish. That is why we will be using step-by-step examples with resources provided so that you can follow along through the data engineering process. The course takes a step back to firstly analyze our data and get a deeper understanding of what we're aiming to achieve and the process that we will implement. We'll be going through the entire process from data design to implementation and actually shown this information on dashboards, teaching you the technical skills needed to conduct these processes by yourself. The cause is already expansive in its content. I will also be updating the course based on feedback from students to ensure that we cover all topics where there is a demand. I'm very excited to teach you all the Easterner about data engineering and can't wait to hear from you in the course. 2. Resource Requirements: For the next section, there are two course resources that we need to download in order to get started. The two files attach that we'll need is the sales data, CSV and the cells grip data CSV. Please download these files as we will be using them in the exercises to follow it. 3. Taking a Look at Our Data: Okay, Supra, next section, we're going to use some raw data. We're going to design this data, put it into a database structure, and actually go through the process of loading this information into our database into correct structure and at the end of the day, report on this information. So the first thing we need to do is to look at our data. So firstly, we've got sales data. So what can we find it in this CSV file? We can see that we have a customer ID, name and surname, a product ID and name, a product price, the quantity purchased, the store ID and store name, the region, product, main category and subcategory, the purchase date, and a sales rep ID. So what we can immediately do is we can see that there are a few dimensions within this data. We can see that we've got a alibi object orientated kind of V. We apply a customer object. We've got products, we've got stores, court regions, categories, and in the purchase date and a self-loop ID surface Alfred idea, we've got the IDs here, but you don't actually have the information and that's where the second paul comes in. So we can see that we have cells rip data here. We can see that we have the IDs and the cells with details. And what we'll need to do is actually to join these two and get the information into one place within our database and are purposely split these two silly. We go through that exercise of consolidating separate files into one data structure. Most times when working with CSV files. And that we get from organizations, you'll see that it's a bit scattered around each CSV or Excel report that you, that you typically receive will have information based off of a one focus point after business. It's your job to go and make sense of this information and put it into database structure, we can get an overall view of all the information that we've captured. So that's exactly the process that we are going to do now. So I'll see you in the next lecture. We will start our design, our data. 4. Creating Information Packages: So we had a brief look at our data and you establish what we're working with. So the next step we need to take is to create information packages and start designing our data and structure of our database. So looking back at our data, we basically have two files are cells data and our rips data that links back to the cells fall under our rep ID. As we've seen previously, we have customer ID and name and surname, the product ID and name, the product price to quantity sold, the store ID and store name to region, the product main category and subcategory, the purchase date, and a sales rep ID that links back to our cell-free table with ourselves with details. So before viewing this video further, just maybe pause here and try to do the following. I want you to see if we can create our information package. We have the data provided based on what we've learned previously. The title of information package will be Product Sales. Try and identify all the dimensions and the attributes, and also identify all the facts in our data. Take a second year, pause the video and see if you can create information package based off the knowledge that you've gained in the previous lessons. So here's our solution. We have our product sales information package and we have the following dimensions. We have customers, products, stores, regions, product name, category, product subcategory, and the cells rig. For facts, we have the quantity and the price. One thing I want to mention here is that we are trying to keep our data as flat as possible. So in your information packet that you designed, if you, for example, wind and puts a product main category and subcategory under the product. There's nothing wrong with that. That's completely fine. What we will do then, who basically snowflake that dimension to include this. But for our exercise now, we're actually going to split those into separate dimension tables to keep our data even more flat with the end goal of having faster query processing and obtaining that data fast and keeping it a bit easier to maintain. Please let me know what you guys think. For this course, we're only going to do it in this structure if there is a need to do snowflake schemas as well and try this out, please let me know and I'll include that in future lectures. 5. Creating a Star Schema: Okay, so we've created our information packages and the next thing we need to do is create our star schema design. So when you leave background information package that we have created, we had a product sales information package that is our focus point. We have the dimensions of customers, products, stores, regions, the product nine and subcategory, and the sales representative. We've identified the facts of the quantity of sales and the price of the cell. So once again, before we get to the solution, I want you to see if you can create a star schema based on the information package that we have created. You need to remember to use the system keys for each table in the design, as we've discussed in previous lectures, you also each create all the dimensions with the attributes and include the data types for each of these attributes. For example, if it's a integer, takes or numeric value and so forth. You also need to create the fact table with all the matrix that we will measure and link that I mentioned tables to r factorial. Please take a moment to create a star schema as an exercise, It's cutesy. Get into the process of thinking about these things and creating your own designs. And in the next slide we'll be looking at the solution. So here's the solution for our star schema design based off of our information package. As you can see, we have the fact table in the middle that is called our fat cells. And we have all our dimensions surrounding our fact table that have been linked through foreign air primary keys to our main table. So from the top left, we have our customers, we have the customer key that has a thought cereal. So one thing that's important to note, if you, for example, citizen integer value, that's completely fine. I specified a serial value that could be using in the Postgres scripts going forward. Or is, it means that it will also increment the value and it will never be the same value. Again is just a bit of an easier way to manage these kind of things as Postgres will automatically increment this value for us. So we have our customer ID, name and the products we have the same, the ID and name stores once again and so forth features by gone through our information package, listed all our dimensions and all the attributes that we have identified. For all other scripted and text values, we use the text datatype. And for all integer values that all whole numbers, we've used the integer datatype. The only place we wish use the numeric datatype is on the price affects sales table. The reason being is that we might have decimal numbers in our prices and it's good to have a numeric value day because if he had used integer values, we wouldn't have been able to capture the decimal values as well. Last remark I want to make on this design is that in fact sales table, we can see that the primary keys not only affect key, but also all the foreign keys that we've referenced from all on ancient tables. This brings us back to the idea in one of the few relations we talked about, I concatenated primary key. We'll be using all of the keys from our dimension tables as well as the fact table, the primary key to make one large primary concatenated key. This enforces referential integrity within our data set. Basically, we are saying we can't have a sale go through without all of ADH on the ancients being prisons. So we enforcing this rule on making all of our dimension foreign keys a primary key within our fact table that we will reference. So that does it for the design phase, our exercise. The next thing we'll do is actually go into Postgres and start creating scripts and creating all dimensions and fact table and loading our CSV data into our database. 6. Creating the Database Structure: Okay, So we've reached the point where we actually want to import our data into Postgres. So the first thing that we will need to do is open our IDE. And as stated in this course, we will be using db which you just opened up. And this is the interface to be looking at. Let's start off by just creating our new database. So you can do this by right-clicking here, going to create and clicking on database, and we'll just call this database product sells. Okay, once a database is selected and just set it as the default database as we will be working with this. And we can see that our current script has updated to show that you're working on product sells dv. So what we can do now is look back at our star schema design. And this is basically the blueprint that we will be working from to create our database. So we'll start with our dimension tables and then our fact table at the end. So let's start off with our customer table. So you can start off just for I can create table customers. That is our first step that we will be working with. And we can specify the customer key as our primary key for this table. So as I've stated before, we have used the datatype cereal for all our system generated keys within our tables in this database. So once again, all cereals going to do, it's going to auto increment all the primary key. So we didn't have to worry about that and we don't have to keep track of that as well. So you can just continue for table by writing the customer ID as thought integer and a customer name as tactics. And once, one thing I want to show here is that while we create these create script, basically we can just in front of each hotel was just write a drop statement. And this will just make it a bit easier as you're working through the data. So maybe you make a mistake like that. And you can just have all your drop statements in the beginning of each table. So you will basically deleted table and created a game. So basically we specify drop table IF exists customers. So we're just saying if this table exists a droplet from our database, if not just continued, it's really important to write the if exist statement because if you haven't created this table, you are going to get eras table asking created and the sequence is an older, it's fine. It's just perfect trick just to make sure that you don't have any issues while, while creating these scripts. And other thing to mention is when we deploy these scripts into our production environment, just remove all the drops. Statements is not really something you wanna do on a production database. Just wanted you all create statements and it's a bit more safe and secure. But as recreating this, it's just easier to get through everything. So when you're understatement, you can see that it executed and we can just select everything from our table just to make sure that everything is there. So as you can see, our customer table has been created and we've selected everything from the table. Obviously there's no data yet, but we can see that a table structure is created and what we are looking for. Just say another thing I want to mention. You can see here I have a capital letter over here at customers and another one and the ID here. And once again, you have customers that's also done. You're on purpose. I just want to show you guys basically that the capitalisation in Postgres, when you put it like that, it's not actually going to save it in your database is going to lowercase everything. So if you want to keep that kind of naming convention, you can just put it in double-quotes, but make sure then you're going to need to reference it exactly the same in all of your scripts getting forward. So as an exercise, this is a pretty straightforward process, is just basically going through each of these some ancient tables and adding them. So I want you to pause the video here and just start doing all your dimension tables. And once you are done individually again, and we'll show you what the result is. And then afterwards we'll start with our fact table. Okay, so you've created all I mentioned tables. I just want to show it over here so I'll run it just to make sure everything's in order. But basically what you did, exactly what you pick the customer table. We just created our table based off of our star schema design and inputted all the attributes with the datatypes as we have identified. The one thing I just want to mention that slip my mind when discussing the star schema. That for the sales rep table, we've actually indicate that a sales rep name and a sales rep surname. So if you go back to your data, you will see that we only have cells for details. This means that we have the sales rep, name, and surname put together into a single field in our diet as you actually need to split this information into a nine and a surname. The reason why we split this in our database design, because there's nothing really stopping us from just inputting the details as is, is that it's a bit easier to maintain. Maybe we just want to see the names or the surnames. And now it might be a bit hard to query that information from our database. So that's something we will actually Robert in our ETL process and split the name from the surname and have a more standardized table that's easier to maintain. Another thing I just want to mention, yeah, I see that the single datatype in my IDE is not highlighting, for example, it's very weird if you're serial datatype isn't highlighting as well. And I'm not sure why this is happening. It should actually always highlight, but it's definitely a data type. And as you can see when you run your script and executes, that means it has taken a datatype as hero. So don't worry about that. So once you've created all our dimensions, you should add a customer, products, stores, regions, product, main category product subcategory, and a sales rep dimension now created with in your database. So the next thing we need to do is to actually create our fact table. And we're going to do this exactly the same as with our design Epithelium or star schema. So looking back at our schema design, we can see that we add a fact key, that is our primary heat at a cereal. And then we have all the other primary keys from odd I mentioned tables that we will create a concatenated primary keyword. We can see that these data types are integer. This is fine because we don't want this to actually also increment within our fact table, which wanted to get the auto incremented value from our dimensions and just use it as a integer value to reference back from our fact table. So going back to the view, we can start creating our fact table. And as with our dimensions, we need to write just a drop statement just to make it a bit easier. But an interesting thing we need to now remember is that our factor was going to be reliant on our dimension tables. So our drop statement actually needs to be at the beginning of our script. Because if you try to drop, for example, the customers table, we are gonna get an error telling us that we can't drop the customers table because the fact table reliant on this table, so we need to drop our fact table first, suggest that the beginning of a script We can write drop table IF exist and our fact table name, which is fat cells, are just no stress it again, we don't actually need to do the strobe statements. I just think it's a bit easier to follow along because there's a chance that you might make a mistake and it's just a bit easier to iterate for you if you're often having to do things separately. So we start off by indicating the same as volume changes. You want to create a table of fat cells and we'll start off with its primary key as well. That is our facts key. So once again, effect key is going to be of type serial. We want this key to auto increment within the table. And you might have noticed I haven't specified that this is a primary key from the beginning. And as a reason for that, when we get to our concatenated primary key, that is actually we will be specifying what the primary key is. So for now we can just go list all the primary key values as we've identified in our star schema design. Okay, So once you've listed all your primary key values, it should look something like this. We have our fat key, Aristotle and each of our dimension keys listed underneath as type integer. And this is where we will now be specifying our primary key for Fact Table. So you can start off by writing primary key with closed brackets. And within these brackets, we're now going to list all of the keys that make up primary key for Fact Table. So we can start off with our fat key and just separate each of them with eye color. So moving over to a customer key, product key, and so forth. So once you have indicated all the primary keys within your fact table, it should look something like this. So the next thing we need to do is identify where we are getting these primary keys from. And that is going to be by indicating foreign keys. So the way we specify foreign keys as by writing down foreign key and then referencing our internal key. So we're basically saying that our customer key within this table is a foreign key. And where are we getting this value from? We're saying that our customer key references our customers tables customer key. So basically we have the concept of a inner and outer value. So our inner value, which is customer key and an LC bit odd because we've named them the same, but it's a bit easier to keep the standard naming conventions. But basically we're saying that our internal customer key references are external customer key that we're getting from the customers table. So the next thing you need to do is that for each upper primary keys in our fact table, you need to go and list the foreign keys and from which table we aren't getting them. So when you've listened all of your foreign keys, it should look something like this, where you have a foreign key reference for each of our primate keys within our fact table. And we specified way those foreign keys are being accessed from by indicating the table names over here. So when you've reached this point, you can actually go and create your fact table. And if we select everything from a fact table, we will see that we've created a house, basically just a linkage. Now we'll add our two facts now, and that's very simple, but we can at least see that we've incorporated or foreign key constraints correctly. So we actually now referencing the product keys outside our fact table. So the last thing we need to do is just add artifacts that we will be measuring and then we're done with our database creation. Okay, So you've just added the quantity and price facts to affect table. So just two elaborated with 30 years. So we specify the quantity is integer because flavor like by a decimal value of a product, you can't go and say, I want to buy, I have chocolate. When you go to the shopping mall, you have to buy a full quantity of that product and price. We're saying it's numeric because you can have decimal values and prices. The product might be priced as $1.40. So you know, you're going to need the smooth values. And another thing of the price is that it's any rounded to two decimal points. So for price, we've indicated that this is a numeric value. And we've made the range of the price to anyone to say it's a bit of a VQ. But basically what we're saying is that we are catering for 21 characters in our numeric value. And it can go up into two decimal points. If you're very concerned about optimizing a database, you can be a bit more specific on how large this immediate field should be. We should be able to get away with only five or six points at most. But just for simplicity sake, we're saying this is 21 characters long with two decimal points. And as you can see, we've selected everything from a fact table. We can now see a blank table with all all reference keys and A2 matrix that we will be measuring. So the next thing we need to do is actually go and populate our I mentioned tables and our fact table. 7. Loading Our Temp Tables: Okay, So just before we start populating our dimensions and fact tables, I just want to show you guys the scripts are used to learn or Tim tables. This is something that we have covered previously and I want everybody to be in sync using the same table names and making sure that we get this right. So basically what I've done is I've just created a 10, 10 of sales and another team table for our ribs. And I've indicated the same sequence and naming convention for each after tables listed in our CSV that we will be importing. So once you've created two tables, you can just create this statement when you say want a copy, but Tim cells from, and you just need to specify the path where you saved your CSV file. And as I've recommended previously, it's just better to save it into the temp folder of your Postgres instance and just rename it to also indicate that this is what we see. So if you actually Athens column names and there's no issue when you import this data. So once you're done, you can select everything from your tymp cells and Tim grips and you should get results down here with all your representative and all our cells inflammation. And just to keep things consistent, please use the same naming convention as IF or temp tables. So that's Tim, sales and came ribs, as we will need to be referencing these tables. And I don't want to be any confusion when we start importing into our dimensions and fact table. 8. Populating the Dimension Tables: Okay, so we've done all our preparation needed to actually start importing our data into our structure that we have to find where Costa schema. And once again, I think it's a good thing just to keep the star schema open as a reference point. So you can visually see how we are going to load this information. Now everything's going technique in essence, that is the point of creating these ERDs is so that we have a blueprint of what we need to do. Another thing, I've just added the two select statements for our 10 tables. I think it's always a good idea to just see the data that you're working with. We'll start off with the customers table. And we know that this information is in sales table. So we can just run this query here and we can see that the data that we will be working with. So in a star schema, we've identified that we were using the customer ID, name, and survey. So let's start by writing an insert statement for our customers table. So we start off by indicating that we want to insert into our customer's table. And what fields do we want to insert by looking at our star schema design and we can add our team table data. We can see we want to insert a customer ID, the customer name, and the customer sooner specified these free fuels here. But if you look back to our customer table, we can see that we also add a primary key. So let's just expand the second year our customer table, and it's actually good at this happen. So don't panic if this doesn't open, all you need to do, you just need to refresh this and you can see that all the information is populated again. So you want to go to a customer's table and we specifically want to see customer key. So as I've indicated, we using the serial datatype, this means that this automatically increment the values of the database. So we never have to worry about this and this is our system generated primary key that we will be using. So all we need to worry about is the customer ID, the name and surname. As I'm doing this are see that it actually must arson and within the customer table. And Alfred, this is actually a great time to show you guys how to use I alter statement to add that field into our customer table. So it's just put this on pause quickly and we can comment out any code by using this syntax. We just do a forward slash and Mike Lone Star and it the same. We want o coming to stop. So this section of code won't run anymore. And we can just focus on altering the table and adding this customer center and fuel. So we can also our table and add new column by writing the following statement, we can say that we want to alter the table customers. We want to add the column customer surname of type text. So I've already run this and as I've stated previously, it weren't immediately reflect here, but you can just right-click on our customer table, say refresh. And we can see that we've added the customer surname column to our customers table. So going back to our original discussion, we will need to add these fields from routing tables. We now have a customer surname as well, and we need to add these field values from our team table. So the base for our fine to do this is just below this. Gavin write a select statement to get the information from the team table that we need. And once we're happy with our select statement, we will then insert as values into our customer table. So what I've done here, I've said that I want to select the distinct values for customer ID, customer name, and customer surname from a tame cycles. So just to clarify, this distinct means, I don't want any duplicates because we don't want duplicates within our dimension tables. So basically what we've done here, we've created a unique list of all our customers within our team sales table. And just make sure here I didn't run the insert statement are just ran the select statement here just to make sure that all my data is great. And this is a good habit of getting integers basically going through your data, making sure that you have what you need. And then once you're happy, you can then commit that and finish up and sit statements. So now what we can do is just simply run this entire statement together. So you're saying you want to insert into our customer's table and we specify the values, as you can see here that we want to insert. And it's very important that we need to keep the correct sequence when doing this. So for my select statement, I have the customer ID first and we insert in the customer ID first and so forth. So you can just run this entire statement here. And you can see that we've added 499 values. And if we select everything from our customer table, we can see that all dimension tables now populated with the correct information and that we have a primary key that a cereal that has incremented correctly. And as we add new customers, this serial primary key, we'll increment and go on. So that's basically the insert statement formula that we need for each of our tables. So for exercise, I want you guys to go through the rest of the tables and just write the sine insert statements as we've done here. So basically specify the table name that you want to infinity, specify the columns. We will add your information and write a select statement that specifically looks at distinct values and add all those values to your dimension tables. Okay, So once you've added all the insert statements for the dimensions, it should look something like this. We have inserted of information into our customers, products stores regions than product, main category and the product subcategory. And just to pause here for a moment, I just want to show with the subcategory now for example. We can see that we have a system key and a 94 percent subcategories as we didn't have an ID for each subcategory. And this is another reason why we like to use system generated keys that we can work from, because otherwise if we didn't do this, we actually only have the string values to users are primary keys. And that is to say if they are always unique and this might not be the case. For example, let's take a no operational system where there might be some spanning areas or stuff like that. We might have a product subcategory. We now have new diseases, but somebody might talking to the system, just new release. Technically, those are two different names. Smart influence R. So what we can do is we can actually with automation tables, aggregated study time is a few things that we can do, but these are two separate instances that went with it the same as a primary key. So when we have our system generated primary key, it gives us a bit more flexibility when dealing with cases like this. So the next step that we have to work on is our Sales Rep table. And as we've seen, we have the cell strip details, but in our star schema and in our databases are we actually want the sales rep name and surname. So this means we need to split this information salacious, look at our sales rep table or timetable for now. So we can see that we have the sales rep name and we can see that we have a name and surname and we need to identify way we can split this. So by looking at this information, a quite easy way we can split this information is by splitting it at the space between the name and surname. This isn't a foolproof technique because sometimes you can get like information like to first names or middle name or something like that. And that might mess this up a bit. But just to show you guys how we can actually manipulate our data, we will go with the rule that for each of our names of the space, that will be the Cerner. And it doesn't mean if this middle names we can't all of a sudden do this just means on logic needs to change a bit. So often looking at our data and that's always the first that we go back to. We can see that for our dataset, at least each surname comes off to a space. So we can use that logic in this instance. So let's just start off again by writing an insert statement. So you want to insert into our cells trips table, the ID, the name and surname. And now we need to start working on our select statement to get our information into this format. So as a starting point, we can just say that we want to rip ID and name from our ribs table. And now we will need to start splitting this into our two fields that is naming syndrome. So what I've done here and we need to take the step prostate is you need to get the position of a space you cannot replace. Just going back to our reps table, we can see that we basically want to get the position after space so that we can go and calculate how long the name is, but how long the surname is because it's not fixed values. So what we've done here, we just say that we still want to stick with IT because that already fits within an insert statement. But we now just want the position of the space within our ape names. You can see that the position is at 9415432 and so forth. So worth disposition in. Next thing we can do is we can go get a substring and get the rest of the information. Okay, so what we've done here is that we've inputted our substring function. I just want to talk you guys through this quickly. So basically what we've done here, and just to explain a substring, a substring means that we want just to get a specific section of a text value and we need to indicate the start and end position after pigs value. So what we're saying here is that we want the substring. So you want a piece of outbreak name field. And we want this to start at position 1 because we know our name is at the beginning. So start at one and go all the way until you get the position of a space in our retina. And if we execute this, we can now see that we have our rep ID. We add on names as we now have basically split our string from the beginning to the first instance of a space. So the next thing you need to do is go get the surname of all rips. And just as a bit of a mini challenge, I want to see if you guys can get the surname bio cells and we'll actually need an extra function to do this. And the tip is you need to use the length function. So maybe pause the video here and see if you can actually go and get the surname for each of our ribs. So basically what we've done is we've started once again, if you want a substring of rape names, that is the field value in our team grips table. And actually what we want to do now is you want to start from the position of the space. We actually stalling here because we really have the name. So you can forget everything before this. So basically from the spice in Rep name and all the way to the length of Orit name. So basically what length does it calculates the same as this position, but it goes and tells us the maximum value of outtakes characters. So what we're saying now is for a sub-string function is start at the space and do this and we're going to get a value. It's going to be like six sub-string from six to the end of our value, that is the total length. And if we execute this query, will see that we have our rep ID, we have our names and we have our surnames, and this is exactly what we need for insert statement. So we can, all we need to do now is just run this entire section together like this. And let's also just select everything from ourselves rate table to see what you've got. And when we select everything from a sales rep table, we can see everything is exactly the way we want it. We have our system-generated key, the IDs, and we split the details that we've got into a name and a pseudonym feel that's a bit easier to maintain and also gives us a bit more flexibility. 9. Populating the FACT Table: Okay, So with art, I mentioned tables now populated within our database. The next thing we need to do is load off Fact Table. So if we go back to our star schema design, you can see the interesting thing about the fact table is that it has all these links with automations have. So this means that we firstly need to populate a fact table with each of these dimensional table keys and then add a quantity and price matrix of the words. This also means that our fact table will be the table that is referencing temp table information or in other words, the CSV file information that we have. So this is going to be our outer reference. We'll be using to populate this table and making the links between our dimensions and metrics. So your impact gravity dbVar, we can now go and write a inset stipend for a fact table. And this is exactly the same as you did for dimension tables. So once again, just pause here for a minute and see if you can go and create the insert statement for the fact table. Naturally not learning any of the data, but just based upon all the columns that we will need when loading this information. Another useful tip is that because we have this view of our database tables, we can just open the fact table yet upon or columns. Once again, just refresh this if it doesn't populate automatically. And we can now see everything that we will need to populate in a fact table for insert statements. They actually excluding the fact that a serial as this will be done automatically in the database engine. So once you've done that, you insert statement should look something like this, where we are going to insert into our fat cells and we're going to insert a customer key, product key, store key, region key, product, main category key, subcategory key, the cells rip key and then lost your two matrix cannot quantity and price. And as the same dimensions, it's very easy just to go and start off with a select statement and make sure that we get everything right in our select statement before inserting this information into our fats. So you might be wondering how are we going to reference our outside information that is not thin table with our internal information that is now within our dimensions. So there's a few ways of doing this. One might be just to do a few joins. And this is actually maybe I faster than doing it by whether I find easy to explain and also easy to read. It's just using subqueries, suggests for a better fury quickly while we're doing this practical, a subquery as basically a query within a query, if that makes sense, and it might be a bit difficult to understand. But the best way to learn these things is to implement it and see how it works. So let's just start off with a basic select statement. And once again, we will we be selecting from where is our outer data? Our outer data lies within our team sales table and as we all information or factor who resides in. So let's just stop writing a select statement to get everything from our sales table. Silver, I can discreetly, we can see that this is all the information that we want to get into a fact table. Naturally, some of the information is stored, if not, I mentioned like the customer name and so on. But what we need to do now and focus instead on the customers, is that we need to now get up customer key from our customer dimension table and link it with this information. So this is a subquery comes in. So to start off with this, I'm just going to introduce few terms here just to get it between. But Mooney, I'm going to remove this and we'll start with a subquery here. So the same as with a enormous lake statement. We will also write a select statement, but within these two brackets, I potent thing to note with the subquery is that it can only return one result for each of our roles. You subquery copy, for example, select everything from RM ancient table or something like that. As this will cause an IRA because we need to specify exactly what we want. And in this instance we want the customer key. So we'll start off our sub-query per item that we want, the customer key from our customers table. And now how do we reference that with our outer table? So this is basically what a subquery is going to look like. We're staying in that one selected customer key and customer table name for this, we can just keep track of it. It's just an easier way T to C We were hearing. So we want the customer key from a customer table. We are customer ID equals temp sales tables customer ID. So what are we saying here? We are getting the customer key from odd I mentioned that has already been loaded and we'll linking this with the customer ID that we have within our team sales table. So basically the reference point for using is the ID because we add an ID within our customers dimension and we also have the same ID within our outer data in our CSV. So this is a linkage that we are going to be using in our fat cells. If we run this, for example, you'll see that we actually selecting from the CSV, this is still the CSV data, but we don't want the CSV data in this case. We want the customer key. We add links to our CSV data. So just to do another example, Let's start off with the next one. That is our product information. So for our products, we're going to do exactly the same thing, but just stating that we want our private key. That is another mentioned table. And we'll linking this, we're on I mentioned product ID equals the CSVs product ID. So once again, if we run all of this, we'll see that we have our product internal key that is now reference our CSV out the information. This is basically consequent tragedy. We tried to link our dimension tables that is internal to the database. Now we have the system generated keys that we've specified and link that with our CSV information. So what I want you guys to try next before going on with the video is just trying to be the same for each of these dimensions. And just stop when we get to the last matrix. The actually the easiest ones, but less basically just pop it on a fact table with all of our dimensional keys. And I'll see you when we add our matrix. Okay, So I just want to pause here in case there was some confusion on any dimension tables that don't contain any IDs. So when you get to other regions table, we can see that we only have a Region 9 to work from. So this is not an optimal solution, but sometimes we are faced with information that we then have the ID stage just give us names. So this is exactly what I tried to explain with the data that has been provided. So in this case, we only had a region name. As you can see, this narrow region ID within our data, three now actually need to lingo information solely based off of the region name. So our insert state law select statement will look exactly the same. We will now select the region key once again from old I mentioned Seibel. But we will make the linkage based off of automation tables regional him with that links with our outer tables, region name, which is called region. And if you run this again, you'll see that we have the region keys based over regions. And this works as well because it's all unique regions are once again, we do have this fail-safe of a system-generated key where there are instances where we actually need to manipulate the data on our side to make sense of it. Okay, so once you've entered all other mentions, and I've actually got an editor to matrix as well. We'll have this kind of structure. We will have created subqueries to select all our information to link with our outer information in your CSV. We just specify the quantity and price within a CSV, as you can see here, because there's no linkage, then these are actually the matrix that we will be measuring in our fact table. So when we run our select statement here, we can see that we've created our fact table with all of the keys that link to automations as well as the two matrix that we will be measuring. Everything from the tip cells CSV file that we've created. So you can just clean this off by inserting this information into a fact table. And if we then go into the lake, everything from Effect table we will now see I should reflect everything we have in our select statement, and that basically does it for our fact table. We've inserted all the information here and we've got all of our connections. So the next thing we need to do is actually see how we can get our information artifact table. That makes sense because we have all this descriptive information. But at the moment we're just basically looking at numbers that are our keys referencing all I mentioned. So in the next lesson we will see how we can create a view to actually get useful information out of our database structure. 10. Creating a View of Our Data: Okay, so we've done everything we need to do to load our data into our database and everything's in a perfect order now to start reporting on that information. So as we've seen with our fact table, this is a really useful information and we've done all that work to get our dimensions in. So the next thing we need to do is create a view where we can actually see all this information together as with our CSV file. And as a bonus, this view will be used in the next section of this course. We'll be reporting on this information using Power BI. Let's just start off by writing a select statement from our fact table and joining all our dimensions together to get a full view of the data at our disposal. So what I've done here, African duplicate select statements, both Arab affects cells. And the reason being as we're going to be using one as a reference and other ones actually going to be our complete select statement. So we're going to be using the top one as our new Select statement that we'll be using. I'm just going to interior to keep everything neat. And we'll start with our fact table to get all the information. So firstly, is to select our fact Q0. So very straightforward. We just want to select effect key and I'm actually going to add the table name in front as this keeps things I built more easy to read and you know exactly what we're doing. The reason I'm also doing this is because we are going to be using subqueries again and it just makes it a bit more legible and easy to follow through. So one thing I've done in the select statement, I've indicated that this column name should be named fact Qi exactly in this manner because that's the way you want it to be printed at the end of the day. So initially this, we can see that we've kept the name that we've given here and we've got all of our fact keys here. So nothing started yet, but we will start going on with automations now. So next we need to do is get our customer information. So you can also just let the customer key and installed joining our customer information on this. So once again, as you see no a lot doing in the schools, we can just do a mini exercise. Just see if you can write a subquery and select the customer information from autumn ancient tables. So if you look at our customer table over here, we can see that we have the key that we want, but we also have the ID, name, and surname that is not presently in our fact table and that's exactly why we operate in this view. See if you can go select all the customer information from automation table and join that with our fact table through a subquery. So there's nothing wrong with actually using a joint in the statements if you're comfortable with doing that. But I like using the subqueries because it's a bit more flexible amino exactly. We use liquid form, it is a bit slower. There are times we can't actually use these, but sometimes it is necessary to use subquery, specifically reform. This is valuable information to learn. And that's why we are going to be using subqueries in our view. Okay, so once you've written your subqueries, it should look something like this. We have now linked our customer table with effect, and we can see that we use the customer keys within our dimension table and link this with the fact tables, customer keys. And we're getting the customer ID, the customer name, and the customer survey. And if we run our select statement, we should see all of our customer information now linked with our fact. So you can see you have effect key and customer key and all of our customer information. So basically what we've done here is linked of fact table with our dimension table. And this is actually the same information as within our CSV. So we need to do this for the remainder of all our columns within our fact. So we still have our product store region. The categories and the cells rips to do before we done grading or view. So in order to add all of the dimension information within our database to effect, we are going to be using the same method as we don't have a customers writing the subqueries. See if you can also go and add the remainder of our dimensions. So effect to provide us with descriptive information around our quantities and price for our products sold. Okay, So you've reached a point. We've now added all about I mentioned tables, descriptive information to a fact table. And if we go and run our select statement, we should get something like this. So basically we have effect key and all of the keys for all of our dimensions with the descriptive information. So as we've done, of course, the customer information. If you go down to our products, we have the product ID. And this is actually a spelling mistakes. Let's just name that product name and this run this again. And we can see that we have the product ID, product name, store key, ID, name, region, and a region name and so forth. So basically always done here is we've created a select statement to get all of that descriptive information from our dimensions and actually join that to our fact table. And once again, you can write joints down here. There's nothing wrong with that. At, might actually be fast as I've stated. But this is a useful trick to learn because sometimes you need to create Tim tables or the subqueries within one another. When we get to really complex type of select statements we need to do. But with that said, the next thing we need to do is actually create our view. And an ice cream is we've already specified what we want in that view. So you can just write at the top if you want a great idea. And we're going to call this view sales here. And this is how we want graded. So once again, we just select all of this. We can run this. And I've missed a as yes. So basically what we're saying is you want to correct our cells view as the select statement that we've provided. So the students again run all of you and we can see that we've created all of you. And to get that information is the same as selecting from a table, but we're going to select from you this time. And once we do this, we can see that all of our information in our view is here as specified. So congratulations on getting to this point. This is basically all you need to know to get data from a raw format into a database and actually start reporting on information. And as I've stated, we are going to be using this view. In the next section, we will be adding this VT Power BI and getting some cool insights to our data that we have within our CSV file. And just as a side note, this view has all the data as in our CSV file that we had. So you might be asking, what benefit are you getting from this? Well, the benefit that we have is that we've actually gone and structured our data into these dimensions and fact tables. We can go write custom select statements, queries, views from the state of opinion on what we want. We can also enrich this data by adding our own information. So let's say for example, we get a request, we, the region of New Mexico and West Virginia should now be made the same. And they don't have this within the operational system. While because we've actually added this to our database, we can do this on our side. So as we've seen in the previous lecture, we alter the table. We can, for an example, maybe alter the table and call it a region aggregate or something like that. And actually start aggregating these values into a more refined on. And that's something we won't be able to do with a operational system or database because the idea of a data warehouse and providing this kind of information that we can actually go in and reach this data and report on it in ways that is a bit more cumbersome than just receives the fall or working on Excel. And we have the added bonus, as we'll see now, we can view all this information in a dashboard. And not only being the data on a dashboard, but we also have our information in a database that we can maybe do more complex things are the machine learning and feed data into external processes. But of all that said, this is a great place to be. We've gone through the entire process of understanding of data, formatting that data twin, the entire extraction, transformation and loading process and actually now creating a view, you see the information that we have captured. 11. Loading Data to Power BI: Okay, So we've reached the exciting part. We were actually going to start reporting on information that we've loaded into our database. So when you open Power BI, you'll be greeted with this display. And the first thing we need to do is load our data into Power BI. And we could do that by simply clicking on the get data button. From here you can scroll down and select the type of database that you were using. We were using Postgres. So you can make the PostgreSQL database election and clicking it. Once we've made that selection, we need to specify the server where our database is located. For this course, we've loaded our database to our local host machine, so we'll be typing in localhost. And we also need to specify the database name and we used products sales as our database name. So once we've done this, we can click on, okay, and we will now need to include our database credentials. So you use a name if you leave it at default will just be post quiz. And you can also input your password and hit connect. And here we have a view of our database. So what we've done now and keeping simple, we created a view in the previous lessons. We will actually just be importing that into Power BI. So basically the methodology that our employee is using views that we've created in our database and only using that in the inner today in RBI platform to share information. This is a bit easier and keeps the logic on our database side, but this is up to you. We will not. You'd like to keep that logic on your BI platform side or on the database side. For me, it makes more sense to create a specific view who were specific purpose and an import those fees into Power BI to report on. So we will be making the selection of our sales here and we'll get a preview of all our data here on the right-hand side. Once we've done that, we can just click on the Load button and the data will be imported from our database into Power BI. So on the right-hand side we can see our fields and you can see that we've imported the public cells view, which is the view that we've created in our database. And we can expand this to view all of the columns within our view. And that basically does it for importing data into Power BI. It's a very simple process. We just need to specify the data source. We could, for example, just have imported Excel. But as I've explained in a previous lecture, there is a lot of benefits to Rava having this information in a database. We can actually contribute to that information and provide added value. So in the next lesson, we'll be starting with creating our own dashboard. 12. Creating a Dashboard (Course Project): Okay, So we've reached the point where we can actually start developing our dashboard. We've learned all our data. And the next thing to do is you start creating visualizations of the information within our view to present for business intelligence. So when I have a bit of fun with this exercise, so I'm going to show you guys a bit of an overview of Power BI, how we can start using the fields that we've imported with visualizations and start reporting on that data. So as we've seen on our right-hand side, we have our fields. We, we've imported our view with all of its columns. Next you that we have the Visualizations tab. This is where we will start to create graphical information on the fields within our view. The last time I want to talk about here is the filter step. So what we can actually do here is pre-filled the information on the dashboard depending on what we want to show the user. So you can, for example, have a filter on your dashboard with a user can go and make specific selections. But the other option is to create predefined filters for each dashboard. So if you look back at our data, we know that we have regions. And as a pre filter type of idea with regions, you can actually go and initiate specific regions on the dashboard and not give that option to the user. Also important thing to note is that we add that capability. We can filter on this page or on a specific visual that we implemented. So when looking at the ribbon at the top, we know that we can get our data from year. But another important thing is that you can also refresh the data from here. So if you have a database with extra fields and something else gets added in just to refresh button as we nowadays, isn't anything extra now, but we will refresh the data from our database and add the latest information at our disposal. Some nice things you can also do here is create measures are like doing it from the Fields tab here. And I'll actually show you a measure in this exercise. And we'll get to that at a later stage. And the last thing I want to mention here is that you can publish your dashboards from year. So if you have a business or student account, we have the Power BI online portal. You can publish the dashboards that you may on the desktop view. That can then be embedded into applications to show to you users. Another important thing that I want to mention here is that you can actually alter the datatypes of off-field from this view. So if we, for example, go to our region field, we can see that this is type text, which is correct and Power BI has interpreted as such. You can also go and change this data time. From this view, we can say that this is a whole number, decimal number, date-time, and so forth. So it's just important to know if some of the data hasn't been imported in the format that you'd like or you may want to make additional changes on the dashboards. That option is available when selecting this specific field and then coming over here and making better types lecture. So let's start off by creating our first visual select. Say we want to see sales per region. We can go and select our region. And basically what Power BI does, it makes a estimation on what type of visual you like to see. And it prepopulated here in this instance in 431, a TableView. Let's maybe stick with this. So let's say once again, we want to see the quantity sold per region. So in our fact table, we add a quantity. We can add this and we can see the amount sold within each region. So let's draw the save. This might be more applicable in a bar graph. We can now go to our Visualizations tab and make the pie chart selection. We can expand this and create a noise view on our dashboard. And we can see that we now have the regions and the quantity sold within our paragraph. You can also hover over this specific information. So we can see that Mississippi is the region where the most products were sold, making up 29 sin percent of all products sold within our dataset. Another thing to note with these visualizations on our visualization tab is that the visualizations provide us with the fields that are required. So we can see that our legend is regions and that's what we split it up into. And our values is the quantity. So this is how we know what is specified for each one of these visuals. And I've important thing is that we can format these visuals. Let's say for example, I don't want this quantity by region at the top, so I can simply go to the title, de-select this and it will no longer share. I can also go to the Data Labels and now at the moment is showing the data value and the percentage. But IMR travel and to see the category and the value. Because we have this, we can now easily see as Mississippi with 2000 quantity sold and this is a rounded number. So we've seen that we've rounded down from 2200 just to 2000. So just to expand on this in further, I might not want to see the legend anymore because we're including this in our data label so I can deselect agent. And now we have all the data labels. We have the values within our pie chart. And there's no right or wrong here. Basically what you want to provide to the user and the way that the information can be communicated the most effectively is the optimal path. So there's no right and wrong is what makes sense in the current context. Another thing that I want to show you is that We are, it provides us with drill down capability. So we know that we have two regions, but maybe we also want to view this information on store level. So you can simply go to our stores and we can select the store name that we have here. And what I see now this is great song. We Shall we refresh capability of IVR. So I see within the store view, I've made this a numeric value because in my view are the namespace of how the store name but actually specified the store ID. So looking back at our database or perceived that the store name column is actually indicated as store ID, which is a mistake from my side. And I can easily change that and make it to store name again. And you can run this to recreate our view, getting back to Power BI and kicking the refresh that will now see that stone name has been updated successfully and it's now a name value. Let's just take it out day, which drag and drop over here. And we can see that we have to store names now, going back to what I was getting to. So we have drill down capability within each of these visuals so I can auger and add store name below our region. So what happened now is we've pulled hierarchy within this visualization. So firstly, we've got two region that makes sense, and within each region we have a school. So when I click the two arrows here, you can see that we drill down to store level and actually see all the store information with the quantity sold. Another important thing to note here, standard clicking this our drill down to the entire dataset. So it's no longer looking at regions. I was looking at all the stores and all the quantities. But the cool thing about Power BI is that we can drill down into specific hierarchies. So by clicking on this arrow, we can now go and say we want to view all the scores and the quantities but with him, Mississippi. So when I click on Mississippi, now I can see that these three stores are within Mississippi and these are the quantities of products sold. A that is amazing thing about Power BI to provide us with that capability to not only through down based off the hierarchy, but also drill into specific subsets of information. So another thing I wanna show in Power BI is that Let's take the products and we take the product price. So you can see in our dataset that we had the product names and the product price, but we don't actually have the amount for each of those products. So if we add quantity over here again, we can see that this product, 141 quantity was sold for this price, but you don't have the entire value of that product. So this is something that we could have done in the view and just basically calculated quantity multiplied by the price. But I just wanted to show you guys in Power BI, that is also possible from here. So the way we can do that is go to old. You click on these three dots here and say you want to add a new columns. We actually want to add a new column onto our database. We will calculate the total price. So we can call this total price. And now we need to specify what the calculation years and the formula that'll be our uses is called bags. And it's quite extensive. It's quite like an Excel based calculations. So what we need to do here, we're going to say the total price, and that is a quantity multiplied by the price. So you can say quantity. And we take our public cells use quantity and we multiply it. And we indicate that by with a scar. And you state that we want the quantity multiplied by that price. And once again, the public cells use price to insert and this will be added now. So when you go back to our visualization here, we can see that we've added a new total price column over here. And by adding that to our data, we can now see the total price for each of those cells. The last thing that I want to mention is honor our Visualizations tab, and we click these three dots here. We can state that we want to get more visuals. If you have a business or student account, you can actually log into your Microsoft account. And this provides you with a marketplace where custom visuals are developed. We can make a specific selection on a visual and what you want to show to the end-user. One more important thing on each shell is that it also add filters onto the eye, that's called slices. We can now go and hit Save on a photo of this information based off of our product name. So we can add this folder here and we can see that we have a list of all our products. And this might not be the view that is most effective for selecting a product. So you can actually go select this drop-down here and say that we don't want a list format or a dropdown format. We can now make the selection. And if you refuse a patch origin, we can now for example, say you want to go to, let's go to Bacardi and you want to see all the cells for Procore. And once again, we faulted pie charts, so all the cells only for Bacardi. And we can see that the regions or Mississippi with the mouse cells a quantity of 41. And let's say you want to see the total price as well. We can add this value, and now we can see, we can add this on our tooltips and we can see when hovering the total cost is 41 with a total price of 516. And there's also works with drill-down capabilities. So once again with Mississippi, the stores that may be cells, we have these tools with the quantities and hovering over them, we can see the total prices. The photos can also be provided with hierarchy. So let's say for example, you want to see the products, but we want to see which cells really sold what products. You can add a sales rep name over here. And now we can go and say, for the product absolute, we only want to view the cells made by outdo. We can click on that and we can see that this is that specific cells rich sales for a specific product. So I want you guys to play on out of Power BI, but there's a lot of possibilities and there's no right or wrong, as I've stated, is basically just providing the information in a way 2D user, that it can be consumed in the most optimal way. Let me know if you guys want a more extensive overview of Power BI and what specific you, you guys would like to see. And I'll definitely add it into future lectures. But that does it for this section. Thank you for joining me and I'll see you in the next one.