Data Warehouse ETL Testing & Data Quality Management A-Z | Lorenz DS | Skillshare

Data Warehouse ETL Testing & Data Quality Management A-Z

Lorenz DS, BI & DWH Solutions Architect

Data Warehouse ETL Testing & Data Quality Management A-Z

Lorenz DS, BI & DWH Solutions Architect

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
19 Lessons (1h 52m)
    • 1. Introduction

      5:11
    • 2. What is ETL/ELT Testing and Data Quality Management?

      4:48
    • 3. Data Categories and Classifications

      7:38
    • 4. Data Quality Rules and Dimensions

      6:32
    • 5. Data Requirements and Test Design - Exercise

      4:22
    • 6. What are Database Views?

      4:25
    • 7. View Data Quality Completeness - Exercise

      6:29
    • 8. View Data Quality Uniqueness - Exercise

      2:24
    • 9. View Data Quality Validity - Exercise

      4:19
    • 10. View Data Quality Consistency - Exercise

      4:43
    • 11. View Data Quality Integrity - Exercise

      4:02
    • 12. View Data Profiling - Exercise

      5:24
    • 13. Create a Database Connection

      4:02
    • 14. Data Quality Completeness Dashboard - Exercise

      6:31
    • 15. Data Quality Uniqueness Dashboard - Exercise

      6:53
    • 16. Data Quality Validity Dashboard - Exercise

      9:35
    • 17. Data Quality Consistency Dashboard - Exercise

      7:23
    • 18. Data Quality Integrity Dashboard - Exercise

      10:37
    • 19. Data Profiling & Summary Dashboard - Exercise

      6:34
  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

209

Students

--

Project

About This Class

Learn the essentials of ETL Data Warehouse Testing and Data Quality Management through this step-by-step tutorial. This course takes you through the basics of ETL testing, frequently used Data Quality queries, reporting and monitoring. In this tutorial we will learn how to build database views for Data Quality monitoring and build Data Quality visualizations and reports!

..Learn to build data quality dashboards from scratch!

..Learn some of the most common mistakes made when performing ETL/ELT tests..

..Forget about manual ad-hoc ETL testing, learn more about automated ETL and data quality reports

The course contains training materials, where you can practice, apply your knowledge and build an app from scratch. The training materials are provided in an Excel file that you can download to your computer.

Good luck and hope you enjoy the course.

Pre-requisites:

  • Basic knowledge of SQL

  • Some experience with Visualization tools would be helpful, but not required

  • Basic setup of database (PostgreSQL, Oracle) and visualization tool (Qliksense) is recommended

Course content:

The course consists of the following modules:

  • Introduction

  • What is ETL/ELT Testing and Data Quality Management?

  • Build database views for Data Quality Monitoring

  • Build dashboards for Reporting

  • Exercises

  • Final Quiz

Who should follow this course?

  • Students that want to learn the basics of ETL/ELT testing and Data Quality Management

  • Business Analysts and Data Analysts that would like to learn more about ETL/ELT testing, frequently used queries and practical examples

  • Software Engineers that would like to build an automated solution for ETL/ELT testing using database views/dashboards

  • Data Stewards and Managers considering to apply data quality standards within their organization

Meet Your Teacher

Teacher Profile Image

Lorenz DS

BI & DWH Solutions Architect

Teacher

Hi there!

Hi there. My name is Lorenz, I am a certified IT professional with several years of experience working as a consultant and functional analyst, mainly in the Business Intelligence domain.

Working experience:

I have worked for clients within the Energy and Financial sector, and have been involved in the roll-out of Business Intelligence and Data Warehouse projects and the implementation of BI solutions.

As an instructor, I want to share my knowledge and experience in BI and Data Visualization software. During these courses, I will explain the basic concepts, but also tips and tricks for creating Visualizations and Reports that best suit your purpose or the purpose of your end-user.

Course topics... See full profile

Class Ratings

Expectations Met?
  • Exceeded!
    0%
  • Yes
    0%
  • Somewhat
    0%
  • Not really
    0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Your creative journey starts here.

  • Unlimited access to every class
  • Supportive online creative community
  • Learn offline with Skillshare’s app

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Introduction: Hello. My name is Lawrence. I am a business intelligence consultant and I will be your instructor throughout this data quality and GTL testing tutorial. So what is this tutorial all about? On this course, we're going to look at, um, data quality and GTL testing. So we're going to start with a brief introduction on GTL testing. What? It's DTL testing. And what is the difference between E. T. L, which stands for extract, transform and load, and e lt which stands for extract, load and transform testing. So we're also going to look at what is a data warehouse And how does E t. L or L t fit within a data warehouse or a data lake or data repository environment? Next, we're going to look at data categories and also data quality. So before we actually look at data quality, we're first going to look at water a different categories of data that we can have. We're going to look at transactional data, mosque data reference data, meta data and so on before we actually look into data quality rules and the data quality dimensions. Eso data quality dimensions are different dimensions that we use to measure data quality within the data quality framework. So they, the quality rules actually help us measuring data quality. So based on our data, quality, dimensions were actually going to create rules that measure our level of data quality for each of those dimensions. So we have different dimensions for validity of her data, for the consistency of our data set for data, lineage and the integrity of her data and so on. And then we're also going to look at how we can build databases views to have an actual view off our data quality rules and actually monitor our data quality within a database view for each of those dimensions. So we're going to look at integrity. We're going to look at the completeness over data set. We're going to look at Dublin kits with in our data set, um, and also going to build dashboards based on those databases. Views were going to use a blustery SQL database in this exercise, but you could also use an oracle database, for example, and then we're going to use a visualization tool. In this case, we're going to use click sense, but you could also use a tableau or spot Fire or micro Sergi, Just any visualization to will do. Finally, we're going to end with a course summary and also a final quist just to summarize everything and practice what you basically learned throughout discourse. So this is actually a very practical course because we're going to look at how we can use SQL Crease to build database use, and tests are GTL flows. We're going to use craze, for example, to test the completeness of our data set by counting the number off dull records. We're also going to build guests of grace based on some business rules and exercises. Within this tutorial, we're going to store these queries in database fuse. So these are all examples of database use within Austria school. And this example we're going to use Bosc Riesco Well, which is an open source database. But you could also use, for example, Oracle or Microsoft sequel server. After that, we're going to build data quality dash sports in Click Sense, which is a visualization to we're going to build dashboards, report on our data quality requirements. So who should follow this course? Basically, everybody wants to learn more about it yell or lt testing and data quality. But in particular, if you're a business analyst or a data analyst and you want to learn more about E T l testing, then you should definitely follow this course. Also, for software engineers that want to automate part of the testing process by using database fuse and building dashboards data Stewart's and managers who want to apply data quality standards within their organization and just want to get a general understanding about data quality and et al testy, there are also some prerequisites if you want to follow scores, some basic knowledge off sequel would definitely be useful. And also some experience with visualisation tools like Click Sense or Tableaux Power Bi I and so on But in this tutorial were also going to use a database. We're going to use blustery SQL. But you could also use Oracle, for example, and we're going to use Click Sense as a visualization tool. So to be able to follow the exercises in the course, it's recommended that you have some basic set up consisting of a database and the visualization toe. But of course it's up to you. Which database and which visualization tool you would like to use on this tutorial, We're going to use possibly SQL and Click Sense as a visual, a Shinto. So good luck and enjoy the rest of this course. If you have any questions, just let me know and send me a message or leave a message on the course landing page. And also don't forget to give the scores a rating, so good luck and enjoy the rest of the scores. 2. What is ETL/ELT Testing and Data Quality Management?: Okay, So before we actually get started, let's first have a look at what exactly is E t l or E lt testing? Well, let's start by having a look at what is a data warehouse, because to understand, what is it? Yell or guilty testing should know what? A data warehouse this and how this fits into the PTL process. So, typically, when we talk about a data warehouse, we're talking about source systems which supply data to a staging area which is basically a 1 to 1 copy off the source data that you receive from your different source systems. And then this data is integrated into what we call a data warehouse where we just want to have one single version off the truth of our data. And after this data is integrated into our data warehouse, we can actually start building data marts for each of our end users. So a Data Mart is basically a subset from our data warehouse that we publish to certain end users. And these data marts also have certain restrictions. So and he's a, for example, will not see the Data Mart from anti Zerby and the end user. See for example, will also not see the Data Mart from aunties or a so that we know that some end users don't have access to restricted data that they shouldn't have access to. So a data warehouse also provides some governance, security and monitoring off your data, which will also bring us further to data quality and master data management. Furthering scores. So to move our data from our source systems to, for example, are staging area and from the staging area to the data warehouse and the Data Marts. We're talking about a process which is called E T l or L T E T l stands for extract, transform and load and basically extracts data from our source, transforms his data and loads it into our database. And we can use basic SQL statements and DTL tools to generate an SQL statement. So when we're talking about e t. L were typically talking about traditional data, whereas environments where we first extract men transform and load data into our target system. Um, when we're talking about e l t. We're typically talking about data leaks where we want to keep the data in its original form. It's so we first extract the data and then load the data without touching the later So we take the data as this and we load it into our staging area and our data repository and we Onley transform the data if need it within our data marts. And we published those data marts to specific end users. So the whole idea about E l t. Is that we want to keep the data in its original raw formats without touching the original state of the data. So we typically see any lt process within a data lake and here we don't talk about a data warehouse, but we talk about a data repository when we're talking about e t. L need lt testing. We basically just want to know if our data that comes in is the same data that goes out. So e t. L and lt testing provides some sort of assurance throughout your data lifecycle because you can see the data life cycle as a whole chain off operations that happen from your source system to your target reporting system, and you want to notice state and remains consistent and valid throughout your whole life cycle. Um, within it e l or L t testing their different categories of testing. Um, one categories. Transformations, mapping. We will look at different transformations from our source to our target system and validate each of those transformations. And then there's also source target testing, which is similar to data lineage or integrity testing. We want to make sure that we have the same number of records in our source system as in our targets environments. When we are performing, database tests were typically talking about different levels of testing at the top level off database testing, You have schema testing where you basically look at the whole schema in our database. The schema can contain different tables or objects, and when we're talking about table testing, we're actually looking at the different tables. We can also test on natural beauty level when we look at the precision and data type of different attributes and see if those match within our database 3. Data Categories and Classifications: Okay, So before we actually start looking at data quality, let's first have a look at data. What is data? Let's first have a look at the different data categories in general when we're talking about information were typically talking about data that has been processed, and that serves a certain purpose. So when we're talking about information were typically talking about data that has a certain meaning in context to it. For example, when a company is presenting its financial results, then we have also transactional data. Transactional data describes an event or a transaction. An example of an event is, for example, a customer who places an order or a payment those roll events and transactions that occur at a certain time and also have a certain transaction. I d for example, Muster data typically describes or transactional data. And when we're talking about master data were typically talking about, for example, um, customer information, names of customers. We're also talking about addresses, those types of things but also, for example, the number of your bank account. Those are all master data. When we're talking about reference data were typically talking about a standardised list of values that is shared within your organization. For example, you can have currency coats which are standardized into a certain list of pre defined Aiso coats that are shared within your entire organization so that everybody uses the same standard. When you're talking about, for example, currency coats. Metadata is actually data that describes your data. So, for example, properties off media file properties off columns within your table in your database, those are all meta data. So these are the most important categories of data. Of course, there are some other categories as well, but this is just to get a brought idea about the different categories of data that there are. So when we're looking at data quality on looking at the quality and master data in general , then we're actually looking for one version of the truth of our data. And we want to make sure that the view over data is consistent across all our different reporting systems, for example. So in this example conceded for a contract i d. 2356 For example, you have one currency coat, which gives you great British pound and one currency coat, which gives you your oh, so in this case, we see that we have two different currency coats, so we don't know which one is correct. This is a typical example off data quality and master data management. Because we want to understand why discourage C code is different. And if we're dealing with bad data quality or if we're just talking about two different currency coats for the same contract. When we're talking about data lineage, we actually want to get a view off the data from the source to the target system. So I want to have a complete view off our data throughout its whole life cycle. When we're talking about data governance, we're typically talking about principles and best practices to put in place to improve and also ways to, um, make sure that our idea is aligned throughout its whole life cycle. So here are some examples off monster data. You can have, for example, involved parties such as customers, business partners, suppliers, organizations. Those are all examples off parties. Um, you can have agreements such as contracts, contract information, for example, currencies. Um, all those kinds of things are also part of the mosque data. You can have financial monster later 1,000,000,000 settlements you can have banged. He deals in there. You can have two country of building Ah, the country off the place where your bank is located. You can have accounting information. So those are accounting units, for example. So you can have all different types of master data when we're talking about the reference data. We're talking about a unified list off values that is shared within our organization. So a typical example for reference data is currency coat. Um, you can see here we have four different ways off defining a currency coat. They all have the same meaning, but they're expressed differently. Um, what we try to do when we're talking about reference data is just to have one single definition off a currency coat. This can be the Aiso definition. This can be the currency code symbol. This can be the currency coat, just ah, in full letters like dollar, euro pound, whatever you want. But it's important that we have a unified list so that we're talking about the same currency coat throughout all our systems. So it's important to know which list off reference values were going to use. So examples off reference data, Um, currency goes or a typical example. You also have either coats for financial instruments like have the icing coats for stocks or bonds. Diesel ELISA codes that are shared within a certain industry or a certain domain. You have specific isil codes for financial instruments. You can also have specific goats for, ah, electricity or gas meters, so those are really industry or domain specific. You also accounting standards, which are also shared across many organizations. You have organizations specific reference data, so this is reference. Say that that you only use within your organization, but it's not shared with in the entire world or entire sector. Um, so when we're talking about organization specific reference data, we can talk about goats for your business units, wind in your organization, customer segments for marketing purposes. You can also have technical standards and units of measurement. An example of a technical standard unit of measurement is a date and time standard, for example, for example, when we want to report over a certain periods when we're talking about meta data, we're talking about data that describes our data, um, or provides a context to our data. So in this case, we have two tables in the first stable. We don't have any metadata available. So when we want to retrieve all the contracts in currency coat, great British pounds, we see we have three contracts. But only one of the three records is actually you fell it because when we look at the effective from an effective to date for the insert time stamps, which all metadata and we can see that it only one of them is still effective. So metadata actually helps us understand our data and helps us put a meaning to our data and also helps us understand ever data is correct. So examples of meta data, our titles description, stacks, insert dates, time stems within your database. All these kinds of things file extensions are all examples off metadata. Um, you can have different data classifications, for example, sensitive public and private data. Meta data can also help you understand if your data is still relevant or not, such as in a previous example. We saw that on Lee. One of the three records were still relevant and two other ones were outdated. So meta data also helps you understand the status or he couldn't state off your data 4. Data Quality Rules and Dimensions: Okay, so next we're going to have a look at the data quality, dimensions and also data quality rules. So what are data quality? I mentions data quality dimensions are actually sets off data quality rules for measuring data quality performance. And there are several frameworks that define these data quality dimensions. And the dama framework is actually one of the most well known data quality for a marks. And this framework actually consists of a few data quality dimensions which are completeness, uniqueness, timeliness, validity, accuracy and consistency. These are what we call the data quality dimensions. So let's have a look at some examples when we talk about completeness we're mostly talking about for data set is complete. Does it contain any No records. So basically, we don't want to see any data missing. So one way of looking at data quality completeness is by looking at the number of nal records or number of null values within your data sets. Validity tells you if your data is still valid at a certain point in time. So in this example, my validity date actually shows me if my data is still valid at the current date within the consistency dimension. You want to know if your data is consistent and evidence reflected in the same manner across all your data sets? So, for example, you want to know if the values wouldn't want a column or consistent. In this case, you can see some of the currency coats, notes, G, B, P and some other currency coats. Note great British pounds. So there is some inconsistency in the way we represent this currency coat. When we look at uniqueness, I want to know if we have any doublet gets within our data sets, and each record that should be unique is also unique. Within our data set accuracy describes if your data is still up to date at the moment. So in this example, we want to see if our contract is still up to date today and within the timeliness and mentioned, you want to know if your data is still available. So based on these data quality I mentions, we can actually build a data quality repository, and within this central data repository, we can actually store all our data quality rules according to each of those dimensions. Why do we need this data? Quarter rose repository because data quality is important and poor data qualities everywhere. Most organizations tackled data quality a talk and don't really have one common strategy on tackling data quality issues. Maintaining good data quality is also very time consuming, and therefore it's important that we also think about data quality and also think about building a rules repository. So here are some examples off data quality rules. So within the consistency I mentioned, for example, or currency needs to be within a list of allowed Aiso values so that all our currency coats are consistent within our data set and follow the same standards. Validity. For example, here we have aged to test if all in between zero and 120 Um, this is our business rule, which defines that h is valid for this criteria for completeness. We want to know if you don't have any missing values within our data sets an example of a data quarter will be name. It's mandatory. Another example of a validity rule is that email address gets to be in the right format. Of course, it's up to you how you define the right format. Um, so these are all examples off data quality rules and within the exercises when the scores were also going to create some data quality rules which were later going to build within bowsprit esco. So when we're talking about data quality reaction, not talking about a one time thing, but we're talking about an iterative process. So what we want to do is continuously improve or data quality by following a few steps, and this process is called data quality monitoring. So first we start by profiling or data within data profiling, you're actually going to look at your metadata to understand some of the key characteristics of your data. Within your data quality analysis, you can actually set up a list of business rules and data requirements based on, for example, your data quality dimensions. They look Lansing's actually the process of cleaning your data, finding matches within your data, eliminating errors, doublet kits, inconsistencies and so on. So when we talk about data profiling, we're talking about metadata, metadata or all the data that describe your data. So, for example, for each column within our tables we have the data type. We have alliance precision and an indicator of its salable or not and we can use this information to learn more about the data that we're actually using. Another example of metadata or keywords and tax that actually describe or data. And we also have relationships between tables. For example, we have primary and foreign key relationships between tables. Within data. Profiling were also interested in learning more about descriptive and statistics off your tables. So we want to learn more about minimum maximum values, total number of values, distribution of values across our data sets. We can perform data profiling at various levels. At the highest level. We have database profiling. Then we have schema entity and attributes profiling depending on the level off validation you're looking at. Of course, there are also several tools that can help you with data quality. Monitoring one of the leaders within data quality tools is in for America, and what you can see here in this quadrant of corner is that companies that have a background in building E T L tools also have a strong position within data quality tools because these data quality tools also use some sort of PTL on the background. So you see in for Matic and there is a market leader. You also have some challenges like IBM and SAP. You also have Oracle. And there you have talent which has an open source solution for data quality. So these are all most important they recorded to providers at the moment. So at the top of the list we have informatica, which has informatica data quality. So talent offers an open source solution for data quality. Then you have IBM, which has quality stage s part of the Info sphere catalog, and then you have information analyzer, which is also a product of IBM. What we see mostly that companies that are already using IBM infamous for data stage. We're also using quality stage as a data quality to then we also have stars data quality, which also is frequently used by companies that already use sauce. And of course, you have also Oracle and some other data quality providers. 5. Data Requirements and Test Design - Exercise: Okay, so in the following exercise, we're going to build some data quality Rose. You can find some samples TSV files that you download into your database before we start digging into the exercises. So in this exercise, I've created some staging tables and some data warehouse tables which are 1 to 1 copy off our CSP files. And then we have one small data mart customer regional sales, which combines data from the data warehouse order stable and the data warehouse customer stable. And then finally, we're also going to build some database fuse on top off our data warehouse. And then we're also going to use Click Sense to report in our data quality and also build some data quality dashboards. So when this exercise, we're not going to focus on building or staging the Herrera's and data more tables. But we're going to focus on how we can build database views to monitor our data quality and also govern or data and build a custom dashboard in click sense to monitor our data quality . So let's have a look at our data requirements. Eso First, we have data quality completeness. In total, we have five tables for each of the CFE fouls that we load it into our database and 40 customers. Manager's orders returns and product stable We can't have any empty fields, so we want to build a database You. So we want to check if these tables contain any no values the next we're going to test or data uniqueness. So here we have each customer name should be unique within the customer stable. There should only be one manager for each region within the manager stable, and therefore each manager should be unique. I would in the order stable, we should have a unique order i d for each order and within the return stable one order can only be returned once so therefore each order return has to have a unique order i d Would any products able? Each product name should be unique. So these are our data requirements for data uniqueness. Then we have some validity requirements as well. Eso data validity. The's a role just business rules that you can define for each of your tables within the staging order stable. We're going to build some data validity rules for discount order dates or I D or quantity role I d sales ship date shipping costs and unit price. So we have the following requirements. Discount has to be smaller than one in greater than or equal to zero because we can never have more than 100% discount. And this exercise, we're also going to build a business rule for order date, it has to be created and 2000 0101 And then we have order i d, which has to be created and one because we don't want to see any order idea which is zero or smaller than zero. And then we have order quantity which has to be created, and zero because we cannot have any negative quantities. Then we have ro i d, which has to be greater than one and seals, which has to be positive as well because we don't want to see any negative sales amount. We have ship date which has to be created in the order date. We have shipping costs which has to be created, and zero because we don't have any zero cost shipments, and then we have unit price, which has to be greater than zero as well. Um, so these are our data validity requirements, then we have data consistency. So here we're going to look at consistency of values for each of the following columns. So within the customers column, we're going to look at province region customer segments within the manager's column. We're going to look at Region. We're going to look at order, priority shape, mode, status, product category and product subcategory. And then we have data integrity where we're going to look at the lineage over data from our source to our targets for stating orders and stating customers stable to our Data Mart customer regional sales. And finally, we're also going to look into data profiling where were interested in the following validation. So we were going to count the total number of records going to look at data types, length, precision is and knowable indicators. So these are our data requirements. Now let's have look at database views and how we can build those requirements. Wooden boss Crete SQL 6. What are Database Views?: Okay, So what we're going to do next is have a look at database fuse. So what our database use So most of you might know that a database can contain several scheme us. And within each database schema, you can have different objects so you can have database stables, which actually store data within a certain table. But you can also have database views which don't store the data, but actually are stored queries that fetch the data when you execute the few. So database views are not the same as stables because they don't contain the data itself, but only view data off one or multiple tables through a select state. A simple example of a few is If you were, you just select all the data wouldn't want table. But you can also have a view which purchase the result off multiple joint tables. As we said, database fuse our database objects that don't store the data contrary to database tables, but actually store the query to fetch the data from your source tables. So they are virtual and don't contain the actual data, and they are the result of a Stort query, which is simple, select a query that can use multiple joints, formulas, aggregations, filters and so on to fetch data from one or more tables, just as if you would execute a select query. The syntax of a database few for post creates, Quell and Oracle will be something like create or replace few. Then you give your database you a name, and then you define your database if you so here you define your select statements, which can be, for example, select star from source stable. So, as we already said, database use, or not the same as database tables because they don't contain the actual data. So why do we want to use database used instead of tables? Well, it's quite simple. If we use a database for you, we can get an ad hoc few off our data quality requirements and also have a view at the source tables that comply with these requirements without having to insert, update or delete a new table that contains the same information. So we can always see the last result off our data quality requirements within our data base . If you it's also more dynamic than creating a separate database table, which contains the same information but then actually stores this information and the database. Few also takes less storage within our database because we actually don't store the data itself. We just view the data to a select statements within our source tables. So what's next? We're going to build a few data quality views. Give us an ad hoc few off the data quality requirements for the different tables within our database. So we're going to build a total off six fuse. First, you will be a data profiling view within the data profiling view. We just want to have a general look at our metadata. Then we have our data quality uniqueness where we'll check our different tables for uniqueness, constraints and see if there are duplicates within our database tables. Then we have the data quality completeness view where we will check for different no records in the data quality of validity view. We will have a look at the validity constraints or business rules for are different Ah, database tables. And then we have the data quality, consistency, view and integrity view where we'll have a look at the consistency within our different tables and also have a look at the data lineage. So what's next is we're going to build these fuse within our database and this example we're going to use a boss Cree SQL database, but you could also use a nautical later base, for example. 7. View Data Quality Completeness - Exercise: Okay, So once we have our connection set up with our database, we can start building some dash sports. So I start by adding my first sheet, which I call data completeness, and I can start by adding a table, which gives me the total number off records in the total number of NAL records for each table and each attributes. So I felt on my data quality completeness few and I add my table. Me, I add my attributes for each stable, the total number off, no records and also the total number off records. And I call this visualization my detailed analysis. Okay, so next we can also add to buy charts, which give me an overview of the total number of no records for each table and each attribute. So I start by adding my table, and they also add my count of the no records. I call this one summary charts. No records for a table. And I won't have a pie chart so I can change the chart. Type two a by chart. I could do the same for my attributes. And I also changed the chart. Type two a by charts and I call this visualization summary charts no records per attributes . And if I now click on done that, I can see my visualization sheet, and I can see that I have some null records for the staging products and the data warehouse products stable. And I also have some no records for product based margin. And if we click on a table, for example, for example, the staging products stable, then you will see that there are 63 cases where the product based margin isno within the stable. So each time I click on a visualization click since will automatically filter out the records for my selection. If I select the data warehouse products stable, you will see that there are 63 cases which also have a no value for the product based margin. So if I now look at my database, you, then you can see also that these cases are present would, in my view, so you can see the product based margin here and also here. I can also use a query to find the cases where the product based modern this No, - and you can see here for these cases. I have no values within my product based margin. If I now account total number of records in here, you will see that I also get 63 results, which is the same as I have in my visualization when I filter on the stage product stable. If I now update my staging products table and I replaced the's no values with the value zero where the product based margin isno, then I don't have any records anymore with product based margin? No. Then you can see in my data quality completeness you that my product based margin for stating products does not have any null records anymore. Then I reload my data from my data source. So if I now reload this data and I go back to my analysis sheet, you can see that I don't have any no records anymore for a staging product stable. 8. View Data Quality Uniqueness - Exercise: Okay, So what we're going to the next is build our data quality uniqueness sheet. So I called his sheets data uniqueness, and I felt it on my data quality uniqueness. Few and I start adding my table. I also add my constraint. I add my count of the distinct records, count records and count duplicates. And I changed my visualization type from scattered plots to a table. And I call this visualization my details, analysis and what we're going to the nexus at a summary charts with all our public. It's for each table. So I start by adding my table, and they also add the count off the public. It's and I call this chart summary charts duplicates per table. And as you can see here in my dashboard, I have some tables which have double IQ. It records. And if I simply select the stables, you can see the number of duplicates for each table and each unique constraint. You can see that for order i. D. I have some duplicate records about 2900 duplicate records and also product name contains some duplicates. I can also take a snapshot and export this data to a pdf, for example. And if I go back to possibly SQL, I can see which records contained public. It's and here we can see your duplicates or Dardis. 9. View Data Quality Validity - Exercise: Okay, so next we're going to build our data validity Dutch port. So I called this dashboard data validity, and here we want to see which records are valid according to a certain set off business rules. So I started by adding my data quality validity view here and I at my table. I also add my validity constraints for each table. I add my ability rules I at the count of the total number of valid records for this facility rule and the total number of invalid records for disability group. And I want to use a table like all the stable my detailed analysis. For now, we've only added some data requirements for the staging order stable. But you could also add some other tables as well. Um, I'm also going to add a summary chart. So I started by adding my invalid records and also my table and also my ability constraints . We can choose a different ah visualization type. We can add, for example, a tree map where we get at a bar chart as well. And we call this one summary charts invalid records. And you can see that for order quantity within the staging order stable. I have some invalid records that have an order quantity, which is smaller Dinner equal to zero. So I have it. A total off 165 records which haven't invited order quantity. Okay, so let's now have a look at these records within our database. I'm going to select all the records from the staging order stable. Where the order quantity. It's smaller than or equal to zero. Oh, and we see that we don't have any orders within order quantity smaller than or equal to zero. Still, we have some invalid records for our ah business rule, which says that the order quantity should be created and zero. So let's have a look at the cases where the order quantity is No, for example. And we also don't have any cases where order quantities. No. And if I look at my data quality view that I can see that I don't have any invalid records for this data quality rule. So probably this means that my click sends data needs to be refreshed because I made some changes to my view. So I just go back to click Sense and I go to my data and I reload my table from the source and I load the data can. Fine. I'll go back to my analysis dashboard, and I look within my validity chart. Then I see that I don't have any order quantities anymore that violate this data a requirement. 10. View Data Quality Consistency - Exercise: Okay, so next we're going to build our data quality consistency sheet. Would in the data quality consistency sheet, we just want to have a no overview off the consistency off values within our data set. I want to know if there are any values that don't match our list off reference values that we have, um, prettified. Okay, so we call this sheet data consistency, and I started by filtering on my data quality consistency. Few. Um, so, first of all, I want to add my details analysis table. So I start by adding a table. I also add my columns, my distinct value for each column and the number off occurrences for each distinct value for each call. And I changed this visualization type to a table, and I call my table detailed analysis. Okay, So next I also want to add a visualization that shows me the distinct values for each column and shows me there any special cases or occurrences that don't occur that often. And it might lead to inconsistency. So I start by adding my, uh, table. I also add my, uh, columns, the distinct values for each column and the number off records and within the stream app, you can see the big blocks or my tables. Then you have the smaller blocks, which are my columns. And then for each column, I can see the Ah smallest blocks, which are my different values. And the size of the block actually represents the number of occurrences for each distinct value. We can also choose another visualization type. If I remove my table and I changed my visualization type in here, then we can also use, for example, a distribution plot. And I can call this visualization my summary charts distribution off distinct records. We can also sort by number off records here and inspector distinct values for each of our columns to see if those values are consistent. If we want to show the distinct values for a certain column that we can just simply select our column in here, for example, customer segments and I could see the distinct values for customer segment and the number of occurrences. I can do the same, for example, for my order priority and I conceded, I have five distinct values for order priority ranging from low two critical. I also have one category, which is not specified. So the great thing about click Senses that you can just easily filter on one visualization and automatically filter is applied to all the visualizations within the same sheet. So, for example, if I just feel that on 11 column product category in here, then automatically my summary charts updated and I can just see distinct values within my summary chart for product category. This makes it very easy to detect inconsistencies within your data sets without having to query your database all over again. 11. View Data Quality Integrity - Exercise: Okay, so next we're going to build our data integrity sheet. So I called the sheet data integrity, and I started by adding my detailed analysis table, which has a sore stable, a source column, a target table and also a target column. 40 staging orders and staging customer stable as a source stable and my Data Mart custom regional sales as my target that I also add the integrity mismatch. And I don't have any integrity mismatches. I call this visualization my details analysis. Then I want to add my summary charts. So I start by adding mine source column. Then I add my target column and I add my integrity mismatch, for which I need to specify that this is a measure. So I put it in measure. I call this visualization Summary charts mismatch for column. Um, my chart cannot be displayed because it contains zero values because I don't have any integrity. Mismatches. This is, of course, a good thing, but in this exercise were actually going to create some mismatches to see if it also shows up within our summary chart. So I go back to possibly SQL to my Data Mart table, and here I create an update to update my table and sets for example. Um, my customer name, too, Just where the order I d. It's, for example, smaller than 100. So in this case, I should have some mismatches. If I know, Look at my data quality integrity view. Then you can see here my integrity mismatch for customer name and this is should also show up within my dashboard if I now go back to Click Sense and I really wrote my data from my data source again. And then I load this data again and I go back to my dashboard. Then you can see that the new record has been at it as an integrity mismatch. So you can see here that I have an integrity mismatch between customer name in the staging customers and Data Mart customer regional sales table. 12. View Data Profiling - Exercise: Okay, so finally, we're going to create a summary sheet. We just summarize everything we've built so far. So we're going to name the sheet summary, and here we just want to get a quick overview off our data quality. So I'm going to go back to my sheets and I'm going to copy this summary chart from data completeness and pace it in my summary sheets and renamed. It's too data completeness. No records per table that I'm going to do the same for my data uniqueness. Also for data, validity, data consistency and data integrity. I don't want to add a detailed analysis for each one, because for this I can go to my detailed sheets. Then I rename my visualizations. So I call this one data uniqueness called this Data fella t data consistency and data integrity. Then I add a title. I call this data quality measures that I also add a second title. And I call this data profiling because we're also going to ads some data profiling statistics, and here I select my data profiling view count records. I add my schema, I add my table, and I also add the total number of records for each table that I can change this to number well for records for table. And I changed the visualization type to a tape. And here you can see my final dashboard, which contains my data quality measures and also my number of records for each table. It's part of my data profiling. So in this dashboard, I get a quick snapshots off my data quality, and I also get a view off the total number of records or a summary statistic as part of my data profiling. If I want to dig deeper into one of the data quality dimensions, I just select a sheet I want, and I get the detailed statistics on each dimension. 13. Create a Database Connection: Okay, So once we've created our data quality views, we can actually start building a data quality dashboard where we can get a good overview off our data quality. So in this tutorial, I'm going to use click Sense, but you can use any visualization tool. I could also use TEPCO spot fire MicroStrategy. But also tableau, for example. Just any visualisation toe will do. Um, I'm connecting with my personal account, and I'm going to create a new app. I'm going to name this app. Data quality dashboards. If you're interested to learn more about click sense, there's also a tutorial on click Sense where you can learn step by step, how to load data into quick sense and build custom visualizations on. You can use this coupon code here to get a special discount and basically learned to build your own clicks and snaps. For now, we're just going to at a new data source, and we're going to connect to a past Cree SQL database in our case. But you can also connect to an Oracle database or two or a Mongo DB or sequel server database based on what you're actually using. I'm going to enter my database properties. So my host name is my local host. So I enter local host. My database name is plus degree this. Then I enter my database user name and my passport, and I connect to my database. You can see my connection in here. I can select my database owner, which is public, And then I could see all the tables within my schema, and I just select my data profiling and data quality views, and I add these to my data sets within click since Okay, so now we have created a connection between our database and our realization to, and we have selected the views that we want to include in our data set. What we're now going to do is load this data into click sense. I don't want to create any associations between the views. I just want to low to the data. So I click on low data. The data was loaded successfully, and I can now start editing my dashboard 14. Data Quality Completeness Dashboard - Exercise: So when this example, we're going to look at the completeness of our data sets, so we want to count the number of records that have a no value. So if we look at our data set and we look at the records, for example, in our staging order stable, then we can have a look at our data sets and see if there are any no records present. I can also copy this query in the credit panel and use a rare statement to select on Lee the records that have on order I d. That is no. And this query will give me all the records that have an order idea that it's no I can do the same for, for example, the order date. As you can see, I don't have any records with an order i d. In order date that it's no. And what we're going to do next is create a database of you, which gives us all the null records for each column in each table within our schema. So what you can do next is have a look at the SQL statements that are included in this course and copy paste the statement for creating the data quality completeness view would in your database. So, as you can see here, we're going to create. We'll replace a new database view, which is called De que underscore completeness. And to define this database view we use, create or replace few big multi completeness s. And here we have our select statements. Each select statement, within our view, will give a new row with interview because we use a union and within each select statements , we give the table name the attribute name, and we take the some of the cases where Theatric Butte or the column is no, we do a union, and we do the same for the second column, which is province. Do the same for region, the same for customer segments. And we do the same for order I D. Order dates and seals, which are all the columns that we have in our Data Mart customer Regional sales. If we execute this query than the new view has been created within our database, if we refresh views, then we'll see. There's a new view created what is called the Q completeness. So let's have a look at this new view and inside you can see all the tables. So in this case, we just added one table, which is the Data Mart customer, regional sales, all the attributes or columns off the stable and all the no counts. So here, you can see that's we have some no records for order, I d. Order, date and sales. And if we look at the Data Mart, then we also see that we have no records present for these three calls. If we want to add another table to this view, we just simply create a union. Select are New Table, which is, for example, data way Rose. Customers would take our attributes name, which is customer name, and we take the some of the cases where the customer name is no from public, which is a schema name, and on the table name, which is data warehouse customers. Okay, and we do the same 40 other columns within our table. We didn't execute our query again. And if you now look at the view, you will see that the new table, which is data warehouse customers, has been at it to our view, and you can see that we don't have any no records for each of the columns. We can also use a query to select data would in our view, and we can even order by count no records descending, which will give this the most no records at the top and least no records at the bottom. And we see that we have some no records within the Data Mart stable for order I d. Order, date and steals. And we have some no records within the staging product stable for product based margin and within the data wears product stable for product based margin as well. 15. Data Quality Uniqueness Dashboard - Exercise: Okay, So what we're going to do next is have a look at the data quality uniqueness. So what we want to validate here is that we don't have any duplicates within our tables. So we can basically do a count of the total number off records within our table and compare this to the total number off distinct records for a combination off columns. So, for example, in the order stable, I can never look at the total number off records within my table by using a select count star from And then I select my staging order stable. And as you can see, I have a total number off 8399 orders within my order stable. If I want to look at the distinct order ID's, then I can use a select count and I use distinct order. I d cause I want to have two distinct order I ds from public staging orders. Then I get the total number of distinct order ideas within my table. And as you can see, I have 5496 distinct orders within my table. So I have more records than distinct order. I DS. So my order idea is not unique. We can also have a look at the orders that that duplicates. So if I now at a group by two group my records according to the order I D. And I counts all the records for each order i d. So I add Count Star and I also add my distinct order i d the negative number off duplicates or the number off entries for each order i d. So you can see that some orders only have one occurrence, but some orders have multiple occurrences. If I ordered by Count Star descending and I will get the orders with most occurrences on top, I can also add having count star greater than one. They only get the public. It's okay. So what we're going to the next is create our database Few. So I start by adding create or replace data quality uniqueness which is the name of our database. You and then I define all the fields, all the records that I want to include within my database You. So we want to have a couple of things within our database. You we want to have the total number off records for each table, and we also want to have the total number of distinct records according to a certain criteria. So I start by adding some default values. It's and then I also add the total number off records I also at the total number of distinct records and the difference between the total number off records or total number off records with an order I d and the total number off distinct records, which will be the number off duplicates within my staging order stable. We also have some other uniqueness requirements. So I can also add those by using a union and I ads, for example, might requirements for my return stable, but also the order I d should be unique. So I add count, order I d and distinct alrighty from table staging returns. And then finally we add all the other columns and tables Empress F five to execute the query. As you can see now, we have created the data quality if you so. If I refresh here because if there is a new view called data quality uniqueness and if I view the data within this you you will see all of my tables, all my columns, the total number off records within this table and the total number of distinct records. So if we now take a look back at our data requirements, we can see that we have some duplicates. Order ID's within the order stable within the return stable. We don't have any duplicate order ID's and within the product stable you consider t product name is not always unique. So we have some product names that occur more than once. Also within the manager stable. You can see that we have some managers which have more than one region assigned because we have duplicates for the uniqueness Constraint manager. You also see that the customer name its unique so there we don't have any duplicate customer names. 16. Data Quality Validity Dashboard - Exercise: So what we're going to do next is have a look at the validity of our data. So a typical example of a date of levity check would be. For example, if you have a bank account number and your data quality check could be that the first characters off your bank account number follow a certain country coat. The reason why data validity checks are important is because you want to know what your data looks like. And if it matches the list of values that you expect this data to have, these can be stored in a different reference table, and it can be part of your general monster data management process. So let's have a look at the staging order stable within the staging order stable. We have a few amounts that have to be greater than zero. For example, we have order quantity, which always has to be creator and zero same for sales amount. We always want our sales to be positive and greater than zero. Discount should always be in between zero and one because we cannot offer more than 100% discount. Our unit price should also be greater than zero and our shipping cost can be zero or greater than zero, but cannot be negative. We also have a ship date and in order, date and according to our requirements, we cannot have on order date. That is later than the ship date, because our customer first has to place an order before it can be shipped to the customer. Okay, so that's now open the query to and we can start by selecting the records that have a order quantity that it's less than or equal to zero. Okay, good. So we don't have any invalid records for order quantity. Let's also check for seals. Oh, and also for seals. We don't have any invalid records. According to the data requirement, we can also check the order I D. Which is also looking good. We can check row I d. And also here we don't have any invalid records. So these a role basic checks. But what if I want to check, for example, if my order date, it's larger than my ship date? Well, let's have a look at that. So if I now take a were close with my ship date less than or equal to my order date. Then I get 991 rose where the order date is larger than or equal to the ship date, which can still be valid if the ship date and the order dates are the same. So let's have a look at the cases where the order date is created and the ship date, and I don't have any cases where the order date it's larger than the ship dates. All my records are a valid according to this validity check. But of course we want to put all our data quality checks or validity checks within one view so we can create a new view starting by create or replace view, and we call this few data quality ability to define our view. We use a select statement and we select our default values, which is the name of the table. In this case, we have staging orders and we call this table. We have our first validity check, which is the Euro i D, which is our validity constraint. And this role I D has to be greater than zero. So we want to add the cases where the row i d. Iss Greater Dan zero as the valid cases and we won't have the cases where euro I d is less than zero as the in valid cases. So I could think to some of the cases when the roe i d is greater is, you know, And then I put one else I put 20 and I take the some of those cases to get the total number off valid records. We can do the opposite to get the total number off invalid records, I add my source stable, which is the staging order stable within the public schema. And my new view is created If I now look at the data quality, you in here, I can see that the validity you was at it. We can also do the same for our other validity requirements. So I use a union all to add my other requirements, and I use the same select statement. But in this case, I used the order I d to check if the order I d is greater than one, and I can remove this part. I also add a validity check to see if my order date is greater than first of January 2000. So I select the cases where the order date is creator than 2000 0101 and I think the some off those appellate cases. I can also add my order, quantity and my sales amount, which both have to be greater than zero. I also add my discount, which has to be in between zero and one. I add my unit price and my shipping costs, which also have to be greater than zero. And then finally, I also add the ship date, which has to be greater than or equal to the order date, and I can creates my few. And if we now have a look at our view here, we can see all of the validity constraints for our staging order stable together with our ability rules, the total number off valid records according to our ability criteria, the total number off invalid records. You can see that we have some validity concerns that have been violated. Example 40 order quantity and that we have 165 records that haven't order quantity that is less than or equal to zero 17. Data Quality Consistency Dashboard - Exercise: Okay, So what we're going to the next is have a look at the consistency of the data within our data set. And if we look at the staging order stable, for example, we have some attributes that quantify our data, such as sales order, quantity, profit, these roll metrics. And basically they are numeric values. Um, we also have some other attributes, like order, priority or ship moat that don't measure or quantify data, but actually classified there. Within a typical data warehouse environment, you will have a list of referential values that you can map to thes attributes. In this case, you want to see the distinct values off these attributes to see if there are any inconsistent records. So when we're talking about master data management, when we're talking about classifications, off values when we're talking about referential values. So when we're basically testing our classifications against a list off referential values or testing them for consistency were doing data quality consistency checks. So let's now ever look at the distinct values within our staging order stable. I go back to the query panel. I can take distinct select of the order priority and also do the same for chip moat. So if I want to see my distinct order priorities than I can execute this query and I see the die of five distinct values for my order parity and I don't have any special cases except for the not specified, which could be still a valid case. If I want to see the total number off records for each of these cases, I can simply add a count. And I add group by order, priority. And you can see all my different order parities together with the number off occurrences for each priority I could the same for my ship moat, and you can see that I only have three different classifications for my ship moat. Okay, so what we're going to do next is create our database view. So I start with my create statement, and I call this view data quality consistence. So, first of all, I want to add my staging order stable, and I add order, priority and I want at the total number off records for each classifications. I add name off my table, which is the staging order stable. And I add to the name of my attribute in this case order priority. I also add my group by statements because I want to group by order priority. And if we don't execute our query, we consider the view is created. So if I now have a look at our views here, then we can see that the data quality consistency view has been at it and we can see our distinct values for order. Priority within the staging order stable. We can do the same for the other data requirements. - So now I add my ship moat and have a look at my data quality consistency of you again. Then you conceded my classifications for my ship mode have been at it as well, so let's do the same for status within the staging order stable and then we can also add province, region and customer segments within the staging customer stable. We also want to add region within the staging region Manager stable and also product category and correct subcategory within the staging products stable. Okay, so let's now re execute this query again. Never look at our data quality, consistency, view, and we can see all the data requirements in here and the number off occurrences for each distinct classifications 18. Data Quality Integrity Dashboard - Exercise: Okay, So what, we're going to the next is have a look at the integrity or the data lineage. So what is data lineage and why is it so important? Well, data lineage actually shows you how your data moves from your source to your target. Because basically, we just want to know that the data that we get in will be the same data that we send out. So how do we check this within or exercise? We have a data Mart, which is called Data Mart customer regional seals, and we want to check that all the records Sweden are staging tables. Basically, the data that we get in are also present within our data mart. Well, how do we check this? Basically, by taking the difference between are Data Mart and are staging table to get the difference between to select statements in Austria skill we can use the except function, which is similar to the minus function within Oracle simply create our first select statement. In this case, I'm going to take the order I d. From the target table, which is the Data Mart customer regional seals. And then I use except and I write my second select statements where I joined the order stable with the data mark table based on the order I d. And if I execute this query, you can see that there are no differences between the order I ds within our Data Mart and our source staging table. If we now want to do the same for the order date, for example, that I can select the order dates within my target table, and I also select the order date within my source stable and joined my source and target table. And if I don't execute my query again, you can also see for order dates that my data is perfectly aligned. So what we're going to the next is create our database few, which we call data quality integrity. And we add our staging order stable as our source stable. We add the order i d. As our source column. We at the Data Mart customer regional seals as our target table. And we also add the order I d Wouldn't this Data Mart as our target call him, Then we at AR minus query and we join our to source and target tables. Okay, good. So my database you is now created, and I can also add my other data requirements. So I started by adding the order date. And then I also add Steele's, which is also present within my Data Mart. And then we can do the same for the customer columns, which are customer name, province, region and customer segment, which our presence within the staging customer stable. So I add my four columns, customer name, province, region and customer segment. And when we try to execute this query, it's well, tell me that's my American date. Ah, data types cannot be matched because I try to select a date and in America field within the same column. So to prevent this, I can cost all my columns s fixed and my database you is created. If we now ever look at our database, you, then we can see that we don't have any mismatches. So all my data is perfectly lined and my data lineage is correct. 19. Data Profiling & Summary Dashboard - Exercise: Okay, So what? We're going to the nexus? Have a look at data profiling before you actually start looking at data quality rules and data quality. In general, you can already have a look at the metadata on your tables just to get a quick insight on your data. So data profiling actually helps you in getting a better view off your data and also shows you some of the most important metadata characteristics within your data sets. So typical data profiling checks that we can do on our database are simply counting total number of records for each table. Also looking at minimum maximum values averages, but also looking at the data types of our different attributes primary key constraints, relationships between tables and so on. So we didn't possibly sq well, And also within Oracle, you have the possibility to have a look at your metadata within your data dictionary views . And I can simply use a select statements to get a view of all the metadata within my schema . So if I select from the PG stats PG stats all tables, then you can see that I get number off records. Number of inserts, deletes updates and also a count of the total number off records. So to find my tables, I can simply use a were close where l name which is the name of my table like SDG for my staging tables or de wh or the M. And as you can see, I get all my stables within my public schema. We can use this metadata, for example, to create a data profiling view where we're going to count total number of records for each table and we're going to select the schema name. We're going to select the table name and the number off inserts from PG Statoil tables where our table name contains STG d wh or the M. If we want to have a look at our data types for different columns, we can have a look at the information schema columns within posturing. SQL. I can do this by writing a simple, select statement, and you can see my catalogue, my schema table names, column names, position within my table. You can see the column is knowable, so if it can contain null values, you can also see the data type my column maximum length and so on. So there is a whole bunch of information within the information schema columns which is similar to the all tables view within Oracle. For example, if I add a were close where table name like SDG or the table name like d wh or table name like the M that I will get all of my s a g d wh er d m tables within my public schema and you can see all the columns in here. You can see the position of my columns. You can see that they are knowable and you can see also the data types and we can now also create a new database you similar to the previous one. And we call this one data profiling information's chemo and you can see here with my two new views data profiling, count records and data profiling information schema have been created