Azure Data Factory Essentials Training | Everton Oliveira | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Azure Data Factory Essentials Training

teacher avatar Everton Oliveira

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

61 Lessons (3h 39m)
    • 1. Introduction

    • 2. Getting Started

    • 3. Understand Azure Data Factory Components

    • 4. Ingesting and Transforming Data with Azure Data Factory

    • 5. Integrate Azure Data Factory with Databricks

    • 6. Continuous Integration and Continuous Delivery (CI/CD) for Azure Data Factory

    • 7. Sign up for your Azure free account

    • 8. Setting up a Budget

    • 9. How setup Azure Data Factory using Azure Portal

    • 10. How setup Azure Data Factory using PowerShell

    • 11. ADF Components - Linked Services

    • 12. ADF Components - Pipelines

    • 13. ADF Components - Datasets

    • 14. ADF Components - Activities

    • 15. ADF Components - Pipeline Parameters

    • 16. ADF Components - Activity Parameters

    • 17. 4.5.3 ADF Components - Global Parameters

    • 18. ADF Components - Triggers

    • 19. ADF Components - Azure Integration Runtime

    • 20. ADF Components - Self-Hosted Integration Runtime

    • 21. ADF Components - Linked Self-Hosted Integration Runtime

    • 22. ADF Components - Azure-SSIS Integration Runtime

    • 23. Quiz - Module 3

    • 24. How to Ingest Data using Copy Activity into Azure Data Lake Gen2

    • 25. How to Copy Parquet Files from AWS S3 to Azure SQL Database

    • 26. Creating ADF Linked Service for Azure SQL Database

    • 27. How to Grant Permissions on Azure SQL DB to Data Factory Managed Identity

    • 28. How to Grant Permissions on Azure SQL DB to Data Factory Managed Identity

    • 29. Copy Parquet Files from AWS S3 into Data Lake and Azure SQL Database

    • 30. Monitoring ADF Pipeline Execution

    • 31. Mapping Data Flow Walk-through

    • 32. Mapping Data Flows Transformations - Multiple Inputs/Outputs

    • 33. Mapping Data Flows Transformations - Schema Modifier

    • 34. Mapping Data Flows Transformations - Formatters

    • 35. Mapping Data Flows Transformations - Row Modifier

    • 36. 5Mapping Data Flows Transformations - Destination

    • 37. Defining Source Type; Dataset vs Inline

    • 38. Defining Source Options

    • 39. Spinning Up Data Flow Spark Cluster

    • 40. Defining Data Source Input Type

    • 41. Defining Data Schema

    • 42. Optimizing Loads with Partitions

    • 43. Data Preview from Source Transformation

    • 44. How to add a Sink to a Mapping Data Flow

    • 45. How to Execute a Mapping Data Flow

    • 46. Quiz - Module 5

    • 47. Project Walkthrough - Integrating Azure Data Factory with Databricks

    • 48. How to Create Azure Databricks and Import Notebooks

    • 49. How to Create Azure Databricks and Import Notebooks

    • 50. Validating Data Transfer in Databricks and Data Factory

    • 51. How to Use ADF to Orchestrate Data Transformation Using a Databricks Notebook

    • 52. Quiz - Module 6

    • 53. DevOps - How to Create an Azure DevOps Organization and Project

    • 54. DevOps - How to Create a Git Repository in Azure DevOps

    • 55. DevOps - How to Link Data Factory to Azure DevOps Repository

    • 56. DevOps - How to version Azure Data Factory with Branches

    • 57. DevOps - Merging Data Factory Code to Collaboration Branch

    • 58. DevOps - How to Create a CICD pipeline for Data Factory in Azure DevOps

    • 59. DevOps - How to Execute a Release Pipeline in Azure DevOps for ADF

    • 60. Quiz - Module 7

    • 61. Wrap-up

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class


This course will introduce Azure Data Factory and how it can help in the batch processing of data. Students will learn with hands-on activities, quizzes, and a project, how Data Factory can be used to integrate many other technologies together to build a complete ETL solution, including a CI/CD pipeline in Azure DevOps. Some topics related to Data Factory required for the exam DP-203: Data Engineering on Microsoft Azure, are covered in this course.


Azure Data Factory is a cloud-based serverless ETL (extract, transform, and load) service. It offers a code-free intuitive user interface for authoring, orchestrating, and monitoring data-driven workflows. With over 80 out-of-the-box connectors, you can build complex pipelines that also natively integrate compute resources such as Mapping Data Flows, HDInsight Hadoop, Databricks, Azure Machine Learning, and Azure SQL Database.

Learn by Doing

Together, you and I are going to learn everything you need to know about using Microsoft Azure Data Factory. This course will prepare you with hands-on learning activities, videos, and quizzes to help you gain knowledge and practical experience as we go along.

At the end of this course, students will have the opportunity to submit a project that will help them to understand how ADF works, what are the components, and how to integrate ADF and Databricks.

Student key takeaways:

  • The student should understand how ADF orchestrates the features of other technologies to transform or analyze data.
  • The student should be able to explain and use the components that makeup ADF.
  • The student should be able to integrate two or more technologies using ADF.
  • The student should be able to confidently create medium complex data-driven pipelines
  • The student should be able to develop a CI/CD pipeline in Azure DevOps to deploy Data Factory pipelines

Who this course is for:

  • Data Professionals
  • Data Architects
  • Business Intelligence Professionals
  • Data Engineers
  • ETL Developers
  • Software Developers

What You’ll Learn:

  • Introduction to Azure Data Factory. You will understand how it can be used to integrate many other technologies with an ever-growing list of connectors.
  • How to set up a Data Factory from scratch using the Azure Portal and PowerShell.
  • Activities and Components that makeup Data Factory. It will include Pipelines, Datasets, Triggers, Linked Services, and more.
  • How to transform, ingest, and integrate data code-free using Mapping Data Flows.
  • How to integrate Azure Data Factory and Databricks. We‚Äôll cover how to authenticate and run a few notebooks from within ADF.
  • Azure Data Factory Deployment using Azure DevOps for continuous integration and continuous deployment (CI/CD)

Data Factory Essentials Training - Outline

  1. Introduction
  2. Modules introduction
    1. Getting Started
    2. Understand Azure Data Factory Components
    3. Ingesting and Transforming Data with Azure Data Factory
    4. Integrate Azure Data Factory with Databricks
    5. Continuous Integration and Continuous Delivery (CI/CD) for Azure Data Factory
  3. Getting started
    1. Sign up for your Azure free account
    2. Setting up a Budget
    3. How to set up Azure Data Factory
      1. Azure Portal
      2. PowerShell
  4. Azure Data Factory Components
    1. Linked Services
    2. Pipelines
    3. Datasets
    4. Data Factory Activities
    5. Parameters
      1. Pipeline Parameters
      2. Activity Parameters
      3. Global Parameters
    6. Triggers
    7. Integration Runtimes (IR)
      1. Azure IR
      2. Self-hosted IR
      3. Linked Self-Hosted IR
      4. Azure-SSIS IR
    8. Quiz
  5. Ingesting and Transforming Data
    1. Ingesting Data using Copy Activity into Data Lake Store Gen2
      1. How to Copy Parquet Files from AWS S3 to Azure SQL Database
        1. Creating ADF Linked Service for Azure SQL Database
        2. How to Grant Permissions on Azure SQL DB to Data Factory Managed Identity
        3. Ingesting Parquet File from S3 into Azure SQL Database
      2. Copy Parquet Files from AWS S3 into Data Lake and Azure SQL Database (intro)
        1. Copy Parquet Files from AWS S3 into Data Lake and Azure SQL Database
      3. Monitoring ADF Pipeline Execution
    2. Transforming data with Mapping Data Flow
      1. Mapping Data Flow Walk-through
      2. Identify transformations in Mapping Data Flow
        1. Multiple Inputs/Outputs
        2. Schema Modifier
        3. Formatters
        4. Row Modifier
        5. Destination
      3. Adding source to a Mapping Data Flow
        1. Defining Source Type; Dataset vs Inline
        2. Defining Source Options
        3. Spinning Up Data Flow Spark Cluster
        4. Defining Data Source Input Type
        5. Defining Data Schema
        6. Optimizing Loads with Partitions
        7. Data Preview from Source Transformation
      4. How to add a Sink to a Mapping Data Flow
      5. How to Execute a Mapping Data Flow
    3. Quiz
  6. Integrate Azure Data Factory with Databricks
    1. Project Walk-through
    2. How to Create Azure Databricks and Import Notebooks
    3. How to Transfer Data Using Databricks and Data Factory
    4. Validating Data Transfer in Databricks and Data Factory
    5. How to Use ADF to Orchestrate Data Transformation Using a Databricks Notebook
    6. Quiz
  7. Continuous Integration and Continuous Delivery (CI/CD) for Azure Data Factory
    1. How to Create an Azure DevOps Organization and Project
    2. How to Create a Git Repository in Azure DevOps
    3. How to Link Data Factory to Azure DevOps Repository
    4. How to version Azure Data Factory with Branches
      1. Data Factory Release Workflow
      2. Merging Data Factory Code to Collaboration Branch
    5. How to Create a CI/CD pipeline for Data Factory in Azure DevOps
      1. How to Create a CICD pipeline for Data Factory in Azure DevOps
      2. How to Execute a Release Pipeline in Azure DevOps for ADF
    6. Quiz

Meet Your Teacher

Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Introduction: Hello dear friends, and welcome to the Microsoft Azure Data Factory Essentials Training. My name is average on an ion and Microsoft Certified Solutions I've touched and I'm Microsoft Certified Trainer. This course is for beginners. You don't need any Thetas experience in Azure Data Factory UI start to edit the beginning and we will work our way through it step-by-step. Also, if you're planning to take the Microsoft data engineer to vacation, discuss is perfect for you. As we will be covering some of the topics covered in the syllabus of the exam. You will learn all the Fundamentals of Data Factory and how it can help in batch processing, and how it connects with mini audit technology with all the connectors available. Well, we will also explore activities as well as the components that make up Azure Data Factory. For example, pipelines, datasets, triggers, linked services, and more. In addition to the competence, we'll cover how to transform, ingest and integrate Theta code tree using mapping data flows. There will be a little project. We will integrate Azure Data Factory it with Databricks by Rooney and feel no folks from within ADF, how cool is that? Half and a finally, you will learn how to deploy all deploys stuff will learn during this course using Azure DevOps, Continuous Integration and Continuous Deployment, aka CI CD. Well, I hope to see you soon in this course, and thanks for watching. 2. Getting Started: Hi everybody. Before we get our hands dirty, I'm going to give you a little walk through the modules we will be seeing throughout this course. First module, we have getting it started. Within this module, we have sent out for an Azure free account. So we will see how we can take the most of the Azure account. We're going to take a look. All the services that we can use for free and the credits that are available to us. We're going to set up a budget so we can make sure our free carrots, I always under control. Also, we're going to see how we can set up an Azure Data Factory from the portal and the PowerShell. See in the next lesson. 3. Understand Azure Data Factory Components: Welcome to the second module of the Azure Data Factory essentials course. In this module, we're going to see are the fundamental elements that compose the ecosystem of data factory. We will go through Linked Services, pipelines, datasets in the various activities available to us in ADF. Also, we will see things like how to reuse objects by creating crowds. Were also deep dive into the available triggers to shadow pipeline executions and how we can use each one of them is module is full of good content. And this is where we start our journey from 0 to here in ADF. And it can't wait for us to get started. 4. Ingesting and Transforming Data with Azure Data Factory: Hi everyone, ingesting and transforming data with Azure Data Factory. In this module, we will get our hands dirty and putting practice everything that we learned from the previous module, where we will use key ADF components to work on very common use case. For example, how to ingest data using the cardiac activity in ADF into an Azure Data Lake Gen2, and how we can transfer data from Amazon S3 to Azure SQL databases. We'll, we'll set up this transfers step-by-step. In the second part of this module, we will work with mapping data flows. Mapping data flows are a powerful tool that allows us to run a Spark clusters without writing a single line of code, we will go to each and every single transformation available in mapping data flows. For example, joins derived classrooms, search operations and so on and so forth. Thanks for watching. 5. Integrate Azure Data Factory with Databricks : Hello dear friends. This module demonstrates how did the factory can enable that engineers to integrate ETL pipelines with Azure Databricks and leverage ApacheSpark to transform and percent data at scale. We role-play that. We are part of analytics team that was tough to report on aggregate a number of crimes of several cities in the US. And we will see how we can provision an Azure database search space from the portal was the workspace is created. We will then create an Apache cluster, followed by a few notebooks to help us to work with the data that we need. And then we will attach the notebooks to the cluster. So this way, we can run all the notebooks directly from Data Factory. Be assured that will be a challenging path. That view enjoyed the monitors. I did see you soon. Thanks for watching. 6. Continuous Integration and Continuous Delivery (CI/CD) for Azure Data Factory: Hi friends. Our final module, if the eyes and the cake, we'll see what's required to set up an Azure DevOps pipeline that we work as a continuous integration and continuous deployment, also known as CISD for Data Factory. We will create an Azure DevOps organization from the crowd app, which includes creating a positive trade, versioning EHR Data Factory would branches and understanding how did they release workflow words for Data Factory will also see how we can create a CICD pipeline in Azure DevOps, how cool is that? This is where we learn how to package up all the cool stuff we learned during this course and a higher environment, IE production. Well, see you in the next module, Getting Started. 7. Sign up for your Azure free account: Hi, We're going to go to a lot of demonstrations during our curse. And I would recommend you to follow along this way. You can make sure you're going to learn in practice what we have seen, how to recommend you to create a Microsoft Azure free account. To do that, you can search on Google and then we have the first link here. Let's click on that. Here, we can have an idea how it works. So we get $200 credits to explore as your services for 30 days. If you use for any reason, all your credits before the 30 days, all your services will be shutoff. Or if you don't use the $200 credit for the period of 30 days, you lose all the rest. You cannot carry it in molar. However, you get a lot of popular services for 12 months, which means that even if you use up all your $200, you can still have your account for 12 months. Oh, it's free services. So if we keep scrolling down here, we can have an idea of other services that we can use for the first 12 months. So basically, we can get to virtual machines of the small size. So also we can use other services such as disks to attach to your virtual machines. Blob Storage load balancers, if we're studying load balancers, archive storage, and a lot more. You also get always free services. On top of the 12 months. You have always resurfaces, for example, Event Grid, DevTest Labs and so on and so forth. If you check out all the free surface products, you can click on this button here and then you should be able to see the full list. Now let's screw up and click Start free year to start creating our account. So from here you have a few options. You gotta have a Microsoft account to be able to create an Azure free account. If you don't have one, you can create one right now from here. Or if you have a GitHub account and you want to use, or I could have accounts you can also use and then you have other options here as well. I already have an existing account and we will see how it can go through that process. Okay, Now let's click on Next. Once you have type their password, it's going to redirect you to the page where you're going to sign up. So there are few things that you have to fill in here. First, you have to agree with the details that they show you, such as privacy statements, software details, and so on, so forth. As you have to, to take here, if one, I get e-mails from Microsoft about offers, I won't do that at this moment. Here, you can have a summary of the benefits you get. As you can see scenes and Ireland, it has directly to my current. So Eileen, 70-year-old, not a $100. Then we can click on Next. Then you have to provide a phone number. So Microsoft is going to send you a text. You can either receive a tax or choose to receive a call. It's really up to you, so I will select tax to me. So once you provide your verification code, you will have to provide a credit card. Microsoft does confirm that they won't charge you after the 30 days. They won't take any money from your credit cards. This is just for validation purposes. Once the 30 days finishes, they will ask you if you want to continue with a pay-as-you-go model. At this point, I already got the text, confirms the text from the previous page. Now I can see the location where I came to ln information from a credit crash. I won't go any further from here because I already had my account. Once you proceed with this option, you should be able to go and access the portal using your new account. To access the portal. The homepage is Portal This is going to redirect you to the main page of the portal. Then here you have a bunch of services. We're going to go through all the steps that are going to be using during this course. And stay tuned and see you soon. 8. Setting up a Budget: Our friends, in our previous lesson, we created an Azure free account to start using Cloud services for free. In this lesson, we will take a look at how we can set up a budget on our subscription to have our credits under control. With a budget in place, we can be notified when the protocols, our subscription, or a resource group goes beyond the threshold that tree define. Throughout this course, we're going to focus on BIA factory, which shouldn't consume other credits very soon. But to avoid any surprise, Let's set up a budget just so we can make sure everything is in the control. If we go to the search box and search for cost management, we have a fill options in here. On the left pane. Let's click on cost management. And also here on the left pane we have. Here is where we start defining our budgets. So we can have more than one byte and define the scope for that point that we have the subscription by default, probably you're going to have your default subscription in here. Let's click on Add to add a new bucket. How so you can define the scope of the budget step you're going to create. And here you can futures, for instance, if we wanted to create a budget for any specific resource group, it also would be possible. So it's important that we define a unique name for our budget. And this unique name is in our subscription. So you cannot have two boats with the same name. Then we have the reset period. We have two types. We have calendar months and invoices. Here we have the creation date. The creation date is when we create our budget. And for the next month, our beach will start from the same day. Here you have a little description about that as well and expiration date for these budgets. So you can put it like to 2030 one year if you'd like. I'm just going to set up a budget. And here I will stick with monthly calendar months. I will start from the first of July and by expiration date, I will leave as it is. And here is the pilot study I want to set up. So imagine that we've got $200 credit. That's our equation, starts the maximum amount of money that we can spend. And here we have a little forecasts as well based on your previous usage. So let's just stick with 200 in that example, in my case would be a 175 years. And we can go next. This is where we define our conditions, are conditions is basically the threshold for our budgets. We can have more than one. And every time our cost goes beyond that budget's threshold, we get not fight. All right, so let's get the actual cost. And then here we have a percentage for the budget. So I'm gonna get notified when we reach and the date put on automatically returned to us 87 year. And here you can see this little dot is the first threshold. Now I'm going to set up a second one and I will want to get a notification when we pass 80 percent, which is a 140 year. So here we have the second line here, the maximum. Now we have to put an e-mail address so we can get notification. There are two ways to get notified. One is by setting up an action group. We're not going to look at this. And also we have the alert recipients, which is enough. So you can just put your email address here and you will get an e-mail from Microsoft whenever one of the thresholds here, it's reached. All right, so I would just as an example, I'll just put test. I will stick with the default and hit creation. That's it. So we have our voltage created and we will get notified whenever we reach our threshold defined for these budgets. So here we have a little progress bar. As you progress with your voltage, you start spending more money. You see this little bar here, it starts to move. Let's click on here. This is just a summary of our budget. We have two thresholds. The expiry date in patient age, our scope, and the males that we will get the notifications. So that's it for now guys. We have now a peace of mind up. You're not going to spend all our money Monday. And if you have any question, just let me know, send me an email or post in the comments and I will get back to you as soon as possible. See you in the next lesson. Thanks for watching. 9. How setup Azure Data Factory using Azure Portal: Hi folks. As we're going to spend most of the time, we can read data factories. That to start by creating our instance of Data Factory, we're going to see how we can create a factory by using two different methods. The first method is going to be fun portal, and then the second method we'll be using PowerShell. I am going to be providing all the codes and the document will be available from the lessons resources. So let's get started. Here we have the portal, that's the main page. And we can start by typing data factories. And we are going to be seeing here the homepage of their factory. So let's click on Create. And here we have a few options to fill in. And we're gonna go through them step-by-step. As you can see, we have a few tabs up here. Microsoft uses a configuration by tabs as you move forward or you move through the steps to create surfaces or open up existing services, you will always have tabs. So let us start by selecting our subscription. I only have one available. And here we have a resource group. So a resource group is very much a collection of resources that usually are created with the same life cycle. For instance, a virtual machine would have desks and network cards. Everything that is required for Ubuntu machine should be usually placed inside of the same resource group. So let's create a new resource group for our Data Factory and we can call it anything you want. It's not required to be globally unique name. So I will click on Create. And I would just type my pdf, dash RG resource group. So click OK. The region is where your resource is going to be physically deployed. It usually advised to create a resource close to your physical location. In my case, I will type North Europe. So I will need to give a globally unique name to my resource. This is because when you access that a factory, you are accessing a DNS domain, meaning and it must be globally unique. I would just type something random here. So just for us to move forward. And then we have the option to select a million to one. So one is just for compatibility purposes. We should always take what they're willing to move and click on gauge configuration. We're gonna go through what scared and how it integrates. Whereas relative factory later on these curves, at this moment, we will just stay quit configure, gets later. Then we have networking. So you have two options here to connect the self-hosted integration runtime to your Data Factory using private or public endpoint. We're going to go through self-hosted integration time during your records. Lets us take Republican point for now. And we don't want to enable the managed virtual network for the outer reserve integration runtime, this is more required if you want to access your resources privately. There is a, there is some configurations around that we don't want to use at this point. So I will not stick this option. Let's go to advanced. Here we have the option to encrypt Data Factory resources using a managed key. So that means that you can bring your own key, place your key inside of the Key Vault, and then use that key to encrypt your resources. However, by default that a factory our encrypts your data out at rest. For example, when it cached data for data movements, or when you're creating your linked services, iterator encrypted by default. If you click on that option, you have to provide the address of your key and it will be double encrypted. And that of the Azure managed key, you're going to also have your customer managed key. So since we don't have a customer managed key at this point, let's just move without checking this box. Here we have the option to create an tags if you want, if, whenever, for example, assign a cost center for your resources or something like that, we select this, we don't have any tags at the moment. We'll just leave this empty. And then finally we have the place where we reveal our configurations. As you can see, it's pretty straightforward to create a Data Factory, to make an instance of Artifactory available. It's quite simple. We don't have a lot of configurations to do here. So just let's click on Create and create our first Artifactory. Create. A resource has been created successfully. Let's go to resource by clicking on this button. And here we have the homepage up that a factory, the instance we just created. So here you can only see symmetrics. We don't have the whole lot here. So Data Factory is really accessed from author inventory. This is the place where you're going to be altering and monitoring are all your pipelines and dataflows, right? So let's click on auto and monitor. Cool. That's the homepage of Data Factory. We are going to be going into all the options we have here. And we're going to spend a lot of time here as well. Stay tuned, and let's see how we can create the same data factory from PowerShell. See you soon. 10. How setup Azure Data Factory using PowerShell: Hi folks and welcome to another lesson. In this lesson we're going to see how we can create a Data Factory using PowerShell. In our last lesson, we saw how we could create a Data Factory from the portal. So let's go and search for data factories again. And this is the Data Factory we just created. Now let's see how we can create using PowerShell. Let's go back to the homepage. And here at the top bar we have Cloud Shell. Let's click on that. Cloud Shell is a great tool if you want to manage your Azure resources from a programmatic way. So you have the option to choose bash from Linux or PowerShell. Here, what it does, Azure provides you a container behind the scenes with an unpatched the storage. Is, storage is ephemeral, meaning once you finish with your creations or the things that you're working on, all your data will be erased. So here you have your options to explore your data. For example, you can change fonts, text size. You can upload files and download files. You can also click here, and it's going to redirect you to a page that you can access, the PowerShell or the Cloud Shell in full screen. I'll say here you have the editor, which is pretty cool. So you can and actually edit files directly from year and saved files. So it looks like at the computer for you would access to your Azure environment. So let's work from here and create our data factory. I'm going to pull up here my VS code with the required commands that we need to perform. So I have set up here three variables, and which is the Data Factory name, the resource group name, and the location where we're going to deploy our Data Factory. As you can see, this is quite similar to the options that we had from the portal. Here. At this step, we're going to create using the values assigned to the variables, any resource group. Then finally, we are going to create our Data Factory V2 inside of the resource group that we create at the first step here. Okay? So again, we need to provide a unique key name, Charlie Data Factory. So I've just typed any random name here. Then my resource group name will be my, might be as resource group location within our theorem. Okay, I'm going to provide this file here for you guys to download. And then you can try from your hand. So let's copy it here. Then let's move back to the portal. Now, we're going to paste this here and then hit Enter. So we have all the values assigned. If an a check, you can just type one of the variables here so that our factory is, you can see it returns the value that we assigned. Now as the second step, we need to create our resource group. So let's copy that. If want a copy comment, it's also fine. Let's return to the partial session that's faced here. And automatically, partial adds a new line here because of the tilt. So let's see to enter, the resource group has been created. As you can see, it's pretty fast. And then finally, we need to create our Data Factory. Let's copy that again. Let's space into here. Hit Enter. And then it's precessing the creation of our data factory. The factory has been created. Let's move back to the portal, to the homepage and see the resource group and also the Data Factory. So let's come back piece to have this other section here, we can minimize this. Then let's search for data factories. Since I already have from my research services, I will click on that. And if you pay attention to this list, sometimes the meeting, I don't have the resource that I just created. It just because it's sometimes it's not immediately. It takes a little second to refresh when you create the resource from the API or PowerShell. Let's refresh this. And let's confirm the ADF name that we just created. Let's copy out and then put into here. So the resource is visible from here. And if you remember, this is the resource group that we just created. If we click on that, we can also see the resource from the resource group view and confirm that it has been created successfully. Go. Now we have what's required for us to move on with our curves C tuned and see you in the next lesson. 11. ADF Components - Linked Services: Welcome to another lesson. In this lesson, we are going to see how we can create a linked service in Azure Data Factory. Linkerd service is a key element for story data-driven workflows, EDF. It can associate linked services, would connection strings using application softwares, where you must provide certain information to connect to a data source. For example, the address of the data source, the credentials, and most importantly, what's the driver for that connection? A SQL Server database requires any specific driver. A NoSQL database would require another drive. An API requires another set of Brown's to retrieve or send the data. Linked services in ADF is pretty much the same. It offers a variety of connectors that out of the box have all the required parameters that allow you to connect to the various services. And as a requirement for this lesson, we will create a Data Lake Storage Gen2. And also we are going to link that a factory and the data lake. Stay tuned. I'm knowing the Portal. Let's open up ADF to start working with linked services. And we're going to type but the factories in the search box. And let's use the ADF we created earlier. Click on ultra and monitor. And let's wait until it loads the page. Now we're on the homepage. Let's go to the Manage on the left side. And you can see that link at service is the very first option on the left pane. So you have the authentic creative clicking on new or creative linked service. Let's click on that. As we can see, it pops up a big number of different kinds of linked services for you to choose from. At the moment of this recording, there are over 80 different kinds of linked services. And as you can see, you have native Azure services, as well as third-party services such as SAP, hana, Salesforce, even Shopify. So for our experiment, we are going to be using Data Lake Gen2. Since this is an initiative as a service. As you can see, there are few options that you need to fill in it, such as name description, Integration, Runtime, which kind of authentication that you want to use, and where your services located. Since I don't have any storage accounts created, yes. Let's create one and see how it works. Let's come back to the portal. On the search box. I'm going to search for storage accounts. And should it be the first option? Let's click on New. I'm not going to go into the various details that compose the storage account because our goal here is just to create a storage account and to connect EDF to it. I will pick my current subscription. I will choose the resource group we created before. I will give it a name. It's going to be, I'm going to stick with North Europe, region. And out your options will be default for me and I will skip all these options, networking, data protection, and I will go to advanced. So under Data Lake Gen2, I will say enable. This is what confirmed that I want a Data Lake, so I will skip tags and then Review and Create. Cool. Now we have our storage account created. So let's go to resource. And there is only one thing that we need to do in here, which is authorized ADF to access the data that's going to be stored on this data layer. Otherwise, we cannot equate the linked services. So let's say access control. At, at rural assignment. I will choose the role called blob, data contributor. An interesting thing here is that when you create a Data Factory behind the scenes on Azure Active Directory and identities created with the same name of the data factory. This way, you can authorize the Data Factory itself to access the resource instead of creating a new account. So let's search for the name of our data factory. And there you go. We have here and save, and let's wait into the world is created, the role has been assigned. You can confirm this on a role assignments. And you see you have the icon of the data factory here, and you have the object ID as well, which is this big string that's composed by the subscription ID, the resource group name, and the resource type. Great, Let's come back to ADS. Here is still under a linked service, Data Lake Gen2 page open as giving it a name. I will not give any description. I will stick with the default options and under authentication method, it's going to be managed identity. Then under my subscription, I will choose my grant one and I will search for the new Data Factory that ever created. But you can see that this is not on the list yet. This is because sometimes it takes a little bit to propagate the new reserves so we can refresh. And there we go. It's in here. As you can see, the managed identities, the name of the data factory. This is what we granted permission under storage account and is the object AG. We can test the connection on. There we go, successful. Let's create the linked service. We just need to publish. Okay, we're all set. Let's see how we can use this from a dataset. 12. ADF Components - Pipelines: Hi. Before we dive into that sets activities, you've got understand what's a pipeline? A pipeline is not t, but a logical grouping of activities that together perform a task. For example, you could have a set of activities that ingest and queen log data. And then you would have a mapping data flow to analyze the log data. The pipeline allows you to manage those activities as a set instead of each one individually. So let's say you deploy a pipeline and those pipelines work as a container where you can group a lot of activities inside, achieve a go. Once you deploy the pipeline, you can shuffle it. Instead of shadowing activities independently. Inside of a Data Factory, you can have one or more pipelines. You could even nest pipelines in a way that you could call an activity within a pipeline that cause a second pipeline, at third pipeline, and so on. Let's see how it works in practice. To create a pipeline. There are two ways. One is from the homepage of Data Factory. You could click on Create pipeline. And it would redirect you directly to this page. Or you could come here to offer and click on the breadcrumbs in here. New pipeline. As you can see, this is basically a canvas and empty canvas for you to start building your complex pipelines. Once you have a pipeline, you can give it a name. It's important that you give a proper name, that you can find it easily afterwards. On the right side, you have the panel and you can give it a name, you can give it a description. Also, you have the option to set concurrency. As you can see, this is the number of simultaneous biplane runs that are fired at once. E, given that the factory has a soft limit. So you can control the number of pipelines that you are running at any given time. And notation is basically a tag that indicates what kind of pipeline you are dealing with. This is something that you can give and it's quite useful when you are looking at things like monitor. Once you decide what kind of activities do you want to use, you can start dragging them and dropping them. In your empty canvas. There is not much around pipeline. And we'll dive more into it as we go along during this course. This is all for now. Let's continue with datasets. 13. ADF Components - Datasets: Now we have already starts to create it. We need to create a data set. You've got a set is a name to view of the data that simply references the data you want to use in your activities. It also identifies data within the data stores, such as tables, files, folders, and documents. For example, a blob data set specifies the blog container and folders. Blob storage from which the activity should read the data. Now let's see how we create that. Let's take a look at the alter option. As you can see, we don't have any data set created yet, so let's create one. Take a look at the breadcrumbs here and click Actions. New data set. For our data set since we used Data Lake Storage Gen2. Let's split this option. Now that we have our pointer to a linked service, we need to specify what kind of files we're going to be using. So let me select files. Hit Continue. We must give it a name, dot value. Then our link it serves should be available to us. Let's pick this option. Here. I can define what's the source or the location where my data is stored. Since we don't have any folder or any structure created yet, I would just stick with known. Let's click Okay. From this panel, as you can see, we have our objects created. Here. It can pre-populate it to us. And I can test the connection again. From these phage. We could edit, stealing service to change the connection or to change the description, something like that. Or I could create a new lipophilic service directly from here as well. So there are many entry points where we can create an object. Let's come here. It's a type of compression that I want to work with. It could be to output my file or to read this type of compression. Let's pick snappy. Here. It's a pretty cool because I came browser my data lake. If I head folders in here, I wouldn't be able to expand all of them. I could even preview my data if I had selected the file. Cool. For now, that's all we need to do. So let's hit Publish and then publish. Great, We're all set for now. Let's see how we work with activities. 14. ADF Components - Activities: Cool. Now we have our two basic requirements, think servers and that the set, we need to define a way to move the data or to transform the data from point a to point B using our linkage service and data set, the main components to make it work that the movement activities, not the transformation activities and control activities. Now, for hosts work with activities, we must create a pipeline first. So let's go to alter actions. New pipeline from the three types of activities that we've classify it. That's a start with dot. The movement activity. Or movement activity generally is associated with a copy dot, dot or copy activity task. When working with Copy Doc, you can see you will have source and sink, which could be translated to your data source and your destination. As you can see here, it requires a dataset. This is how we start plugging together are linked services and datasets. Here as well, we need a data set. There is a wide variety of datasets and the list is always evolving for not that transformations activities, we have different options. We could transform the data from within Data Factory with its own resources, which we could call dataflow or mapping data flows. With mocking do the post, you'll have a big number of transformations. Also, you could call compute resources to do the transformations for you to reach the data, such as Databricks, Batch Service, azure, Synapse, and other virtual machines. And the list keeps growing. So you have several options to work with transformations. Finally, with control activities. You can orchestrate the flow of your activities to make them execute in a way or an order that you desire. You could chain activities in that sequence, branching and the more things. Let's take a look how it works. Let's suppose we want to change in a certain sequence how your activities execute. So you could plug them together and even define when this should have been executed. If you right-click on the arrow, you can see you have success, fader completion and more, but it doesn't stop there. You can find more options in here. For instance, if we start to its integration and the conditions we have for each, we have if condition, we have switch until. Those are controls. And you can use them to modify how your pipeline execute. For each activity defines a repeating control flow in your pipeline. Does activity is used to iterate over a collection and executes specified activity isn't a loop. The if condition can be used to branch based on a condition that can evaluate to true or false. I'm under. Quite useful. Activity is the one called executes pipeline activity. So let's search for it. Executes pipeline. So does active to allows you to invoke another pipeline. But why this is useful when we talk about pipelines, there is a maximum limit of activities and a specific pipeline. This number is 40. If we're working on big data warehouse where you can get a lot of tables. You might need to create a chain of pipelines. Meaning you can come here and call the execution of a second by blind. And here you could call another pipeline. And also for organization purposes, you could call, they mentioned tables in here. You could call facts tables in here, and so on and so forth. 15. ADF Components - Pipeline Parameters: Hi there. In the ecosystem of data factory, we can create various objects. For example, datasets, pipelines, data flows, and the list goes on. But how do you create dynamic solution in a way that you can avoid a big number of objects just because they have different bias parameters. Which parameters you can create a very robust and dynamic solution. Creating headquarters datasets or pipeline is not a bad thing in itself. It's just when you start creating many of the subjects that things get a little bit time-consuming. And not to mention, there is always a risk of main welding iterations where when you start creating a lot of the same objects, you just tune out and make a lot of silly mistakes. We are going to split this demonstration in three parts. The first part, we are going to create a pipeline into a dataset. And we're going to pass the value from the parameter that we have on the pipeline down to the dataset. The second part is to pass the parameter values between activities. So this would be more internally within the pipeline. And then the third part, we're going to use a global parameter in a way that we would have a higher its scope. And then pass the value of that parameter either to a pipeline or too many pipelines or a dataset. There are some elements of the demonstrations that we haven't seen before. For example, how we create a data lake. But don't worry, I'm just going to the future here a little bit to save some time. But we'll see throughout this course how we create all those objects. Here, I have an empty canvas, which is my pipeline. So I'm going to rename the pipeline to PL. For our demonstration, we are going to get the data from a rest API and it's stored that data onto a delayed. So first of all, we need to get our activity, which we can use is the copy Beta. Since you were going to be transferring data. There are two main things here. We're going to go straight to the point. We have source and sink. Our source is going to be recipe. So we need to use the dataset since we don't have one yet, Let's create one. Then I'm going to search for rest. And this should be this one here. Let's click. Okay. Okay, we have our dataset created. Now we need a linked service. Are we linked service is our connection string. And datasets is the type of connection that we're gonna be using. So let's open our dataset. And as you can see, it's empty. So we could pass if we wanted the relative URL from here. Let's create a new linked service. As a naming convention, I'm going to be creating like ls for a linked service API. And then for all these options, we're going to go through them later on this course. We will just stick with auto resolve and the authentication type we are going to be using anonymous. Now we just have to pass the URL. So I'm going to just get the URL that I already have and paste into here and we can test the connection. Okay, that's successful, which is great. Let's click Create. And now we have our linked service creators and our dataset called rest is using that linked service. So let's change the name of these datasets to stick with RNA convention. And then I'll put ds for Data Service. And the score rest API. Call. Now is where we get to the point where we start working with browsers. So let's move back here. And then I'm going to add a new parameter. Here. I'm telling that dataset, it's going to be receiving it from Enter and then it can't assign a default value fresh. So we have three fields here. I will just name my prompter, us relative URL. And my default value is going to be forward slash. And that's it. I'm saying, hey, datasets, you have a prompter and a default value. If nothing is passed, this value here is going to be assumed. Let's save it. Now. We have created our dataset and linked service. No, we gotta go back here to our pipeline. And as you can see that immediately, as soon as I add a new parameter to my dataset, it gets populated in here. It's expecting a value for my pipeline. So what we can do here, we can click here and anywhere outside of the activities. So we change to the scope. And we can get this information here from the bottom. And we have a few options in here. One of them is the prompter. Let's create a new parameter and also let's call it relative URL. Same type string. And I'm going to pass the same value, which is the speakers. Okay? I'm going to save this again. What's happening here is the following. We have a parameter which is created on this outside of the scope here. And my activity is using a dataset which is already in terrible. And this dataset is expecting a parameter. If we click here, you can see that immediately I can see ABS dynamic content. Let's click on that. And from the drop-down here, we have a list of available parameters. If I had more parameters from my pipeline, it would be listed here as well. So let's click on that. And one interesting to notice here, the name comes as fully qualified. So this is the scope where my pipeline was created, but that's quite important. Okay, so let's click on Finish coal. So now we are using the prompter that we created in our pipeline in passing that value to our dataset. Now let's go ahead and click on sync. And we're going to create a new dataset to learn the data that we are just getting from our rest API. Okay, so let's go ahead and create a new dataset. In it. We're gonna be using Data Lake Gen2. Let's click on that and go continue. And I'm going to use a JSON type. Let's keep continue. I will just name as the data lake. Then. I already have one linked service is created from a data lake. And I will just select this one, but we could create a new one from here. So I just selected this one. And then from here is the path where I'm going to store the file. So I would just browse that information in here and see what are the containers, what are the routes folder that I have available? So I would just click here on my CSV. And this data will be landed on the roof level of the container. So let's click. Okay, great. We have everything that we need right now. So let's go ahead and test this pipeline here and see how it behaves. So I'm just going to go and click on the book. You can see that if I wanted to, I could push that information from here as well. And then it will be passed to my data set. I would just stick with the default value and click Okay. You can see that it's now queuing and has been succeeded. So our pipeline use the parameters passed down to the dataset. Now, we have to create the second part of our demonstration. 16. ADF Components - Activity Parameters: Right, going to the second part of our demonstration. Now, we're going to use the parameters between activities. In this demonstration, we are going to get the metadata of the file is stored on the data lake. And using that information, we are going to populate a table on a SQL Server. So to do that, there is a phase test that we have to do here. So I have a dummy table that will hold if your fields, for example, the item name, the last modified date for the file. And we have a surplus here. These are procedural basically receives as a parameter, the same fields that we have available on our table. And then with those browsers will populate a table we have inserted into here. And then we have just down here a select statement to validate the data on our table. The table is not created yet. We can verify by doing this, I will make sure that table doesn't exist and I will create the table. So let's do that. Cool. Our table is created. We can check the number of rows and the table is empty. Now, let's make sure we have our procedure also created because this is what's going to populate our table. I'm going to select everything here and create a procedure call or procedure is valid data. Now let's get an activity called metadata. So to actually especially designed to get the metadata of files on different locations. In our case, we're going to get to the metadata of the file that we uploaded from our previous demonstration. We can use the same datasets that we had before, TS Data Lake. And there are a few arguments that we can get from the file, so we can specify from the available list. So if we click on new, we have the arguments. And from here we can add more arguments. We're gonna get item name. And I'm going to add a new one item type. And the last mode five. So total three items that we got. And just to check here on our table. So those are item name, type and mood file. Cool. So we have all the arguments we need. And this is going to be retrieved from our file located on our data lake. Let's go ahead and get another activity called store procedure. So with this activity, it allows us to call a surface seekers on databases. So let's click here on the screenshots and drag and drop it to the next activity, so make sure they are executed on a specific order. Then here, on our surface seizure, we can select our database. So let's go ahead and get a linked service. Since I already have my liquid surface creators I've just selected from here. And then I can pull here the available surface seizures or my database. Since you're only had one, we can get this one here. It's cool because here I can fit fine Mentally the parameters that I want to pass. Or if I want, I can import the parameter from the database automatically. So let's hit Import. Great. So as you can see, the parameters that we have on our surface feature are now available from here. Cool. Note that we have those two chain together. We're going to have to change our surface seizure to receive the parameters that we want. So those are the available parameters and here are the bytes. If we wanted, we could pass a default value. But what we want is the result set coming from the ghetto metadata. If we click here, automatically, adds dynamic content. Let's click on that. And we have the metadata output. So that's interesting because it comes as an object. So for us to get the values of the result set of the metadata activity, we have to access them as an object. So let's click on that. And I'm going to click on Finish here. So we can come back to you really quickly to see what are the information we have. So we have item, name, item type and less modified. So here we can click on that. This will be the last modified. It will be altogether. So let's click Okay. This one here will be either name. So we can click on batch. Then type item name again on together. Then we have finally identify. Let's click on that. Select our output. Then type island, right? Great. Now we are passing the values associated to this activity to the surface here equation. I'm going to save that. And let's run this activity. Let's hit debug and see how it can look like. It's skewing. Sassy. No. First one succeeded. Let's check the second. Greece. The values of our activity get into data were passed successfully to the surface seizure. If we click here on Canvas, we can check the execution. So click here. We can see what are the arguments that we chose from our method data activity. So item name, it didn't happen last modified. So those were the output of the activity. The input. We can see we had a stored procedure and those were the values that we passed to the surface. Now, if we go back to the SQL server and if we check the table again, we now have a new volume I knew records, but it's great so we could see how we can pass the values of an activity to another activity that is using parameters. That's great. We can reuse our code. 17. 4.5.3 ADF Components - Global Parameters: Now, moving to the third part of this demonstration, we're going to see how we can work with global parameters. I left. The global problem doesn't stop because I wanted to demonstrate how we can use the global prominent. There's either for pipelines or activities. On the left side, we have managed. And down here under opera, we have global parameters. Let's click on New. And let's keep the name relative URL and the value will be the same as we provided before, which is the speakers. Let's hit Save. Now, moving back to the pipeline, I have opened here both pipelines that we've worked before. The one that we pass the parameters within the pipeline and here between activities. So if we check the activity and then we check source, we can see that we have is to the prompt reset. Let's click on that. And now underneath parameters that we have before, we have global parameters. And since we gave the same name, that's the same them that we see here. So we can delete this highlighted option here, which is the actual parameter. Then we can pick relative URL. We can see that now we have a fully qualified name pointing to the global parameters. Let's hit Finish. And here, if we move to the other pipeline where we pass it from speaking activities, if we click here, we can see that the global parameters is also available and we don't have the pipeline available here. That's the beauty of the global parameters. Global parameters is highly reusable, either between activities or among several pipelines, instead of being scoped to only one pipeline. Okay, Let's cancel this one here. Let's go back to this one. And since now we are going to be using a global parameters, let's save and run this and see how it's going to work. Let's hit Debug. Then again, I can impute the parameter if I want. And it will kick off the pipeline. Now it's in progress. Great. It's mean succeeded. And with that, we conclude one more demonstration. I hope you guys enjoy it. And let's see in the next lesson how we can work with treaters is ancient. 18. ADF Components - Triggers : In this lesson, we're going to talk about Azure triggers. Triggers are a fundamental element in the ecosystem of data factory. Using triggers is how we can shuttle our pipelines to meet our requirements. At the moment of this recording, there are four different types of triggers and we're going to go through them now. There are mainly two ways of accessing existing or new triggers. When you went to associated by plane to a trigger, we have this button here are triggered and we can click on new or edits to your. Also on the left side we have managed and under author we have triggers. And here we can see all the triggers available that we have within our Data Factory. So let's come back to the pipeline we created in our previous demonstration. And let's clicking Add New Trigger. And here you can see that I could pick when existing trigger or I can create a new one. Since I don't have any. Let's click on New. Okay. This is the main window where we're going to set up our triggers. There are mainly four different types of triggers. We have skeletal, we have tumbling window, we have storage events and custom events. So when it comes to scale, these cattle either triggered that evokes a pipeline on a walk look. And the relationship between a scale in a pipeline can be one-to-many. For example, I can have many pipelines using the same kettle. Could all my pipelines run at one again? Yes, they just need to share the same scattering. Ok, now we have a tumbling window. Tumbling window is a triggered that operates on a pretty chaotic interval while also retaining stage. It could be described as triggered that can be used for more complex scenarios. For example, you can create dependencies between triggers, between another tumbling window. And you can run a few dog from the moment it failed. You can set user retry for pipelines. And the relationship between your tumbling window trader and pipelines is always a one-to-one. You can't use the same time window for many pipelines as we can do using his cattle. Another very good feature of the blue window. You can run a tumbling window for past and future data denotes wild scandal. You can't, it's perfect gateway and you want to run back fuels for your deeper house or your databases. And finally, another element of the tumbling window is the fact that it has a self dependency property, which means that the trigger shouldn't proceed to the next to window until the preceding. We know it's successfully finished. So you've got to have that in mind. Scandal is always in the future. And w window, you can have a defined period of time. It's Storage event is a very cool type of trigger. So there are many use cases where you want to react to events that can happen on your storage account. For example, imagine that you are waiting for a file to land on your storage account and then you can load that file. In that case, you don't need to periodically check if the file has been successfully loaded into storage account. Using historic events is perfect because whenever a new event is created on your storage account, you can use that trigger to fire a pipeline. So these storage events uses Event Grid behind the scenes. So whenever you create that trigger a new topic, iso generated for you behind the scenes. And if you go to Event Grid, you can steal the event there. Now we simply add it. We have custom events. So not only for storage accounts anymore, it can be any day. Even greater, great, because you can define your own topics and create your own events. Imagine you have an application that has a very particular kind of event and you want to react to events. So you could create a custom topic for event creation. And then using the custom events in Data Factory, you can subscribe to those events. And then we act whenever a new event occurs. That's brilliant. Now, we have much more flexibility when reacting to events. So here, as you can see, these Gallo is pretty simple for a pipeline, so you just define the date and time. For tumbling window. We have more options so you can set the recurrence of the events like every 15 minutes or you can, as ours, specify an end date for that. Let's say just want to run for a period or for a window of 30 minutes. If you click on Advanced, you have a lot more options in here. So you have a max concurrence, which is great for Data Factory. So you can define the maximum aware of some Daniel's triggers wanting can define a number for your retries of the chair retry policy, then this is great as well. So you can think of this as some sort of validation where you can define the integral for your retry them, which is great. So limitations is similar to what we have for pipelines. So you want to add some sort of descriptions to your trigger. You can add a new notation here. And this is to say if it's activated or deactivated. And let's go to store events are disjoint events. You got to define a storage account. And then here is where you define the path that you want to listen. And here are the events that you can react whenever a Blob file is created or when a Blob file is deleted, then you can really smart, you can't ignore tax files or let's say Parquet files that they are empty. Okay. So you, you don't need to, You don't need to run your pipeline because it won't load anything. Okay? Then the same permutations and activated or deactivated and then custom events. Because some events we have here, the subscription and then we have David could talk, can name. This is created outside of Data Factory. So then you have your subjects, which is similar to what we have for the storage accounts, like what you want to listen to react. And then given types, those are all Customs I can define. And with that, we conclude one more lesson. I hope you have liked. And if you have any question, just show me a message, passing the comments, and I will get back to you as soon as possible. Thanks for watching. 19. ADF Components - Azure Integration Runtime: Now, let's talk about as we're integration runtime, let's see how it works and why this is so important in the ecosystem of Azure Data Factory. So to see where it's located and where we can configure, manage. And you should be able to see integration runtimes, as you can see here. By default, when you create a new Data Factory, it comes with a default integration runtime in a nutshell. And integration on time is a compute infrastructure that's running all your activities behind the scenes. There are nearly three types of integration runtimes, but we are seeing here is the other type. We can also have the self-hosted type and SSIS integration runtime. So basically, what are the differences between costs and network restrictions? When we use Azure Integration Runtime, this is basically meant to be used between resources within Azure. Or for instance, when he went to connect to Amazon S3, or other Cloud resources. The beauty of the Azure Integration Runtime is that it provides a fully managed serverless compute. In other meaning, it provides the whole infrastructure for you. You don't have to think about software installation and patching or capacity scaling. Now, let's see how the self-hosted integration runtime works. 20. ADF Components - Self-Hosted Integration Runtime: Now, if you want to work with self-hosted integration runtime, it has to be created manually and for us to create a new self-hosted integration runtime. You clicking New. And we have the option as xr self-hosted integration runtime. So let's go ahead and click Continue. As you can see, we have a couple of options to choose from. And here we can see more information about the self-hosted. And here we have a linked self-hosted. So let's start with the self-hosted. With the self-hosted if you want to perform some sort of data integration securely in a private network, which doesn't have a direct line of sight from the public environment. The self-hosted. The self-hosted means that you have a compute on your own network. You install the integration runtime on that specific computer resource that you want that's validated using a key that's provided at the creation time. So let's go ahead and continue. You can give it a name. We can create. Now, my self-hosted integration runtime is created. As you can see, I've got a key one and key. Then it gives me a link to download the integration runtime. Then once it is installed, it shows me how many nodes is installed, if it's being updated or not, if it's shared. So let's click on Download going to redirect me to the Microsoft page. And we can see that we have this Download button. Let's download it. And I will get the first one. It's 700 MB. So you've got to have some space on your computer and start going to pause this video and return. When did solos? I have clicked on installer and let's go ahead and see how it looks like. I will just say next. Accept. Next. It's quite easy installation. I will install it now. When you finally finish the installation, This is the page that you will see. This is what's going to validate your self-hosted integration runtime with the other. So let's come back here to the one that we just created. We're going to get the key. You can choose any of them. The reason that we have two keys is for security purposes. Imagine that you want to renew the first key because you're not sure who has it. And when I validated, you should be able to grab one of the two and recycle one of them. Okay, so let's get it. The first one. Then I will paste it into here, has validated. So I can show you the authentication keys, which is the same. Don't worry, this is just for demonstration and it will be deleted. I will register. Now, it's very dated and you have this tool tip here. This is something that you want to use from your intranet. So let's say if you have more than one degree from runtime running on the same network and you want them to communicate with each other, to use them as a failover, something you could enable this and then there is a way for you to buy the data, but let's leave it as it is at the moment. So let's finish. Cool, It's now validated. We have a node that has been registered successfully. So let's launch configuration manager. And this is how it looks like. It's pretty simple. There is not much for us to do here is just the case. If you want to troubleshoot some connection. For example, you have settings in here. If you have some proxy settings on your computer, it's usually not into doing here because it picks up from the environment variables. Have diagnostic us. If you want to view the logs or if on a test connectivity as it would be working from the ADF, you can get the source and you can pick the server. What kind of authentication you want. So let's say imagine I'm selecting SQL Server, occur to get Windows authentication on our basic, which is a SQL logging. And I would pass the credentials a year and passed my connection. Okay, So this is the update at the ADA updates it automatic is just for information purposes in here and help If F find some links for some information. So let's close that. I will come to this. And as you can see, it turns to green. I don't have any VPN setup. It's a pretty basic setup here. And my connection rely on the connectivity from the integration time using my key authentication. Those little wasn't a year. It's monitored BY activity on my integral from time. It shows the utilization of my CPU or the minute What's going on with the version, the available memory for me to run. Basic information and the throughput as well. How many concurrent jobs that can run at the same time. If you provision more resources, you can have more concurrent jobs at the same time. And if you have lots of things in July, how much resource available to pipelines will be allocated to the skill and they will be keep going until there is an available resource for them to run. So here are the activities. You can find all the pipelines that is elution time has ever won. Looking at this page, okay, here are the theaters free to filter for how long you want to check? Its usually it's by default set to last 24 hours. That's it. Thanks for watching. Stay tuned. 21. ADF Components - Linked Self-Hosted Integration Runtime: Hi, and welcome to another lesson. Now we will talk about linked self-hosted integration runtime. If you come to manage. You must have noticed that when we selected as self-hosted, there is another option here, linked self-hosted IR. You can click on Learn more. And we can see there is a comprehensive documentation here. However, if we click in here, the linked self-hosted integration runtime, it's quite a useful when you cannot afford having more than one self-hosted integration runtime in your enterprise network. Or you just don't want to spin up another computer research to be dedicated, the self-hosted integration runtime. So in that case, you can link to that effect trees to use only one linked self-hosted integration runtime. And on top of that, a self-hosted integration runtime can only be linked to only one that a factory. So this is also useful in scenarios when you want to create a CICD pipeline for a data factories, you only reference to only one self-hosted integration runtime instead of having to manage multiple self-hosted integration runtime. Now, let's have a look how it works in practice, how we can set that up. So to do that, we will need to first create another Data Factory because at this point we only had one. So go back to the portal, search where data factories. We already had one year, so we create another one. I will just copy this name. Create. We're going to select the same resource group we created before. My region will be North Europe. I will just add another number here for my name. I will skip get configuration, I will skip networking. I am not enabling any encryption. In that case, I will live empty with you and create K. So it gets configuration. I will just select configure Git later. Okay. So go ahead and create. Okay, The been created. Now, let's go to the resource group where it's deployed. Crazy. We can see the uterus factory we just created. It's this wine with O2 in front. Okay, this is the new Data Factory we just created. As you can see, MT, No Pipelines, not datasets manage. We only have one integration runtime. This is d'Azur, one that comes with all the new directory is you create. Now let's suppose I want to use the same self-hosted integration runtime that we created for the other Data Factory. To do that, we have to go to a couple of steps. I want to first allow my new Data Factory to access my OS Data Factory. So do that, we go to the previously created integration runtime. We construct this page. We have to open the self-hosted integration runtime that we already have. And then there is a tab called sharing. So let's click on that. And you can see that this is the location where you grant permissions to another Data Factory. So this is one-to-one relationship. Once I select this, you can see that it brings me all the existing data factories that exists on my subscription or the dentin one and connected. So I will just select this guy here. This is the one that we just created hat, and it will grant the permissions. It's all set. Now, the second step that I need to do is to take the resource ID of my existing self-hosted integration runtime. And then on my new Data Factory, I want to create a new self-hosted integration runtime. But at this time it will be linked self-hosted integration runtime. And I will click on continuing. As you can see, there's not much for us to do here. We just need basically to tell that affects what's the resource that I want to link. So this will be my existing self-hosted integration runtime we can create. And that's it. As you can see, there is nothing to the loading here. There is no key to a self-hosted on your computer because it's already done before. The managed key here is that it's associated with something that was already set up before. So the type now is self-hosted. And it comes running because these datas in here is also running. Okay, so you can see self-hosted, IR of self-hosted. And if I click here, I can see there's only two resource AG. And the other one, if I click on match, I have the keys, denotes the upbeats, shadow, or my self-hosted integration runtime. And now I can see that I have my permissions graduates. That's all for now. This is how we set up a linked self-hosted integration runtime or cheese for feature. And for watching. See you soon. 22. ADF Components - Azure-SSIS Integration Runtime: Now let's take a look at our third option, which is the ASOR SSIS. Let's go to New. And here we can see we have a second option for Azure SSIS. So let's click on latch. So why do we need this third option? So many customers, sometimes they find it difficult to migrate all the SSIS package into natively ADF package or pipelines. So the reason why we have this Azure SSIS is pretty much for lifting and shifting packages from your on-premise resources to Azure. When you have the Azure SSIS, it's also a fully managed sir. Behind the scenes, those are yams dedicated to run those as assess packages. Here you can observe as well that a license is required because you're running SSIS components, which means that you need a license for SQL Server. So you can leverage a feature called bring your own license. And this way you can have discount, meaning that you are we using your on-premises licenses in the Cloud. Similarly, you will have the ability to choose resources of your own virtual machines on prem. Here, you can also pick the number of nodes available to your pure integration runtime. And you can pick the kind of machine you want. And of course, the more resources you provision, more it's going to cost you at the end of the month. If you wish to save some cost, you can pause this integration with time as you wish. Here you have the location where you want to deploy your cluster of virtual machines. Ideally, you want to pick a region that's close to your physical location or the physical location where your data is stored by selecting the right location of your SSIS integration runtime is essential for performs the throughput. This way, your data can travel less to get to the final destination. Now, let's give it a name and continue with the creation of our integration runtime. So I'm going to say my SSIS integration runtime, going to buy the date, the name, the name is. Okay, I can give it a description that I want. I would just say go at continuous. One interesting thing to notice in here is the fact that we need a location to store the SSIS packages. In that case, you need to have a SQL Managed Instance or a SQL database that's going to be served as a SSIS DB. And you're going to store those packages in there. So in my case here, I don't have any instance created yet. So I will just stop at here because the purpose of this demonstration is just to show you how the use of SSAS and what's the purpose of that? With that, we conclude one more lesson. Thank you and stay tuned. See you soon. 23. Quiz - Module 3: Good. 24. How to Ingest Data using Copy Activity into Azure Data Lake Gen2: In this lesson, we are going to be using the copy activity to move data from one place to another. We will use an Excel file located on my computer as a source and the ASU Data Lake that we've created as a destination. As a second example, we're going to be using an S3 bucket on AWS as a source and a SQL Server database as a destination. I'm not going to the details how to set up an Amazon S3 buckets because just out of this code, which is the same for the SQL Server database, let's just assume that everything is created for us. And as a data engineer, we just need to move data from one place to another. So without further ado, let's get started. Hi, Let's have a look through objects that we have created that we will need to move forward with this lesson. Let us start with manage, and let's take a look at the linked service. You have to make sure that one of the options that you have here is the data lake that we've created before. Cool. Now let's check the integration runtimes. And you must have the self-hosted integration runtime. And it must be up and running. If you have two objects in here and they are all green, you're good, you're ready to go. Now, let's get started. Coming back to the homepage. Let's create a linked service to our Excel file. So go to Manage Links and services. Click on New. And now we will search for file system. This is because the files is obviously stored on our local computer. And we will need a file system linked service to get that file and populate a storage account with that file. This file will be available as a resource so you can download it from the lesson section. So the file is something like this, is just a sample data that mimics a order kind of system. And you have aids and name of for people. I would just save this file on my C drive Temp folder and then we will take it from there. Let's go ahead and create our linked service. So continuum. We'll use. This is quite important for your scenario to work, is to select my self-hosted integration runtime. And now I will just say where is the location, where my file is stored? And here is the name of the user on your computer. This is what's going to authorize Azure Data Factory to connect to your file. You can use your own user. Here, I just as an example, I'm not going to create a new user. I just use my own account and I will test my connection out. What do you know? So it has been successfully created. We then can go ahead and finish the creation. And here we should see a new file system creators. This is my CSV locale breach. As you can see, we have our source and our targets are really created. Now, we have to go ahead and create a dataset. So on the left side, go to author, go to datasets. And here on the breadcrumbs actions, new dataset. Here we have to find what's the sign of the datasets that we're going to be dealing with. Remember, we created a link severs as a file system. So let's search for file system. In our case, it's a csv file. So it's delimited text I've created as a delimited file. Okay, so let's select that. That's my dataset name. Remember is not the filename, and I will get my local connection in here. I have to tell ramified is located. I could have subfolders inside of my path, but since I don't have any, I can just exclude this part and tell what's the file. So there you go. It's now selected. I have a couple of options to choose from here. So I could import the schema or get a sample from the file. I could have known here if wanted to something later or just from connection, slash store. Go hatch. Now let's see. Brilliant. We have our first data set. We can connection. It should come as successful as well. And here I have lots of things to manipulate my file if I want. For example, if I wanted to quote the field setup file because there might be commas instead of the cell. So there's a lot of things that I can do here. So we schema and I can say import, my schema and I have parameters as well. It's everything already set. So let's go ahead and publish to save our dataset. Perfect, we have our search. Now we have to create another dataset for our destination. New dataset. I'm going to select Azure Data Lake. Do I select the text? And I will say, okay, because I don't have any location or any file in my storage account, I could create a folder structure inside of it to one specific permissions on a specific folders. But let's just stick with the default. And the file should be dropped on the root level of the storage account. Okay, and perfects, we had our second dataset. It's created. So let's go in and publish it again. Perfect publishes completed. Now we have to create our pipeline. Our pipeline remember, is what's going to encapsulate all the objects that we have just created, which is the linked service, which is the dataset here on the right size. Now we have the option to hide this for a little bit. So for this example, we're going to be using cot beta activity and that effect tree. We have only one task that contains source and sink, which is your destination. Here is where we are going to be selecting the datasets that we have just created. So my source, we will be in my local computer. My Azuri CSV file tests destination will be the Data Lake Gen2. So let's click on that crate. There are both selected. If I haven't selected in specific file on my dataset, I could get wildcard myopathy and everything underneath that file would be copied over. I could list the files as well if I wanted to. You also have a very cool feature here, which is the preview data. Let's click on that and see what comes back to us. Brilliant, if you remember, these is the file that we saved on our local computer. Here, as you can see, it's treating our header as an actual recreation, set up the file, and we don't want that. So let's exclude this. Had there will be the practice. One will be order date, region and so on, so forth. To do that, we have to go back to our dataset to fix that problem. Let's expand this again to see our datasets. In that case will be the my CSV file Test. And here I can configure where is the very first line of my file? You can select this first row as a header. Okay, it's all set. Let's come back to our pipeline. And here on the set of things we can again, for your data, we should have the proper header. Nice, it's all perfect and configure it. Okay, now let's take a look at our sink. And there's not much first to do here is just the case that if we're dealing with lots of information to be transferred, you can configure the file extension that you want. You can configure the copy behavior. If, for example, for dealing with JSON files, you want to flatten the hierarchy of the file. And here you can say what's the extension that I will be ready in my file. There can change if you want. In my case, I will save it as a CSV file. In my mapping. I don't need to do entity here because students already imported. So just go to Settings. And here is how my 50 get parallelized with all the default options. And here is user properties and something that I don't want to do any chains right now. So our first pipeline is properly configured. So let's publish. Freeland. We all sets, we have our datasets, creatives, we have our pipelines, we have our Linked Services created. That's pretty cool. Now, let's run this and see how it's going to work. However, before we go ahead, There's one lead on defecation that we need to do on our dataset, go to the target CSV that direction too. And if you remember correctly, I had left empty. We need to have a folder to jump to file. If I had browser distinct here, I have now a folder and set up the data lake. But you don't need to create this beforehand. You can just give any name that you want a year. And if it doesn't exist, that effect, we'll create that for you. My CSV location one. And then I will publish this. Okay, datasets now published, that was the only modification that I did. Everything else is the same. Okay, so let's run now our pipeline. Okay, for us to run the pipeline, we can have the debug option, which is, I want to run this right now, and it doesn't depend on each finger. And then I also have a trigger, which means that I trigger now or I have new and edit to create a new trigger. So let's just hit Debug and go ahead and see how it's going to work. Okay, our pipeline is now huge. It's in the queue of data. And this is the beauty that, that effect tree created for our activity. So remember, it's combining everything together here, putting all the pieces together to transfer our data. And what do you know? Our file has been transferred? Remember, we did not have a folder called mice yes, geolocation. And it was created for sure. So you can see it was cute successfully, it was transferred successfully. Well, in the duration that it took for each one of those steps, happy days, we have everything set and our file should be on the data lake. But now, how do we make sure that it's in the data lake? Let's have a look. I will go to the portal. I would just type portal dot Here on the Data Lake set of things I will search for storage accounts. And I have my ADF Data Lake that we created for this project. And we have Storage Explorer preview and we have containers. So here inside of my container, There's go my fires a year. Just coming back to my dataset. This is how the files I split. Whenever you see filesystem hearing aids stuck in about a container, you've gotta have a container. I will just then main different folder in here. Folder a one. As you could see, coming back to the Data Lake Storage Explorer, there is no folder gets defined on the root level, so you don't have to worry about creating a new folder. You can just come back, give a name to your dataset, a folder name inside of your datasets. Publish. Okay, great, It Snow published. Now we can rerun our pipeline, skewing our pipeline and L, it should run in a few seconds. So our file has been created. So we can check in two ways. You can check again from the file system here instead of the dataset, if I come back to the storage account, I see my new locations. Might two containers created. I will select this line. And now I have a new folder and the previous file that I had dropped. So there must be a new file in the year. Great, or files in here, That's perfect. From our storage explorer view, we can come back with fresh. You can see that we have a new folder and the files in here. Cool, That's all for now. I hope you have liked. Our next example is going to be transferring it bracket file from Amazon blood storage. Okay. And then don't mean that file inside of a database. Thank you. Stay tuned. 25. How to Copy Parquet Files from AWS S3 to Azure SQL Database: Hi folks, welcome to another lesson. I hope you have enjoyed our last demonstration. Now we are going to look at how we can load data from Amazon S3 into SQL Server using the copy activity. This is the same activity that we used before. The difference now is the linked service that will have a different format. So just as a prereq, you must have set up your own S3 bucket. Setting up S3 buckets on AWS is out of the scope. Same for setting up a SQL Server. Okay, so let's get our hands dirty and see how that works. I have here the AWS. I'll open. I have my S3 that I had named Ozzie test training 2020 000. And as you can see, the bucket is empty. Before we do anything, it will upload the data into Amazon S3. I will make it available the file that I'm using here, It's just a simple file. So before we go ahead and want to show you guys how the file looks like. This is a project that's on GitHub. You can preview your packet files in here. If you have some errors on your computer, it also works, okay? So my backup file is called user data dot bracket. So I just selected this and this show table here selected. And I will submit the file because I just want to quickly see how it looks like. So you see here is just a simple data, bunch of random stuff, okay? There is, there is enough data here for us to mess around. So if want to take a look at these websites when you get a chance, it's called your rate. So we know how our file looks like. Now, let's upload that file into Amazon S3. It's pretty simple, just upload or uploads up here. Your choice. So add files. I will take this file and user data one, It's quite as modifiers, just a 100 kilobytes. I will upload the file. It shouldn't take long. It's succeeded. Okay, it's finished. So we have our data city on Amazon S3. Okay, let's come back to Azure Data Factory. I will set up our first Linkerd service, which we'll be looking at, Amazon S3. So let's go ahead and click on Manage and linked services. New Amazon S3 will be here on top. I think a very common requirement to ingest data from Amazon S3. I know for a fact that Microsoft has invested a huge amount of money to make that connectivity reliable and fast. So let's go ahead and click on Amazon S3 and name it as My first Amazon S3. Now if you remember, one of the use cases are auto resolve integration time or Zurich integration runtime is to move data or to access resources in different clouds. That case here, we would have using my Microsoft spec bond talks us AWS. Here we have two options. We have the access key. When you create a user on AWS, you are given an access key and secret key. This is the first option here to authenticate to the Amazon service, or you have temporary security credentials. So basically this is a token that's granted to you and it expires after awhile. So we're going to be using access key for that case. And I will just paste into here and this key will be temporary. And I will paste my access key, and now I will get my secret access key. Okay, now that we have our secret access key, we have is an option here called serves the URL. So basically this is if you want to access S3 via a different endpoint, you can select this. College has three connector you want to should change to something different here. If you really know what tears in most of the cases, it will be empty. So now you might want to test your connection to see if your S3 it exists or if your user is up and running correctly. Okay, so it's a good idea to select to file path. Here we pass the name of our bucket, which will be ADF training or just copied from the bucket. So we don't waste time. So n here, since we don't have any directory inside of the buckets, which we could create if we wanted to add. But it's on the root level. Okay, So let's give us this and Test Connection. Yep, it's all working. Now, we are all set to start getting data from Amazon S3. Let's hit Create year. And it's been created. Now we have to look at our SQL database. 26. Creating ADF Linked Service for Azure SQL Database: I will move to the Azure portal now. So I would just search or SQL database that I created. I created a pretty small SQL database. And let's see how it goes. You have a bunch of configurations. So as you can see, I only have 20 mega space, but since our upper occupies only yeah, 100 KB, we should be fine. All right, so what we need to do here now is the following. We need to get the server name so we can create are willing to serve as. So I'm going to copy this, come back to Data Factory and clicking new Lincoln service. Then I will search for SQL. I will just select from CSER hear SQL database. And again, we're going to be using the auto resolve integration runtime, okay, because we are within the Azure backbone, there is no point for us to get out to the self-hosted and come back. So we're gonna, we can, if we want a tape on a seed money for, for any reason, and don't use compute resources from CSER. That is possible, but there is a performance penalty for that. So let's just stick with this option. And I've just selected for my subscription. I will get my 78 because it gets populated. I I didn't need to actually to to do anything there. So okay. As you can see, I've created a database here before. That's an ADF train is equal one. And what I'm gonna do here now is create a credential for my SQL database. I don't have my traditional setup, yes. As you can see, I have three and methods to authenticate to my database. I have SQL authentication. It's just a user and password that gets created eSight of the database it is is the default credential. You also have the Managed Identity. Managed identity is the user that gets created with Data Factory. So it holds the data factory name so you cannot get the password is it's all behind the scenes. And you also get the opportunity to use Service Principal service principles like a service account credentials. So you have the tenancy and you have the participants pointy and service principle key. So those are things that you will need to configure. But what I'm gonna do here, since it's less hassle to create and also more secure is the managed identity. I will just stick with this option. I just need to make sure that my database has authorized the access to this object. So to do that, I will open up my Management Studio and see how it's configured. 27. How to Grant Permissions on Azure SQL DB to Data Factory Managed Identity : Okay, I opened up my Management Studio and I have logged in on my SQL Server, virtual server. And as you can see here, I have created a simple database. This database is empty. There is no tables whatsoever, no logins as well. So what we need to do here now is to create a user with the name of the data factory. This way, SQL Server will understand that this is a managed identity. So the syntax to create the user is the following. We're going to open up a new query editor, okay? And then we'll type user. And I need to get the name of my Data Factory. So I will just come back to that a factory. I will copy it again. The name of the user, go back to Management Studio. Type, the name of the user in here. And this is how it works from external provider. As you can see, there is no password in here, just external provider that will indicate these user I would exist within Active Directory. However, there is something else that we need to do. First. If you go and click on execute, you will see there is an error message here. It says connections can be established if with Active Directory account. Okay? So as I showed you in the beginning, I am connected using, and I say a county, this is a SQL logging. For me to be able to create a user from the Active Directory, I must be connected with an I, an AD account. So to do that, I will show you quickly how it works on the portal inside of the Azure AD so you can follow along. Okay, this is slightly out of scope of this training, but for you to follow along, we'll see how it works. So search for Azure Active Directory. If you have rights to do so, of course, on the left side, search for groups and create a new group. I will just name it to us at me. And then here I will take the numbers in. I will search for my user, which is a user that I'm logged in in a moment. Now, we need to come back to the portal. And I will go to my SQL Server and select several recreated Active Directory AD domain. As you can see, no Active Directory Udemy has been set. I will set admin, and then I'll pick the Ottoman group that we just created. Select. Then Save. Great, It's now saved. Coming back to my management studio, I should be able to login with my account. Okay, so let's open up a new connection, the same sever. A difference now will be universal with MFA. You must be using one of the latest versions of the Management Studio for this option to be available for you. Okay. I think it's from the 18 hours. So that's my e-mail address. I would see if it works. Okay. It will prompt me for my password. Okay, great. I am now Connectors yet now I am the Udemy of this account because I just set my own, my own Udemy account which is within the group as the Active Directory domain. So I can do anything here instead of this account. Now, let's come back and open a new query editor under my new account, okay, and my email. I will copy the same statements a little before, then, basically into here and hit Execute. Brilliant. Okay, it's now created. We have a new account under a database. Okay, this is a database contained user which references an object within Active Directory. And then you can see that's being created here. We now must grant permissions to this guy because the moment it has only permissions to connect. So we need to go and permissions to this guy. Remember, i o granted the honor, the user. Because I don't want to be restricted anything and I will hit executes. Okay, it's now granted as the beyond. Let's come back to Data Factory and test. Okay, it's here. It's the same thing as you see. That's the connection. And action fails. Let's see why it fails. Okay, if we pick a more, we should have a message. Okay, So the client API address is not authorized to access that server. Great. So this is an issue with a firewall. We can quickly fix this. Let's go back to the server. And in here, you should have the firewall and the virtual networks. Here I have my client IP address. What I need to do here, I just want to make sure that allow us resources and resources to the server is Two. Yes. So which means that all the Azure services, you'll be able to reach my database server. So hit Save and it will update if I were rules, okay, it's not updated. So let's go back to that effect tree and we will test it one more time, reach it has successfully connected. Now, I will just skip as it is at the moment. Cool. We've set up our Linked Services, our source and our target source B, AWS and targeted in the SQL database. Now, we need to move on and create the dataset. 28. How to Grant Permissions on Azure SQL DB to Data Factory Managed Identity : Hi everyone. Now that we have our Linked Services created, we have to create our datasets. Let's go to author. Then. On the left side, let's click on Actions and new datasets. Let's just start with Amazon S3, which is our source, become that. And then continuing. Here, we have to pick our Beta type, which is a file. Let's select and latch. And my linked service would be my first Amazon S3. Then I'll click Okay. Now we have our dataset created or S3. Let's create one for SQL Server. Let's go in and tweak and Nutella search. Let's search for Azure and bound here you can find Azure SQL database. Let's click on that continuum. Then I will get my first SQL database linked service. As you can see, I don't have any table created. Okay. It's pretty straightforward. We've created the pointers to our datasets. One interesting here to note is that here we have to pick something that affects tree needs to know where the data will be sent. So if you click Non here, when you are creating our pipeline, you'll get an error message. So since we don't have a table created, our goal here is to create a table at the room time with the same schema of our archive file. So to do that, we can clicking Edit and then you give it any name you want. You have to pass the schema name first. And then this will be my user or a simple, okay, this is how I'm naming my table. Cu we have created our both data sources. Now, we have to create our pipeline which will trigger our activities. So let's go ahead and click on Actions here. And new pipeline. We will also use the copy activity here. Let's go through the options here really quickly and see what we need. So let's name it as a copy. That's the name of our activity. Our source will be the S3 that we just created. Then we select the per kit, and then here we have a couple of options you can choose from. So in this case here, I am passing information that I have in my dataset. But if I wanted to create something more dynamic, I could get a prefix of a file or some wildcard, or even a list of files. So in here I can get the chance to explore the S3 bucket. Let's just stick with this option in here. Let's go ahead and go to sync. Sync will be our target, which is the SQL database we created. I want to select outer create table, okay? Because as you can see this, this will automatically create a stable at my sink if the table doesn't exist, meaning the name that I had on my dataset. This is if I want to copy scripture for a room, some sort of automation or some prep scripting year. This is the timeout or my batch to batch size if on a hand-picked that, let's say for a population and 1000 records per batch or a different number. Usually you, you're going to stick with the default option mapping. We can leave this empty because automatically Data Factory, we'll create the table based on the party of file. So this is the data integration units. It comes to the auto resolve integration time. Here you can specify what's the power of your auto resolve integration runtime. If you say auto factory based on its own rules, we will locate more resources, automatic data consistency and verification here just basically specify if your word count match at the source and the sink. Any for moving files across, it's going to get the size of the files are checking it. Binary of the files is we'll add extra duration your total load because at the very end, it is still needs to process the data and systems verification. Here is fault tolerance. We don't need to worry about that. Enable logging, logging if you wanted to, let's say if you have some fault tolerance and you want to skip records, and you want to see what was the euros that's happened. You can enable logging and specify a data lake or a Blob storage to dump that log message. Okay, So I will not select this and enable staging, which means that before you load your data in your database, you can stage that data on your Blob Storage or Data Lake sometimes depending on the size of your load. This can be helpful because as you're loading into a staging area, you are going to load your data in a parallel manner. You can get more throughput. Your data should go faster from the data lake to SQL Server because of the network distribution. Again, it depends on the size of your dataset. You'll have to experiment and see how it goes. We are not going to enable staging. And here at our source, you can open your dataset. You just need to pass the bucket name we created. And here you can test the connection, okay? And you can explore the S3 bucket if you have enough permissions to do so. Let's go and explore. And here you can see the name of my S3 bucket, and I can click on that and I can see all the files that I have inside of that bucket. I will just select userData Wine dot. And as you can see, it's on the roots level, so I don't have to specify the directory. And I also can preview the data and see how it looks like. That's pretty good. So that's a good sign that we can connect and we can reach. And let's retain, publish or good. Let's run that and see how it looks like. So first, let's have a quick look at our SQL Server Instance and see if we have any tables created. So as you can see, people's empty. And let's go back to that effect tree in our pipeline and see if we will have a new table. We did dataset insight. Okay, so let's be above. So it's queued at the moment. It's doing progress. And what do you know? It has succeeded? It's a good sign. Let's take a quick look at our SQL Server instance. Let's refresh that. Great. Our table is here. Now, let's take a quick look at the result of that table that's selected the results of this table and see if the data is there. Simple select statement and our data is here. You can see the table got created automatically. Just one thing that you have to pay attention is the datatype of your table. Since it's hitting your table on the fly, it doesn't really know exactly how much space it should give to your data, your varchar data, it will want to guess based on your Parquet file. As you can see, if you check the table, the datatypes of your columns. Everything came as any varchar max in null. And let's see, salary came AS float. You have to make your own judgment call. Okay, let's come back to our Data Factory and let's add a little bit of complexity to our pipeline. 29. Copy Parquet Files from AWS S3 into Data Lake and Azure SQL Database: Hi and welcome back. In this use case, we're going to get the same dataset from Amazon S3. But instead of sending the data directly to the database, we are going to use a zoo that election to us our landings on, you might want to have a data lake to please all your sources as a PDF file before Saint into your fight and on this nation. This is a common use case because people sometimes want to do some treatment on your data using Databricks, HDInsight. So let's try to mimic that use case and use the same Parquet file to transfer to our database and making sure the file exists on the data lake. For this example, let's start publishing from our last example. Cool. Let's go back to author. Let's expand that and let's create a new datasets to transfer our Parquet file. I will create a new data set. It's going to be Amazon S3. But instead of being Parquet file, I'm going to select binary file. So this way I don't have to Egypt Parquet file on Amazon S3 in my linked service will be the same. Just my file path here, you provide the name of our bucket, 2020 0, 0, and this is empty. We can explore the buckets from here, and we will select the same file. So see you can do this when you are creating the dataset, where once it's created, there is no really difference between the two. Let's click. Okay. Great. We have our source created. Now. We have to create our landings on datasets, which will be a Data Lake Gen2. Let's select this guy. This will be a binary as well. Continuing. It will be my first linked service. This is when I create it in our first example. So I need to select the file system, which is a container. And you can see you to that we created previously. I don't want to attach. So even if it doesn't exist, I can give it a name here and it will get created automatically by Azure Data Factory. Well, here inside I will name as AWS is our file directory. All right, so that's creative. We want to publish now just to make sure our new datasets I created. What we need to do here now is we have to create our pipeline that will transfer our party a file from AWS to Azure Data Lake. That's the name of my pipeline. So we're going to use copy data. My source will be procured binary transfer AWS first, then I will stick with the default in your sink learned is on Data Lake Gen2, right? So let's publish this option. Right? So let's run this pipeline now and see if we can translate the data across the clouds. Cool, let's go and debug. And it will start money. It's in progress. Okay, It failed. We can check the fades message here and so we can take a quick look at the reason. So it says file system has a name that is invalid. So it says it contains characters, which is the underscore. So that's pretty simple to solve. I did this on purpose, so you guys can see what kind of barriers sometimes it might happen. Let's go back to our dataset and we can remove that. And let's keep it as lending, lending raw without the underscore same directory site. Let's go to our pipeline. Moves data to the learnings on, and go ahead and debug. Cute, it's in progress. Let's see. Great, it's succeeded. So one way that we can explore that we can come back to our landing zone dataset and you can click Explore. And if you go back to here, remember before there were only two containers, now we have a third one. And here we have the AWS folder, That's the folder recreated here. And here we have our party file. Okay, that's great. Now, we have to move that file from the data lake to the database. But before doing that, let's make sure the file exists before moving into the database so we can avoid any errors or we can create some sort of validation in here to send a message to the provider of the data. The file doesn't exist. All right, so let's go back to our pipeline. And let's take a, a general activity which will be look-up. A lookup activity. Career is any datasets that supported as a data source inside of Data Factory. You can also create tables. It can execute the stored procedure. So let's use this lookup activity here, and we will go to Settings and we have to select our source dataset, which in this case will be the Parquet file that we transferred to our landing zone. So let's go ahead and create our new dataset, because as you can see, it's not available to us. So let's go into New and select Data Lake Gen2 continuing. It will be Parquet file. Then it will be my first linked service, right? So let's explore that linked servers. Let's go to London in RA, which is our container AWS folder, and this is our data. Let's just stick with the first option. Let's click Okay. All right, so first row only, that's okay. For our sample. Neither properties. We can then change this guy to the lookup. Lookup will make sure the file exist. Now that we just created our lookup to make sure our file exist. Let's come back here and get our copy data. Again. This is a step. We'll transfer the data from the data lake into the database. All right, so let's change this to together and let's get our source. And we did a source here. Can the, the same dataset we just created for the lookout activity, since it will be looking at the same location. All right, so we can open that dataset and validate this is what we want. Let's come back to the pipeline. And let's pick our sink, which we'll be using the Azure SQL DB sink, which is our target. Here again, we have to specify a table option known if we were using the stored procedure and auto Create to create the table. If it doesn't exist, it's going to rely on the table. We specify it in the dataset. Since the table we would exist, it will skip the creation. All right, so let's come back to our pipeline. Yeah, we're ready to go. So we have here our scenario I'll set up. So we transfer the data from Amazon S3 bucket into the data lake, which is a real engineers on. We have a final check to make sure the fire has been transferred correctly. Once it, this is valid data as we move the file, the contents of the file into a database. So if you want to, you can rename this. So let's publish your butane. Because completed. And let's go ahead and click in the bug. But first, let's just do a quick towns of the data. On our table. We have 1000 records. Let's see how many records we will have after we execute our pipeline. So let's hit Debug. First step has been succeeded. And it will check the second also succeeded. And it will now move the data into the database. It's in progress. Also succeeded grayish. Now let's check how many records we have on the database. And I will hit Execute. Do 1000 records, which is our second attempt. And upper arcade fire has a 1000 records. So here we can make sure that everything has been successfully executed. One interesting here to notice now is how we monitor everything that we just did. Our next lesson, we will check how we monitor all those executions that we already does. And the ones also running the past. So stay tuned and see you soon. 30. Monitoring ADF Pipeline Execution: Hi, and welcome to another lesson. In this lesson, we will explore how we monitor the execution of pipelines. You must remember that in our last illustrations, we were using the bug option. The debug option triggers the current pipeline. And this is your session. This option is for something that the developers working on a feature branch or something that is just for the moment and shouldn't be locked. If someone else wants to check the execution of my pipelines at a later stage, these executions sphere wouldn't appear. So how we make sure that everything is logged and how we understand the true put up something that we were not present when was running. Have the option of trigger now. Trigger now we will run whatever has been published and we'll log the execution of the pipeline. So let's hit trigger now for the same pipeline we did a before. And you can see no records found. We can go ahead and click Okay. And it will re-execute my pipeline same way we did it before, but using the trigger now option, our pipeline has been kicked off. We can come back to the left pane and click on monitor. And you can see that it has been succeeded already. So here is where you can see all the executions of your pipelines. And if I click on that, you can see that the same activities has been successfully executed. And you can check using this little glass, the details of your transfer. Click on that. You can see that my transfer, it came from the Data Lake Gen2 and went to my Azure SQL database. And here is the throughput of my transfer. And how many records were transferred. The P connections, this is something controlled by ADF and the size of the data written into my database. Huge number of files read the same amount of froze and actions. And here you can see a breakdown of my Tasks. Again, this is something executed behind the scenes that we have no control. This is the queue time, the pre-comp, the script transfer system. As you can see, since we don't have any pre copy script, it was 0. And then the transfer itself is quite useful in anyone with access to my data factory would be able to see that. We can see they can charge as well. And those are the executions of my pipelines, like when one started and finished. The second one, which is my data check, and third one which is inserted into the file destination. We can run it again and see this actual process trigger now. Okay? And if you go straight to the monitor, you can see that it's in progress. You can click on that. And you can see the first has been succeeded. Can refresh. The second one is skewed. Now. It's in progress. Has been succeeded. The second one is cuz now and it's in progress and finished. Okay, so we should have 40 thousand records into our table, right? So it's all good. So if we come back to our pipelines, you can see It's now logged. So anytime I opened up my Data Factory, I should be able to see the military part. Okay, let's suppose your pipeline has failed as well. All right, so you have the option to be execute the pipeline and it will, we execute from the moment or from the activity where it stopped. We don't need to rerun your entire pipeline. So imagine you're working on a very big transfer overnight. And then you had to go home and wait. And you had finished the other day and one of the stats failed, but all your massive data was already transferred. Then you could just click on rerun. And it would, we start from the moment where you have finished, which is pretty cool. Here, you have the option to click on consumption and see how much the IU hour it has consumed. So which is the movement of activities. All right. So if we had more pipelines, you would see how them overlapping chartering here, you have the option to select at the last seven days and cyclic custom data to see the execution of your pipelines. By default, it's always 24 hours. Yeah. So it's a very comprehensive in quite simple for you to investigate the executions of your pipelines. All of them has a VLAN ID, a unique viewpoint ID. And also here you have the option to see what was the huge factor. Also is a JSON file confirmed by Data Factory. This is a very good tool to investigate and make sure that everything is working correctly. Yeah, That's all for monitoring and see you in the next lesson. Thank you very much. See you soon. 31. Mapping Data Flow Walk-through: To get started with dataflows, we can come to the left panel. And as you can see here, just under datasets, we have dataflows. Dataflows can be created just like any pipeline or dataset. As you can see, you can click on the three dots in here and you see a new data flow. Let's click on that. It's cool because if it's the first time that you are opening a data flow, you are going to get a walk through here, which is quite handy. As you can see, we can start by adding a source to data flow. Let's click on Finish. This is our location, but the flow pains are divided in three main parts. The first part is the top bar. Default. The top bar is quite important because it's where we can validate our workflows. For example, the JSON or the logics that we're working on. Also, we can enable the dataflow debug, which means that we're going to start our clusters in order to run Spark code behind the scenes. The second part here is the graph. The graph is where you can place your transformations. You can create a transformation stream that shows the lineage of data source as it flows into one or more sinks to add a new data source, just select users up here, and then you see that it brings a new to tip here for you. To start our dean more tanks. For example, the left side of the nodes show the type of transformation. The right side of the node shows the name and description of the data stream. So let's click on Next. Here on the node, you can click on Configure by right-clicking. And then if we click Connect again, if you click on the little sign plus here is where you what a new transformation. Let's click on Finish. And here we would have our first transformation. Of course, clicking here to add another one. And you can see we have lots of transformations. If you are familiar with integration services, you would relate to some of the transformations we have there. For example, aggregates, pivot, and then pivot. Those are things that we can find Integration Services as well. So we will go to each one of those transformations and what they do in how they work. And finally, we have here our configuration panel. You can notice that if you select one of the tasks, it returns to you the options and configurations related to the selected task. If you just click outside and don't select entertain, you get the settings and parameters specific, the overall execution of your logic. You could add more prompting here, or you could add different settings in here and so on. If you select back to the task, you can see that we have several options. And those options may vary according to the task that you are working on. But one thing that you might find in common, easy optimize, the Optimized tab contains settings to configure partition schemas. So for example, you could, you could use the current partitioning, the default one or single partition, or you could set a partitioning. We're gonna go through each one of the options and it may be useful for you the painting on the requirement you have. If you click on Inspect is the option where you get to see the data that you're working on. Now moving to the data preview is where you can get an interactive is snapshot of the data as each transformation. That's quite cool because before an inner join or a lets see an aggregation, you would expect to see some type of data or a state of the data. So you could click on that and then just riot back to your transformations in C, the expected result, you can only see that when you have your cluster up and running or your debug mode set to one. 32. Mapping Data Flows Transformations - Multiple Inputs/Outputs: Starting with multiple inputs and outputs, we have joined as our first option. We joined transformation. You can combine data from different data sources or data streams, and the output of the data will include all columns from both sources. Matched phase on a giant condition. The join conditions are inner join, left join, right join, full outer join in custom cross join. So we have five different types of joints. Then following the giant, we have conditional splits. Conditionally speech, you can route rows of different data streams based on the matching condition. So imagine you want to route your records for any specific data source is now condition. Your condition could be where the status is activated or deactivated. And based on that condition, you would send the data to a table or two different datasets. Then we have x's. The axis transformation is the world filtering transformation that checks whether your data exists in another service or eStream. The output stream includes all rows and the left stream that are either exist or don't exist in the right stream. Then we have union. With the union, you can combine multiple data streams vertically. The output would be like you have one single dataset. Imagine that you want to put one dataset on top of the other, and they would look like as one single dataset. Then, in the end of our category of multiple inputs and outputs, we have lookup. This transformation is used to reference data from another source. For example, imagine you have a fact table to a dimension table. The lookout transformation append columns from the match the data, in that case the dimension to your source data. In that case, the fact, if you think about it, the lookup is quite similar to a left join operation, where all the values exist on your output stream in additional columns from your look upstream. 33. Mapping Data Flows Transformations - Schema Modifier: Then our next group is the schema modifier. As a first option, we have the derived column. With this transformation, you can generate new columns or modify existing ones using the dataflow expression language. Then we have select. With select, you can rename, drop or reorder columns. This transformation doesn't alter rho beta, but chooses which columns are propagated downstream. So let's say if you have a very wide dataset and then you just want to say selective few fields from that dataset that would be possible using this transformation. Then we have aggregate, aggregate to define different types of aggregation, such as some min-max count. You have to buy existing or computed columns. Then we have surrogate key. Surrogate keys, quite useful when you are dealing with dimensions and fact tables. You can use this transformation to add and incrementing key value to each row of the data. This is useful when designed animations tables. For example, in a star schema that will be used in analytical data model. Then you have pivot. Pivot, you can classify it as an aggregation where one or more grouping columns has its distinct row, transform it into individual columns. Then moving on to, um, pivot and pivot to you to the opposite way that you have bros on your dataset. And you want to transform them in a way that they are shown as volumes in our wider dataset. Here we have window. That's interesting. During the transformation is where you will define a window based occupations of columns in your data stream. In the expression builder of this transformation, you can define different types of aggregations that are based on that time window. For example, this is quite similar to the SQL over clause where you have window function in general. For example, you could think of lead lag in detail. Those are similar functions. You could find here. A new field is generated in your outputs that include those aggregations. This is quite useful. You want to work with different types of aggregations within the same dataset. Then we've got rank. Which rank you can generate an ordered ranking based on the certain condition that you can specify. You can aggregate in a way that you would create racks within the dataset. 34. Mapping Data Flows Transformations - Formatters: Then we've got as well D4 matters. So basically, using the flatten means that you can take an array that you have hairy code structure, such as a JSON file, and then you can unroll them into individual rows. That's quite useful if want to flatten your JSON file when you have a lots of nested nodes inside of JSON, then following the sequence we have parse for. It's quite useful when you need to parse texts columns, your daily stream. For example, the limited attacks for CSV files or XML. So that's quite important when you have to deal with a lot of strings and you have to format them, parse them. 35. Mapping Data Flows Transformations - Row Modifier: Then we have rho modifier. This is important if want to enrich your data in a way that you're going to need to modify them with the future. You have a filter based on that condition. Just like as an Excel, you've turned off future by age or phone I filter by timestamp, it's your choice. You have also search. You can search the incoming row on the current data stream. Let's say you want to search by name, by age, by any field. We want that be possible as well. And then you have altar row. You insert, delete, update, and upset policies on rows. You can add one to many conditions. This condition should be specified in order of priority as each row will be marked with the policy corresponding to the first matching expression. So you can define if my regular expression matches, I want to insert this record if it doesn't match my first condition, my second condition will update that row or delete that row. So the outer row can produce both DDL and D and L actions, I guess the database that you're dealing with. So it does the same as the merge statement and a little bit more. 36. 5Mapping Data Flows Transformations - Destination: And then finally here we have this nation. This is the same as we've seen before with the pipelines. The sink is your destination, is your place where you're going to insert your data. Okay? This is all the transformations that we can work with. Remember that those transformations are visual transformations that behind the scenes, I just running the code for you doing the, all the optimizations to run in an optimal way on Spark clusters. In our next lesson, we're going to see how we can use data sources into some transformations. Stay tuned. See you soon. 37. Defining Source Type; Dataset vs Inline: Hi guys and welcome to another lesson. In this lesson, we are going to see how we can create data sources using mapping data flows. We start, we're going to go here under data flows and click on Actions and click new data flow. This is empty because we're going to start from scratch. So I will just skip this first to tip here. And as you can see, we have a dotted box. Let's click on that. And I'm also going to skip this first guide. Here. We have a setup options in a few times with different settings that we can choose from as well. So let's start with source settings. In the first and foremost important decision that you have to make is what data source type you have to use. So we have datasets and also we have inline data set is just an entity that can be reused across data flows and also pipelines. If you remember, all those datasets, we created them before for our other demonstrations. So those are just entities. Here, inline. It's something more dedicated for each dataflow logic or workflow. There are benefits in using both of them, but there are cases where both of them will be supported. For instance, here in line dataset, we have a few out of the box datasets that we can choose from. And those datasets would work with a few connectors, not all of them. For example, those inline datasets you wouldn't be which use the Azure SQL database connector to have a more defined list of when to use inline source type or datasets or Skype. I would recommend taking a look at the documentation. For example, if I get the documentation here, you can see in a better view what are the connectors in what ID the source types supported. So we have Azure SQL database. As you can see, only dataset is supported. But for those types here, we have the option to use both of them. An important element of using dataset or inline is the fact that inline is a native source type for Spark while dataset is not. So you will see sometimes a better performance when dealing with files on a bigger lake rather than on a database or reading the same procure fires using datasets. The case that it's not very good or I should always be the main ones is just a case that what's the connector that you're using and you have to make a judgment call, which one of them is performing better? Urine demonstration here, we're going to stick with datasets because we're going to ingest the data that we have uploaded previously into our SQL database. To get us started, let's select Azure SQL DB. And this course sync it sink because at that other example, we're using as a sink, but now we can use it as the source. 38. Defining Source Options: Once we have others, the dataset, we already made the important decision, which is dataset for us 0, 0. And then we have to configure a few settings here. And we start by selecting the source settings. Here. For example, we have the output stream name, which is the name of our task. You can give any name you want or just stick with this name and add sample data. It doesn't accept that or underscore. It has to be one single string. Here we have allow schema drift. This option is the ability to latch Data Factory LBL, flexible schemas. And this is useful if your schema changes quite often. The setting allows the income source fields to flow to the transformations, to the sync automatically, so you don't have to handle the schema modification minorly all the time. Also, we have E for drifted column types. We use option. You can instruct Data Factory to detect and define data types for each new discovery column. That's quite useful like it's measuring a crawler. Data Factory has the ability to define and understand the datatypes as new columns appear, then we have valid data schema. If, but it is schema is selected, the data flow will fail if the incoming search data doesn't match the define the schema of the data set. This is a way to make sure that you have trustworthy schema. Then we have sampling data. This is for you to limit the amount of rows that you're getting guest when you're debugging or a testy, your logic is useful when executing it flows into the Bergamo from pipeline. 39. Spinning Up Data Flow Spark Cluster: Okay, So here as you can see, the dataset is grayed out. It's grayed out because we don't have our Spark cluster running. And we do that by enabling this option here. It might take some times up to five minutes because it's spinning up a new Spark cluster for you behind the scenes. So be patient. Sit back, relax and you will have your cluster. But soon, Let's click on that. And as you can see here, I have the option to select an integration runtime. So this follows the same aspects that we've seen before. I could have a self-hosted here and then I can pick the configuration for my auto resolve integration runtime. We're going to stick with this option. This is the book Time to Live, which means that if my debug is running idle for a one hour, it will shift now itself. This is a nice feature for you to save cost and not be a huge bill. I didn't know a month because you forgot to turn not fewer Spark cluster. So let's click Okay. Once we click Okay, we can see that the creation of the cluster will be initiated. And then if you pay attention to this bar here, this is when the cluster is actually being built behind the scenes. Once this is created, we will see a green check box here. At this point, we know that we have the compute resources for us to work. This can take a few minutes. So you've got to be patient and just wait until it gets completed. I'm going to speed up this video and we'll come back in a sec. Go now we have our cluster up and running. And if you check here, the notifications bell, you can see that our cluster took roughly six minutes ready, which is expected BC about five minutes, so for us to six. So let's click Go K here and close. And now we have our environments that up. And we can now test your connection is in Jeff's theta and so on and so forth. 40. Defining Data Source Input Type: We're going to move here to Circe options. And serious options may vary depending on the task that you're working on. So in our case, here will be source data and here we are connecting to a SQL database. And we can run the search procedure. We run a query. So the projection is actually the data that you have. Now we're going to come back just for a sec to our database, do the same that we've created. And we're going to check our table. As you can see, I have the database you're up and running, and the table we created, and we just says a few thousand records. So if you check here, we have 4 thousand rows. So what am I gonna do? I would just copy this code, is we'll select statements that we have. And we're going to go back to the source options. And I will paste this statement here. 41. Defining Data Schema: The projection is the service or something that defines the data columns types formats. So for most data-set types such as SQL in a park you, the projection is a source that reflects your schema when you source files that are not strongly typed. For example, if you have something that can change the jazz CSV files and text files instead of Parquet files that has a pre-defined schema, you can define the data types in here. So I can either be said to my schema or change it from here. Now our case here, we won't be able to edit entity because it's already coming from a database. And Data Factory understands that it's a fixed schema. So whenever there is a change, if you clicked on the schema, drift it to build in just any schema. But for example, if you're dealing with a text file and your text file does not have a defined the schema. You could click on the text. There would be an option here for you to click on the tax base schema. And it would take a sample of the data and infer the datatype into your production. Remember, if you're defining a new schema for your data or if you're changing the schema, if you click on imparts rejection, it will always override whatever you did. So you've got to be careful if you're updating a lot of things in here. 42. Optimizing Loads with Partitions: Now moving on to the optimize, we have three main options to select the type of partitioning that we wanted to choose. Most of the case, we will select to use Grant partitioning because we want to use the predefined set of rules that Data Factory has to choose the best path to find her data. But for example, if you have a SQL database as a source, you might want to change. It depends on the type of partitioning you have. So for example, you would have like six different types of partitioning to choose from. And then based on, let's say, a partition that you have on that specific column. Or if one I use a query condition and then create a partition based on that, it also be possible not every time you want to create a partition based on today's fields. Sometimes you'll have a new integer range that we wanted to find out a partitioning. So you can use your career condition to do that. If you click on this surface, but petitioning, which is a custom partitioning, It's likely that EDF, we'll read the data the fastest. Because here that a factory can use multiple statements and make several connections to get your data and well incurs in parallel. What will dictate the performance year for reading here is the available resources that you have, n how much concurrent statements you can run at the same time. The lowercase here, let's select column. If you click here on the drop-down menu, you can see that we have all the columns we have available in your table. So let's take the ID field, and these will be our partition column. 43. Data Preview from Source Transformation: Then moving to the Inspect, we have a sample of our IIS schema. So as you can see here, it shows all the columns we have. And then the datatype that we have. Moving on to data preview. It shows a preview of our data based on the partition that we choose. This is quite similar to the dataset. When you explore the data that you have, your clicker here and refresh, you can see that it's going to fetch the data from our table. Great, as you can see here, we have a sample of our data from the table that we're using from our dataset. That's quite cool because you can see here if everything looks correct, review is also nice because let's imagine you're doing transformations. After that. You would be able to see the data. At that point, we'll imagine your day in inner giant and you want to see the columns that you've got from your left table, in the right table. At this point, you would have only one table to select your data. But looking at the next step, you would see from data preview all the columns according to the transformations that you choose. So we have our search data created. We have gone through all the options that we have available for that. The next step that we're going to create is the sync side of our transformation. If I tried to see you here at this point, it will complain that we don't have a sink and ethanoate save your pipeline. We've got a sink. So let's create our sink and maybe create another table based on the source on the same database. Secant. 44. How to add a Sink to a Mapping Data Flow: Okay, Now we need to create our sink many where we want to send our data. So click here and a plus sign. We're going to list all the available transformations that we have. In our case, we're going to use destination. And then the only available sink. Click on that. And then again, we get a nice tooltip here. At this time we're just going to close this. And here we have similar options as we have for the source, we have really sink. In our case here we need to pick a data set that will represent the point where we're going to send the data. Let's click on New to create a new dataset based on the observed database that we have. So let's click on New. Then we have the option to choose Azure SQL Database. So let's continue. Since we already have our serifs are created, let's just click on that. And then I will say my new dataset. Okay? Here we have the option to select an existing table, so we only have one. And then also we have the option to create a new table. So let's create a new table. And then I just need to give it a name and a schema. So my scheme will be DIYbio, the table name will be sampled data. This is where I'm often get buffalo. And we have a little button here for advanced. I don't want to do an idiot is point because it could be used for optimization and so on. I'll just click Okay. At this point, we have a representation of that table that it wasn't really created. Yes. Then we have settings. Here. We can manipulate the type of permissions that we want to select from this dataset. Here we have p bar actions, we have recreated stable truncate the table, meaning every time that we run this, we could truncate the table if it's an entire datasets or if we're dealing with some schema drift, you want to make sure that you have it clean table or you can leave and recreated into our table. Or if we're dealing with incremental loads, you could just select Nano. Here is the batch size. If when I control the number of rows inserted as a time, if it's going to use TempDB data, if you want to, for example, use SQL scripts does use a fully fauna, let's say created table before that, if we want to drop the table or does something with the table, create an index dropped the index, for example, image one at dropping exists before the data is loaded. So you can get the data in quicker. Most data is loaded, you can create an index. You could use these two options. Also, you have a nice error handling here. You could stop an error continuum on error and things from that mapping. We don't have to worry about mapping because this will be one-to-one. I want to get all of these columns that we have in to my sink. So that's okay. Here we can again select PSD partitioning is that we have, I'm going to set the partitioning again. I will get maybe keep partitioning. We're going to select IT GAN and then inspect something that a preview. We could refresh the data to see the data again. All right, so let's publish. It should be okay to publish at this time. The data has been published successfully. And at this point, we should be able to run the pipeline and see if we can get all the data across into a new table. 45. How to Execute a Mapping Data Flow: How do we run the mapping data flow? That's interesting because from the standard pipeline, you would have the book and then also trigger now, which allows you to execute your pipelines treat away. That's not the case for a dataflows. For Dataflow to be executed, it has to be executed from within the pipeline. In other words, in asset by blank. So let's expand that lines. And let's click here on new pipeline. And then we are going to get data flows from the options are activities. So as you can see, it shows like just any task that we had available. And because we already have our data for created, we can call it from here. So we have Settings. Then we can get there for one which is our Dataflow. And then from here, we can again choose decompose each type that we're going to run. We can get just a general purpose and then the smallest cluster that we have available for us, we don't need a big cluster. Then we have the options of blogging. We will just stick with the propose. We got to bear in mind that if you're promoting this thing to production, it might be a good idea to work with the different options. The more options you have, the more logs you get, the more time it will take to perform all the activities. We have similar properties. We can run this in parallel if we want. Our just stick with all the boats. Here is the staging. We don't need a staging. This is something seemed like we did for the other pipelines, is if you want to use the Baddeley for a Blob storage to stage your data before it gets to this nation. And here parameters, we don't need any parameters at this point because we didn't set any parameters for our Dataflow. We're just going to validate this. It has invalidated. Ok, lawyers found so we should be able to publish this, but before let's give this a name. Then we can publish now. Go, it has been published. Okay, if you notice, our clusters is you're running. They don't think you found the book. Or we can just either new tutor, I would just click on the bug at this point. Then our pipeline should start running straight away and our dataflow will be kicked off. Cool. We have here our pipeline in progress, and it's now running the Dataflow code. We can see that our pipeline has been finished with success. Here is the, actually the time that it took to their patients. You may have noticed as it took a little while until it started. But again, this is a cluster that gets it's gone up behind the scenes. So the first few minutes or seconds, you should not really consider because pipeline easy, powerful tool to work with a massive amount of data. And then once this is running, everything should be really fast. So we can continue and check our database now and see if we have a table. Let's come back to our database and we still have one here. Let's refresh go. We have another table in here, and then we can check the data to see if everything is there. Satellite is great. As you can see, we have 4 thousand records. Just as we have in our source data. This could be any day, just like we did for our datasets using pipelines, we couldn't be working with a different type of connector. It couldn't be Oracle, could be Parquet file, the data lake. The nice thing here is to see and understand how we could use the pipelines in the mapping data flows to ingest data and transform the data using a Spark cluster without writing a single line of code. This is very powerful. Now, let's see how we can do some transformations in how it's going to perform. Stay tuned. Thanks for watching. 46. Quiz - Module 5: Good. 47. Project Walkthrough - Integrating Azure Data Factory with Databricks : Hi folks, and welcome to another lesson. Now, let's take a look at the document for our projects. In this document contains a step-by-step what's going to be required from you if you want to integrate that with Data Factory. Here we have the use case for this. So imagine that you're a part of an analytics team that has recently received a huge assignment or analyze crime data for several metropolitan cities. Year team has decided to leverage the capabilities of Azure Data Factory Databricks to ingest transform in aggregate the required data. As a result of this music case, you will understand how to work a straight data transformations using Databricks and ADF. For this project, we will have to have the field things are already up and running. In our case, we already have steps 12 so we can skip and starts directly from step tree. If you wish to go ahead and create a new storage account, we have here step-by-step how to do that? And then we have to grab the outcomes name and key to use from Databricks. Once we have that, we can create a new container name inside of our storage account. So let's just start directly with Azure Databricks workspace and see how we can create that from the portal. Let's get started. 48. How to Create Azure Databricks and Import Notebooks : I'm gonna move here again to the portal and search for other bricks. We can just click here on the button. Then you're going to search your subscriptions. I have only one available. Let's create a new resource group in college. Then I'm going to be required to give a name to my workspace. Then I have to select here my region, it's going to be North Europe. Then I have primarily two different types of pricing tiers. So I have this done ApacheSpark, and then I have premium. What allows me to have more granularity around the security and access inside are my workspace. That's a security standard. Here we have an option to stick with the default options, meaning we're going to be using public endpoints to connect to our environment. Or if you have a structure infrastructure, you can select an existing virtual network. So let's just stick with no. Then advanced, we don't have the option to select it is at this point. Then if want to add some tags, this is important because if you create a DAG at this point, whenever a new resources is a spawn up from Databricks workspace, you can easily identify from the tags that gets inherited from the research base. Okay, so let's review. We can create. Great, our workspace is up and running. So let's go to resource. And then as you can see, we have a portal to actually launch the Databricks workspace that I've read is a product not owned by Microsoft. It's deeply integrated, but not owned by Microsoft. For that reason, we have a service that allows us to connect to the actual Databricks workspace. Let's click on alone for space. And then we can see at first glance how it looks like. It's going to assign me as my Azure Active Directory account. Great, This is how it looks like. Let's come back here to our documentation. Once this is created, we have here a few other steps to go through. We launched the workspace, and then within the workspace, we can see the left bar, select workspace users and so on. Let's take a look here, how it looks like. Here. Home workspace should give you the same set of options. If you select workspace and go here directly to your username, you have the option to import. You have chosen to create a new notebook. You have the option to clone and existing notebook. And if want to export an existing is from this option here. Now our case, we are not going to develop anything at this point. We will just eat ports in existing notebook from the Microsoft GitHub web page. So let's import, going back to the documentation, we have clicked on the plate, and then let's go to this URL and download the notebook. Click here, and I'll paste. And it will download for me straight away where you back to portal. I'm going to browse. Great. I went to my download folders and then I select the file that I just downloaded it. And now you can see that this is violated and we can import it. Once you import your notebooks. At first glance, you will, may not see anything straightaway, but actually, each has already be important. You can either select from workspace or a kitten home. If we're under Workspace, you can select your user. Then you will see you have a new folder in here. The folder contains the notebooks that we just imported and we had includes another folder. Then you can see here we have different options clicking here on the Getting Started. Going back to our documentation, there is a video explanation for each one of the notebooks. The Getting Started notebook. It shows you how to set up a new storage account as we've done it before, we can skip this unit. Then the second notebook is the data ingestion. It's going to go to here, you step-by-step, how you can ingest data from a pub kick counts using Data Factory into a storage account. Then the third notebook contains instructions to create connectivity between Data Factory and Databricks workspace. So let's get our hands dirty and see how this can be done. 49. How to Create Azure Databricks and Import Notebooks : Moving back to our workspace, as you can see here, is just a step-by-step how to create a storage account. And a data factory. That's a skipped is notebook because we already have done it before and we have both surfaces are promoting. Then if you click here on the ear, you have the notebook list. So to conclude here on the workspace again, and then you have data ingestion. That ingestion shows how we can ingest the book, the data, and then it gives us a SaaS token to connect to a storage account so we can ingest that data. So let's do that from Data Factory. And from the homepage we have the cop data. Let's click on that. You can see here, we're following step-by-step. And then we have the cup wizard that we just clicked. Now let's give a name to our pipeline. Let's select this and a copy. We're going to select the pronouns now. Next. Then this is the second step. We have to create a new connection. So if you click here on new connection, let's search for storage. And click Blob storage. Again, moving back to here, we can select Bookstores, we can continue. Then we have to keep a name to that. And then we have to select authentication map that we're going to use the SAS URI. So I'll rename will be perfect dataset. If you click twice, Don't worry, it will reform which the whole pain. But that's okay. If you went to see the format that information, you can just click back here on the left pane and it will return. So let's highlight this book dataset. That's going to be the name of our linked service. Then we have the integration runtime is going to be the same. Then we have the authentication method. It's going to be Sazzy your eye. Now we're going to copy these sizer. I go back to our linked service and it will paste into here. You can see it's already created. Then we can test our connection. There's a connection now. Great, test being tested successfully. Let's come back to our notebook. Then. We've done this test, the connection, it finished, so we can finish the creation of our Linked Service. This is what we're going to select now, then e his next. Now we have to select our folder, the location of the data inside of the linked service that we just created. Let's move back here to denote book to get the right location. You can see the specified location is training crime data 2016. So let's browse this then we have training. It's next batch. Crime data, doesn't 16, That's what we want. We're going to choose this folder. We have to select a binary copy and then go next. We have to make sure that we have selected a presumably and also binary copy. You don't need to worry about any commercial type at the moment because the data is already Park case, so it's compressed by default. And we can click on Next. Here. You can see that I did this step is finished, which is a service. Now we have to concentrate in this nation. We have to create a new connection for our destination. Let's use an existing connection that we already have, saints, we have created one before, so we don't need to create this new storage account. That's it skipped this barge in. Come back to our creation here. And we have my first, let's click on that. We'll AD connected. We don't need to get the key or anything because we're already connected using managed identity. So we can get one of the folders or a can type the folder name that I want, so it gets created at the real-time. Let's go to the notebook and take a look at the requires this nation name. Let's copy this. And here are the folder path. Let's base this. And then we don't need to worry about any of this. You can just stick with the default options. Now, we have just to go in and we will stick with all the default options. Here is the summary page, but everything that we just did, It's okay. We will just go next. And then it's going to create or by plane. It will do everything for us. Call it's finished. So we have a pipeline. We ingest a public dataset. Let's finish. Cool. Let's go to the pipeline we just created, select Pipeline and then we have lab pipeline. Here. You will see that it has some weird names. So let's just go here really quickly and see if everything is fine. Now what we can do is we can load that data. We can run the pipeline to ingest the public the desaturate into our data lake. Let's click on the book and it will start the pipeline. Cool, the data has been transferred. So let's take a look and see how it looks like. Great. We have inflammations of Khomeini fly us read that data, the size of the data, the number of connections and everything. Great, it looks fine. If we go to our dataset, we should have when new folder. Now I did Lake. 50. Validating Data Transfer in Databricks and Data Factory: If we go to our dataset, we should be able to validate all the transmitted data. So let's highlight our copy task. Then go to our sink. We have our dataset. Let's open the dataset. And then from here we can browse what's the content of the dataset. So immediately we can see that we have in our dataset the folders that we just created. And then here we have the root level up the folder. Then we have all the data that we saw from the source when we authenticated using SAS tokens. So all the files are here. That's great. Let's click. Okay. And now that we have our data, Let's function despite playing this, McCrea had saved. Great. However, pipeline has been saved. Now let's move back to our workspace and see what our notebooks as we have the monitor part of the pipeline run that we just did, then we made sure the files had been transferred to the Blob storage. So everything's okay. We have the right folder name. Then we can examine didn't guess data. Then we have this part of the notebook. So this is actually where we start working with Databricks to see what's going on with our storage accounts, data and so on and so forth. Here, we have to fill in a few options. Let's get started called name, the one that we're using. So here, from here, from our dataset, we can check us. So if you've got an edit, and then here we have the name of our storage account. Let's just select make sure that everything set up correctly. Then we can't the name of our storage account. Now we have to get the key for that storage account. So these is also part of our documentation. So if you check it here, you shouldn't be able to find the right location where to get the storage account. Let's go to the portal. So you get a test. Let's search for our storage account is the first option here for me. And we have access T's. Let's click on that. Then I will show my keys and then I will copy the first option that I have. And then we would place this value here. Then we can save that. But before we run this code, I have to create a cluster, which is actually what's going to run. My code is the computing resource. It's going to perform the operation. To do that, let's do the following steps. Here on the left pane, if you select on clusters, we can see that we did have no clusters running at the moment. We haven't created any cluster, yes. So we have to create a cluster and attach our notebook to the cluster. Let's do that. I'm just going to give it a name. I would get a single node. Then I will get this cluster that we have available, which is for a course for a tangent memory. But it must options. Make sure everything is fine. And let's click on Create cluster. Grade. Our cluster is now up. No way. You're going to be a little bit space here because it may take a few minutes until it's annoying. But then also it's running. You can start running the code using our cluster. Because you can see here we have no books. We have all the notebooks attached to this cluster. As you can see, we have no notebooks are attached to three dash. Then if one adds some libraries and so on, so forth, you can go to other 1000 here. So let's go back to our workspace and data ingestion. Here, our notebook. So we need to attach this, as you can see here, it's detached. If we click here, we can now select the cluster that we already have, the one that we just created. So let's click here. And these will attach our notebook to that cluster. And then if you select this cell, you can run the cell and only the cell. You can have the option to run all the cells if you want. But at this point, we want just to assign the variables, the values that we have replaced. So to make sure we move those crafters here at the end and at the beginning. So we have only the values that we want. Then we can run this. Let's keep going here on the Play button and then run cell. So as you can see, it was pretty quickly because just assigned the values to the variables. We're going to skip this. This is just if you want to see the contents of the blood storage, we already did this, so we will just skip this part. And then here is where we're going to read our percussive file, the crime data, New York 716 data Boston into our data flows. Okay, so let's highlight the cell and run the cell. And then, sorry, I'm into DataFrame. So we can run this into our DataFrame. Okay, This looks good. Now we have the two that offense created and we can actually display the data frame. Let's run the cell and Create. We can see the contents of the file. Here again, we can run as well to make sure that it's fine. Scrape. So here we have the beta friends created one for the arch. And then you want to for Boston, you can use different names for your data friends. And the next stop is the transformation. So we're going to select the other notebook to do the transformation on this data. 51. How to Use ADF to Orchestrate Data Transformation Using a Databricks Notebook: The next step for us here is to start with the data transformations. So you can click right here and we are prompt with the third notebook. It's a descriptive book that show us how we can achieve the data transformation. So the first step is to actually get an access token. And then as a second step, we are going to create an activity in ADF to connect to the notebook here deployed on Databricks. Then we look at the activities together. We are going to publish the pipeline and run the pipeline. Once it is executed, we can actually validate from Data Factory is it has been succeeded or not. We will then come back to the direct workspace and check the execution of our transformations. So let's come back up here and start by creating our access token. Let's click here on the User Settings. Then we have generated a new token. Let's click on that. And you must give a name to this token. So just keep black token. And here's the lifetime of my token. You can change to any number you want. I will set 30 days and they generate. So this is important that you copy this at this time because this is the only moment that you will be seeing this token. So let's cut this. I'm pressing Control C. And then I close this for now. And I'll save my talking right here. What's my stock things creators? Let's come back to our notebook. And then you can see that we have to go back to our data factory and create our linked service. Let's go back to that, a factory. It, Let's grab here a Databricks activity, and then we will select OK. Then from here you can see we have a few options. We can select a linked service, which will be the workspace that we are working on at the moment in the settings where we can set the location of our notebook. So let's start by creating a new link at service. You can give any name you want. I would just keep US Azure Databricks. I'm going to be using a Zoom Zoom Integration Runtime. Here. You can grab the workspace from the subscription. I'm going to select my subscription, and then I will select my workspace. At this point, you should see where workspace right here. We have three options of clusters. We have the option and select a new cluster. We can use an existing one or an incidence for since we already have created our cluster, let's select Existing Directory cluster. Our authentication type will be access token. We will use the token we just created from the Databricks workspace. We'll just copy this token. Then we should be able to see the cluster that we created. And it's right here. So let's select this and let's test our connection. Let's click on Create. Great, it's been created now. So we can check here the step-by-step, what we just did of the linked service. And then we have the locations of our notebooks. So this is the Settings tab. We have to select your users, and these will be your user account. And then we have to actually point to the notebook instead of only the folder. So what you can do here, we can go back towards spaces. Click on includes. From here, if you highlight this option, you have the whole directory where you are at the moment. So if you just highlight this and coffee, you can get the URL as requested. Let's come back to our okay, since we have this information, let's go back to that a factory and paste this information right here. We have the location of our notebook. Let's go to base parameters. At this point, we will need 20, a few parameters that will be passed to our notebook as an argument. So this is quite important. It's gonna see we have this information here as well. So we're going to need the account name, account key and container name. We can get this information from my storage account, from properties and the portal. Then here I will add a parameter and I will add second, third. Now, our second parameter will be account key. And there will be a container name. Now let's get the values for those parameters. Here in the portal, I've searched for the storage account. We want the access keys. Premier, you have two information that we need to show keys, got the key. Then we can basically account key right here. Now we can get the account name. We can paste your red here. And then we need the container. Our container is DWT. Then let's paste it here. Great. We have all the information that we need. So what we're going to do, we're going to chain those two together. So we make sure that we have the correct order, meaning we will import the data in it. We will process that data. Once this is done, we can finally publish our pipeline. So let's click on publish to make sure everything's saved. Our pipeline is now published. Now we will execute this pipeline. Let's click trigger now and follow the execution of our pipeline. If we go to monitor, we should see a new AI in progress. Let's click on the live by plane, okay? Or if our clients failed. So let's take a look and see if we can troubleshoot that. Let's click on mirror. And we can see that there is an error with our path. It must start with forward slash. If we go back and pay attention to the documentation, we should start with forward slash users. And if you copy straight away from the notebook itself, you're going to get workspace and should not be included. Let's go back to our pipeline. And if we check the pipeline and see what we have, we see this is included so we should not include workspace. Let's delete that. Let's publish our pipeline again. Great, it's being published. Now let's run it again and see how it's going to look like. Let's move to monitor. Let's click him wrong. Snowing for our purpose, we're going to first copy the data and then execute our notebook. It has been succeeded, and we can check what is the output of our notebook. Let's get here. And we have a little bit of information here of the run ID. Then we have the destination of our data, the service rates executed. That's pretty cool. Another interesting thing here to notice is that if you highlight the little glass here, we can have a little bit of information. If you select the notebook activity, we can go to the wrong page erasure heaps by directing us to the workspace, let's click on that and we can check the activity that has been executed from pipeline. So here we can see how long each one of the cells took to finish. So you're creating the dataframes. We are normalizing the data frames for each city, the audit transformations in creating a single DataFrame. And finally, we exported the preppers data to a persistent table. And then from here we already have a table. As you can see, the startup came as OK, and the data persisted as a table. So let's come back to our documentation. And is can see we just very fight the execution of our notebook as a six step here, we can take a look at the data using this notebook. If you notice here it did no books detached. So you can attach to the cluster before you run. Let's get here. And I'm going to attach to my first cluster, and it's now attached. So what I can do, I can get the table homicides, the one that I just created from the transformation. And we can run this cell pH. We have the output of the table. Then this is a little exponential. What we should see, we could check first several tables if we want, for example, New York or any other table. Let's come back here to the OEM sides again. In this case, since we are winning is Park, this is quite easy for us because it can just run the programming language that you're more familiar with. So in that case here we are born a SQL to get the same outputs from the sale of both. Let's click on Run cell. Let's create, we have the output of our data. In that case here we are limiting BI desktop first 20 records. That's pretty cool. Then if you come back here, we can do some aggregation with that data. This is the transformed data that's not the raw data. Let's click in here again, run this cell. We are getting into a DataFrame again. And let's, Let's display this DataFrame. Let's get here and we'll sell. And what do you know? We have a visual information of our aggregation. That's the power of Databricks and Data Factory together. Once you load the data into Data Factory, you can create all these transformations with Databricks and integrate both of them together. Quite easy. Congratulations, we've just completed our lab in our project. In our next module, we will see how we can deploy to production. All that cool stuff that we just did using a CICD pipeline from Azure DevOps. Thanks for watching. I'll see you soon. 52. Quiz - Module 6: Good. 53. DevOps - How to Create an Azure DevOps Organization and Project: Hi folks, and welcome to another lesson. In this lesson, we will see how it can create an Azure DevOps organization and a project. So we can start creating our repositories and pipelines. Here on the portal, you can search for Azure DevOps. Then select Azure DevOps organizations. And down here you can see my Azure DevOps organization. Let's click on that. You will be prompt with this page. I'm going to blur some of the information here. And then let's click on Create new organization. Then you can give a name to your organization. Let's select a region close to your physical location. In my case, I will select UK salt. All right, folks, now that you have created your organization, it's time to create a project. In simple words, the organization is the parent level of a project. And then when you create repositories, pipelines, those items belong to a project. So let's start creating our first projects and name it as PDF. It will be private. That's OK. And then click Create Project. This is the first visual of our projects. As you can see it comes and T and most our task will be done from the left pane here on Repositories and pipelines. Your next lesson, we will see how we can create a repository that will hold all the JSON files from our Data Factory, CH and see you soon. 54. DevOps - How to Create a Git Repository in Azure DevOps: All right folks, this is the time to create our first repository to restore the code from our factory. On the left pane, let's click on repos. And then here as you can see, it comes with a default repository with the same name of our project. In that case, it's not initialized yet. There is not a unit, no files at all. So that's why we don't see anything. But let's create a new web bot and see how it can initialize it. Let's click here on the drop-down menu and click on New Repository. And just let's give it a name. So I will give us her edf QuickBooks, you are one. And then we will add a readme file to initiate the repository. Let's click on Create. Let's create our first report created. And then if we go to the left side branches, we can see that we have our default branch creators. Back to last year. Maybe we would have a master branch as a default name. Microsoft has started to change those default names, domain instead of master. That's okay, Let's work with me and then see how it can connect our Data Factory to this repository. This is all for now. See you in the next lesson. 55. DevOps - How to Link Data Factory to Azure DevOps Repository: Hi folks, welcome to another lesson. In our last lesson, we could see how it can create a web ball. Now, let's connect our Data Factory to this repository. For that, Let's come back to the portal and go back to our data factory. Now that we're here on the homepage of our Data Factory, you may have noticed that we have something called setup repository. Or if you go to manage, we also have good configuration. So let's come back to the homepage and click Setup code repository. Then from here we have the types of repositories available since we created a Git repository, let's click on hazard DevOps Git. Then we have our directory, then we have DevOps account. This is the organization that we created. Then the project name, which will be ADF. Then we have to either create a new repository or click on existing. We put literary since we already created hours, Let's use existing and then select Azure AD F Repo one. And then we have the collaboration range. Now our case here, we want select the main, which is our default branch. And then we have what's called Publish branch. This is a branch used Data Factory. We will go more into details around that in our next lessons. Then this is our root folder. This is the location where Data Factory will put folders in, try to read folders. So let's just stick with the root level. And this here is quite important at this point. Since we've been doing a lot of demonstrations and creating pipelines, we already have items in our Data Factory, and this is too low if those items into our ag to oppose the tree, Let's make sure that this is selected. Then this is the location where we will import the existing objects from our API, the factory to the wrench. We will use the collaboration branch, which ARE cases domain. That's all that it's required. Let's click Apply and see if we can link our Data Factory to the repository we just created. That's called ribose now connected. And if we select all on the left pane, we can see that we have a new item here, which is the working range. From here, you can work in different feature brands. You can create new brand for your peers. And no one would overlap the word and you would avoid a lot of mistakes. I will click on Create New, and I put my name and save. So as you can see, I'm creating a new branch from the main, which contains the items that we important. So from here I can work with any task that I want and it wouldn't affect the main branch. That's all for now, we've seen how to connect our data factory to the ripple yarn. Next lesson, we will see how we can work with the French. Stay tuned. See you soon. Thanks for watching. 56. DevOps - How to version Azure Data Factory with Branches: Hi folks, welcome to another lesson. Azure Data Factory can be considered as the primary ETL service to push, poll, and transform data in Azure. The objective of this lesson is to explain how the release is between environments, for example, development and production occur and how we can leverage the repositories that we created two versions of our code. The image that we have here in front bubbles depicts the lifecycle of the ETL pipelines is deployed in Azure Data Factory with a git repository. In the first step, we have 12 for the sandbox because this is the moment that the Data Factory is associated with a Git repository. Then the developers, they can start working in feature branch. And then let's suppose that John developer is working on a new feature. He then creates a new branch. Once he's happy with all the chains, everything that he did. It then creates a pull request to merge this code into the master branch. Master branch, in our case, as it were could see before, is the main branch. The naming convention has changed for this code to the emerged from his branch into the main branch because it has to be approved. Usually, there are policies that prevent developers to merge directly without an approval. This is the moment that the code is reviewed and if everything is okay, the code gets merged. Finally, as you can see here, we have our coding, our main branch, when we associate the data factory with a git repository in new branch, gets created behind the scenes, but it's not visible yet. As you might remember from our last lesson, we could see that we only had one branch, the main branch. When we click on the action of published in Data Factory, we are going to go through this visually in our factory. Did the code from the main branch goes to a branch called ADF published. And then at this point we can see the branch from the repository. As you can see here. This is something that's already in Artifactory. Some of the elements, they're done manually if you're by the developer or as a result of an action. For example, you click the Publish button once the code is on the ADF on this branch, this is the code that is ready to be sent to production. On the Azure DevOps side of things, this is where we create the release pipeline to look at this code wrapped and patch it up to be released to the production. All the elements that we have inside of Data Factory, for example, pipelines, linked services, data flows. Every single element that we have gets created as a JSON notation. So this is our code and this is what gets back to happy here to be deployed. Then we have released is released can be manual or automatic. It depends how you create you're releasing Azure DevOps. And then here is when the code gets sent to a production that a factory, it's sometimes, might look very complicated with a lot of steps. But looking from the visual, it's quite easy to understand once you do once or twice. So let's move back to that a factory and see how this works visually from the Data Portal and also from Azure DevOps. 57. DevOps - Merging Data Factory Code to Collaboration Branch: Okay guys, Now I'm here on the portal when it comes to CISD deployments. The only environment that we associate with a GET repository is the development fund audio environments. We would just receive the code that we publish to the Git repository. Okay, so now moving to our data factory, we have the Data Factory that we've been working on so far. We have our pipelines in here and I have my branch. So this is my feature branch. Okay, Let's have a look at our diagram and see how we do that step-by-step. In our case, we already have done this type pointing to because we already associated details and then we have our pipelines created in our feature branch. Now, we have to create a pull request to merge our codes to the master branch. Let's have a look at our repository on Azure DevOps to see if we have a code where I have my Azure DevOps open. I'm going to open up the projects that we created. Then I will check here on the left side my repository. Then we have created this repository. As you can see, I have a few items here. We created this was the commits that we did before. We have our main branch. I don't have a branch called ADF and describe publish. If you remember here, this is the branch that gets created automatically. So at this point we have some code here in our main branch. Let's create a pull request to move what we have in our feature branch from my average will branch into the main branch, okay? From here, you can click here on the drop-down menu and click on create a pull request. It's going to redirect me to Azure DevOps. And then from here I can create a pull request. But as you can see, since we already did if initial merged or no change for me too, to merge into the main branch. And we're going to just add something into Data Factory just for this change to be picked up here. And we can see how we create a pull request and emerge. Good, Come back to the Data Factory. We have our pipeline, so I'm just going to clone one of the pipelines here. And we can move on and create a new pull request. So just click here on breadcrumbs action and we can clone. So we have a clone of our pipeline. The name is copied at the end. We have now and change has been picked up. What we can do, we can save all. So I'm saving against my branch. It gets grayed out here. I cannot publish anything. So this is one of the reasons that we can only run debug mode for a new pipelines because my code hasn't been published yet. Now let's create a pull request to merge my copy into the master branch. If I change my, my pranks two main branch, you can see I don't have the pipeline that I just cloned because it hasn't been merged yet. So my feature branches ahead of my main branch. Well, let's click again here and create a pull request. It's going to go and retraction me again. And as you can see, it brings me here to this page that it's from integrator ProQuest, so I can add some description or anything that an ID, but usually it picks up the change in that I did. So I will go and get an creation. I have here a new pull request, and this is active. This is the status. If you click here again on the left side request so you can see that you have mind, you have actors, you have completed an abandoned. So I will go ahead and I will approve my requests. Usually, you wouldn't have this permission on our enterprise environment. So I will go ahead and approve requests. And then let's suppose someone else just did it. It's my duty to come here and complete. So and then I would just go and complete the merge. Merging a pull request. Great, It's no merge. Let's come back here and let's check if we have this now into the main branch, which is here. This is how we create a pull request. The third step of our, our workflow. Now that we have our code here, it's time to polish. So we can actually publish this code to Data Factory. And if we want to run the pipeline with a trigger or if you want to deploy the code to the production environment is in that version of the data. This is the moment, this is where you can do that. So let's go ahead and click on Publish. Remember, you've got to be running from your main branch. When interesting to notice before I go ahead and publish a code, is that my branch, everything has disappeared. This is because of the type of the merged they did during my emerge. There was a tick box here. And then one of the big box says that if you wanted to delete your venture, so this is a good practice for you to start. Always clean. So you can see I'm on the main branch and now it's not grayed out anymore. I can go ahead and publish my code. It's going to go ahead and publish the code. And then it's going to give me a summary of that. So I have only one change here. And then I'll go ahead and click on Publish. Okay, my publisher has been succeeded. So as you can see, it pops up now generating ARM templates. So this is the moment that might ADF and describe publisher gets created in all my pipelines in our audit Jason's that I have that compose my data factory gets packaged up. Okay, now moving back to the repository, now we have made, let's refresh this page grayish. Now as you can see, we have the F polish. So with that, we can finish our workflow here because our code is ready to go. We are happy and this is the version that we want to go live. Now, we will see how we can create a release pipeline to publish this code. Production Data Factory. Stay tuned and see you soon. Thanks for watching. 58. DevOps - How to Create a CICD pipeline for Data Factory in Azure DevOps: Hi folks, welcome to another lesson. In this lesson, we'll see how we can use Azure DevOps to create released by blinds I released on Azure DevOps is the primary tool used by various departments to deploy that factory objects from development to quasi environmental production environment and so on, so forth. Now, let's move to our Azure DevOps account. I am on the homepage of my KDF project. So to begin with creating our release pipelines, Let's come back here to the left pane. And as you can see, we have what's called a release. Let's click on that. Great. It will be empty because we haven't created anything yet. So let's click on mu by blank. Great. First step here it shows me my artifact and stage. It is all grayed out because I have to select something where I'm getting the code and what tasks that I want to use. So here you have a number of options out of the box for you to choose from. So if we're working with an application development, it's very likely that you're going to have an option here for you out of the box. We will just go ahead and click here at the very top empty gap. And then we will just do it ourselves. So I'm going to close this one here as well. We have add an artifact, so this will be our ADF published branch from the repository. Let's click on that. Then here you have different options to get your code. Here could be built. Then you have the repository is if we're quick get hub or inflammations, you also have the option. In our case it would be a zoo repositories get, Let's select our projects will be ADF. Then this will be the source. The source is the repository. This is the one that we created and saved our code. Then we have the default branch. This is will be the final location of our data. So we have main and ADF published. In this case, we will select ADF published. Once again, we are at this stage here and we are developing and getting the data from our artifacts from ADF published, okay, we select the EDF solution and then we want the latest version. O is the latest version of the branch. And from here, if when I change the source areas, I usually just stick with this option. Just click on edge. Now, great. This is the location of our code. This is the data that we will publish to our environment. Here is the stapes. You can have multiple stages in East Asia as an environment. For example, if one I change this four or production, you can give any name and keep adding more and more. Once you do one, you can clone and we use the task that you created. And Yara case, we're going to just stick with product because we only have one production data factory created. Here now is the time to create the tasks. Azure DevOps gives you an agent for free, meaning you have a compute resource behind the scene here that will process your code. We will stick with all these options. So as you can see, we have the Azure Pipelines agents, and then we're given a gati container running Visual Studio 2017, Windows 2016. We can stick with other default options. Then we're going to click here on the plus sign to add a new deployment. And then the task we want, it's called uncompleted planet. Let's click here twice, 1, 2, and Y2. Because here we want to validate our adjacent, we just don't want to deploy straight away and get a failure. One of the options here for us is to do a validation instead of deployments, Let's start filling in all the options that we have to do here. So this is the connection we have to create a connection, meaning we have to connect it to our subscription. But this is, will be my default subscription. Then I had felt arise. Great IPO tries to my Azure DevOps to connect to my subscription. Then I have here the subscriptions. Then I have the action create or update a resource group, azure. And we want, we don't want delete and resource group. And now we have to pick the resource groups we want. This is the second option. And we have the location. Again, this needs to be a location close to you or where you develop your factory. Then we have the template. This is what we wanted to publish. So let's get here rather than page. And then we have the repo, and then we have the name of the repository, which will be the Data Factory. Then we have the data factory, I'm completed. This is what we wanted. This is the code that we go to production. They're not even close up because it's already picked up. Then the template parameters. This is the parameters. Now, let's select parameters for factory. Once we selected our parameter file, this is the deployment mode. This is where I mentioned to you before. We have complete an incremental, completely deletes anything that is not part of your ARM template. This is our ARM template and we only have about a factory. It can delete all the contents of your resource group. So this is dangerous to go to France to select incremental when you wanted to play. Since this is our first task, we are going to select validation only. So that's great. This is our first task that's been completed. Then we can go ahead and give it a name. Say about it. Then here we can select. Now we're going to just stick with the same options that we already created. Since we already authorized, we can get from service connections. Then we select our subscription. Our action will be to create or update your resource group. Then we select again the same resource group we selected before. And we have to pick the location here. Then we have to select a template again. And now the perimeter. These is going to be implemented. And then that's all. We're going to save this. And we will run our code. Let's see how this can be done. 59. DevOps - How to Execute a Release Pipeline in Azure DevOps for ADF: Great, Now that our pipeline is ready, let's do a quick review here. And the pipelines, we have, again, our artefacts, which is our code. This is the stage. This is the location where we want to deploy our code. Then we have here this little sign. As you can see, this is to tell if we want to use continuous deployment trigger. So by enabling this is we want to tell whenever there's a new push to our branch, I mean ADF and describe public branch. This will trigger this pipeline. This is going to run the code. Okay, let's stick with continuous deployment. Then here, if you click here, this is location. If you want to add approvers for dash. So let's say imagine you want to select pre-deployment approval is from specific people. So you'll toggle this option in the new type, the name of the people here. And then the code is only deployment. Deployed if if it's approved. Okay, Let's close. We don't want that. Let's stick with that example. Now. Let's run our pipeline and see if we will see our pipelines into production. Before we run, we just have save the pipeline. So let's save. And now we can click on Create. Let's go and click here on this hyperlink. Okay, so if this is your new Azure DevOps organization, you might have gotten an error message here like you to step. Let's explored year unless you can look at and larger directly into the box. If you click here, you can see that we don't have any host at early warning England. This means that we don't have any free compute resource and when available to our organization to get that, requests one for free from Microsoft by sending an email to this address here with the name of your organization. Your organization is just the name of the roots level of your Azure DevOps. So you can just come here to do with flattened copy that and send the email to that address. And you should get recompute research for you to run your pipelines. Okay? At this point, I already sent an e-mail to Microsoft and I got an answer back and Meccans fully set up so I can continue and run the pipelines that I need. Actually, I haven't got any answer for Microsoft. I just came back here to the portal and test sets to see if it was working and it was all set up. It took roughly two days for me to come back here and test. So let's go ahead and click on ADF. Okay, just supposed to have clear in mind what we're trying to achieve. We're going to move our pipelines from the dev environment to the production environment. You must remember, this is the data factory where we created the self-hosted integration runtime. It link to one and it's empty. It's important that you use the same data factory because there is a permission associated with this Data Factory. And if you just pick another Data Factory, probably will get an error because the permissions are not configured correctly. Okay, let's go back to our Azure DevOps, and let's click on pipelines and releases. And here we still have the same status. Let's click on Edit because we still need to make sure we publish our pipelines to the right target. Here we have the task, which is the deployment. And here we are validating the ARM templates and here we are using ARM templates to publish somewhere else. However, if we don't change the variables that come with the JSON, it will just try to deploy to the same data factory because the exported ARM template contains all the variables names from dev environment. So it's important that you change it. Let's click here on overriding plate brown first. You can see that Azure DevOps, brings to us all the variables and values already filled in, which is pretty cool. Here, it's going to be the target of our deployment, the new data factories. It's the same name, but adding a number to at the end. So this is the location where we're going to deploy our pipelines. Okay? So let's click Okay. And you can see it automatically fills in with all the parameters in here. If you want, you can just select everything and copy. And from here, since we're just validating and it's all a replica between each other. We can just paste into here, or you can just click here on breadcrumbs and repeat the task. It's up to you really, just to save time. Okay? So you just got to make sure that it's validation only here. In here, you are incrementally deploying to your target resource group. So let's click Save. And let's run our release. Let's click on Create. Okay, let's click on the hyperlink here. And now our releases mean queued. And let's click on Logs so we can follow what's going on here. The first two steps are usually fairly quick because it's a computer running behind. It's defined some environment variables. Then we don't lose the code locally to the compute. Then it's going to validate the I'm Tim page. We're just touching the JSON. We're not deploying anything. And here it's deploying the code to our target. The deployment has been successfully completed and we can validate the data factory. Let's click here 0, 2, and let's check pipelines. Let's refresh. We have our pipelines here on our target environments. We can double-check between the twos. Remember it is 0, 2 was empty. And now this one year, 0, 1, which was our source, pipelines in datasets are all matching. And we have a complete copy of our dev environment so we can be assured that the code was validated and deployed to a process, to the production environment. That's OK. Now, I hope you have enjoyed this demonstration. Thanks for watching. See you next time. 60. Quiz - Module 7: Good. 61. Wrap-up: Congratulations on completing this course. I'm very happy to see you made it to the end. There was a lot to go to, but you made it. I hope you have learned a little something that can help make a difference in your work-life. My recommendation for the next steps would be to read the docs. There's a lot of good content out there, especially published by Microsoft on GitHub and on the Microsoft Learn platform. Also, remember, you can only fix what you learn by practicing. So try stuff out, experiment. This way you can get better at altering ADF pipelines. I really hope you have enjoyed this course and please leave a review if you can. It means a lot to me to see you guys liked it and I hope to see you soon in another curves of mind are the best. See you soon. Thanks for watching.