Azure HDInsight for Beginners | Eshant Garg | Skillshare

Azure HDInsight for Beginners

Eshant Garg

Azure HDInsight for Beginners

Eshant Garg

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
24 Lessons (1h 32m)
    • 1. Hadoop Overview

    • 2. Why we need a Distribute Computing

    • 3. Two Ways to Build System

    • 4. Introducing Hadoop

    • 5. Hadoop vs RDBMS

    • 6. Hadoop Ecosystem

    • 7. Hadoop Summary

    • 8. HDInsight Overview

    • 9. Why Hadoop is Hard

    • 10. HDInsight Makes Hadoop Easy

    • 11. Important aspects of HDInsight

    • 12. HDInsight Cluster Types

    • 13. HDInsight Architecture

    • 14. HDInsight Demo Overview

    • 15. Create Azure Data Lake Gen 2 Storage

    • 16. What is Managed Identity

    • 17. Add Managed Identity to Gen2 and Database accounts

    • 18. Create HDInsight Interactive Query Cluster

    • 19. Ambari overview and UI

    • 20. Ingest dataset in to DataLake storage

    • 21. Data Extraction with Hive

    • 22. Data transformation with Hive

    • 23. Data Export with Sqoop

    • 24. Summary

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

In the first section, I will cover basic concepts and understanding of Hadoop.

I will only cover what is essential to understand Azure HDInsight. If you are already familiar with Hadoop, please feel free to skip this section.

We will also discuss fundamental understanding of the 3 main building blocks or components of Hadoop, like the HDFS or Hadoop Distributed File System, the MapReduce programming model for processing and the resource negotiator YARN for cluster management.

We will also take a quick look at other components of Hadoop ecosystem.

In this module, we’ll try to understand, even though Hadoop is so useful in big data analytics but why it is so difficult to implement it, what are some common challenges we face with Hadoop.

In section 2, we will discuss how HDInsight easily overcomes those challenges.
We will not only discuss some very interesting aspects of HDInsight but we’ll understand HDInsight architecture, and also go through simple demo where we’ll fetch data from Data Lake, process it through Hive and later will store data in SQL Server.

By the end of this module, you'll understand HDInsights basics, how it relates to Hadoop, and specifically how we can use HDInsight to perform data processing tasks.

Meet Your Teacher

Teacher Profile Image

Eshant Garg


Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Your creative journey starts here.

  • Unlimited access to every class
  • Supportive online creative community
  • Learn offline with Skillshare’s app

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.



1. Hadoop Overview: Hello, friends. My name is a Shango. This section is to cover basic concepts and understanding off her group. Her look is a big topic and requires a full length course in itself. So in this model I will only cover what is essential to understand a Georgian side. If you're already familiar with her dope, please feel free to skeptics section. We will discuss why our traditional systems can't handle the data requirements off today's work and how hard do is different from relational database management systems. We will also discuss fundamental understanding off the three main building dogs or components off her do, like BFs or ago distributed file system. The map reduce programming model for processing and the resource negotiator young for Western management. We will also take a quick look at other components off her lips ecosystem. Thank you 2. Why we need a Distribute Computing: in this video, we will try and understand. Why is it that we need a distributed computing environment? The amount and scale of data that we are dealing it into this world requires distributed computing environment. We will see the role that her group leaves and the distributed computing set up, and also take a brief look at technologies which work along with her. Do it's often helpful tow. Have numbers in order to get an understanding off. How big is big data? Well, look at Facebook and Google and see how much data they're dealing with. Facebook has 2.4 billion monthly active users, an Injun 84 megabytes off data every single day. Similarly, Facebook now has over 100 million ours off Billy Video, watch time and user generated. Four million likes every single minute. Now, what about Google? As you can imagine, the amount of data stored by Google beats the Facebook. Google stores more than 20 eat a bite off data. Google score job. It should download the entire Internet, so this shouldn't be very surprising to you Now. There are more than four million searches happening every minute in Google for comparison, there were only 10,000 searches per second in 2006 and now close to 65,000 searches per second. More than six times this estate clearly shows how the Internet continues to change. The baby live Google has almost four million APS on Google play store and 300 arse off video uploaded on YouTube in every single minute. Google not only constantly download the new Web pages, but also crosses the updated pages for such results. There's a huge data set both of these organizations deal with their data said, which would have been unthinkable even 10 years ago. Now, if you think about a data set like this, do you really think a single machine will be able to service this? However powerful that single computer is? It is just cannot deal with the data at this scale. Let's see, what is it that we require from a big data system? It should be able to store massive amounts of data because the road later itself is very huge. So there's a storage that is first thing. Now, after stories, we should be able to extract useful information from these millions and billions off rows of data, so we need toe process it in a timely manner. So we need a processing power. And finally, there's a total requirement if you look at the trend in data usage by companies like this, for example, as we just discussed that Google search increase from 10,000 to 65,000 in last ticket, let's take one more example off Lincoln Lyndon grew from 37 million users from the first quarter off 2009. WHO over 600 million users today. Now, to keep up with the data that goes, at this rate, you need a special kind of system. The infrastructure itself should be flexible, says that as the data goes, you can increase storage, and you can increase computing capacity without completely reworking your system. So the big data system that bill needs a scale stories and processing capacity are useless without the ability to grow them. As later goes. Big data system requirements cannot be met by traditional technologies. They don't cut it anymore. In order to get all the characteristics that we require from our system, you need a distributed computing framework, and this is where her group comes in 3. Two Ways to Build System: any system which performs computations for us. Can we build up into with a monolithic system where everything is on a single machine and in one big chunk, or a distributed system where there are multiple machines and multiple processes which run on these machines in exactly the same way? There are two ways to build a team. For example, a football team. Your team can have a star who probably rebels and endure. So you have a one star who does the heavy lifting or an entire team off good players, none of whom are stars. But all of them work together really well and are really good at passing the ball. This is exactly equivalent to a supercomputer and a pleasure off machines. Two ways to build a system The monolithic There is equivalent off one star player that you really rely on that one player does everything. A monolithic system typically is a single, powerful server, and if you want more computing resources and capacity from the server, you might be tempted to pay twice the expenses to improve the performance off this machine . But you will get less than two times the performance. The performance off a single machine does not scale linearly with cost. On the other hand, if you look at a distributed system the equivalent off a team off a good players who know how to pass the ball and work together really well these distributed system are made up off typically commodity hardware, which can be used to paralyze the task. Here, the owners off improving performance does not lie with one single machine. It lies with all the machines which work together. The players on this team that individual machines are called the north and the entire system is called a cluster. No individual Nord in this cluster off machines is a supercomputer. They all tend to be cheap machines or commodity hardware, but they come together as a team to solve your data processing needs. Something that is exam Attic today but was not really well known 15 or 20 years ago is the fact that a cluster off machines can scale linearly with the amount of data that you have to process the capacity off. Such a cluster increases with the number off machines in the cluster. If you double the number off north that exist in the system you will get twice the storage . That's pretty obvious. Each machine comes with its own hard disk and doubling the number of machines. The buster storage capacity. What makes such a distributed cool is that you can achieve computational capacity or performance that this twice what you originally had. Then you double the number off machines you get nearly twice the performance Twice dispute . A distributed system satisfies the third critical requirement in big data processing, keeping up with the ever increasing amount of data that we have to process at any point in time. And this is the reason that companies such as Google, Facebook, Amazing or Microsoft build these vast data centers that are populated with hundreds and thousands off machines. These server farms, all these machines which live in these server farms are were actual processing off data express in peril. Now we need something in orderto have all these machines work together in a way to solve a single problem. These servers need to be coordinated, and it is a software which will perform this coordination. This is a special piece off software which takes care off the needs off a distributed system, things like partitioning data so that any data is a store across multiple notes in a system . This data is protected against loss by the PLICATING, the data on more than one Nord or coordinating Compute task. You might have a task which runs in peril on multiple machines. This software will ensure that all these processes have run through to completion successfully and store their results in a direct place. It also mused Toe Handle, fourth Tolerance and Recovery. Each individual machine is not a great system by itself. It's a commodity hardware, which means it's prone to failure. This failures north failures. All of these needs to be handled in order to run a process the software needs to ensure that a particular Nord has the resources required for the process to run in terms off memory compute capacity hardly space at sector distributed computing, which coordinates so many machines together to achieve one objector. That's a huge overhead and complexity toe Any processing task. Individual developers who work on this distributed system cannot focus on the nitty gritty off the actual cluster management. That's where the software comes in to abstract developers of it from the details 4. Introducing Hadoop: as we discussed in previous lesson that working with distributed systems require special software. Which explicit purpose is to coordinate and manage all the processes and machines which exist within the distributed system. Back in the early two thousands, as Google Web search was getting more and more powerful, we will realize that the traditional software traditional methodologies would not work for this is scale off Web search. That's when they realized that they would have to build something off the own to manage all these processes, which can run on hundreds off thousands off machines. Google developed proprietary software to run on these distributed systems. What were the objectives off this proprietary software? The first was to store the millions off records across multiple machines, so keeping track off what record existed on what Nord within the data center was one part off. The problem which does offering had to solve the second part waas how violent processes across all these machines and coordinate these processes in such a way to achieve a single objective. This was the most important problem that Google ingenious work on back in the day and then came up with two systems to solve this problem. One was scored, the Google five system and the other was scored. The May produce Google File system was responsible for storing data in a distributed manner . It was a file system which existed not on a single machine but on multiple machines which made up a cluster this foot soul. The distributed stories problem map produced was used to solve the second problem. The problem off distributed computing running processes in pedal across multiple machines to take advantage off the inherit Parral Isma in data crunching tasks and bringing the results off these processes together toe achieve something useful. Google engineers develop these technologies and then published paper talking about how these helping distributed computing. These papers were off course available to all other engineers who had been working on similar systems. Engineers working on distributed computing framework adopted these principles based on published papers from Google and then developed the core off. What is today? We called Hadoop so that equivalent off Google file system is at DFS and the equivalent off map reduce programming framework that Google has scored my produce HIV efforts is the Hadoop distributed file system and my producer is the programming more Buell for federal processing. That should be a first Bless my produce is just the Hadoop that we know today. That should be offense is the file system, which manages storage off data across machines in a cluster map. Reduce is a framework toe process data across multiple machines. Hello is distributed by Apache Software Foundation, and it's open source, which means there are thousands off engineers who have contributed actively tow her do. And it happens even today, how do constantly undergoes improvements. And in 2013 Apache released a brand new version off her do Hadoop 2.0. This was a fundamental change where the map reduced framework, which was responsible for processing data across machines, was split into two logical components produced and young. The responsibility off each of these components was made narrow and more focused. The map reduce framework was now only to define the data processing task. It was focused on what logical operations you want to perform on the road. Data that you have young was the framework to run the processing task across multiple machines, manage resources, manage memory, manage storage requirements. Yarn did not care what the data processing tasks. Waas that was under my produce. Young only was responsible for running it and seeing it through to completion. These three components, as DFS may produce, and Yang have corresponding configuration files. These are the basic building blocks off her. Do so whenever anyone talks about the block, diagram off her dupe these air. The logical components. You have a processing task, which you want to run on the cluster, which has her group in store. What exactly is the city's off steps that happen when you summit a job toe Hadoop? The first thing is for you to define map and reduce tasks. Using that may produce FBI. This is available to work with in Java and also other programming languages. This defines what competition you want to perform on. The data may produce programs are packaged into a jobs and her do plaster deals with jobs as a whole. A job is pre guard on the cluster, using young and young checks, whether the cluster or the notes in the cluster have resources. In order to run this job, Iand and figure out whether this job supposed to be run and then run the job and stored the data, which is the result of the job in its DFS. The stories system. This is a difficultly life. Second off a map reduce, which is running pedal across multiple notes in a cluster. 5. Hadoop vs RDBMS: Let's see some difference between her duke and our PBMs. The 1st 1 is structure was the unstructured data. First of all, there's a nature off data structure with RD bms. The data is absolutely structure with her do. It's all unstructured data across the edge of the first file system. So you have got essentially no schema. You can project structure on that in such a data in a number of ways using H base, for instance, writing high queries. The idea here with her Duke is what's called schema unread. So you're developing the schema as you're doing the processing operations on the data. Where is the sequel databases and data warehouse. The scheme is always present there as a skeleton. The next one is the cap verses, acid properties, relational database management systems and sure, data integrity two transaction toe s it properties A C I. D. S. It stands for a Thomas City consistency I solution and durability. In contrast with the big data like, how do we have what's called the cap off your, um, consistency, availability and partition? Tolerance, not consistency is not her do biggest point. It's not generally speaking and interactive environment because again you're dealing with potentially huge scales off data that's completely unstructured. So you have data with different context data that has different properties, different use cases. But certainly availability is a huge point off her do, because you have spread your data across number off north, and we are going in national Bill that actually inside it's another layer off availability because you are separating the cluster. Next is data report, normally in a on premises Hadoop environment, that DFS is actually part off your cluster. But at it inside can decouple is the storage and compute layers, which means you can bring her do clusters online or destroy them. Which is good thing from a cost saving perspective. Query performance. Generally, relational systems offer a lower data throughput but faster query performance. It's lower because, especially in the case, off systematic multi processing systems like a George Sequel database, you have got just a single virtual server, so computers are bottlenecked. But in contrast, Hadoop doesn't distributed across the North Sea in a cluster, you have got potentially much higher data throughput. But because my producers has toe map across all the north's toe around up the data that you are requesting in the Kredi and then distill it down, reduce that results that there's going to give you potentially much lower grand credit performance and then scale idly. BMS is our vertically scaled, and her do is hold gently that similar toe your sequel, database verses and your sickle data warehouse. And are the Obamas, as we know, is only and transaction processing platform, whereas her do bear some more for your lap and finally cost the Sequels over. Product is, you know, are all licensed enterprise software products that are closed. Source, by contrast, is mostly flee an open source I see, mostly because the companies like Cloudera and Microsoft that offer managed version off these open source tools and you're paying for the premium experience off having not only hosted environment but in the case off, like as the inside integration with the entire azure ecosystem 6. Hadoop Ecosystem: as we discussed them in the previous lessons that how her do and equivalent Google proprietary system completely re villainized distributed computing as distributed computing become more than the norm rather than the exception. People realized that the use of her Duke and the distributed computing environment should not be restricted to hard core developers in Java. It should be made more demographic. People with different skill sets ought to be able to use the power off distributing computing. This is what led to the rise a whole ecosystem off tools, which worked with her. Do some of the most popular technologies that you might have heard off our hive, which gives a sickle like Grady interface toe data stored in a big data system at base a nautical database. Big a way to convert unstructured data into a structured form was he of workflow management system flume and school tools, which allow us to put data into her, do and get data from her duke and finally spark away to perform a really cool, complex transformations in a functional way on big data. Let's take a brief look at all of these in order to understand where they fit in and how they help make data processing more demographic. High five is a sequel. Interface toe. How do? Traditionally people from analytics backgrounds are very familiar with relational databases and sequel, but in a big data system they need to program in Java, which they may not be comfortable with. However, Hive gives a bridge too hard do for folks who do not have exposure toe object oriented programming in Java by converting their sequelae queries to map produce behind the scene. The User Rights Qualities, which runs in the background in the May produce programs. PECH Base As Vic data Systems become more prevalent, people realize that you needed a way to perform low latency operations on your database. Such BFs is a batch processing system. Ach base was built on top of each defense toe. Allow low latency operations on key value pairs data in each base can we looked up and scan very quickly? Indeed, it's essentially a database bull on her group. Big is a scripting language, which allows us to take data in. It's extremely raw and unstructured form. Think lock files where there is absolutely no structure to the data. It just the lines off messages. Warnings. Tech sector It meant a place this data and converts it into a structure for Mitt. This a structure format can then be stored in each BFS or in hive, and then can be created by people looking to extract information from it. Pick can essentially take data, which looks unusable and converted to a farm, which is actually useful. No spark, one of the most important one. The spark is a technology that very much involved nowadays because of how easy and simple it makes. Working on big data sets, you can change and transform huge data sets and extract information with it. In a very interactive and creative way. It's essentially distributed computing engine built on top of her duke. It has an interactive shell, either in the scholar or in the pie turn, which allow you to quickly process data sets. You can simple type in a bunch of transformations and see what the result looked like in a spark. It's very fast, and it has a whole bunch off building libraries for machine learning, stream processing and graph processing. And it's a framework includes all the useful tools that you will need with a big data system. Organizations like Google, Yahoo and Facebook doesn't just oven processes off one type. They have a chain of processes constantly munching data. Managing all these processes requires a workflow management system, and that is what you see is all about ah Workflow management system, which works with all her Duke Technologies. 7. Hadoop Summary: And with this we come to the end off this model on her dupe overview. Hopefully you have understood why we need a distributed computing system and why it is superior to a single computer set up. You have also understood what role her do place in a distributed computing set up and why technologies like her do was developed in the first place. You also have some a big picture understanding off various technologies which live within the Hadoop ecosystem and are built on top of her do. Thank you. 8. HDInsight Overview: hello and welcome. Everyone in this more duel will try to understand. Even though her group is so useful in big data analysis is But why is it so difficult to implement it? What are some common challenges we face with her Do and however inside, easily overcome those challenges. We will not only discuss some very interesting aspects off Chesney inside, but we will also go through some simple demo where we will fetch data from Data Lake, process it through hive in the cluster and the later stored this data in Sequels ever. By the end of this model, he will understand as the inside basics how to relate to the whole group and specifically how we can use edged inside to perform data processing task. 9. Why Hadoop is Hard: in previous more duel. We discussed that how traditional systems are failing while dealing with big data and how her Duke is a promising technology toe process. Such a massive scale off data with all its hype, it's not wonder that organizations are getting caught up by the idea off having their own big data initiatives. But as promising as that idea sounds, the reality is that majority off big data projects are unsuccessful. Clearly, there's a disparity between the idea off big data and the successful execution off Big Data initiative in the Enterprise. And the reason for that disparity is simple. Implementing big Data is challenging on a number off levels. Let's discuss three main challenges to implement big data solutions. The 1st 1 is a front hardware cost. While the benefits off her Duke adoption are many and varied, the reality is that implementing on premises Hadoop is extremely difficult, and the first challenge is to estimate and forecast out that how much hardware I would need and the number of machines to purchase will depend on the volume off data to store and analyze. And this is where you have to analyze and visualize the nude off hardware for decorate insights. You have to plan for new projects and services, and you have to predict the impact off rising demand on systems, APS and services. And they say that Haru plans on community hardware. But still, these are not of cheap machines. There are silver great hardware. It costs a lot. In addition to hardware cost, a large Hadoop cluster will also require support from various teams like network admin, saipi, security admits or system amendments. And this put a lot of overhead on Big Data Project. And there are many other costs, like data center expenses, schooling, electricity except right sector. So in general, hardware infrastructure is a big challenge to implement her doof. Second is the scalability challenge. Big data projects can grow an evil rapidly on premises. Hard Oop analytics platforms rely on commodity servers and that physical environment results in scalability, problem and storage limitations. To solve these problems, more physical servers must be added, and that can be very expensive, time consuming and deceptive to the project. In addition, big data workloads tend to be inconsistent, which makes it a challenge to predict where the source should be located and the 3rd 1 is the data talent shortage. I need to find expert who knows how to deploy her do, and this is not easy. Successfully implementing big data is largely dependent upon getting the right person with the right skill set. But as big data adoption accelerates, those people are getting harder to find. This is especially true for organizations that have adopted an on premises big data solution. These functionalities typically requires sophisticated teams off. Developers get engineers and her group Edmonds, who have the knowledge and skill required to manage and maintain her do clusters. Putting such a team together can be painstaking and expensive. Process, and failing to do so can do a project before it even gets off the ground. 10. HDInsight Makes Hadoop Easy: So what is edged? Insight in very simple terms as the inside is a Hadoop on cloud. What I can say as the inside is a cloud distribution off her do components and your residue inside makes it very easy. Fast cost effective, scalable and secure to process message, amount of data. Now what does it mean when I say, as the insight is a completely many service, that means adorable. Take care off operation off application for you. They will run her do for us. We don't need to hire someone who knows the needs off care and feed off a Hadoop cluster as you're providing sle on her. Do plaster. If something breaks at two AM in the night, Microsoft support will get a call, not us, and they will fix it. Before we even realized that there is a problem, Microsoft is continuously investing more and more to make it easier for use. They are integrating with other. At your service is putting her dope inside the visual studio. They're doing so much in order to make it easy to get success in her Duke on all of the challenges which we discussed in the previous lessons are taking care by the edged inside. Let's see that now. First of all, there is no upfront hardware cost actually inside, deploy and run her do cluster for us. All we have to do is to logging into the cloud, provide billing information and let them know what we need and actually inside will deploy it for us. And now we have almost unlimited scale to play around with. One of the biggest admin page off a jury actually inside is that it enables developers further skill up or scale down the big data clusters very easily and efficiently. They cannot process a massive amount of data without being worried about the scalability. 3rd 1 is pay only what you used and your provides dynamic machines that are bill only when active. This enables elastic computing where you have to add machine for the particular work lord or projects and then remove them when not needed. And you just pay what you used whereas in on promises, we don't have this flexibility. We have to be a hydro expensive whether we use it or not gives it If you have worked on the big data technologies before, you know even after getting all cost approvals from higher management. Sometimes it takes months to set up hardware infrastructure, but using a juror has been site surveys, same infrastructure we can get just in a matter of a few minutes. 11. Important aspects of HDInsight: Let's discuss some off important respects off a Jewish insight Microsoft has designed. A juror has been side based on Heartened works. Date up that form Etch VP, even though her Duke is open source. But very few companies compile their own ecosystem. It is like Line X. You know, very few companies compile their own carnal. Most companies take distribution from companies like Redhead. Similarly, companies like Horton Works or Cloudera sell their distribution. They all take different slices off projects and sell them as distribution and sell services and support. Microsoft does not have its own distribution, but partnered with the Horton works in order to use their distribution off her dupe. Microsoft has very deep relationship with Horton works, and they both are committed to each other. Microsoft even make changes and develop a new projects in certain books. A jury edged inside enables us to use a wide variety off other Apache products. For example, the developer can facilitate batch processing using Apache pig about a spark or, let's say, a party hive. Likewise, they can access no sequel data using Apache Etch base and stream off millions off streaming events using Apache Storm Apache Spark or Apache Kafka. And in addition to using Apache products, the users can also have the option to seamlessly integrate edged inside with a wide range off and your data storage and other native services. For instance, they can integrate the cloud based analytic services with Data Lake Storage as your cost most DB sequel that every House data factory, blob storage and even tops, and not only a patch of products and adjourn native services. But we can further leverage set off business intelligence tools, including such as Microsoft Power BI I and the sequel server analysis, services and reporting services Apart, a chaplain, visual studio and so on. This is fascinated with especial or D B C drivers. These tools make it easier for the user toe. Build big data application and process massive amounts of data by extending as the insight as the inside means the most popular industry and government compliance standards. It can also capitalize on the security and management feature off at your integrating with your active directory and log analytics and as you're also provides the monitoring lock tools through which we can monitor all off over clusters 12. HDInsight Cluster Types: when we try to create edged inside cluster at your gives us option to choose clusters, for example, let me go to all services on and search for as the insight and adhere. Now here we have option to select cluster type. Let's discuss what are these? As we have been discussing, one of the main challenge with her Duke is infrastructure scale and cost as the inside completely solved this problem by offering different types off Lester. That means you can literally create a different kinds of cluster for different business needs and scenarios, and these clusters can be made on early based approach to building. And in a decoupled architecture, that means you can process the data you won't and afterwards destroy the cluster, saving all the data inside the azure block storage all at your data lake storage. The data will remain there without being removed. All changed once the process is over on, like an on premises environment where we are not allowed to turn off the computing part. Most of the companies that use the edged inside flavor adopt this approach to achieve performance and at the same time reduce their cost with the infrastructure as the Insight offers the clusters like How Do spark Kafka H Base Interactive query Storm our server, by the way, as the inside is the only pass platform that offers this amount of fully managed cluster types in the cloud environment. Now let's discuss the different cluster types and review the best fit solution and, ah, everyday use case scenario for them. Now let's discuss different cluster types and review the best fit solution and everyday use case scenario for them. Apache Helou, a framework that uses at DFS Young resource management and a simple map produced programming model toe process and analyze batch data in peril. Common use. Kissing Mary maybe like a batch processing, local storage or peril processing about your spark. An open source barrel possessing framework that supports in memory processing Toe Bush The performance off Big data analysis applications. Common use case scenario will be like data streaming machine learning and interactive analysis. A Patrick off Come again, an open source platform that's used for building streaming data pipelines and applications . Kafka also provides message you functionality that allows you to publish and subscribe toe data streams. Common use case scenario is like messaging Exchange website Activity Tracking metrics Data monitoring log aggregation Event. So saying and stream processing, Apache Etch Base is a no sequel database. Build on her duke that provides random access, consistency for large amount off unstructured and semi structured data. Potentially billions off rose and millions off columns. Common use Case is toe handle huge volumes off messages, automatic shard ing or fail over scenarios in tracked equerry is in memory cashing for interactive and faster. HaIf queries. Common use case scenario. Maybe like data analysis in each quill and other data Warehouse or Data Mart kind of scenarios apart. A storm is are distributed. Real time computation system for processing Large stream off data storm is offered as a managed cluster in as the inside common use case. Maybe like real time data, normal logician, Twitter and Alice is all maybe, like even love monitoring. Our server is us over for hosting and managing Parallel distributed our processes. It provides data scientist or our programmers with on demand access to scalable distributed metres off analytics on as the Insight Common use case scenario is like scalable and distributed. Our services are based analytic process distributed set off algorithms like Microsoft ML and to process other our models 13. HDInsight Architecture: in this lesson will try to understand very high level view off, actually inside architecture, and again, I'll try to keep it as simple as possible. The high level architecture off edged inside is very similar off her Duke. We have got the same concept here that you would have in any other cloud or on premises. Her do cluster like we have ah, this massive barrel processing. So we have these two types off North's head north or also called the resource Manager and the worker Nord or slave Nord or also called the North Manager. The resource manager is the master. It knows where the worker north or the slave north are located and how many resources they have. It runs several. Service is the most important is the resource should Uhler, which decides how to assign the resources? The Nord manager is the slave off the infrastructure. When it starts, it announces himself to the resource manager. Periodically, it sends and heartbeat toe the resource manager. Each north manager offers some resources toe the cluster. It's a resource capacity is the amount off memory and the number off recourse at runtime resources should Oula will decide how to use this capacity and then at the bottom of the slide, we noticed that we have got be couple storage where the structure or unstructured data can restore. Every inside can be configured to store the data either own at DFS, within up as the inside cluster north's or own A your blob storage. The most common approach is to use as your storage to store the input data, intermediate results and output data and not store that data on individual notes. And there are many advantages off this approach. For example, in cluster can be provisioned and destroyed as and when required, and their data is still available on the blob storage even after the cluster is destroyed. This is highly cost effective, as we don't need to keep the cluster active toe access. The data, then storing the data on blob storage, which is common storage, allows the other tools or processes to excess and use this data for other processing and reporting purposes. Now someone can ask, What about the speed? Is it okay to move data between the blob and the clusters? Yes, as long as we create a George Blob storage account and as the inside cluster in the same data center because Microsoft has implemented at your flat network storage technology, which offer a very high speed connectivity between the blob storage and compute notes. 14. HDInsight Demo Overview: Welcome to the Dem apart off this model. In this demo, I'll try to illustrate the use off, Fetch the inside so we're going to combine actually inside with high or high curl or the hyper variety off the sequel Query language to do a batch processing job, we will applaud CSP file toe a George Data login to which will act as a source in our data flow cycle. Then we will access this file in your data leg. Gentle storage using hive and injected tau is the inside cluster. Then, using haIf sequel will transform daytime in tow, aggregated form and save it back to the gentle storage. And finally, using scoop, we will export this process Data to sequel server. All right, let's get into it. 15. Create Azure Data Lake Gen 2 Storage: In this lesson, we will create our source and destination. That means storage, gentle account and sickle. Several account. We have already created data leg, gentle account and sickle. Several accounts in previous model and explain in great detail there. So I'll be real quick here and will not waste lot of your time again to explain in detail. You can please go through those lectures. If you want to see more than detail, let's to click. Create our data leg gentle account. First, let's again goto all services storage. Let's click on Add button and let me choose my subscription. My resource group. Give their stories account name. Choose my home region. I'm going to select standard performance General purpose Version two locally redundant storage and hot excess here. Next pitch will take us to the connectivity Metha and I leave it to default. We'll go to next to the advance space. Let's leave it before setting for security as a neighbor and will enable Data Lake storage Gento And, as you know, when we enable Gento it automatically disabled Data protection option. That's it. Live in create it may take a few minutes. Meanwhile, a atyour create this account. Let's go and create a farcical civil account. Also, let's again goto all services Nate Abyss sequel database and add new instance here. If you're falling this course by now, you must be very familiar with these information. So, without wasting time, let me quickly fill these boxes Here. I will give database name. I will create new silver here. Give seven name and strong killing shows. Choose my nearest location. That's it. And won't be use equal elastic pool. No. So if you have multiple databases, you can choose this option and group together multiple databases and all those databases will share the computer resources like storage and due to use of transaction minutes configured database. We have different level off duty use to choose here, depending what kind off workload we have. I'll choose the lowest possible since I don't need a whole lot of processing for Demo 100. MB is also enough for me, but it doesn't make a difference. Imply So let's shoe studio B and apply next photo networking. Choose default and go next. We don't need The simple database here will create our own database, so I will choose none and go next I don't need tax review and create. Now it will take a few minutes that is still processing. So let me pause this video here for few manners. Meanwhile, about these deployments will be completed. Now we can see both our deployment completed successfully. I can goto database here, and I can use Kredi editor to create a table visual store. Our final output. Let's log in using over database cervical and chills, which we created few minutes back. So before got to add over I p address in five all settings. Let's go back to sequel server and ah, to fireball and virtual neck with settings. It is showing our i P address here. All we have to do is to add a client I p and save. Okay, Now go back to the sequence over and try to log in again. Now, here will create a table cord beliefs which will store final output off our flight. Billy data. It will store the origin city name and delay time off that city. I will explain this table in detail in few minutes when we will discuss about our input, data processing and output data or the final output in upcoming demo. Thank you 16. What is Managed Identity: as you deploy your mission critical next generation and cloud innately workloads on Microsoft Azure that consume multiple at your service is one of the key things to consider is to ensure that these instances off different services in a jur talk to each other securely and the access to sensitive data within the cold should not leave any exposed credentials. So we can use manage identities for a George resources to securely connect to other instance off at your services without the need for stalling any credentials in your court. So what is managed identity in simple words, manage identity are used by at your services to authenticate toe other ed your services that support at your active directory authentication. One of the common challenges in building glowed native applications is to manage how credentials will be stored to authenticate to other clothes services, keeping sensitive information like credentials secure in the critical task. In an ideal world, credentials should never appear on developer workstation. I should never be checked into source control. Let us consider a simple scenario to better understand manage identities. Let's say you have an application currently hosted on and at your virtual machine that needs to upload some file to the azure stories. Blob. We're talking about two different age or services here. The Azure Virtual Machine, which is offered by Microsoft Computer Resource Provider. And at your storage, which is from Microsoft Storage Resource Provider to upload files. The A George Stories Surveys needs to authorize excess toe your virtual machine to perform actions like blob Uploads. Now when you create a stories, account a Georgian rich to 512 which storage account excess keys. These keys can be used to authorize access to go storage account via a SHARQI. So what we could do is tow have other endure services toe Consume those keys so that we can perform actions against the stories account. Also, just by enabling managed identity on and a jury source does not magically give it access to and at your service when you enable and managed identity on a jury source, it is by before it has access to nothing. Enable manage Identity only enables authentication. You still need to authorize the identity for it to perform any task you intend to perform. The step to off the process is always to make sure that the identity has necessary excess. When a George Resource in a second, I will show you both how to authenticate and how to authorize the identity. By the way, the manage identities for a jury resources is a free feature with your baby for your subscription. There is no additional cost. There are two types off managed identities system assigned and use their assigned A system assigned a managed identity is enabled directly owning at your service. Instance, you can do this using the portal, the CLI power shell or to Aaron Templates. The life cycle off a system assigned identity is directly tied to the agile service. Instance that it's enabled own. If the instances deleted Microsoft adorable, automatically clean up the cannon shells and identity in your ready if you completed previous more DOOL in permission Lecture. We did this exercise where we added a George Data Factory system generated key toe Data Lake Gentle Storage. A user assigned manage and entity is created as a stand alone a job resource through a great process. Microsoft. A jury will create an identity in the azure active directory tenant that's a trusted by the subscription. In news after the identity is created, the identity can be assigned to one or more A George Service Instances while system assigned managed identity is, for instance, off the at your service, you can assign a user assigned identity to my people. Instance off a service, for example. In our case, you'll see in few minutes that we will use one user assigned managed identity two sequel server data leg gentle stole it and as the inside, also after demo. Even if we delete all three Lee sources do sell, assign identity is not deleted. The life cycle off user assigned identity is managed separately from the life cycle off at your service instance toe, which it is assigned. 17. Add Managed Identity to Gen2 and Database accounts: Now let's go into the azure portal and I'm going to search for managed identities. Once again, we are going to create an identity that will represent the actually inside cluster, and therefore we can interlink over as inside cluster with date, Alex Storage, Joe sequel databases and other products. So let's click. Add go with the name to their source girl and finish. Now that we have our managed identity created, we're going to give this identity access to both over gentle storage account and sequel server account. So basically will use this manage identity and assign a role in both gentle storage and that your sequel several accounts be created in previous lesson. Now let's start with gentle stories. Account now we want to do is to go toe access control and add a role assignment, and we will choose one of the building storage blob rules. If you do a search for block, there's a building roll call storage blob data contributor. That's just what we need in the name off the least privileged security for a sign. Excess toe will choose user assigned managed identity, and we have got our minutes and entity right here. It's wonderful how much easier role based access control is now that because the system and user created manage identities, let's repeat the stained steps with sequel server, Go to sequel, server, access, control and, uh, Roll assignment and choose sequel BB contributor. Choose Managed Identity. Fancy. Now this manage identity has rules in both the source and the destination accounts. And in the next lesson, you will see that will choose this manage identity. What we create the inside cluster. Thank you. 18. Create HDInsight Interactive Query Cluster: in this demonstration, we are going to create our first actually inside interactive query type off cluster, All right, without any delay. Let's search for as the inside cluster. Click on add and let's select Over Resource Group. Give the cluster name. Make it a unique for a location. I'll choose my home location now the important stuff coming here, the cluster type Select Lister type. And this is where you have got those different models. As I said before, traditional Hadoop Spark, Kafka, Etch based and other types were going to choose interactive credit. This pleasure type, of course, is optimized for memory, which gives you that low latency analytical processing, L. L. A. P ability using high sequel on her group. So let's click Select here to choose that, and then we can choose which specific version we want to choose. I'm just going to use the default for the cluster credentials. We have got question logging user name and ah, we have toe have a separate user name. If you want to manage the cholesterol via secure shell, and I'm going to choose credentials rather than that befalls bless you some name And of course, we need to use a strong password strong in terms off. It needs alphanumeric and non alphanumeric in traditional for traditional things like length. So we have got over grand show here, Let's Ah, go next to storage and heavy need to choose Where is the primary storage for the cluster is going to be? We can choose ah, traditional and your block storage or data like Jen one or gentle. And there are some additional options given below here for additional at your storage so you can really lay out your SDF EST file systems right here. Of course, we're going to choose Gentoo here, Data Lake Storage, Gentoo and choose the name off our stories account, which we created in last lesson and the fire system that will use for our cluster. I will call it file system hype and give the file system name, which will be created in gentle storage while creating this cluster. And I want to stress that once again that this is really important because these edged inside clusters may or may not last for longer. So it's crucial that you're as DFS stories layer be separate. That's why we are going to use a gentle date. Alex stories account for that purpose. We created a managed and entity earlier because it has to exist before becoming here. We're going to choose over identity from the list. We don't need Gen one or additional storage. We can preserve your haIf meta data outside of the cluster as well, by storing at an azure sequel database, for example. This allows us to keep our high settings or queries or other stuff outside the cluster that we can reuse later. If you want to use this facility, we need to create these accounts before creating the cluster and authenticate them here. Let's go to our next step and leave the fort for security and networking. Go to next, and this is where we need to customize the notes. Notice that we have toe have two heads for high availability. It looks like we have toe have three zookeeper notes, but we can have a number off worker notes here. I'm gonna cut that down toe to now. If you open up the north side list, there's a recommendation, and then there are other options. Be careful about the other options because it may actually affect your deployment remember that depending upon the type off edged inside cluster we're deploying, it's going to be optimized for a particular use case because we're doing the interactive query. We have heavy emphasis on them. We can see here of worker nodes. Have a 112 gear bite off them. So this is giving a sec Total estimator cost, but are and we are now went to the view and create It will take quite a bit time to create these clusters. Let me pause this video here and come back once a complete for it is done. So once the deployment is complete, we can go to the resources. A couple things I want to show you here if you want. Oh, take a look in the Ural Field. This is the address off over plaster and it goes under a jure. It's the inside dot net. The sigh of pleasure name needs to be globally unique. And other thing I want to show you under the essentials the cluster dashboards specifically we're going to look at the embody home where portal let me scold on the settings of bed You can look at your ssh and cluster log in credentials by dialing in your host name here, and you get a quick link if you want to s a session in. But as I said in our case, we're going to use em. Barry, which is the Web? You I There are some other options here. The specific option you get are going to depend upon which, where I be off today. Inside cluster your provisions. I like him. Very. So let's go there. And we are asked to authenticate, which we will do with credentials, which we just created while creating the cluster. And here we are in, um Barry. So we are going to pick up with this very screen in the next demo else. You then. Thank you. 19. Ambari overview and UI: Now we have hundreds off clusters. Our system that scales to hundreds off Peter bites off data and support some mission critical business applications. How you're going to men is that, well, that's where somebody comes in. Um, Barry is what we're going to use to manage over the inside environment. Um, Barry is a management black form that runs everything from edged inside administration management all the way to the configuration. So somebody is the Hadoop management platform responsible for cluster administration monitoring and configuration. Now I will say that all is your appearance and Barry, but somebody does not only run at UDP, there are other platforms and other uses where embody can be used outside off HDP. It's still an open source tool. It's written in Java. It's built and supported by the community. Let's look at embody from a different perspective. Imagine a flight. Our what is the flight are responsible for? Well, ah, flight tower is responsible for annoying where all the planes are, where the planes are in relation to the airport and where they are in violation toe each other. It's also the communicator for all these airplanes and all your flight. There's going around in this own area. But they also have access to real time weather reports. So they know what the veterans going to be like as the planes come in and out of the airport area. And they also know the should do. They know when the planes are supposed to land and planes who are not even lending but are just passing through the airport area so they can coordinate all this with the planes. And largely they have access to all the maintenance reports. And so they also northern maintenance off each individual airplane when they should do Liz and what they are due for. The also no relation off how much gas they have as well. So the flight over in an airport is really what runs the airport. They not only handles all the communication, but they handle all the configuration and all the shoe Dooling off everything that goes on around the planes. Now let's compare that to the M Barry. So, um, Barry is responsible for adding in the moving notes on body also have visualized reports. This will give you visualize reporting from user access, data storage total and even CPU utilization so you can see how your system is performing. And no, very you need more growth. So maybe you may need more superior position. Or maybe you need some notes that a little more denser so that you can take on more data and more capacity. Somebody gives you tool where you can change these conflagration, update those packages and even manage things like system cartas, snap short and other authentication providers to so in her do environment on body is like that flight over at an airport. It controls everything, and it keeps it running smoothly. This is the graphical front end used with the whole group one off many interface which Microsoft Jews for as the inside. Of course, at the top left, we have embody logo. Here we can see the number of ongoing background operations. Any warning or critical alerts can be seen here, this better dashboard. Here we can configure the settings for the service in the cluster. Here, we can configure the settings off the notes in the cluster. Here, we can find not only critical alerts but also information and warning related alerts. Here we can see all the services that are installed on the cluster. These services account information and other security related information and heavy can find like you the settings and, of course, sign up. Now I figure to the alert. Here you can see we have status as okay and warning or critical and unknown. Also, alerts can be organized into several before groups. Here, we can manage these groups by using the actions me no and selecting manage alert bloops. You can also manager alerting methods and create alert notifications from the action. Me new by selecting manage alert notification. Now let's go to the dashboard. In the dashboard, we can see three taps, metrics, heat maps and configure history. In the metrics, we can see that status off of a cluster in glance like CPU usage, the service side but on the dashboard provides the quick staters off the services running on the cluster, and these services are based on the inside cluster type. The services displayed here may be different than the services displayed for your cluster, and if you select any service here, it will display more information on the service. The host page list all the host in the cluster for action. Me New you can start or stop or restart all components. We can also stop or start all the services from dashboard here or from service page. So this was the quick intro to the M. Barry just so that you feel comfortable. Then we work on the edge inside hive cluster. 20. Ingest dataset in to DataLake storage: in this demo, I'm going to show you how Putin high credit is against your edged inside cluster. We are going to do an extract transform and Lord using impractical ready on the inside cluster. Now, we also have option to do all of this using command line tools. But I'm going to go the way of from Bari because we are already in graphical environment in over browser. And also, I want to make sure that you understand the main flow that we're dealing with. I have downloaded a CSP data set from the U. S. Government, the beautiful off transportation statistics. As you can see, it's ah traditional comma separated value file. And if I select cell and if I just do the control and here there are nearly 600,000 rows in this data set and our goal is to ingest this into our actual inside cluster. Actually, we're going to put it into the Data Lake storage version two there is going to be separately saved from the cluster where when we delete the cluster, that gentle storage will still be there. And so we're not going toe lose the strata. And once we have ingested that CSP file. We are then going to use high pretty language SQL to run sequel equities against their data and perform other kinds off operation. Specifically, we are going to take that very deserves and put it in the new table off, same as your sickle database, which we created earlier in demo. Now, to upload this file, I'm going into the stories account now, In previous module off Data Lake, we used the external at your stories Explorer tool. But if you go back, we see that a jury is integrating the stool right here in Portal and this is already in previews. So let's use that under file systems. We have the file system that we specified when we created the SD inside cluster. And you can see that as the inside has already populated the file system with what it needs in terms off blindly execute tables and logs and so on. There's a robust collection off samples in the actually inside samples folder, but what I'm going to do is create an another folder and I'll call it a demo, and, uh, we are going to actually in just over daytime inside that demo folder. So we will come down into the demo and create another folder. Let's call it flight data because this transportacion data it's specifically is air travel data. And by the way, it's the real data and then under demo flight delays. Finally, we will create a folder gold input data. I want to refresh the view in this input it a folder. We are going to populate our CSP, so let's applaud now. Interestingly, a juror has given upload button, but to actually upload the file again, we have to go to the Storage Explorer now We had in a storage Explorer tool. Let's go to the block and thinner, which is actually file system and praise down to the folder input data. We are going to upload file here. The destination directories, given this fine, but we're going to browse to our deck. Stop that. I have that CSP file. If you see I have ah enable the easy copy. If you go into the preview menu here, you can open as a use a G copy tool for improved blood upload and download. And that actually makes a quite a different. All right, it's completed. Let's go to the next part of the demo where we will be fetching this data from cluster using hive queries. Thank you. 21. Data Extraction with Hive: All right, So now we have got our storage layers set so we can come back toe over browser. And here we are in the M. Barry again to get into the hive Interactive query editor. We can do that in multiple different ways. Back in the portal under the Cluster Dashboard, there's a direct link to the high Editor with interactive query editor or in the M Barry. We can use the application Meenu, and we can browse to the heights of us. Here are the three checks that's DFS, habeas and hive. If those pass, we will be taken into the interactive Really editor, remember, under the hood, we are building with the actually inside Cluster, and we are going to use the haIf sequel now put on interactive credits. Now, what this query will be doing is first, it will drop if the objects really exist and then is going to create two tables. Billy's underscored raw and believes. Remember that we're dealing with about half a 1,000,000 off rose around 600,000 rules. They were dealing with the airline traffic. That's our data set. So we are going to create first irritable using create external table. So we are building this delay, Lord people. And if we come down to the bottom part after we defined the columns stored as a text file, we want to look at the format and location off the file. It's ah, comma separated value file. The lines are separated by new lines and the location is there. Move flight realism, input data. And that's Ah, underline data extraordinary gentle account where we uploaded file in the last demo. And then we are going to create a table called Delays that is based on Billy Underscore Rock. So we are doing some data ingestion from C. S. V. Right now. Let's click. Execute toe, run this gravy. Remember again at the background, huh? Group or just inside is taking that Sequels in texts and creating a map reduce job out off it. It's pretty impressive that we are able to do all of this interactive Lee. Normally in the Hadoop ecosystem, it's about the bad jobs that you just run and let run, and potentially it may take ours toe until the complete map reviews is not known for their speed. While we're waiting here, why don't we scroll up to the top off the browser and double click the worksheet tab. And let's call this a create. They both click Save if you choose to save your hide meta data separately in a Josi called database so that you can reuse these Grady's. You have got that option here save as create tables. We can also check on the job status in a couple of different spots. We can go over to the jobs page and we can see that this optician succeeded. The operations and alerts are also available. Appear. Let's come over to two tables and we know see, we have both the de less and the less raw. Well, we have a column list with the data types, the data definition language, the actual statement that we used and other meta data and same thing for other tables. So let's go back to the Kredi. Let's open up a new tab and let's look at our data Britain to the select and begin to the limit. Five. That should only give us the five rows back instead off half 1,000,000 rose, so we'll click execute here. It shouldn't take very long. And here we have the data. So in this Dimon the fetch data from Data Lake gentle storage and ingested into our actually inside cluster using the high queries into two tables Bill Aargh and the final process Traitor in tow. Delis in the next demo using high, we will further transform this data and we'll save it back into the data lake storage gentle. 22. Data transformation with Hive: So now we have, ah, then extraction from Sea SV. We have loaded that data and model eight and projected it into the tables. It's important to understand that we're dealing with the data on the Edge DFS wild system that are just simply unstructured data chunks. But the beauty off for high is that we're able to project a tabular structure on it and use the sickle like credit language on top of that bag off unstructured data. So now it's the time to take the data and the create an output file that will then put into the azure she called Database for Permanent Storage. So here we have another Kredi where we're going toe. Specify demo flight delay Output is an output directory, and this will be an overwrite. So if there's already filed there with the same name, we went toe overripe it, and we are also going to do a little transformation. So this is the transformation face off that bitter. Specifically, this Kredi retrieves the list off cities that experienced weather delays along with the average billy time, and it's going to save it to the output folder less executed. So again, the transformation is fairly basic. We are making a regular expression based replacement, and we had using the average function to do a minor transformation on existing data. But it gives us, I think, a good idea off what's possible running its high jobs on as the inside. Let's go over to the job and we can see that operations succeeded began off course. Verify that as well, by coming into our storage account. Flight delays are good data, and this is just the road it that that turned out the way that her group normally does. In order to make this raw data intelligible, we are going to create a final home for it in the sequel databases, which will finished in Over Next. Them where we're going to use scoop toe, read the data from this location gentle storage and exported into the azure sickle database . 23. Data Export with Sqoop: All right, Now, let's go back to our secret server. And if you remember, we created this final destination table called Delays in the start of demo. This table has two columns. Of course, there's nothing in the stable yet, but it's really to receive data coming out off edged inside. Now you're probably wondering how we will authenticate toe your sequel from Scoop. That's Ah, good question. Indeed. We are going to need toe ssh in. So let's come back to our cluster Goto Ssh! And cluster log in will choose the host name. And I will copy the excessive statement to my clipboard. Connect toe power shall here and connect using ssh keys. I'll accept their public key and authenticate with my ssh password, which I created while creating cluster. You know, Scoop is our excellent import export in John used in her dope. All right, it's authenticated. Let me clear the screen. And based in our next statement toe view the data basis on over a Jor sequel database server. I will run scoop list database, connect with JBC Colin sequel server. And then we had the name off our virtual Sequels over which, you know, we can find in the secrets over here listening on TCP 1433 with the user name and minus uppercase B which allows you to provide your password interactive Lee, which is more secure way to go than putting the wrong naked password in your command line, which usually never wants to do well. Tied the password and press enter. It seems there is some error hair. Let's see what it is. Okay, client I p does not have access. This should be easy to fix by now we have ah, did many times. Client. I pee in a bicycle server firewall settings. Let's do that. Copy this I, p address. Go back to sequel server and look into security for fireball and virtual network settings at Client I be and save it. Come back to the power shell and let's try the same command one more time. So we have authenticated in tow a Joe sequel server, and we had the master and maybe 108 database. So far, so good. Now we have one more statement, by the way, you can download file where I have posted all of these commands so you don't have to type yourself. Now this command is taking that data from gentle, stolid and ingesting it into sequel server table again authenticate with Sequels of a password. Now this may take up to 5 to 10 minutes to complete, and so let me pause this video here. So as a final sanity check, let's go back to the sequel server and select the top 1000 Rose from Billy's Table and Heavy have, as expected, 330 rose as a final process traitor where we have the city and whether the time for the city, Thank you. 24. Summary: So in this demo we looked at, how is it inside provides the first last Hadoop experience. And so regardless off whether you like a juror or nor I mean in a sense, it doesn't matter because you are using the same native her Duke tools. But in your well, at least you can use either the command line or graphical or both. And in this demo, we saw how easily we can integrate our as the inside cluster with Data Lake storage and using a PSA search. We can also authenticate sequel server in scoop. Thank you.