Introduction to Apache Spark for Big Data Analytics | Irfan Elahi | Skillshare

Introduction to Apache Spark for Big Data Analytics

Irfan Elahi

Introduction to Apache Spark for Big Data Analytics

Irfan Elahi

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
5 Lessons (1h 7m)
    • 1. History of Apache Spark in the context of Big Data Evolution

      16:43
    • 2. Introduction of Apache Spark and its advanced capabilities

      14:43
    • 3. Introduction of Apache Spark Libraries

      14:06
    • 4. How Apache Spark simplifies Enterprises's Big Data Architecture

      8:56
    • 5. How Apache Spark integrates with Hadoop Ecosystem

      12:15
  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

67

Students

--

Projects

About This Class

e6602a26

Apache Spark is becoming the standard for a number of big data use-cases all around the world. Businesses all around the world are leveraging Apache Spark in their Big Data stack, courtesy of its advanced capabilities like in-memory computation, and this trend has catapulted the demand of individuals possessing skill-set in this domain. 

This class, taught by a renowned Data Scientist working in the world's largest consultancy firm, is intended as an introduction of Apache Spark. The class covers the following topics:

  • History of Apache Spark
  • Overview of Apache Spark unique and advanced capabilities
  • Introduction of Apache Spark libraries (SparkSQL, SparkML, GraphX, Spark Streaming)
  • How Apache Spark fits in Enterprises' Big Data Architecture
  • Understanding Apache Spark in the context of Hadoop Ecosystem

These concepts are crucial for developing strong foundation in Apache Spark. This class doesn't require any prior programming experience and is well-suited for a broader audience including developers and business analysts who intend to get to know this amazing technology.

Meet Your Teacher

Teacher Profile Image

Irfan Elahi

Teacher

Class Ratings

Expectations Met?
  • Exceeded!
    0%
  • Yes
    0%
  • Somewhat
    0%
  • Not really
    0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Your creative journey starts here.

  • Unlimited access to every class
  • Supportive online creative community
  • Learn offline with Skillshare’s app

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

phone

Transcripts

1. History of Apache Spark in the context of Big Data Evolution: Hello, everyone, and welcome to the next lecture and the goals that is about participar Handle Special Edition for big data and analytics. Now my mom tries that it is You wanted you the technology or if you want to develop your skills, it in the technology you need to have contacts off. That technology is relative. Equally important, you need to understand from where this technology came into the picture where there was a requirement for such a for such technology, what problems it didn't address that was not addressed before, so that, you know, and you're confident that what you are learning is truly substantial and important. And also there were you, actually shoes it and also the other trended headed as well. So that lets. That brings me to the affection of the course. We're going to have a sneak peek or an hour and in fight about history off or do not exactly. It's not going to be a boring lecture. Gargano. History for found can appeal or cannot, but it's going to be really interesting with just be with me. No, where from their Dicks Park in fall. Now this brings back to all their days and in older days, then since the dawn of time, I should say it's a bit dramatic. But still we humans have been trying to come up or improve and enhance technology or bringing dimension in technology to address a number off problems that are related to a number off issues as well and one of the issue that way. Humans have been trying to solve worth to address scalability, no scalability, the very genetic world. But when it comes to big data specifically, there are two dimensions to it. One is compute and the other is storage. Over the years, I mean, our within it is every ball as well. And there are sources have reward as well. Technology has become cheaper. Memory have become cheaper. Computer shin capabilities have become cheaper professor from becoming so powerful, which actually prompted humans and professionals to profess more and board data from a number of different sources. Another girl of that back in the old days, the best way to profess data wars or two attractive scalability problem wants to scale up, which was also synonymous to vertical scaling. Now, how to understand? Let me give you an example for instance, if you want to solve our competition problem, if you want to process data, for instance, and let's say that data you're working in a company and dad would company, for instance, of the fictitious company. And that company has set, which are one Julian size too small but still and then And for that you had a server, but that you used to you. Then you had some sort for installing that everything that's over you are you are shooting that's ever to do different types of competition for number. You're losing axle for that purpose, and you are developing some power people, mortals and real additions from that. Now this has been serving your purpose successfully. Now, as the company crew, for instance, eso its customer base increase and it's their assorted increase the volume, the volume of the data increased, and and now, for instance, you are in the position that you know need to profess, tend to be off or 100 to be or how you will attract this problem. No, the natural question of the natural floating that at your drilling off system or a machine or server that had, let's say, x amount of Ram. Just double it up or triple it up or quadruple it up so that, for example, if you had 40 off ramp, increased it to 10 GB 16 30 GB or 60 40. And as you increase the ram in convention with including the I for processing that is being used in that particular server, like initially, maybe you're doing dual professor dual core, and now you're living quite core, and then you are doing optical. You are basically increasing the compute and memory resources off that one particular machine, and you are trying to make it capable enough to process bigger wall interpreter. Now this report is good, and it works to a greater extent as well and a lot off leading technology windows like Oracle and Sap. And today today levels to this mortal tube for a lot off industry centered solutions like Rabbit Outing Solutions and the sequel based Solution. That's well, and it's so in the three youth case for a Lord for a long amount of time as well. But they have some problems. First thing is that there is a limit to you could that you can skill you can increase the computer or memory off a server to some extent. Yes, you can go 200 TV 1 20 to 50. 60 me, but they will. You will always find a limit to which you connects. Increase ram or percent before have you got If you have some number, hardware, experience or knowledge and you can see that there are limited on the slots for what you can place ram and also for the process it smell. So there is a limit to we always hit our threshold. Ah, saturation point beyond which you can not in Crete are scary and offer When you hit that point you are limited. And also And this brings me to the second point off the end off protesting scalability problem that Okay, if you have hit this Barton Marco, are they? Have you had this bottleneck that you cannot increase me on the point? Human started to look for alternative solutions and wild, ordinary evolution wars to change the way they look at addressing scalability problems. And that is to scale out. No, it is a synonymous do another time No, known as horizontal scaling. What? It basically means that, for instance, if you have one machine that is used for computer or Lexie for storage and of your competition or storage requirements increase than instead of in making that one particular machine powerful or resource rage. Rather, you add more machines to that woman. You create a cluster, you create a distributed and bomb and you provisions and machines 100 machines. 1000 machines in di particular in woman or plays toe address increased Wallens off data and competition. Now that is another, entirely different way to the previous approach off. Scaling up where you are not actually interested, or investing your efforts or resources in making our machine powerful. Rather, you are trying more resources in a horrid Thornton manner. Now this approach is has proven to be truly successful, and this formed the basis off a lot off big data technology that are out there like I do and spark and different terms that you hear about the work on this premise that if you want to process large volumes of data, scale out major environments scalable, add more machines, add more virtual machines, are north, more notes for that woman and increase your compute and storage capability. No, which this problem this approach off, addressing them petition and scalability problem has worked as work and the testicles still has work and still working. But it is not without challenges. There are the Lord of telling me that associate it when you caught on that part and one of the challenges. If that now, when you introduce different machines which work in conjunction to address a particle, it'll come. Data processing problem. You are in a distributed paradigm and that ever been a distributed paradigm, that whenever there are different machines which are professing data in a distributed manner, there are all with Lord off challenges, for instance, that you need to know how to do coordination between these different machines. Like how? How will you distribute your data processing jaw on two different machines? How will you do that? For example, if you want to compute average off one million off a particular problem, let's a prize off a one billion road. How will you chunk or segregate or divide the staff into Alexei? 10 machines that you have in this environment hardly do that. How will they do the competition? How well there is a report that back to you and how the shooting off the data will happen. How the states with the mundane and also when you had these distributed machines working there, is all of the probability that a machine becomes out of thing or machine becomes unusable. Machine becomes damage or machine crashes for a number of reasons as well. So there a lot of failure problems that happen is well. And also, when you have such a problem, it is now converted into a distributed competition problem. Your programming complexity increases well. Previously, you are working in one machine. All cool. No ovaries. Just create a problem. Or let's say I go into more detail. Have everything store in an array. For instance, if you are using a programming language like Java, are Parton's list everything in that and just prophetic. But now you have to distribute that. How would you distribute that? It's another a lot of challenges that appear, and there are a lot of talent. Is that having a trust as well? Now have I mentioned that the current bigger technologies like Hadoop and Spar they're all based on this horizontal scaling out problem approach. Now this all the technology is basically addressed the different issue that I highlighted so back into the than, for one who will launch a map of four of the most disruptive and renowned paper. But it's what about now produce, which probably basically introduced a noble way to protect our process. Data in a distributed manner it sold Lord Off Victor problems. So it basically introduced the way that if you want to profess Dana distributed manner, you follow a map reduce approved by map. You basically mean that you distribute or assign or delegate a task different north, where the competition or the professing happens locally. And then when you have done the competition locally than their reduce faith as well, they're all the mappers will bring back the results back to one machine. So this one will not entity that it is a very high level, a lot off intricacies and that, But we will treaty with that thing. And also, in addition to that, how do you solve a lot of other problems as well? And one of the problems waas Ford tolerance that is, that it basically introduced the concept of $4 that if your machine goes down your processing and your story will not be infected. And I do basically has two layers competition and started for stories. Primarily lube introduced Hadoop distributed file system, which is still being used, and in that it introduced the Pont Flips off data block. Basically, if you store a file in her duped, then it gets chunked into different blocks. So let me just try it up a bit. You know, we're deviating a bit, but still it's important. For example, if you have a file like this and when you store and this if you are, let's say how do blister and then you start that fire exits the 1 10 GB file a 1 25 then this basically get store in tow and this death distributed or turned into a number off blocks, which is called as blocks and decided. Very typically, it's 1 28 and be, and it is considerable. Now these laws are actually stored in the machines. Now, the mapping off this file to this clock and our role is also responsibility of. It is also captured by one of the heart oh, component first, well so and the result of event of this all of this innovation was that the previous problems with machines have go down. How will your trap that and how to coordinate a petition in a distributed man and do but trust that to some extent people started doing their revolution came into the place and the Hadoop started penetrating island in businesses and organisations at a rapid scale. But often a while, people started finding that there are still a lot of issues with Hadoop and one of the issues with the new post there. That ad it was based on my produce programming model. It was heavily centred on disk. It was this Iot bound. What? What is that? Equipment by that The five level, if you have again, I always give the example of word count Problem where, For example, if you have a file where you have different lines and Netflix if intensive and you want to find the quantity of each word you do, the natural algorithm will be. You will Lord this file you will convert each word into its separate entity. Right? So you will say that, OK, each word is something on. Then you will basically make a couple of if it Let's say, If one a word is it fonds and you will say a final coma, one village. And if another word is Pakistan, for instance, then you will converted into a different data type where you will have a word and our number one in it and what you will do next. Instead, you will combine all the elements which have the same first Island, which we call a key and add this thing. So maybe in this little side fun appear 30 1000 time and some other word appeared, Let's say 100 times. And then you have this word counting now. And so what's happening is you are doing some sort of transformation and other type of transmission and enduring some sort of aggregations or the the different phases. In the terms of my producers, this is scoreless map and score. Let's reduce again. Again. I'm keeping at a very high level now, Dina. Different Faillace. In the case of her due back, my produce all of the map fit it wanted. But this is Matt faced map one that is mapped to the each faith involved, writing back to this. So if you have generated this reserve, it will right toe disk and then this phase will read from this. And then it will again ride to disk and then it will read again from disk. And then it will again, right back to this, give you Google about this. I will find that this I'll have the most expensive a pressure and whole competition process , because if I o ultimately it's slow and in competitive with Ram, it's super sloped. And that is the reason that when you ran map produced job and even if you still use my produce and current industry, we find it quite low even for computing us from our small, small task. It will. You will find that it is. It will sticking a lot of time cause it is computing. Writing back to death and reading from based and running back with. This depends upon your map produced job designers and the also. There are a lot of other limitation that well that may produce with a bit better. It was static there. It worked on the concepts off large. It's a bit advanced, once said, but there were a lot of charity that's well on De. So this brought this basically prompted the intelligent people in UC Berkeley, Toe launch and other tool in Hadoop ecosystem known at a party spark. And that's where party spot comes into play. So Participar was introduced to solve a number of problems with Centurion over Duke. On one of the problem was that that describe bone and this Carbone Waas. The reason that her do map produced waas slowing it is still still no hard spot among all a Lord of other innovations provided on innovative model to process data. And that is based on in memory competition, as I mentioned before in competiton with mean memory and disk memory, if significantly fostered and sparked leverage this time so called celebrity, then in a very, very efficient wit. So it performs all the competition in many for number of your doing, work down all the different favorite table still remain in memory without that, warm it right back to this. It's still writes to this at some point, but again, it's not. It doesn't impact the performance. All of the processing happens in memory and leather that you get about 100 times performance gain over my produce, and the other side of that previously had do produce for often competition. This is the word that you will see a lot that that is her boob job map produced, used to run at a specific date and time. They took their time and then they produced. It is offline, which takes time, and it's not optimal. Now Spark is being used for offline plus online. But in another word for that and streaming from English defence. Well, now you will hear the term streams. Losing the large stream processing is another way to process data that it added. It appears, just don't wait. In a way, I do Butoh well before brother just process, it added, appears as happens in real time. And all of these are a law because of the innovation off an American tradition next part introduced. But is Spartan ultimate solution? I would say no, there are a lot of talented It's Marcus Wellons Community is actively evoking to address this problem, but still spark if the de facto tool and it comes to distribute performance, distributed computation in a party spot, indicator community and bigger environment. So this brings me to the conclusion off this whole structure that we started off with the approach of scale up identified what are the problems? And that with sauce? They allowed in another way to address competition problems and then saw How would you address some of the problem and then spark address one of a lot of remaining problems in competition and why it's parked live a lot of performance benefit that is mentally because of in American tradition capabilities. And still, that is the real index Park is heavily being used and preferred for a lot of data processing use cases. I can't wait to resume over next lecture that we will see a lot of other aspects of a party spark to develop your understanding in this powerful technology country. Thank you. 2. Introduction of Apache Spark and its advanced capabilities: hello. Everyone and a party spark into the earth's welcome to the next lecture. Previous, like we saw a sneak peek off history of our duke. How it evolves and Way Spark came into the picture. And in this particular elector, we're going to be formally introduced to a party spot the most powerful technology out there Now in this lecture, we're going to see different aspects of a party spark. It won't cover the architectural concept because it will be covered in the later stages. But let's see what the parties work actually is for thing for us. The first question that time the doing mine is that what where does it did it came into existence, come into existence and party spark was actually developed by brilliant students and UC Berkeley. And one of the main motivations off developing correctly Spark wants to come up with a mortal or competition frame world that, um, Butte data in memory without acquiring it to ride into disk. Now this basically catapulted one of the reasons that it could populate the success of spotty spark in Victoria community and ecosystem Participar is open source, and it is a German birth competition chamber it is. One of the end of it is one of the striking differences in my produced. In the case of map reviews, it was a bit rigid and brittle. You had to choose map and reduced based competition model to do competition. In the case of a body spark, it is highly general purpose you can and again before they deliver another aspect of that as well. For example, if you want to do different types of data processing, but it for structure data, let's say it's a delimited. Or if you want to process tax data and you want to do natural language processing orders, you want to run some sequel cuties against your data fair. Or, if you want to do some spotting a stream processing, or do you want to do some grab and analytics if you want to do some statistical analysis? A. They want to train machine and immortals. You can see the possibilities off shooting party spot that quite general purpose. It's not meant for just one purpose. Rather, its application and its possibilities are almost limitless, and the thing with the party spark of that it is distributed. It's not something that runs in one machine or one note that's restricting Jew from processing data at scale. Rather, it worlds in a distributed manner, and when you run a party, spark in front in a cluster off machines. So typically you have a cluster off some number of a number of north of 300 Notre 10 north or five notes, depending on your computer equipment and sparkle. Unz in the distributed manner on almost all of the North American depends on a number of factors as well and how many machine that will run. But the the point, if it's nor a single node, profit its lands in a month in a distributed married now there. That argument that that you can still run, sparking a local more as well within one within your own laptop. And you can even come configure the prime number off course that you were fined two party spot. But in production scale go, I talk from industry, background and experience. Party spark is used in a distributed manner, and it runs on a cluster that is provision and it runs in conjunction with Haruka Environment as okay, so it's not like you require our dedicated machines for running about this park if you are running, if you are having a system where you have distributed, file something as well and you have a Smith winning as well. And if you have, for instance, other hi, everything that's well for instant actually rented Napa do a party spark will run along with them. It's not like it commands the whole note it runs alongside with them to A basically improves the compatibility of that. Because there are their executions in market, which require dedicated deployment, it doesn't it's not OK. That's about spot. It was at a part of the whole ecosystem and its scale him So, for instance, if you are using the party spark and again while using a party spark, you find that you are able to perceive, like, say, 100 TB of data and your Derek woman's increased right and in increased like say, you find that now. Now you are about to process one TV over. How will you toss all department now add. It's based on a horizontal scaling more than Lord skate out way that if you want to profess more data, more machines, so as you will add more machines through the plaster. The compute capability of Spartan will increase automatically and again there. Ah, lot of things that go into at the back end that interview provisional machines that you have to provision John, that is, if you're running it on top of cliffs humanity. But still, if you pavilion John, then spark will run on toffee on If you have configured in this way and the taxes that if you have mawr, compute requirements Advil machines to your cluster and it will increase your compute capability and you will be able to process more Walliams off data on much, much more efficient weight. And that's another thing with spark is that it is fault tolerant. You will find that, as you will see it and other words will understand recently distributed to different concepts off party spark that if spark is running on a node and if that node goes down for number, credence and trust me note do go down, then spark is not like it. It's dogs will feel it's not like that, that it will Sorry, not failed. I cannot proceed. I cannot do my competition. It will just simply it will. It will still continue to work. But that's important concept, but and again, this capability is also there in her job as well. But it is better managed and provision in the participant. They're not locally based competition. Powerful concept. Now the thing with party spark is that and again, if you have data that is work in, let me just explain that instance. But I say you have our cluster on in that place where you have different machines. Letter. You have hundreds off these machines, and you are. And on these machines it is distributed. Let's see, you want to districted profess some files, managed the affairs about one DV in five on Duh. They're the probability that data is located on these three notes. But data is not available on these two notes like that. The probability as well, then if you are using a participar, then the way it will work is that it will bring the compute to the data. Not that it will bring data through the computer. It's a bigger concept previously, and traditional systems. If you have data and you have computer than you had toe, bring your data. All the way to were computer side on. This has put like as well, because you have performance issues. Never been with issue. The Lord of the Bottlenecks Now spark and a map. Pretty, who's basically an her do a traffic, this problem and it saved that you Leverett data. Look at it now where your data is deciding process data at that point, so spark George or Northern processes will run actually will prefer to run on these north instead of running on these note. Because if it runs on that found these notes than this that I will have to be moved to these notes, which involved a movement. It is always costly because never climbed men, but it is always limited. So the data locality based competition principle is that spawn are initiate. Your spark process is very related inside so that it celebrities them locally and process them local instead of shuffling it around the most powerful concept and it's significantly improves the competition capabilities participar. Other tools, like Impala, also elaborated this concept as well. I mentioned before that spark ultimately another unique propositions off its market that it provides in memory competition. If you have data that you want to process it will ultimately, what to do with that. Spark initiates a number off processes like that. The review machine one and this is your machine do. Then it will launch a collector prostate. And this is another exactly Professor, exactly what it is like a worker profit or the World Court or the porter that actually performs some task. So ultimately, it's Jamie em like it's a gel, a vision machine or if you know that over to machines actually an environment with our court runs. So it's concentrated of a process, right? So it will actually perform competition within in memory by in memory. I mean within on heap jbm as well, the one he memory or Virginia. And it will really rely on that disk that is underlying on the note. Rather it we will bring greater in the JV in memory and will perform, and it is satisfied its strikingly fast and give you a lord or performance benefits. Another unique characteristic of about this part is that it provides a lot off rich and expressive AP I, if you go through the documentation of about this part, will find that AP eyes are intuitive and expressive and truly powerful. And if you have experience of writing my produce programmes and job, we find that those would really cumbersome. You have to write detailed mapper classes and then you have to write producer classes and you have. You can manage on text as well, and you had to manage a lot of other things. And you have to create a job where you say that is my map a club. This is my producer class. It was really comfortable even for forging a small problem like a word count or anything like this. You have to write Lenti, double out, gored, to attest a simple problem and jar fellas quite well bores as well. And scholar, on the other hand, is very, very less rebels and much more awful provides sort of syntactic sugar on top of java. Another look. They're developing efficiency increase, they said. You want to solve a big problem in sort of writing plenty class if and function you just right. Quick functions are do that quick expressions and also leveraging the functional programming capability of color right, amazing and making the short court to fall bigger later problems and other aspect of it. That smart provide a lot of low level and high level FBI's by low level AP as you mean that , like already then did affect. They provide you with as well that if you want to solve, you get want to get more control, more flexibility and more compact and manipulation probably used low level appears on sparkle. The bride high level appears like, for instance, in the case off sparks equal. It is one of the libraries that provides you that allow you to rinse our sequel cuties, or process started up. Ah, it provides data frame as another data structure as well, where you have just, like excelled road and columns. And similarly, if you have other, if you are using sparks streaming, then it provides different disguise. This discreet eyes arteries, which actually consists off party. It's another high level data structure. Similarly, if you are using spark, I'm a lived and you have to have data frame that you use it a frame for sure. But the thing is that you have to get a frame in a different proper format, so the point that I'm trying to make it that it provides you a lot off high level FBI than in the case of data frames. You can use equal languages or functions like slab filter, and you can apply conditions much conveniently and much quickly. Eso you have the convenience as well. We just simply not found in the previous solutions. Overdue Spot Waas, written in scholar and scholar, is that one of the most faster, stewing languages out there and then and sparks Carla itself is quite powerful as well. Like it had its support of decorated and functional programming a truly powerful way, and it's one of the highest paid programming languages as well. Now the accredited written in Scala and then a lot of five years to that. So if Sparks community is quite active there, a lot of contributors and committed to participar they're working on new relieving is new feature of new capabilities so that new features and capabilities are relieved. Their release, firstly in scholar and then are and then all the bindings like Part and Are and Job. So that's the first thing. So that is the reason that I am fired on shooting scholar and their freedom that industry people improvise using skyline. Secondly, there are a lot of the accommodation and literature available on and proved that if you use Scala Ah, party spot with Carla, you get a lot of performance benefits compared apartness. Well, right, so then this just from a production point of view, you need to know that different and scholar. And if you are one of those people who want to go to the fourth court level to understand, develop more understanding of a party spark or want to write something for the participar gonna want to make some changes in the party's park, you have to have a knowledge of scholars. Well, sparkle whites, strong interdictions with story systems, if you want to read it, are from fire systems like how do HD affair blob, which comes in a field as three, for instance, or even local file system sparking there. We want to bring data from the lab if it's like traditional bitches like sequel server data data. Excuse me, Oracle. It sports that as well. If you want to read data from no sequel databases like Ishmael Cassandra. Mom, What do you mean? Nobody's If you want to write data to different destinations, like solar or elasticsearch, do it from smart as well, so it provides strong continuation. Restore it systems on more and more integration being developed as well. So we're working in spark. You don't work in isolation. Rather, it gives you a lot of strong indication capabilities. And another interesting aspect of what spark is that you want working on a lot of iterative algorithms. Now, if you have worked on machine learning, the you may have heard them off Grenadian descent. Now they I wouldn't want to go to the leaders of that. But it's highly actually that it performs a lot of competitions and again and a game to address maximization problem now for a lot of algorithms. Machinery that I treated competitions exists likely to do the same competition or similar competition beginning again on for such a creative competitions. Markets super suited for that. If you do that in my produce, and there is ah, machine Learning Library for a Dubai feel like named asked Mahut, you'll find that it is not performing there, even if that again describe Bone gained doing something, writing back with if then doing something. Writing back to this getting from this slow some of the system in spark, as you know that in memory. So and again, the way it partitions our task in tow A a distributed a petition problem into different jobs and within jobs you have in two stages and then jobs and then tasked its the concert Working concert give you a lot of a moment over any other. It's a critical position platform over there, so this sometimes deflection. But you saw what is about to spark where it came into existence. Over there it's different characteristic, like $4 memory and also the Lord of Prospects, like which languages used and different aspects as well. I would really recommend that you go to deflect it again if you haven't it, because I know that I threw a lot of information but still and also try to follow some books available online. It felt so. There is a learning spark book by O'Reilly. Do follow that if you can and develop, add much knowledge of a party spark as possible, but it will really be helpful for you can't wait to review over the mating Jenny off learning about his part. Formanek selector. Build them. Thank you 3. Introduction of Apache Spark Libraries: Hello, everyone. And welcome to the next lecture. In this course, in this particular lecture, we're going to go a bit more deeper in understanding off much a spark now. And we're previous lecture. We initially saw the history of party spark like over the years, how compute evolve, how storage requirements evolve and what type off the scalability of models were previously used. It is scaling up, then scaling out and we saw what map produced is. And then how about the spark addresses? A number of problems that were there intrinsically in my produce. And then we host saw another view about this park like a very foundational view off a party spark in a way that what capabilities are there and what different features and characteristic that a party sparked the white. Now continuing the same narrative in this particular Lecter, we're going to see a bit more interesting featured related to party spark. Now what? The question that you may are again, even though you have been introduced party spark to some extent by now. But still what actually either party spotted someone asked you this question. How will you answer now? The best maid wants it is that, Yes, it's like a petition framework that is used to process data at scale. It allows you to tow distributed computation off data, and it also allows you to do in memory competition off data. But another view, another dimension to perceive to conceive a participar is that is actually a growing equal system off libraries. Now it's a platform on the bits off it, on top of which a lot of libraries are being built and have already bean built. Now this is a very interesting concept with a radio party spark. No, it's not just like a plane or his own tell platform that provide for you a couple of FBI than you think the FBI you do are distributed competition problems. It's not actually that the gate Now, how the were Parties Park has been developed if that there is one particular section or component of a but its part, known as a spark core on that spark or library that you can say or more, you'll you can say per wired you its own abstraction or its own data structure that if they are resilient, distributed data sets that we've seen before now on top off that underlying platform or traditional platform. There are a number of higher level I really that have been built, and there have been that every bill, in a way to address specific use, case it and specific presenting problems. Now this particular approach has really catapulted and really revolutionize. The way distributed in tradition has been done, and it has super optimized developer efficiency as well, because you at times your when you're working on certain problems, you don't need to go at a very low level. Details. Rather you work at a high level problems. Ah, similar technology that I can give it. That though it's not technically and highly 100% correct, but still different ology keep will allow you to understand it more. So the world took a beautifying. We have 13 languages which are very low level, in which it's like we have a family, languages and everything at a to court. It's like mine re and then you have third in high level language that is a family language and then you have other sort of language is that it see of it is still considered as quite low level that allows you to interact with computer resources like memory and professor it quite a lot of low level needles. Another third of that time, of course, that you write and see, if relatively complex, because you have to take it off. A lord of Things. A lot of moving parts in order to make the program book, though at is losing that you still get a lot of good performance. Of that is another discussion, but still, on the other hand, you have a lot of other languages for the Fightin Joe Wa and scholar as well, bitter cold at high level. And what did that actually levity is that I really, actually absolutes many of the lower legal form you are from the programmers to provide you high level obstructions, high level data structures and high level function that you can use to do your tasks. So there are different ways to be with. Similar concepts can be understood in the concept of index. Everybody spark. Yes, it's a platform, but on top of that, from software engineering perspective, it gives you or non profits or a P I a number off domain specific or high level more deals that you can use. So similar diagram has been drawn here as well. We'd actually I have called for much of Sparta party toward or website. So party spark and it's stored there with a spark or FBI to provide certain functions and 13 data obstructions like our duties. But on top of tired, there are 13 more deals that have been developed. And let me just walk through very quickly for them. First of this marks equal FBI sparks equal FBI. You talk in terms off shower. It's basically a package. It's apparently a library. And you have to, and you have to import that in your program. If you want to You with its particular function now in venue sparks equal. We just do one thing. You've got my pen and, uh, yes. So sparks equal specifically gives you a particular type off their structure, which is called as Yep, so it gives you a particular level Data structure of it is called as did a frame, so it gives you a particular level. Adidas activities called as data streams of data frame. If you have your used excel, which provides you dilute structures to work with what you have problems and rules in excellent, and if you have used are you have the concept of data from there as well. If you have used Brighton specifically finals and you get laid off in there as well, so similarly, in the case of a body spiral, there is if you want to work out a data frame level you can do. A party sparks equal libraries When you use particle. The primary infection then becomes data frame on again. They under the hood, the data frame operations and you use their fame, FBI, for example. There was select a P. I filled out a P I, and all of those things under the whole does operation that you interact that you issue that you initiate against data. Friends were ultimately converted into Arcadia patients under the hood, and the the reason that sparks it will is developed. It's and there are a lot of benefits of using sparks equal as well. But I under the hole there a lot off optimization that happens as well. So spark Sickle uses a particular optimizer known as a catalyst, optimizer and Bankston optimizer as well that ultimately improves the efficiency of your operation for them. Believe you're doing a particular filter operation, and then you're doing something transformation. That's well, then. The plan. The execution plan that if converted it it's quite optimizing the cost made. Optimize them just sure, sharing with you some ah Jon Bones are tips that you can further Google on. So it's something. Is there no way refused specifically when you when you have relational later you have tabulator or when you are interacting with databases darkness. Boksic will issue because it gives you data frame at the L A data structure of it, which has which have similar structured compared to the table that it's Road and Colin that you can use to interactive data at much Dominion states. But still, if there are, a lot of the Jews sparks equal as well. And I've used are disease as well as a lot of time, and you want more control mawr manipulation capability. That's where you paper toe artery. That's well, because are they give you more control? It's a bit more complex, but ultimately it gives you more control to do a lot of data processing that at times are there. Differences cannot provide, but Still, it's really powerful capability. Anything is spark streaming IP yet know if I'm not sure where that I've talked about that on or but if their different times off persisting patrons that out there, the first type of patrons, it vertically up back spectrum. So in the case of bad direction, you have a sort of jobs that are scheduled to go on a particular mount off time. So when your when your data is made available, you execute those jobs, those your start, they get the data, do the transformation they're supposed to do. And then the jobs are terminated because they have done their work. The second type off the skin pattern of execution, of course, streaming more also gonna extreme processing and with cordoning off stream of data that is coming all the time, and you process the data as it appears. So it is a new form or the new paradigm off processing data because it reduces letters that you don't have to wait for the data. You don't have to all of the scheduling complexity that you have to do, and it is heavily being used, and there are a lot of new tool that technology that are coming in this space as well. Now, in the case of Sparks, in the case off, if you have three processing use cases, if you have 13 it winds of your third and data and you want to ingest and process in real time. That's where you use sparks training. When you use spark streaming, you get different. I forget obstruction. You get something boil as disguised streams. So would you have appointed new stream of data that is coming, for example, if you have certain data that is coming, and this created is one stream of data added, another stream of data added another stream of their down that another simulator on you process it in this way. You something. You got something oil as the stream, which consists of sequences off these it as well, so you don't need toe, understand all the things that it's now just want to highlight that what you need to understand that there and we have a particular you escape that is really processing relational data. That's where you sparks equally. Typically, if you have a particular use case where you are, 12% state are in real time that you use park screaming now or disclaim it at this point is that Sparks swimming is actually not a purely real time stream processing engine because it ultimately works in micro batches. And I will try to create a course on that as well, because I basically worked on on a huge project which employed sparks timing and it's infection with no sequel, a deputy sensible. I'm looking forward. Toa work on developing that courses. But in its a library that loves you to care for real time stream processing use cases, then come machine learning capability. No. Ultimately, if you have a lot of data, you want to do the transformation on goal generally, if that you want to try that some machine learning mortal than that, A part of machine learning models if you have. If you have some experience that you may know that at the party of working the machine learning pipeline, then you do have follow certain steps you acquire it are you transform their that you develop a model and a part of the development of immortal you train, the more than its value flex. Certain features out appeared and you apply A certain techniques like cross relation as well do minimise certain effects like bias and variance in your data set. Then you make some predictions over your data, and then you validate your modeled as well. And then you I did Italy do all of these things. Now all of the steps that I mentioned that is acquiring it up sparks boards that transforming the that sorts park it's for. But apart from that, when it comes to actually doing specific machine learning tasks like featured engineering, training, immortal making predictions on all of those things and also some statistical my years as well, like mean, averaged over some inferential statistics as well different capabilities and different function that go and actually developing a model. All of these are provided in spark as well. Now, if you want to develop machine learning mortal at a product and height scale that you have a lot of data, that region Walzel order distributed competition as well. You can conveniently use part time Aleppo Library. We provide a lot of algorithms implementation as well. For example, if you want to develop ah classifier what you want to develop direction model that instead of moving away. If instead of doing transformation since bark and then using another tool, you can stay in one platform that this park and do processing as well as development all machining immortals as well. Using the same set off language and similar FBI's s well so did provides Lord of Algorithmic implementation to machine learning models as well a lot of tools that required a lot of technology that are required. A lot of a party that are required to do feature engineering. Yes, well, mortal evaluation already that Our Lord of things that provided a spark machining battery. If I compare that with other tools, other collaborators like Psych it led. I must say that it is not ads mature as, like psychic learn or our machine learning capabilities, but it still have any being vote on. And it's catching up with tools. Library that's well at the time, will come when it when it is really a developer, that stream. But for now it's It's one of the best tool that it's their fourth machines scalable machine learning. And if you have certain use cases but you want to develop machining models, it is really D Goto Librarian Spark platform. Then comes the graph library. Now, if you have 13 Jews case now, you understand graph to lead, and you may know that draft it's like it. Class is a pretty it. There are no obvious that you can understand the concept of club, but at higher ever higher level. What you can say that when you have certain when you want to study third and relationship with previewed between objects. That's where you use graph ah, semantics and affordable in the case of a car view of the concept off world Defeat and add it as well. And, for example, in other cases that, for example, if you, in the case of Social network is a classic example in the case of fortunate rules, you have a network of friends, and there is 1/3 and degree off relationship that you have with certain friends. There are 13 friend with which you in track the malls and there are certain friend which have to followers, So these are different levels of interaction between different entities as well. So if you want to capture certain such relationships, Smarck Graphics Library provides you these constructs and these data suggest as well that you could use to emulate or simulate or work on such graphic teething problems. Truly barbell stuff. So the key take away from this particular lecture is that if you want to use park, then there are a lot of ways to approach. One is that you can use parks. Corey P. I do think that you can do all of the things that you just mentioned, but there are certain high level more jewels or library that are already here and sparking a row for heavily being world on that you can leverage to work on, and you cannot do that. You can leverage to work on those hurting problems with more efficiency. So Leader sparks equal, mostly for Relational later spots. Training for stream processing data machine learning for working on machine learning problems and graphics for working with craft and Inditex problems. When you combine all these things, what you care ultimately is one platform that Cater's Lord of Use cases and as a result of their your efficiency, it improved because you don't have to change different tools technology than platform. You don't have to look for different development skills or technology skills and your developers to work on. So it really, really simplify it a lot of the thing and improve your efficiency to upgrade on a scale. I hope this has been insightful and helpful. I'm really looking forward to continue the selector CDs. And in the next sector, we're going to see more respects off this amazing technology that we know as a party spot. Thank you. 4. How Apache Spark simplifies Enterprises's Big Data Architecture: hi, everyone, and welcome to the next lecture in this court. In this particular lecture, we're going to take an enterprise view on a dictator because this particular concept is also crucially important because as part of your data, and then enroll if you're working at the or, if you're working at a party, spark develop. But you need to have what you need to really develop your understanding and skills. That, in holistic view of how it depicted a fits in the overall and deprived big denied a woman's No. One. Actually, if we're looking for before, then our screen, it's actually how about this? Park fits in the overall enterprise viewer and overall enterprise ecosystem to serve certain competition requirements. Now I may have touched on the particular type off patron that are mentally to patrons that is bad streaming before on let me trade them very quickly that batches away, where you actually have data that if coming that if you have certain amount of data that becomes available, a certain amount off that starting butor off time during the date and you want to profess them depending upon certain schedules, then what you do is that you do batch processing on them when the data is made available. For example later, 35 15 that are out there. They amid files. You launch jobs at a party, and they does the other scheduled to run for simple nine AM everyday or 12 PM everyday. Usually they are executed out of business, others and those that dog's been executed. They have certain and defying logic that they do that they acquired. Later. They apply certain type of transformations, which can be either cleansing the rate, are handling missing value or applying certain business logics as well. And wind ultimately degenerate output. And those output can be either feeding the data into certain data warehousing units. Like, for instance, you have an all out. These are all a bit different done. But you can say that cattle for reporting requirements, these confirmation So you have data initially in the form they apply. It turned in transformations on them so as to make that are conducive for reporting requirement that ultimately ability that usedto drivin fired from there. That's one way off processing the other way there. If you have certain resources that are emitting data all the time, for example, Twitter Fortune networks, if one of social networks and specifically Twitter or Facebook and send similarly dated Web . There are there are artifice feeds for some from certain blawg committed. If you have your The organization is a retail organisation, which has certain sensitive plays. Different barrel your condition. Have some logistics as well. There. There are certain sensors plates on there, we eagle, and they want to analyze that or there directors your organisation has in the street of factory, where they want to monitor certain my years like temperature pressure off different machinery's what they have. All the organization has worked forward on gondola on, and there are sensors associated with your workforce as well. For the bullet fiend off scenarios there, senses are placed on the heart or the cap off the hook for the work off. Liberals and the analyze how they're working, so there are a lot of ways you can and let's say that. But the thing is, if you have such uncertain all sorts of the date of that meeting data at all the time, or did I can omit at any time and there is a business requirement to analyze process or persist that data in real time or al it happens, you have you basically fall into the speed limit and the previous that is when you have a certain amount of leader that comes and there have dropped their schedule. And they didn't at a specific time, no matter whether it is available. Your job All the drin, for instance, at nine PM every day that send you for into the passion and the case of speed layer Judy, the moderator, that this porcelain isn't patches. So, for example, if in three seconds you will get, let's say, maga bytes of data or KB's of later. But in the case of batches bash. But if you are generally processing data in gigabytes because you prevent them all at once , so previously. What if the case was that you have different data sources and those resources feed it either Go into those toe those technologies in your organization, which are specifically optimized for back possessing, like the traditional rate of unloading our did our basic knowledge. It's like secrets over generator oracle or them foreign bachelor processing, because but their respective constructs for the pulse sequence ever used torpor figure to do 39 points of processing for candidate Are you used? Be taxed. So all of those with were program in such a way that you have to rely on and third party scheduler, for there are different ways. Some people use grown or some people use control. And there are different rules out there that are used to schedule the execution off stored procedure or be taxed plates to do 39 transmissions and ultimately generated at the serving here, which every of the reporting. So that's one layer. That's one feature that is one branch of your competition, the other bunch of competition of the newer one. When there are certain data sources which are sending data up and you have to analyze them as it happens, and there are certain technologies in the space, there's usually open source technology. For example, you have in the case off open. How do you have sparred yet you sparks strong fling sansa. Four different players are coming up in the state, but if you go to, for instance, as your than you also get a your event her but the old stream and under days. So these technologies are also being provisioned and developed by cloud environments as well. So you use those who sort of technologies to process data at it appears, and ultimately that it goes into serving Lear. And these two riddles are then trying of these two data assets and then joined to get bored , which give you a bit more holistic view of your. In addition, another challenge. DisneyWar the previously different technologies were being used for different types of pattern for batch. There were different back from different technology that I mentioned it for speed late. Different technologies were being used. What one of them, one of the reason that really accelerated the adoption of Spar or a war are simply you can say God revolutionized the processing paradigm, specifically bias part. If that spark is perfectly suitable, or is the perfect candidate to be used both in speed and bachelor. Now this is the most important thing now you have. You don't have to use different technologies to serve your batteries in requirement or speed for something with climate or real time repressed single Kliment. If you're doing spot, you can use one technology. You can use one language that is scholar. Or, if you're using party, it's up to you to basically cata or toe or toe personalize or instrument are orchestrate your back or something and stream processing requirement that this is a huge gain that you get because you don't have to. In West, in life, sensing all different types of technologies are different infrastructure that is required. Whatever technology you have to use one sister and you have to you with one technology and with one language, do all that something now they've had a lot of impact, said Well, for example, you don't have to hunt for people with different types of skills because different technologies required for type of skill for them but for secret. So while you may different types of skills and similarly for Saturday dying all of those things. But in the case off bar, it is perfectly suited. Do you cata or to serve your enterprise back and something needs and also your enterprise stream for something needs, and it is still being used as well, So this is one of the most important selling point off part. If you have certain requirements, use part and you can use bars for simple spark about this park. Core body blow sparks equal all spark family lib libraries here toe cover for your batch processing needs. Similarly for speed persisting order. Summarily, the speed limit is part stream for something you can still spark stream processing. Maybe I that is a party spark streaming to develop applications. That process that are in near real time of the 5th 1 of the most amazing and attracting features about this park. And if one of the most unique selling points of party spotted lettuce, you have one platform ticket. A. For all these things that were by this part, fits in the world picture. It's being used in two ways that people in developed about the spark applications, and those publications are scheduled to run a certain amount of time. And those applications generally use Spark or a P. I like Argon Igor Park sequel, FBI's to do type of transformation that they want to, and similarly, there are certain long running jobs jobs that are Miranda that are being run all the time that that actually level this party sparks dreaming. FBI. You got 40 real time something use cases and other there. Look there. Delap's simplified as well. Because you have one technology to mundane and to operate at your development by second isn't proof that, well, yours cost was down as well and all of the things that simplifying an optimized. So this was the view that I wanted to give a party SparkNotes cases In current enterprises , I hope that had been in frightful. And the next lecture we're going to leverage our understanding. Or but it's part to the next level. Thank you. 5. How Apache Spark integrates with Hadoop Ecosystem: Hello, everyone, and welcome to the next lecture. So the previous lecture we talked very briefly about how body spark fades into an argon additions processing requirements and how participar simplifies at architecture, which journal cattles to dive process things like past patrons like badge and streaming. What I didn't mention interestingly, were there actually a name for this architectural. That name is call as Lamott architecture. Have you talked to big data folks? Then you'll journal E and often hear the stone that Lambda Architecture were using Lambda Architecture on, and you will really or hear this term a lot and what this actually means that you have an architecture and within their architecture, you have to live originally thought mentally trillion on and one of the lead is the bachelor, which is used for back to something of your data. One of the layer is your stream is catching or throwing the stream processing capabilities of your organization at the last little bit, serving clear with all that that is actually persisted or joined or English. And then you people, Jews, a Curie interface, which can be either Impala or it can be. Either you can create hive table on top of the NFL, but they're provides your interference to actually interact with the data that is transformed and process by the stool years. So it actually had a name which is called as the Lamb Modalities. L. A M B D ate them directly. Write it here as well. Um, like this, right? So keep it, Keep it out of this. It may come up in your interviews. That's now moving to the next part is really interesting. Discussion on their discussion entails, like, yes, we saw that. How about the spark fits in an aural organizations processing landscape, but still Hadoop ecosystem. When you talk about Haruki because system you will you, you will have to understand. How about this box? If it in the whole growing ecosystem off our job as well? No. How do? It's not a name of one technology? No, There is a lot of misconception that Duke is the name of one particular technology, but it's actually north, a case that her do for started. It actually consisted off two major technologies, one with my produce and other what it's scalable storage system that is her do distributed file system and then it Cruz essentially and exponentially, and it had walled together for a water functional and nonfunctional capability. For now, her job is actually an equal sister or the sister, or it is actually a planet you can say, or actually a phenomena consisting off different types of technologies that are serving certain, different types off capabilities and you get it. So this is a holistic view off do at a hole, and it actually list different types of capabilities and functions. And also does some mapping off technologies as well? No. In a typical Hadoop deployment, you have certainly assessable. For example, you have an integration layer, storage, resource management and so on. And where the spark fit is specifically here. No, let's have a brief overview off what it looks like. So the first thing is integration early. When you have a hard do cluster in place, you have to integrated with your external systems with your existing 15 because yes, you pavilion a cluster, you can you provisional start a certain set of machines that are actually used for persons for storage and compute. But you need to find a way to New York left her with your force after with your data, sort for the data can get inside the list of you know this integration can have different forms. You can integrated with databases like you have a source system, which are a little bit sore system, which are black. So you have a mental system or you have certain types of fear insistent. You have toe that are generating data on that. Integration can be in the form of our database level. Integration, which can recall with a further level which can be at the JCP, is in depression. So that's one thing other indication can be. You can and you have to integrate with different types of five systems. You have to indicate the difference of a P I based sources like Twitter is another form of FBI who can who have to become the time. And you need to India with different so message queuing applications like Robert M. Q and G. M s today, the different shortest on that road, and there is an integration layer or there's an indication component of her dupe as well that actually captivity. So tools like poop falls into this particular category that indicated with your database and tools like flume all the fall into this category, which provides you the capability toe teach date, are from file systems or message going application and And it also explosive http endpoint as well. For example, it creates an http endpoint, and you can send your data in the form of rest all and you can. Then India started. I knew in your cluster. So tools like mainly flume scoop after the foreign to this particular category. And now there's a new paradigm shift order. The new trend happening that if tools like a party 95 they're coming up in this space as well, and they allow you to let me just write these names here about she knife I and Stream said . Either the tools that are coming up there actually are falling into the display. That well are. You are more than happy to try Google about these technologies that are really powerful, and they allow you to take your integration capabilities really to the next level. Now, then comes the storage part. Now, when you have data and when you have integrated with your sources, the data ultimately needs to be stored and In this case, there are a lot of tools and technologies come into space. Mainly, they are divided into two parts. Money's file system and others. Davis is now in the case of file system. You judge what the first thing that comes into the first technology that come to mind if the Hadoop distributed file system. So it's like a virtual scalable folder learned five person that if they're edible, levity is to spread to store the right skills. And there are a lot of databases as well. And many concert database. See if there are further distinction off sequel and no sequel. So in the case of No sequel, which actually what it means, basically, that if no relational conflicts like table our relationship between tables like foreign key constraints on all those things are joins. There are a lot off players in the space as well, like H base, Cassandra Mongo DB offender arm mongo db Onda. You have caused your costume STB, and the list goes on and on someone in the cable sequel. There are very few off the stool out there, but still cool. Do when it comes to. Actually persistent data is the new player in our particular habit clouded eyes. Putting a lot of really powerful is a tool that allows that you freely is unique probabilities that are not available, even initiate offense and even rivaling ah Lord of Technologies out there as well. So coup is a nuclear in the space on the unique ability of coolly that it gives you roll level upgrades and leads and indexing capability as well, which is truly baffle, will be developing a course on their technology as well. Similarly, you can have you must have their heart number of higher Impalas. Well, now, personally, I don't my view was that private bar there North storage capabilities, but they give you ah sequel interface on top of data that is stored in do. So I'll just ratified them here. So there are a lot of complaint that go here similarly, there a lot of components into it fall in the resource management expelled by the sword fighting. You mean that you left you have a lifter. You let's er 10 machines of 15 machines or 100 machine that ultimately delusions. If you want to give you resources, they give you computer, give you storage and the computer and the force that actually memory actually pulled together. For example, you have 10 mode, and Eastwood has having left 100 db off memory ram that you get 1000 gebe auf ram that you can Jews and for your distributed competition requirement. Similarly, if issued, has one terabyte of disk space and you don't you get to totally get 10 terabyte of space. Now you have to manage resources as well. For example, if you are running 10 jobs that are running into cluster, how will you assign different types of resources to these different types of jobs? You have five part jobs running. Will one drop all the processing and the memory that is available in your sources? Or will you do become vivid architecture system in such a way that each will get a fierce catch a fair portion off? It always will get a certain percentage off it. So all of these resource many capabilities going to displace similar leaving your running in pal acuity, for instance, you have to definitely come up with workload management. Construct that whether one duty that is running will take up all that I thought or will take certain the fact of resorted home iniquities can run in a particular system. All of these discussion going to resource management. And there are certain technologies, particularly John, that comes in this place. Andi Spark actually turns on top of John actually claimed that it sorting that it requires to do the processing. So spark is I got processing family that runs on Hope off John with a clear resource managers a spark for its capability, for it's for something, it requires RAM. It requires processors. It quite specific course and trades, actually the word competition, and it gets both resource save form the Pallister by working with John so mentally John is there. There are other Resource Manager fell like mess or that is not of much use. Interestingly, smart was initially used what developed with methods. And but now it's not that mature industry. You will almost always find Yang Impala. If you will use Impala, you will come to know that it has. It does not work on top of another on a map, its own workload management construct. So I've been still write it here. So technically it's a processing primer Now, then comes at some applications or third insistent that run on top of it. So at the mention that there are certain knows their ability, which are no sequel. H with comes into Dad. If you want to run certain sequel cuties to interact with the data in your system, they used Impala. Have you warned, if you have certain unstructured data or semi circulator and your heart No pleasure, you want to have third capability to search that data. You use a party's Cloudera third, which actually use it's cloudy out solar, which have the Benji back back. Unusual you've seen. But that was the thing. Typically, my produces jute, which is It's still being used f milk. And you may have heard of them off hive of one of the powerful technology since in Florida or groups that which gives you a game sequel on her dupe capability, that is, you used sequel language to interact with data in your cluster so you and you re right hive beauty that would look just like sequel Curie. Those curious, actually, you would map produced to do the back for 15. Now Americans toe in memory. It's a bit misleading ever because in polite, also an American tradition, claiming, but still when it comes to in memory, Spark is one of the most strong players in a part of your job stock. So when you want to present data in a distributed form and in memory that are used bargain because it is using in memory and therefore wedding really low latency, that's right fit for steam person. You get a sense well, and when you want to use machine learning, you can use tools like art and sass. But you can also juice part on my lip, a p I s. And then they started to the fall in security space as well that that's where you will hear the terms like sentry that is used for rule with self control. Similarly, ranger as well and at cliff as well you have actually is medium used for moderate about injury, mostly used in this space. You will hurt you and for metadata that a lot off tool and like finally, you use when you use higher than the rate accompanied of life matter store that gives you moderator off the data that is there in your place toe on you will stop its quiet Devers. But in the whole scenario, you have told it integration, resource management. But you know what you want to contest later in memory, that once you use part, which actually uses resource manager in a do typically young to process data that is there in your storage layer, it either hasty affair h based on what I were on, that data can come from different sources that are integrated with your I do. Plus, that's how you need to understand the holes that I hope. This has been in frightful to try to Google and learn more about the technology that well, because the Morgan with the better and the next lecture, we're going to tell you our understanding party spot and another views off it. If you call any questions more than welcome toe on, set them. Thank you.