Hadoop for Beginners | Nitesh Jay | Skillshare

Hadoop for Beginners

Nitesh Jay, Teacher

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
47 Lessons (8h 33m)
    • 1. 000 Intro & Course Overview

      1:26
    • 2. 001 Big Data Big value

      5:46
    • 3. 002 Understanding Big Data

      5:16
    • 4. 003 Hadoop and other Solutions

      7:25
    • 5. 004 Distributed Architecture A Brief Overview

      2:54
    • 6. 005 Hadoop Releases

      5:16
    • 7. 006 Setup Hadoop

      28:57
    • 8. 007 Linux Ubuntu Tips and Tricks

      4:34
    • 9. 008 HDFS commands

      10:32
    • 10. 009 Running a MapRed Program

      7:48
    • 11. 010 HDFS Concepts

      4:35
    • 12. 011 HDFS Architecture

      6:35
    • 13. 012 HDFS Read and Write

      4:54
    • 14. 013 HDFS Concepts II

      4:04
    • 15. 014 Special Commands

      6:34
    • 16. 015 MapReduce Introduction

      6:05
    • 17. 016 Understanding MapReduce Part 1

      5:12
    • 18. 017 Understanding MapReduce Part 2

      5:19
    • 19. 018 Running First MapReduce Program

      10:31
    • 20. 019 Combiner And Tool Runner

      11:05
    • 21. 020 Recap Map, Reduce and Combiner Part 1

      7:27
    • 22. 021 Recap Map, Reduce and Combiner Part 2

      7:45
    • 23. 022 MapReduce Types and Formats

      5:37
    • 24. 023 Experiments with Defaults

      7:11
    • 25. 024 IO Format Classes

      6:16
    • 26. 025 Experiments with File Output Advanced Concept

      3:38
    • 27. 026 Anatomy of MapReduce job run

      4:22
    • 28. 027 Job Run Classic MapReduce

      7:54
    • 29. 028 Failure Scenarios Classic Map Reduce

      3:45
    • 30. 029 Job Run YARN

      9:45
    • 31. 030 Failure Scenario YARN

      5:18
    • 32. 031 Job Scheduling in MapReduce

      5:06
    • 33. 032 Shuffle and Sort

      4:32
    • 34. 033 Performance Tuning Features

      7:10
    • 35. 034 Looking at Counters

      6:21
    • 36. 035 Hands on Counters

      3:32
    • 37. 036 Sorting Ideas with Partitioner Part 1

      7:19
    • 38. 037 Sorting Ideas with Partitioner Part 2

      5:31
    • 39. 038 Map Side Join Operation

      4:42
    • 40. 039 Reduce Side Join Operation

      4:29
    • 41. 040 Side Distribution of Data

      3:47
    • 42. 041 Hadoop Streaming and Hadoop Pipes

      2:24
    • 43. 042 Introduction to Pig

      9:24
    • 44. 043 Introduction to Hive

      10:07
    • 45. 044 Introduction to Sqoop

      8:43
    • 46. 045 Knowing Sqoop

      4:05
    • 47. 046 Advanced Hadoop

      211:53

About This Class

Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems. It is at the center of a growing ecosystem of big data technologies that are primarily used to support advanced analytics initiatives, including predictive analytics, data mining and machine learning applications.

Following modules are explained in this class,

  • Introduction to Hadoop
  • Hadoop Setup
  • HDFS Architecture and Concepts
  • Understanding MapReduce
  • MapReduce Types and Formats
  • Classic MapReduce and YARN
  • Advanced MapReduce Concepts
  • Introduction to Hadoop Ecosystem

Transcripts

1. 000 Intro & Course Overview: are you excited and want to learn Big data technologies? Do you feel that Internet is overloaded with free materials, but it's complicated for a newbie Friederich. Learning materials can be a can off rooms for big now on greening his advice for jump start . But then they feel that grounded company off trainings would cost you an arm and a leg. And when you look at we've been ours from other institutes, you find it off poor quality on dodgy. Without vanity, basically a huge risk, the world can go upside down while learning a new, complicated technology, as I do become a certified Helou bigger but goes everything you need to know to start your career in new technology and achieve expertise to a level where you can pee certification exams like Cloudera and Horton works with confidence. You can start as a big nerd on this course would help you to become a certified professional. This course will take you through the need, and it went off. Big Data Technologies. How to set up a new details off DFS mechanism. How my produce program books in classic mark reviews on yon important considerations you need to take to write my previous programs on introduction to her do ecosystem. Get on track to hear certifications get flooded with job offers on down the coolest idea jobs in the current times away ground guns and need the right decision. See you at the course. 2. 001 Big Data Big value: welcome to the food citizen being did up. Being value this isn't is designed to understand. Why did I so important in the modern times? On it went on the need for big data technologies. As you've used this course, I hope that you would have already heard many people called that companies like Facebook, Twitter, Google are generating and walking on Peter Bytes of data every day. The large, hardened collide in years, you never producing 15. Peter bites off later every day, so much so that they are throwing away most portion of the data, hoping that they should not be anything valuable to be analyzing that while that these facts are interesting but feel to show importance off big later to a normal organization. So I would begin with the classic problem, often organization trying to find price off the new product and importance of data to get the optimal price. In this case, the organization would be back on nuclear PC insurance. This is a very unsophisticated example to show the value of gator to an organization, so please don't mind the attributes which have taken in consideration the greatest ready for any organization in this case, this bank is to find optimal price off. The new product that will generate Manson revenue on is equally welcomed by the market to calculate the optimal value that has lots off internally down did which might be off help. First is they mean from repository, which may contain all the customer information on account logs that have generated for so many years. Second, they would be hosting websites and they are release activities on the websites, which can be valuable to understand the market range on interest off the customers. These can be derived from clicks on and people showing interesting political prayer on the big pitch thought they have the spending patterns off all the customers that can feel important information to understand and categorize its customer. Along with this internal later that is available. There are external sources also available, which will be important for the and this is late, all important completar spacing the social media bridge, which would be generated the market research forms on blazing prints from activities on social media. Last week, third party started sticks, which would give an idea like what is the recent trends in medical problems on expenditure along them are how many accidents are happening, both housing people in a locality that would collect all this information. We will leave them on each other 100 statistical algorithm to find the optimal price. In this example, we see how the data acts as a decision support system. The more the actors taking into consideration the credit will be the decision support system. So more than ADA, more accurate will be the predictions. At this point, you peek into the future and see how big data technology is going to change the middle of decision making in the future. In future, the reader would be the foundation off digital over system. What it needs is that, based on the changes off any off the input, attribute the operable automatically change things like Skynet. Let us understand this with an example. Suppose you have you know that you're planning a trip abroad on social media. The bag software gets this feedback from its sources, which keeps an eye on social media updates and so advanced software automatically. Since offer for talent insurance suited for your travel are supposed, the competition changes its space. The price off her back automatically changes to a new optimum value so as to maximize the profits. This is a futuristic vision off a computer network that imitates the biological nervous system in the four main characteristics. First, deciding on what bit off information is important and what is not. Second, learning from experience told adapting to changes in its external environment. Food reacting swiftly. Advantages are threatening situations, so that was about the future. Let's see how the data is used as a decision support system at present in the organization . At present, we used it of a rousing and let us look at the whole view off its architecture. Er they will be multiple sources of data each off. This would be sampled on clean and put into the database. Known a state of your house on top of this data warehouse, the statistical algorithm would run, which would create report ahead full in business decisions. In this architectures, they are two limitations. Forest ended a sample on the basis of our village and not the whole leader WAAS scene So in this case, we would be looking at a partial it. Only this sampling is necessary for DW house to function as if the completely that is considered the Data Analytics would take days to these two deserves. So for this reason, when the sample of data with the most important activators is concerned for the damn lunatics, so it is to like, look through a keyhole and trying to get the size of the room seeking the cedar tree from various sources was clean on processed, just to give it really for the endless is so to the point of time the analysis was Ron the leader was already still so the decision hasn't taken into consideration the current situation, which is the most important. 3. 002 Understanding Big Data: where could you listen to understanding big data in the previous listen, we understood the value off data Toe Analytics, which plays an important role as a decision support system. In this. Listen, we would understand a little deeper about big data. Let us look at the definition, but definition big data our collection off the deserts so large and complex that it becomes difficult to process using on hand database management tools on traditional data processing applications. Let us break this divination into Bartsch force. Big data are a collection of innocents, as we saw in the previous example that the bank had so many sources of data, the back application had to get data sets from each source and start them together. So our student did analytics. Second, part of the definition is large and complex. The size of data that is considered in analytics should be taught as a window through which we try to see to look and get the picture off the outer world. Begin the window size, but it would be the picture other unless is on the decision. So it is important that the size of the leader is large as well in case of any or musician . There's a variety of sources that are too complex city. Third part of the definition is that it becomes difficult for the traditional tools to process when complexity and the size increases efficiency off traditional whose decreases and the decrease in performance is exponentially proportional to the increase in size. We would compare and understand the reason for this in a little more details in the next lesson. For now, let's look at the attributes that describe big data on understand them. With the banking problem, you would be able to relate to them. Big data attributes are 1st 1 you because there will be so many sources. So the data put together would be large. And as we have already discussed, the size of data should be as large as possible so that Data Analytics algorithm would be able to produce meaningful results. Thinking is ready. Each source of data would have its own schema and maturity of sharing. For example, in Dona Leader, Repository would be a mean friends Rita bees while the market research forms Monsieur, the updates on social media on a flat file code is a velocity. The fresh new data that will be flowing toward organization would be on weekly daily or my beyond the overly basis the street of flow of data, which time is described as velocity. So there are these three attributes volume, variety and velocity, which are to be kept in mind when thinking off big data. Let's look at a few key points that are vital for Data analytics to generally accurate results. Did analytics would give us cues if the decision is worthy or not? It is important for analytical algorithm green on a large states so as to predict correctly larger to sit better would be the accuracy off analytical algorithm. It has been researched and proved that simple algorithm on a large data set. Would you more accurate result, then a sophisticated algorithm on a small status it This shows the importance off the largeness off the leader country. Hundreds of parameters rather than just five, would increase the accuracy off analytical morning so more the perimeters better would be our analytical morning for statistical on. This is the didn't need not be a little once it has finger to grieve and put on the weight of his on which we run the Data analytics. So the pattern off the architecture er would be right once and read many times was. The data has been put in the deal ofhis on which we would run statistical longer than it would hardly be changed. Next, they just look at an industry study done in the field of data size and its group, so as to gain what is ahead off us in the future. International Data Corporation is a market research from which carries out measure off all the digital data created, replicated and consuming a single Leo. It also predicts the crimes invidious subjects relating to data. Here are a few exciting points taken out from the most recent survey. The leader from 2005 to 2020 would increase by staggering amount of 300 times. That implies a whopping 5200 TV, both every human being. They did our double every two years from now 120 20 33% off. This data would be valuable if analyzed. They would be a lot of expenditure in the big data technologies in the time to come. So if you're stepping into the feel of big data I congratulate you as an excess. Now, I would suggest you do a Google search on I. D C D Universe. Also learn a little more of all IBC on E M. C. As they're important companies in the field of big data. 4. 003 Hadoop and other Solutions: welcome to listen. Three. You know, we have gone through the importance of Data Analytics on its importance to business. Also, we have learned that the data has group taking recent times on would continue to grow in this. Listen, we would understand how this big data can be analyzed and processed for use. Grace Murray Hopper, the famous American computer scientists who developed the first compiler on conceptualize the idea Off machine independent programming language, given real nice example for this, she explained, historically ox for used to carry the load. But friend, the load increased. We didn't consider to grow the Elks Lodge, but instead we used several ox put together to pull the heavy load. The same idea is applied by analyzing big data. When this concept is applied to the computing world, it is termed as distributed computing, and this is well, it's a cool concept to a dupe. Let us see this problem in computing world. We had computing resource on data to process. As the data grew, we had an option to grow computing capacity as well. So we did. The data grew at a large speed. The solution of a breeding computing device increased, the expenses fell because off three primary reasons. First, the hardware cost second, the license off for costs toed high disco failure for the It had an upper limit to the capacity off the data that can be processed, but the data is always increasing. In this case, Maddux distributed computing concept comes to save us instead, off one bathroom machine, the task was distributed among a cluster off machines it want It is first you hardware costs as commodity hardware was used. The Dome commodity hardware is often used to refer to a note specification in her group's Lester. It means commonly available hardware available with many vendors. Don't confuse it with cheap hardware or low grade hardware. Second license software is free. Third, reduced risk off single point of failure in a cluster. If a note fails, the performance degrades but does not halt as it would in case off a single machine. Usage foot Studies have shown that in certain situations that had oops distributed cluster can process 10 times the data at one tent off person time with fun tent off price. Interesting, isn't it? On this slide, we would compare traditional database management system with the hood. Umar produce. I haven't described my produce yet, but I want you to think it as a framework that works in a distributed fashion on a cluster of machines. Don't 40. We look at my produce in details in the next section, circling back to become bad is in. This is in many ways similar to a comparison between a cool car on the train engine. The car is expensive but fast to carry. Small number of people. Train on the other side would produce a higher throughput by carrying a lot of float. Each has its own benefits on social be applied cleverly in accordance to the need of the situation. Let's look at the stable. RT. Bemis is a good option with the data sizes in the range of gigabytes when my produce would start to shine on its performance for the data sizes in the range of Peter Bites and up Oh , RTB Amiss offers both interactive on batch accessing options on the data. When my produce is only batch did Axis Spartans in RGB? Emma's is really and right many times while in Helou file system, we cannot edit a file we would rather copy to local file system, delete the original in her root file system and copied again with the modifications to be done in RTB, Miss. The schema should be present at the time off, loading the did itself while in her dupe the schema binding is delayed till the time of processing. This is one of the media advantages off dupe. Let us understand this with an example. Let us consider that formal market research form. We get data about the activities done on social media in relation to a bank. Let us say column one. Be the source like Facebook. Twitter Column to be timestamp and Country B comments In the RTB Miss version off, the solution will have to store it to a table for which the schema and other constraints has to be decided beforehand in her group. We just need to copy to her Do five system on at time. Off read. We can decide on the schema. Let's see. We can combine the 1st 2 columns into one and considered see tree as calm to. No, if you do a sort on column from all the data would be sorted by the source and every row from same source would be sorted by the time stamp. This gives us great flexibility in programming. Next in rd beamers, we keep the data normalized, whiling Hadoop. The data is not normalized. This first hits in complex joins next scaling. As the data increases, the processing time off relational database system increases exponentially while in her group it is linear in this slide, busy and interesting analysts on seat time. See, time is improving a lot slower than the transfer. It typically in nineties that this drive would be off. One GB on transfer speed would be around 4.5 MBPs. There's a time taken to read. The whole drive would come out to be roughly four minutes. Nowadays, the typical scenario is one deep in memory on transfers being is 100 MBPs. There's a time taken not to read. The whole disk is close to do one day does referring the orcs on Lord example. It is like load has increased and so has dogs got stronger. But increasing Lord is a lot more than the increase in stocks strength. This gap can be closed populism. Suppose the same one TB is distributed equally over a cluster of 50 notes. The complete three time would reduce to one by 58. That is 3.5 minutes. This another it wanted with her Dube as it employs vandalism. One more it wanted is that Harut maintains, that replicas off the data. So failure off one note doesn't affect the integrity off the whole data. We would see in depth about how high Duke maintains the replicas in a separate listen. 5. 004 Distributed Architecture A Brief Overview: where you can do listen for in the previous listen, we learned that the single so architecture was expensive unless efficient when compared with the distributed architectures here, we would look at a few other distributed architectures on, understand their limitations and advantage off her dupe with them. One of the distributed solution, which has So this field now is a high performance computing on gray dark picture. In a typical high performance great architecture er, there are number of processors communicating through message passing interface, MP I and shared memory. This SOS very in a compute intensive jobs in the situation where large amount off data say hundreds of G B is not needed. As Adidas ice increases the network traffic increases. And hence bandit becomes apartment like how hard oops architecture is. A little different group has notes, which are just like personal computers. There's a hard disk with every CPU. This every note has their own storage area as well. When assigning the task the master note considers did a locality and hence the network is used only for small appeared messages. The scalability is high. In Hadoop, we would deep diving all the terminologies like job tracker Bass tracker on data locality in the next segment Off the course. Another distributed computing model is the Volunteer Computing City, which stands for Search for Extraterrestrial Intelligence is a project which aims to analyse the radio waves that they received from universe. They try to find a pattern. There's any trace off indigent communication in the radio is doing for if there is any extraterrestrial indigents trying to communicate with us, this project US people like you and me to download the application, which would pop up as screensavers on our computer. And so when we are not doing anything on a computer, this program would use the island. Cebu cycles toe allies of Oak Unit for patterns applying various advance algorithm slate for your transforms except when the work is finished. The application on our computer would send the results on would ask for the next work unit . The central server has to dispute over work unit 23 or more notes so as to come back failure on Ernest results. This architecture is suitable only for superior intensive work, which can have variable throughput Time on this can be applied only in the situations where data can be shared across the network, so this is not a viable option for the most off the business problems. 6. 005 Hadoop Releases: where can do this in five. In the previous listen, we learned a little more about other distributed solutions on their limitations to solve common business problems. In this listen, we would explore a little tricky topic with Hadoop. It's versions Hello follows the standard release nomenclature denoted in the form off extort Wydad Z, where X signifies is meter leases. Why signifies its minor leases and see signifies is the point releases, which may fix some bugs. The changing major release can breathe the backward compatibility. That means that they can be a few features that can be discontinued or implemented in a different fashion in the next releases. In those cases, the court needs to be changed and re compiled. For example, the cold ridden in her group extort odor who may not be compatible with our do fortune Express one dot go dot This breaking compatibility is not certain but can happen In those cases, we have to refer toa the release notes off express from Helou convo on Break the compatibility between the minor on point releases, which means the cold, ridden in a loop exhort order would be compatible with the minor release exhort one not go on the point release extract 1.1 about your job is best known for its. My produce on its distributed file system is DFS, but it has a family of projects around it, which work well together. This is referred to as her group's ecosystem. The other projects are big hive, each base zookeeper, school fume ex cetera, which we would look at leader in the course. These presents have their own releases, and a political version would be compatible only with a few virgins off her group. So it gets a highly complicated to deploy the ecosystem, which are compatible within each other. A party Big Top is a project which deals with the development on packaging off a compatible ecosystem. This is where other vendors, like Laura on Harden School over a Parcher. They're releases are easier to understand on our compatible within the ecosystem. Let us look at the recent releases off her duke, which are important to know always pretty good Ilha do or goingto complex t starts after this hope don't toe got extended toe tot toe to toe on Talked to one the doctor one changed their a p. I is to increase programming efficiencies on later dot to three. Improved architectural To implement my produce runtime yon is defense filtration and high availability. Under the hand Hadoop Dato Dato moved towards a stable release. Don't Toe Dato five, which added a couple czart indication this release is stable, unpopular and has been implemented in business. This group dot or 0.0.2 or five became Hadoop one Dato The version don't do three is not her duke toe officially yet as I make this video, but it has been speculated Toby 2.0, there's even a possibility that it may be named her Duke 3.0 and in that case who should not too would become are due to 0.0, In this slide, we will see the difference in features between the releases. To understand these differences easily, I would say, is issue toe think it as her one not X to be an early implementation off her loop which could not accommodate all the architectural features from the Google paper. Why not? 23 is the one which is closer to the original Google paper. Think of fusion not to do as a bridge between the two which improved a few programming efficiencies by introducing new AP eyes and so features would be easier to understand. Why not Excuses the old configuration names? Why not to do upgraded to new configuration names and so Doctor three. The old configuration names are supported but are duplicated indoor toe on about to three. Exactly same is the case with the FBI's. We're not excuses. The LDP eyes, while not to do, uses new MPs and soda. Start to three. Don't Toto and not to three supports the old AP eyes to mean in backward compatibility. The architectural change to get closer to Google paper has been done to map reduce runtime in dr three release while photo on one dot eggs work on old classic My Produce one another architectural upgrade. Toe edge, DFS. Patrician on edge, DFS High availability has been done in dark tooth release. Hadoop. One. Not eggs had improved insecure Kerberos authentication, which is not in doctor to release but has been covered in DR three release. This wraps ups that discussion in regards to a party Hadoop releases as a exercise, I would suggest you to search a little more on a party big dog project 7. 006 Setup Hadoop: welcome. This is Video Guide to set up her Do. I'm going to use this document throughout my video, which you'll find in support material along with this. Listen, I've made this document so as to make a step by step guide on how to look. Extradition can be done in this review on just falling all the steps in the document ensuring you heart books. There's always a possibility that you may get stuck with a new problem when you try to do it, as there are so many different conditions in that case outside issue that you search the Internet to seek help and resolve the problem and carry on from the point who lived in the document. Although I have considered all the problems which I know by making the document, but there can be many. Give yourself a little time if you're starting with installation right now concerned that you are new on. If you are unlucky on you, do not get stuck anywhere. It would take somewhere on 2 to 3 hours to set it up, so spare some time. If you get stuff, you can take these two result depending upon the problem. You're stuck with it is common for a newbie toe. Get stuck for days with the installation. But in a way, it is good as you can learn so much. So don't be disappointed if you get stuck. But that is why I made this document that should help you to go ahead and not me. The common mistakes. So here and diversions off the components I'm using, I'm installing. How do 1.2 dot one on going toe? 80 years, 12 04 and open toe 80 s trail 800.4 would be running on a virtual machine. All the components used our license free except for Windows on. I find this year's way to set up a group. I've tried cigarette on windows and all the examples you would see in the course would be me running on her dupe on Seagram Leader when a an installed secret and tried toe real soul . By making this video, I ran into some problems with openness such and I'm still in process off resolving it. So this is pretty much I recommend running even do on top of windows on Hadoop on it is the simplest way to go forward running on minutes gives you feel of the environment on a practical set up rather than Seguin on Windows. So I highly recommend this approach. So at this point, I would recommend to download Oracle Virtual Machine Ongoing. Do I saw you made 64 bit So Step one is to install article virtual machine. I've already in started. It is fairly simple. If you get stuck somewhere, this search for a solution on the Net I'll start with you. Do installation on the bm. Just type in. You do so it picks it up And then I died the name bees I bumped that I am up to do gp. I would be constantly different to the screen short just to make sure that the document is complete and ready. So Lipiec, where are okay? I have for selected dynamic drive on would bump it up to 20 g b. Then you need to goto settings and then click on storage and then add city drive which would point to the you can do. I saw you have downloaded next ABC which screenshot have moved upto. Okay, this is where we are. Okay, All in the settings is good. I'll click. Okey, I will not start the word your machine. You would get a few problems regarding keyboard and mouse can cause Just read them for information and click. OK, it would start. I just see if everything has been shown in the document. Looking on, this is the thing. If you're installing virtual boss for the first time, it may happen so that it made throughout some air off Some regards, like 64 bit support or BD tactics are a family support. Something If so, it means that the bios conformation do not support fortune machine to run. So in that case, you just need to do these simple steps. Just he started a computer on Go to bios on Do the following steps. Make sure you write these steps on a piece of paper as you won't be able to access this document while you do so next. If it doesn't give you that problem or you have already resolved the problem, you would give this cream go ahead and install you went to yes, then click Continue. Then comes this screen Asking to it is the risk and install you goingto go ahead, click continue on all your leader would be formatted. Just killing it would just it is And reformer the dynamic disk be located. So now you come to this screen and civic where you live here, put in the readers looking my back a fuller to put in my passport. Okay, Now, let's be for this to get finished. At that time, I would send this document to myself on the meal so that I can access it on you going to Israel. It would take a little time to install, so I would just speed up the video allergy. Okay, Now the installation is complete on you can click on Restart the computer. The first thing I do is to don't look the document. You can, of course, downloaded from the site. But I need to share through my me. So here's my document. I'll just open a dome in it and lock it toe the launchpad. No, I don't love her Duke installation package from a party. I've taken my steps from the documentations off Luke itself. So if you get stuck somewhere, referred to this thing good or stable, Rudi's and don't look, look for her. Do one door to door one been download the one ending with daughter God. Jesus. So next steps is that make a folder or do in home directory? You can make your own, but I would suggest you distinctive this one so that the remainder of the document would be really easy for you to use. You would just need to copy and paste most of the stuff. We'll have to wait till this downloads. I have speed of the video. Okay. I just remember that we need to don't know Ricki and J R E as well, so I would start their download as well. Accept the license agreement on download the one ending with tar dot TZ for Lennox 64 bit Next download bgr years. Well, look for 64 bit in English style. Jesus. And make sure that you agree to the licences Where so all these getting down, door it on, and at this point, I have speed up the video. Okay, Now, how do pull installation fights have been downloaded? Now the next steps is to copy the tar dot gz five toe, the new for that we have created Now I moved to that folder and I see the title fight now, this undoubted using the command in the document. Just copy and paste it on a little direction. We look in the folder and you see many boarders inside it not it is to get the next step in the document. You can check if the job is already installed on Ubundu. Mine is a flesh and soul, so it didn't have it. If you have something appeared in Java seven on, you want to get rid off it, then you can do so by following the commands I have mentioned in the document. If you do not have Java and doing a fresh in store like me, just follow the steps in the document. We make the folder where we would install Java. So just copy the come on and taste it. This put in your password mixed up is to copy the Tar Fights to the newly created folder. We moved to the download folder. So we are beating for Jerry Key and Jerry to get down looking. Okay. Now, G. R. Is done. I just copy paste the command to move the Grt to the folder. Okay, Now die. Now. We'll wait for the GK to download. Okay. Now gvk is almost done. The original copy Paste Command to move jewelry. Kay. Now we have to move to the Java Fuller copy. Paged, then dodgy, leaky, then untargeted e r E. Now we would edit, profile and put Java home on her Lupin, solve the rivers there and add them to the parts. - The next we do the following steps to lately next know where we have stored java. You can coffee piece the commands. If you have the same general version on Java folded as I have created, it's not. You can edit it on a note pad and then find them one by one. Going to the 2nd 1 We would just He fits the profile to take up the changes we have done. Now, John, a dash version should work. We should be able to equal the Java home. Really? And here we see. Now let's move to the next. Now let's move to the next step. Oh, yes. Congratulations. Now you have installed a group in standard on more. This mode is a good way to learn. You can do all your programming is here in this mode on practice programming on for Duke. Let us try to run an example. I stick to the document and show how it books so with Nikhil directly, first, and then I would run an example and use it books. We just look at the output. It chews that it has run successfully standard and more is a good more to practice programming and learn how Do. But in this video, I would go ahead and show installation in pseudo distributed more Israel would that we need to install message. A copy is therefore strain. Okay, now we need to edit the confirmation fights forces school site dot xml then HD of his sight not examine, then mattered site dot xml And now we will change the Hadoop iPhone. Envy your message which has all the environmental variables on sit up job home there. So in this fight, there are a number off activites that had abusers. We just put in a part off Java home here at the Java home. Very. But make sure that hash symbol is not there to un come in the line as the next step. We would need to set up your password less in sausage. The following steps are state with signal from the Hadoop documentation. And then when you do a sausage local host, it should not from you for a passport. I do it one more time and so it doesn't ask for any password. And so we are getting close to getting things done. Next, we for Martin implode, Then start iPhone. Although Kasich, this would start the name known, they didn't know that Second rename note, Then the joke cracker on. Lastly, the bass tracker the latest run An example to see if it has bean successfully set up. And so it has started successfully. Let me launch the U. Y, where you can see so years name no joy on this is job trackers. You and you can see the progress is where so congratulations. If you have reached this point, - we can print output. - Now we would just do a stop. I can all got message Hope this video was in hip. We need next happy coding 8. 007 Linux Ubuntu Tips and Tricks: we could do a new listen in this. Listen out. Share a few tips and tricks with you If you are a bigger or a little new to Linux, this would help you to walk around linens with a little more ease and make you work a little more like professional than an amateur. For the people with experience in Lenox, this would be elementary. First of all, I would start with the copy and paste. On many occasions, you would be required to copy and paste on the terminal, so for that you can use the control insert and shift insert. For example, I opened the text editor and type in. This is a test on. I would select this on. Copy this using control. See note that outside the terminal normal control cm control. We would walk normally. No, I'll go to the terminal and be stirred using shift insert. I can copy something on the screen as well. Using control. Insert on a piece of using shift insert. Next, deeper trick we discuss is using Profile or Bashar Bashar. See if you want to set up a variable globally, you can do it by setting it up in e. D. C slash profile or e. T. C. Slash. Bashar Bashar See providers of one which runs one position on Logan on Bashar Bashar. See good. Pick up the fresh changes every time you close and restart the terminal. Because how we set up the neighbors in profile while setting up a new dark slash D C slash profile is a command to re fresh the profile changes on make the newer changes effective. Knicks Dipper Trick is stop completion. You can sue do it slash D c slash bash dot bashar See on you would find these lines. I'm coming them on your top completion would be activated. So now I do in a list. Now I have one to go to into work space. I would just type in CD the blue and then the top character and I would not need to type anything else. Next deport trick is to clear the screen. I will use this we often in my video lessons. It is just to press control. L on the screen were clear. Next deeper trick is to customize a command prompt. Normally I do not prefer doing it. But if you like you can shorten the command prompt by typing export, be this one equal do dollar space and the condition mark and into. And so now the command ground food looked like this. If you want to make these changes permanent across the log ins, you copy this line in slash d c slash profile. You can you will make your command promise to colorful and play around with it. You can check Internet with loads off ideas about it. Next. Four trick is that you can have command across the lines. For example, if you want to edit their profile on your typing. Suji ated slash D c slash profile and urinals the space. You can go back on my back, slash on Endor and continue with the come on line. This would be a continual lesson, and I would keep on adding tips and tricks to this. Listen, meanwhile, if you come across some tape, he's shared with everyone by typing, You didn't know questions. Window. I'm sure there will be many good tips from you seeing the next listen 9. 008 HDFS commands: Welcome to a new listen HD office. Commanche. In this listen, we would learn about the SDF is Commanche. First let us understand the term lodges if it's chill on your eyes. Hruby official is nothing but an interface between user on Hadoop distributed file system that is a DFS. So if you want to perform any action on his defense, we would have to use the Hadoop official in order to do so. A loop if its shell when he takes you our eyes, that is uniformly. Source identifies as import arguments. Unified resource Identify IRS are part off fights in the falling former scheme authority on the actual but scheme. Can you off religious types depending upon the fire system it accesses it can be is defense for file Saanich defense Local for the files on the local machine. If BP for the fire system Bad by FTP server, it are also known as Hadoop Archive, which is a fire system layered on top of his defence and so on. So in short, group officially can access the files from various fire systems and so scheme I an authority would have to be put a calling. You would look in depth about how do park eyes later. But right now I want you to remember that there are Hadoop archive files which are multiple . How do files put together on their access in a special manner as any our guide or zip file ? But these do not compress the file. What they exactly do is what will come later. Have one would imagine that discourse storage media off any note which has her group installed, has two words. One is the HD official on another Is your local fire system will in the age DFS fold the scheme I used his age defense on authority is local host in our case scheme on authority are optional parameters. If they're not given the Defour it's, uh, picked up on it is mentioned in Court Side XML. Let us have a look at what we have said it in pseudo distribution more here We see that if his door before dot name has been sent to HD frisky on local host as authority, so these would be Billy Force and then is a part which would be the location of the file or directly. So you are a for a child file in a barren, directly would look like his DFS colon forward slash forward stash local host forward stash parent forward slash child. In the local file system, the you arrive would look like PFI column, followed by three forward slashes on the part. If you are familiar with UNIX commands, its D fiscal march would not be new to you. And in case if you are new to UNIX commands, don't worry. They're only handful, and I have attached a document with this listen, which would let you know everything about them on. He would be able to understand them pretty easily. Motor, would I have marked the commands with a star so that you can specifically remember at least those offhand as they are the most commonly used. I just demonstrated few 80 of fiscal Mars next, especially the ones which are not present in UNIX or Linux systems cruciform I would do and GPS. This command returns back all the Java programs training. So here I see all the demons have running, and so I do not start any. If they would not have been running, I would have started them with bin slash start dash all dot Shh. Also an interesting thing here to notice that all the deed a nose job tracker named Lords are the Java programs with the mean classes as what is listed here. So name note is nothing but a Java program with the main class name. No Libby. First do a list that is to list all the files that represent in HD effects. So what I do is type bean forward. Stash her Dube. If it's hyphen Ellis and then Indo, there are a couple of things to note you being forward slash Hadoop efforts would be at the start off every command we write. Remember her Do professes a shell or an interface between direct with in order to perform command nine operations on each DFS. Also in important. And an interesting thing to note is that when we list the files, it shoes are put similar to what we see in a list. Ash L in linen. Before recording this video I had already created forced the I R directory on a file, so you see them in the listing. If you observe closely, you'll see that the is for directory on hyphen signifies is the fire. The remaining are the answers, controls off the owner, then the group and then the other. Does artists for Read the Blues for right on X has no significance in HD efforts. There is nothing that is exude a bill in HD AFIs, so it is off no significance. Second column shows the replication factor, So this means that this fight has been stored with one as the replication factor. As we have said, the property DFS start replication toe one in his defense. I don't examine the second, and the third column shows the owner on the group on Fifth column shows the number of bytes it occupies. The seventh and the eighth column shows the creation, date and time, and lastly, it shows the. But next I'll remove the file by command being forward. Stash her dupe if his hyphen are him on the name of the fight. So the final gets deleted. You're observe that we haven't explicitly returned the complete you are as a default off HD frisking on local host authority has been taken up. No destroyed toe Ellis on the local file system, looping forward slash duke. If this it lists fired and the booth Stashes. So in this case, it would miss down complete files and directories in the room system notice. Look at what is in the home directory. It says NJ Paris. Look, what is inside that? So at least answer documents in injury. Then it is create. If I in local file system and copy to age DFS, I'll go to home. I need this find which I've created on you. Now I'll create one more file and let you name it If I I've been here. You all are rock stars Now I'll again Guinn Ellis. So here we see that file at the inn. Nobody's type in being forward slash group if it's copy from local home NJ Fine on in, which would be the destination PFI initially if it's in this, observe closely that you would see that we haven't specified the complete your eyes. Still, this books copy from local command assume start The last argument would be in HD if a spot on all the previous ones would refer to the local fire system and hence this command works . And this is the difference between the copy from local command on the get command which are similar in all its spits. Just a copy from local implies that all the arguments, except for the last one, is of one from Lucca. Fine system. So you can copy Multiple fighters were using this command. No, it is doing it. This we see our fine very spring, the fine. And here you would see the message retyping. So the copy has worked perfectly. No, it is trying to do the reverse off this latest radical P this five fromage DFS toe the local fire system. So we've used been forward stash her dupe fs copy to local in on a new find name H Fred. None of this check if we have received the file from her Duke So v c h fine. And her group says that you are rock stars. Please play around a little bit. The commands in the document. It would be fairly simple. Now just observe closely how and where to mention you. Our eyes and everything would be simpler. See, in the next lesson 10. 009 Running a MapRed Program: Welcome to a new listen in. This isn't you would learn how to compile and run a new program. We would be working on Ubundu, which means toil in R B M. First you download the Eclipse it up a Google search on download Eclipse and click on the Foods drink. Then we would click on the Link Lennox for 64 bits and then the next thing and then we would save the fighting. The download would take some time, so I forward the video now. The clip set up has been downloaded. I just go to the download section copied on pastry in the Home folder. Now I'll extract the Eclipse Yousef by right clicking on clicking on extract You. Now we see the Eclipse folder in the Home Directory. Then I'll go inside and click on Eclipse Item. This would launch the clips I d. Then we'll get this pop up window asking for the creation on this book space. Stick to the default on Lukoki. They don't go to file new and click on Java project. I'll name my project. Lubick scrimmage on click on finish. Now I have downloaded the source code in one folder you can download it from the site so I would sleep these dodge Other programs would Condor jama would can't map it or Java onboard , can't reduce it or Java and copy paste your workspace in the folder, which we have created just now. I'll go to her do experiments on, then source further. So no, In my eclipse, I d. I see the source cooler. I would just see fissured. And now, under the default package, I see all the Java source School, which I copied at this time. You would see a lot of errors on these programs as we haven't included the do packages in a big part. So to clear daters you just need to right click on the project are due big Spend mensch in this case. Then go to the properties, then go to Java big parts. Then the library's. Then click on Add Extra Villagers and then go to her root folder and click on her. Do called our job like OK, and then you would see that the external jar group core has been included. Click on OK on all your ears would go away. Next step is to create a job fight again. We would right click on a project for you, then go to the export option on then. Under Java, you would see the jar file option, Siddiq. That and take on next been grows through the pot. I would put the job fight in the been folded itself. You can, of course, select any part. Then I just typing. The name would count on. Click on. OK, and then click finish, then let us look at the job fight. I'm right now in the being fuller itself where I have created the job fire. So I was just doing a list on here. Received a word condo job you just doing in this honor her do file system they received in fire I had created that finds this before this video. I just I'll put the content off that file, so and so here you see the output. So, being in this folder where the jar file is, I would run the command job filed by using the command our new jod would conduct. John would go on in and vote All could be the output directory and would be the input file . You learn about all this later in the course, and the program should run as you see on the screen. Now, we were doing a list on the new fire system and see if the automatically has been created or not. They receive the Outback tree. And now let us just doing a listen out. The ABC all associate files the fight, starting with the part contains output. Let us God that file and print the contents of the file. And so here we see the upward. This goes with car, every details off how this complete process has been done. What was objecting off the program on What is the output on how has it processed and how increase efficiency off it. It all will be covered in the course. So if you are able to run the program, it's great. You've computed the hard part of this course. Everything after this is going to be simpler and my heart, his Congress rations to finish the hard part off it. See you in the next class. 11. 010 HDFS Concepts: Welcome to the new listen HD FX concepts in this section. We'll look in depth about 80 if it's let's start with the dome lodges used in HD. If it's a d, a office is a distributed file system. That means the fights are stored across a cluster of computers and not just one. The pleasure is nothing but multiple rocks put together on a single track is nothing but a lot of computers put together, which are individually tone. Last notes In Easy. It is thes nodes, which store data are known as did Donald's. They act as broker or Steve Notes Name Node, which is the master node, is responsible for the management off the fire starters disputed across the cluster legacy . Assimilation on how file is stored in each DFS. Pfizer broken up into smaller chance. Also known as blocks. These blocks are then replicated. In this case, they are replicated by a factor off three, which is a default multiplication factor Office DFS. These blocks are then disputed. All the cluster on this process off replication on distribution is managed by name. Note. Name note keeps a track off complete file system on block locations if you notice the distribution done by name Node is smartly done so as to provide Brazilians. If a failure happens in this case, suppose if one did a note feels name, note would still be able to put together the complete file with the help of replicas. If suppose a complete rock face even then name, no one would be able to put the file together. We learn later. What considerations? The name no takes to dispute the five blocks. Let us understand the ideas behind his defense is DFS is designed to handle large fights off hundreds off jeebies and TV's and more. Leader Access is not quick with random reads and writes. It is followed that the leader access patterns off right ones and read many times is the best So for the deed analytics. His defense is designed to use commodity hardware, but it is definitely not cheap hardware. A typical unit would cost around one K 25 k which would be available with many vendors. Typical installations off our GMs over can take up to 50 key expense on hardware itself, which has an upper limit off processing. But this as well, means that the hardware failures would not be a specialty case, but in nominated the effects. As the cluster size increases to thousands of nodes, hardware failures may happen every other day or might happen every other are. As we study the HD FX concepts, we would see that it is equally important to learn about the failure scenarios as it is to study stable processing straits. Makes me look at what is DFS is not designed to do. It is not designed for quick read off data. It cannot function as well. DP database For that, we definitely need RD. Bemis, at least in the present scenario, is Eva's also doesn't work well with a lot of small fights. A see if it's doesn't support arbitrary filed modifications as well. Only upend is supported. Let us understand the most important for nominal toe any file structure that is. It's blocks block sizes, a minimum amount of data that can be read or written in a fire system. But block sizing Hadoop is a little different. First, it is big, while it is common toe have a block size off 51 toe on a storage media. It's the fourth sizes. 64 MBI in age DFS that is 1 28 times small. Second, if you find a stored in his DFS is smaller than that is the office block size that only the amount of size that is needed is your life and not the complete block. There is a reason for a large block size we had discussed earlier how seek time becomes a bottoming, quite processing large fights. So the idea is to keep the sea time around one person off transfer raid so considering 100 MBPs transferred and 10 milliseconds as additional see time overhead. The block size would have to be 64 Emmy affords to keep the Sikh time around one person off the transfer time. In the next section, we would learn in depth about 80 of its architecture. 12. 011 HDFS Architecture: but I'm doing You listen In the previous listen, we studied about HD of his blocks in. This isn't really deep dive into its defense architecture. His defense VOCs on monster sleeve architectures. Name node is a master node. On data notes are the vocal notes. That means name note would be responsible for all the management of the story. Space on the data notes on Did announce would do the actual groundwork off storing the data blocks. Nilou performs a function of keeping a track off the complete file system by managing two things. First name, space image and second edit Logs, Names faces the Middle Rita about the fights on Die Crease, which are stored in age DFS. It contains data about all the blocks to which flies they're associated with and on vegetate annals, it recites Eric Log is nothing but the log off activities on HD. If it's performed by the client and it lost, just keep on piling up and grow as the activity on its defense keeps on happening. So out of the two diplo is the one which keeps on growing at a faster pace. These two combine form the complete file system image giving details off all the fights and block Saanich defense. The block information is a pleaded by the name Lord, as in when data notes joined the network. That means as soon as it did that no boots up and connect to the network, it would send them know the information about the blocks it has on this name would would upgrade the name space image with the data. Both Eric logs and name space are maintained in the main memory off name node. This helps name no too quickly. Look up for the blocks as and when required. No. Let us take a look at the case when the name not feels as you can give the compete file system would go down on would be unavailable as complete name space image on data block information is lost For this reason. Name notice Also referred to as a single point of failure. It's be a wave image. DFS. That is why it is important for the name. No to be resilient to hardware failures on it is highly advisable to spend more on name notes. Hardware still, with upgraded hardware failures, can happen to counter those situations. Falling resident Edition is done. The name space image on every clogs are transferred to a highly available remote in the first month by name, nor from time to time. Additionally, secondly name note is also added. Do not confuse it to be like another name node. This is considered to be one of the naming grounders in her duke. Secondly, Name? No. Doesn't function like me. No, it's mean and sole purpose is to combine the name space image on 80 clogs, so that name knows me. Memory doesn't fills up because of the ever increasing Eric logs. Secondly, Name Note. Also create strict points off the name, space image and every plans much together on rights it to a fight, this hips name No to release the mean memory occupied by the edit loss till the point off last trick point on. This is the only purpose off taking the name. No second reading note is a Java program, which just combines the idiot loss and the name space and creates a checkpoint. That's it. This operation of combining the idiot logs and named face is itself complex and CPU and memory intensive. So secondly, name Node needs to be running on a good hardware configuration, as the job of combining the arid loss on the name space requires good computing resources. At this point of time, I just want to remind you that the name node and secondly name notes are nothing but Java programs that run with mean classes as Name Lord and secondly, name? No. So, in case of failures off the name node Hadoop administrator needs to boot up a new name. Note. This is the case on. Leave it there. Earlier releases off a dupe have moved on to three. Release and CDH four have high availability features available in them. In those cases, this situation is a little improved. We would look into them later in the course. So in the previous releases to her Duke Doctor three on in case off CD s three in case off failure off name Lord Administrator would have to bring up another machine as name No. This machine had to be off good configuration as name node system requirements that high. So in that case, most often in a small cluster machine that ran the second, the name node is used to reconfigured as a new name No again. Please do not confuse that. It is. Secondly, name notes function Toe Takeover has finally named Lord. It is not just that the machine, which ran second name note, is most often the best choice for the new name note in case of failure. So, in case of failure, the last information from NFS Mount is retrieved manually by administrator to the machine, which would take over as a new name note on the machine is then reconfigured as the name note. This process can take around 30 minutes to return to the stable straight. Next. Let's look at the guidelines for the name notes Mean memory as that Lester size increases the number of storage close. That name not has to take care off. Also increases it. Really. The block in the story school would consume some amount of name knows me memory. So it is important for the name nor to have enough mean memory so that it can properly. Man is the pool of data blocks. As a rule of Tom 1000 People 1,000,000 stories, Bronx is recommended. Let us take an example off 100 north, cluster with faulty B disk and let the block size be 64 m v. Then the number of stories books would come out to be two million. That means name no should have around food ZB off me memory in the next time are the few key points from the last two lessons. Peace Pause the video, if you may like more time to read. 13. 012 HDFS Read and Write: Welcome to a new listen in this. Listen, we would look behind the scene on as to what happens when you read all right into age. DFS Let us force deep dive into HD. If it's right process, is he if his client is a GM that has to run on the node, which interacts with HD office? No, that DFS daughter replication is a property which contains the replication factor off the blocks this property can because to my eyes to any set up in pseudo distribution mood off deployment on H DFS, it is overridden and said no one in the configuration file age DFS hyphen site, not XML, but before its value is three. So as a first step, Klein would communicate to name known that it wants right into its DFS. At this point, the name would perform various checks on the request, like if the file exists or not, are like if the client has correct permission levels or not to perform the activity. If all is fine, name node would return back to 80 office Klein, with the list off notes to be copied on at this point, Klein Foot connect to the forced Dayton old and asked it to form a pipeline to subsequent data notes. The data notes would acknowledge as they successfully copy the blocks, Step 34 and five would be repeated until the whole file gets written on his defense. After that, the line would end with a completion message. In case of data node failure. The Iranian snowed escaped on blocks would be returned on the remaining notes name not would observe. The under application on would arrange for the replication author under replicated blocks, seem would happen when they are multiple node failures. The data needs to be returned to at least one note on the under. Replicated logs would be taken care off by the name Lord. Now let us look at how data nodes are selected by name. Lord. If the client node itself is part of the cluster name, node would consider it to be the first known where the replication should happen. If it is not the part of the cluster any known within the cluster is chosen. Keeping in mind the north is not to busy are loaded. The second note is chosen off the rack as the 1st 1 was chosen the 3rd 1 is chosen to be on the same rack as the 2nd 1 This forms the pipeline. Now let us look at the simulation drunk, which we have seen in the early listen. The file is broken into blocks and then replicated and then distributed across the fight system. Now, if you observe if one off the node, what even dropped feels there are still all the blocks off the file available. Failure off multiple grass is most CS one and less probable to happen. Also, it is to be noted that the whole precision off selection and replication happens behind the curtain on developer. All Climbed doesn't need to worry about what happens in the background before we look at how it happens. Let us look at how distances calculated in each is your face. In a distributed network, bandwidth is a scarce commodity. Hence, the ideal distance is based on bandit block to be the food on the same data node is said to have zero distance if the blocker sites on a different date, an old but on the same back, the distance would be counted as to if the block recites on a node on a different track. Distance is considered to before. And lastly, if a block recites on a node in a different data center, the distance is taken to be six and these are only possible cases. Now, let us look at the anatomical freed for the easy. If his client sends a request to the name node in response name node returns the data nodes containing the first few blocks. Name node returns the list starting from the closest node containing that block. Do the food ist so the client would connect to the first note on Read the blocks one by one . Let us again look at the feeling cases that can happen while read they can be to failures. First, the leader block its current. In that case, the next data node containing the block is contacted. Second. If the dude they don't know itself feels little CD seven fears while the Block B one was being read, then the next note in the list would be contacted. In this case, climb food. Make a note that the seven is about data node and would not consider it later. If it appears in another list, please go through the key points for this. Listen 14. 013 HDFS Concepts II: Welcome to a new listen on HD of his concepts in this Listen, we would look at the new features added in her Duke Doctor three release that is H DFS Federation on high availability. Let us start with HD of his federation. This feature is added in order to balance the load on name node as the closer size increases. Let us understand this with an example. Let us say there is a directory tree structure rude on. Under it are two folders for the one on for the two and let us assume that there are a lot of files under it. As the closer size increases, the name note has to store more information pertaining to plots in its mean memory. So for cluster with high number off notes in the range of 2000 name notes, memory becomes a limiting factor for Skilling. Under federation, a new name Newt can be added on the filed restructures on the dock pull can be divided between the name nodes. This East name note has to manage only the pool off blocks it is associated with and not the complete pull this reducing the load on a name Lord it is to be observed that the same data not can be associated to different name nodes at the same time. And failure off one name no won't affect that other name node. For example, if name No. Two goes down, the files in Fort of one would be still be accessible. Let us just look at the key points we have discussed. It's different refrigeration addresses. The limitation off name Nords Memory to scalability. Evening note would be responsible for the name space volume on a block pool. Deter notes can be associated with my different name. Notes. Name. George won't communicate with each other on failure off. One would affect the other. Let us look at the next feature. High availability. This feature is to address the time taken to come back to the stable street in kings off name, not failure, as we have already seen, that the name node is single point of failure on it takes around 30 minutes off time to come back to the stable street after its failure. So to address this anemone is always running on standby. The prime mini node understand by name nor share the names piece on it it locks where highly available and if it's storage mount in future releases, zookeeper will be used to transition from primary to this stand by one. In this set up, the data nodes are configured to send reports to both the name notes. In this case, if the primary name not fails, the standby can take over very quickly. In practice, it takes around a few minutes for this feeling, for transition to happen in this set up, it is important for the standby toe way to confirm that the primary has gone down. They can be a situation where the primary might not have been completely down, but just a little slow to respond. In that case, there can be too active family notes, and this cost corruption and chaos. So to avoid such a scenario, the reserve node fences. The primary node when it takes over fencing means that the standby would kill the name known process, revoke shared access and disable the net footboard off the previous finally node. In certain situation, it goes to an extent that it got stopped previously active name node from the power supply itself. This is often called us stun it shoot the other node in head. As you can imagine naming this standby node assessing Henry named Lord would have bean apt . But there is a naming mistake that has happened. This wraps up our discussion for high availability for a quick revision off key points on the slight peace positivity, you know? 15. 014 Special Commands: Hello and welcome to the listen here. We would discuss some of the special HD if it's commands which we haven't discussed so far in the course. First we look at each are also known as her group archives, as we have already discussed that lots off small files is not a good case for HD effects, mainly because it except the name notes me memory. Although it is to be understood that the small fights do not actually take up the complete block size on the desk, that is, if a file is standing be on the block sizes 64 nb, then the fire would just occupy the enemy off the storage space. So the issue with small file is that it occupies the name. Notes mean memory as name not has to maintain meter reader. For each file, the more than a matter of fice more would be the moderator, which name known has to take care off so name notes main memory becomes a limiting factor. Hello archive is a tool which helps in such situations. In addition to this group, archive files can be used as input to matter These programs as well. Let's see an example of for loop archives and understand how old books. Just before recording this video, I've created this small fight folder on my local system in the home folder. In this I've created two documents. You fight, I'll be fine. I just do a GPS to check. Everything is running or not. Yes, everything is running soon. I insist. Copied this file structure to HD. If it's using the command copy from local No, I do it a list to see if the fights have been created. So there we see the directory. Nobody archive this final restructure. So the common is her. Dube, our life hyphen. Archived name on at this point well hidden in there and there we get this index off this command. So the Sendek says that the command is archive hyphen archive name followed by the name off . The HR file, followed by iPhone be followed by the parent part fallen by the source on then the destination. So I type in her group archive hyphen, archive name. They have won the Hadoop Archive file to be filed one dot h a r. He's know that here we need HR is an extension which indicates how do bar Guy fights? These are handled differently. These are read and written in a different manner as we'll see on to differentiate them. We use dot HR extension The hyphen P on the parent part would be slash user slash injuries slash then followed by the name off the directory structure that needs to be archived, then followed by the destination part, which, which would be slash user slash injuries slash i would press ended at this point on the map release program would be involved. I'll again do analysis on her do file system and see if the group archive file has been created or not. So that's how do back I find I do a list on her. Do our guy file. So as you can see, there are four fires that have been created for and how do about gunfire first. While the success which marks the successful completion often archive command, the powerful is the one which has all the countries off all the fights concatenation together. The Doing Next Files index on Master Index contender indexes used to look up for the content it is doing. Recursive Ellis on our new park I file in, our producer will put H A r scheme so as to specify how Loop archive has been read. So it displays if I be filed. The drill sign are the temporary files that were made when we copied the small fire directory structure from the local fire system. They were created because we had opened them in texture. Ditto. Next we understand the limitations off Lubar guy fights first you create an archive file, you need as much as this space as the original. Her group archives currently do not support compression, so it is like a duplicate fighting. Second, her dupe archives are immutable. Are you? Remove fires from her loop archive. You must recreate the archives told if you're reaching the limits off the name notes, memory using in DFS Federation would give you a better scope in scalability, then using her dupe archives. Next we look at another command. This CP This command is used to copy the files from 100 file system to another. The coping process is done in a parallel fashion. This index of this CPI is as follows Helou, this CP following it would be the source folder on After that, they would be the destination for need node one. And they know to would specify the name Lords off the different age DFS deployed. This command would usually be used when you're using age DFS Federation on your cluster and have two or more name notes on the same cluster, and you want to copy from one inch DFS to another. I flew. Sis, listen at this point on, see you in the next lesson. 16. 015 MapReduce Introduction: Welcome to one. You listen from this section, you could look at the most important encore topic. Napoli's notice. We'll start with looking at the dome Nagy's that I use in my produce first is the Spirit Street is nothing but the fix chunk off data that which so as input to my house this want you to remember that blocks and spirits are two different concepts. He didn't mind that laws are 80th response sick and belong to the HD of this world, and spirits belonged to democratise. For notes on this modest size and data stored in HD, AFIS and spit is the data that is input to the march jobs. Mathayus processes display and produces an output in the diagram have shown that the map output is smaller than the map input size. This is a gender and a good case, but I do not want to pain so that it is a restriction. It can be equal or even larger than input science as well. But that is not a good case. It would be advantages if it is as small as possible. You see, leader, why civilian thing off map values. The problem is divided into two oceans forces my part and second part. Instead, reduce all my jobs running Bannon and produced are put. All the results are stored under careful on much together into a file. It serves as input to the radio show. The release job dates this as input and produces that result. The whole job execution is controlled by two nodes. Job Packer on the bass tracker. You can drop Palin between golden age DFS on job tracker in my produce. It's pointing to details. They will beat our strikers in the Japanese word like the didn't holds uses Tool. The leader Our strikers run on each data node, this plating running off Martha's on reduced US job trackers John is to manage stealing does to get on on past strikers at this past tractors duty who run not on reduced jobs and send progress to the job tracker. Okay, I want you to imagine jock lacquer on gas trackers as a job for jobs that are running on the machines on. They are not the hardware. One more parallel between a DFS and my police work begin. Draw is that night been in notes. Failure is a more serious one HD AFIS lake Voice Here in Ma produce word good job crackers Failure is the most easiest one, as all the jobs in progress and our strikers status would be lost. That is why this wisely to spend more on the hardware that one stopped lacquer notice. Run through this ignition again and try to understand the fees is in a little morbid. The weapon has to bring the problem into two portions. Force is the math phase and second is introduced. Face, My job's would broke on the beat Down, which is located on the normative. This principle is known as data locality. It is important that map jobs get their input, which are local. If they're not local, they would be needed to be fished from the network, and so lead agency would be added in network I'll on the performance with the greed hence optimal value. Off street sizes equal the block size as one complete law will be present on one note. This every map would have explained located on the storage itself and so speaks the equal block. Then the map. US would carry out its process and write the output on the local list and not on his defense with replications. It is to be noted that the map our code, is written on the Lupul. This, as it is an intermediate result on, is off no importance after the final result has been calculated. Hence, it is stored only till the time the reduce it has picked up on process it successfully. It may happen, so that will reduce and they feel on. In that case, job tracker, would we use them up? Oh, good. So job tracking cleans up maps out could only after the successful completion off India's job. He is to be noted that the maps are put would be done to 80 if it's only in the case friend , zero reduces are specified in That is the maps are put is the final result on final result has to be stored on media office as it in pieces that recipients to loss because of heart failure. The experience is shuffled and sort all the maps output is merged, sorted and partition. So there are three steps that have happened. First is the much which is nothing but combined. Output off all the map jobs. Second is sword, which is sorting the map out. Put these on key and police partitioning, which means that the output is divided based on the key value. Then comes the reduce face. As you can see that the reducer won't get the data nuclear, it would be fished from the Net. For second thing to be noted is that the number off reduces are not decided based on the input size life. In case off maps, which is dependent on input, size and spread size, the number off in producers are decided independently reduces our group is returned to HD if it's with replication for the liability after a long process, results cannot be afforded to be lost because of heart failure. So there, strolling on a defense, which is more insulin to hardware failures. If it is getting a little too much dis relax. We'll run out forced mapper news job and things would be a lot more here. Then 17. 016 Understanding MapReduce Part 1: Hello and welcome to a new listen on this. Training my produce mechanism in this. Listen, we would understand how map various works on how to break a problem in douma Pretty Solution. In Buda map is split a split foot containment in the courts. Each record would go through the same map. Operation one by one map function has input in the form of keys and values on output in the form of keys and values. Israel at the time of input. Her group supplies the key which is unique to every record. It is by the fort bite off set from the start of the fight, it can be record number or line number. Israel program. It does have some control over the input keys, which we would learn about leader in the course. My process is keys and values one after the other to produce 01 or more or put key value pairs. So the important thing is that the math, food or cookie and value pairs he's know that these key and value pairs would be the same in case if before mop it is used, which is also known as identity map, which does nothing but copies. The key and value pairs from import toe open with no processing in between another thing to notice. Violate input to malfunction generally have unique ease. The output of the map would generally have known you Nicki's. We would design the malfunction. So this would be helpful for US leader in the reduced face because, well, sort the data on the basis of Keys on would like to meet sense off the values with the same keys. So the main idea off the map function is to divide the input key in the keys on values in such a way that the values, when put together for the same key, start to make sense. Well, I understand this leader in the simulation, so let us move ahead with the simulation so the whole input will be processed on output would be created. The map output would be chauffeured and sorted on the basis off keys. So now all the values from the same key up put together. Now these values are fading to the producer. The intermediate key value output would be created by my different maps. It is critical for the my produce framework that one particular reducer gets all the values for particular key, Then we won't be able to make sense off any value. This one mechanism for sorting the data and sending off data over the network is all managed by her group itself. On programmer need not program anything for this. That is a beauty off my police framework. I want you to notice here that the reduce input is in the form off key on list off values associated with the key and not just the value. Pairs Muite values for the key. I'm not sorted. They're taken from so many mapped us and are put together. Maps me finish at different times so that the that would be gathered randomly. So everyone off the job may desert in a different sequence of values for key. The sequence of values is not important here. The reduced function is card for each key on reduce function processes each value one by one for each key on Mr Fane news release and may choose to our put 01 or more TV repairs. Please note that the or put off producer would be sorted as it is receiving the import in the Sergeant manner. Now let's see. How did he compose a problem into a map? Pretty solution. The BC trick is to reverse engineer identify how the final output should be. Then you should be able to find out how they input to reduce It should be, which in turn would help you to find the key. Identifying key solves half the problem. Then you can find how the input data should be broken into key and value by map and just find a solution. So now I give you a challenge to solve. We look at a hello world equaling example in her dupe. The work on problem in this. If your job is friends with the line to be or not to be, your job should turn it into a word for word by the occurrences off word in the record. So the output here is to cometo as they work to appears twice in the input on be common toe , as be appears twice in the input and so on. This is a common technique which search engines applied to the content off the website to find the relevant keywords for the website. The most green words are taken as reliving keywords to the website. So the next challenge here is to find out how your map should bring the import record two keys on values so that reducer will be able to produce the output as shown in the next lecture really discuss the solution. 18. 017 Understanding MapReduce Part 2: hello and welcome in this. Listen, we will discuss the solution off the challenge problem which we talked about in the previous Listen, if you haven't taken time out to find the solution, I would suggest you to give a strong thought on what solution should be. This would help you to understand the concepts on the design off my police friend book in a bit of a So in my received input record one come out to be all not to be you want signifies the bite officers, which is supplied by a group. So what are mapping rhythm will do is that it would organize their input line into words and for every word, emit world common one as a key value pair So good would be to come one become one all comma one not come a 12 comma one I'll be going on. One would signify that the world has been incompetent Once this will be sorted on the basis off key That is word in this case. So now he's and values arenas themselves in the alphabetic order for reducing processing, the key value pair would be changed to key on list of values So now it would look something like this. Now you can see that the world has a princess falling it. The shuffle and sort step, which is provided by her group, has put together the keys on their values. And so the values put together have started to need sense. So now they reduce. It would call the reduce method. Once party on in the method, you could trade on the values off each key and sum it up to produce the result. Please know that every time you would see the same structure off the producer, it would have initialization ordered by rations off values A with a key. And the function ends with the awkward off key and value we can design reducer to emit 01 or more key value operates for each time it is called for a T. Let's look again and see through assimilation how this would look like in case of many maps across saying Palin. Let us consider the case off two maps running in parallel on having input as one to be our and then not to be with one ending on the bite offsets from the start of the fight Please note that in the real world there would be many maps on imports would be huge. What city you're in meetings. This is a simulation. So we're talking very small inputs to understand the concepts as we have already seen my food to organize the line record in towards an image one as value, the output would be asked Truman. These would be most and started. These would be fed into producer to produce the awkward. The power of nihilism can and should be harnessed at reduced fees as well. Let's take a look at a case. It's producers. In this case, the producer input would be partition, keeping two things in mind course that the value of all the keys goes to the same producer . Second distribution is almost equal. So now the reducers would produce the desert as shown he's known that single reducer outputs single started fight. Why do reducers on puts to individually sorted fights? Another thing I want you to notice is that the world be has been processed back to different map functions. Yet this process by the same producer to produce the desert. This has been only possible because of the shuffle and sort step in between. This is critical to any map really solution. It is important to understand the keys are processed in a distributed fashion at math. Face on at reduced fees is brought up together so that the crossing off all the values to a particular key can be done by the same reducer. And all of this is possible because of the shepherd and sort steps. If you can understand this concept on, bring the problem to write map algorithm on release algorithm, then you can design my produce solutions. Mark produce can be returned in many languages. In this course you were seen usually Java examples. But the point is to understand this concert and you will be able to apply toe any language in Java, we would have to write three classes first math class, which would have maps aid off logic Second reduce class, which would have reduced site programming. Logic two is driving Program, which we control and decide configuration on how the job would read and write the data. The function side distribution off the court with multiple machines so that map gets the later look me on map our country. It is that correct? Reduce machine along with shuffle and start stepping between. This is always taken care by her group itself on program only not called anything for this . That is what makes you special In the next Listen, let us look at the Java programs and see how it books. 19. 018 Running First MapReduce Program: Welcome to new listen in the previous listen, we discuss the algorithm on logic off the program, and here we discussed the actual code of the program. Let's start with the map class that is work on mapper. It starts with the import statements. These pushing off imports treatments import group specific data types for key and value in her loop. The key and value data types can be only off her group specific types and are tailor made for her loop systems. What was the need inviting? We used the already prison job types. Folke and Value would be understood later in the course for not just understand that long rideable is something similar to Long Feel, which is used to take care off a long number and takes is something similar to string in Java, which is used to carry sequence of characters. At Incredible is a data types similar to Indy GIA in Java. Next, every map class would extend my upper class, and we'll read the map function these year are the type parameters which specify the Hudood Data types. This would have input key on input value data types, which Hadoop supplies to map, followed by our put key on value data types. So here the data types for input key is long credible on input. Value is text. Andi today for all bookie is fixed on leader date for output value is incredible. We would declare the two fears which we require in the processing logic we need to write the malfunction. My function has the parameters as input key on value on context. The bigger type off input key on value should master of all mention it a tie perimeters rule of context is to cast all put key and value pair after this is the processing logic off my function. We don't nice string in the words and write it into the context with a ski on one as value as we had discussed earlier in the garden. So the idea is to understand the basic structure off mark class and so you can customize the same for a different logic and everything would be easy. First point is to declare the type parameters which are data types, off input and output key values. Second is to read the map function With the processing logic you require, make sure that the 1st 2 parameters are the input key values, and their data types should match with a bow tie parameters Declaration. Third is to write the logic you need on end it with context. All right, metal to write the output key and value pair. Next, let us look at the release of class. If you got the mean team off math class release, it last would be easier to relate. Toe. Every reduce class needs to exchange reduce surplus falling. It would be the type parameters, which would specify the produce specific data types for input key on value, followed by the reader types off all cookie and value. Then we need to write the reduce function. The characters to reduce function are the key, followed by the label field off values. As we discussed earlier. The import to reduce function is key and list of values. And so here you see that the values is specified, as in it of able feel. The Tour pounder. To reduce function is the context which collects all perky and value pairs for Baghdad is the logic which we have already discussed. I want you to know that in processing logic for almost every solution. You'll have the exact same loop, which it rates, or the values. In this case. We just add the values into the some field. And after all the values off fertility are processed, we all put the key in value. Peer through context aren't right metal. So at some rises structure of the reducer class, which you can apply to any solution. First, we specify the Hadoop specific data types for input key value on output key values. Please note that it is important that the data types one input key and value off the reducer should master or put key in value off the map function. Second, we need to override that reduce function. The 1st 2 parameters are input key on value, and the 3rd 1 is context. It is important that the data types mentioned about should neither leader dies. Mention in the function Could we just need to initialize properly and change the logic in the four loop as per dissolution and use context or trait metal at Wright points toe or put the key in value pair next, they just look at the driver class. The structure and the flow of driver class is absolutely simple. If you understand the job class and its function, you can imagine Job object as a dashboard with delivers to control the execution off the job. And the idea of driver class is to send the job parameters so that her Duke can take it from that point and exude the job as specified by the programmer. And so you would see that that is what we're doing in the whole driver class. We first declared the job object. Then we would be using the same Bajor class metal and passed the name off the driver class . This had identified the job fine. When it is distributed across the cluster, they decided the job name, which will be visible on the U. S. They said of the map, a class on reducing class, bypassing the names off the map class on reduced class, we just design. Finally, we set up the output key value later types by using the method, said Output key class and said output value class. Now this all put key and value mean the output key value data type off the job seeing doing it means the output data type off key value pair off the reducer, so we need to make sure that these values on the one we declared in the producer class should be the same. They used the metal fight input, former dot adding per pot and on a good part to our input and open fights for a job. These would be possible command line arguments job dot wheat for condition is the metal, which actually triggers the submission off the job to a dupe. And that is all a programmer needs. Soup could if you notice that there is nothing a programmer needs to do to distribute this over the cluster and manage the network input output. All is managed by her group, and that is what makes her special. Another wonderful thing is that this court is scalable. If it works on a single machine, it can be scared to thousands off machine without a change off line in the court. No, it is trying to run this program, which is basically exactly the same as we did in the listen hard to compile and run a program which we did in the section setting up a group I first off will create the job fight for that I would go to her loop experiments then right? Click on it on, go to export, then click on jar files. And they know next the name is already present. And so I'll click on finish. I would like the job fighting. Now, At this point, I'm already in the bin folder where I created the fight. So I doing Ellis? This shows the world Condor. John, fight. Let me just doing a list on her new fire system. Make sure you have run start iPhone, All Lord message and all your demons are up and running before you do. So here we see the infighting, which we have already created in the listen, compiling and running a program. Maybe just do a god on that fight. So it has contents as to be or not to be noticed Runner program on this. So for that, I'm type in her new jar jar. Finally, driver class input and output fights. So here it would be her new job would conduct job. Would God Yeah, about one. I lied. Output as one as out banditry is already present. He received the run off the program. Now it is again doing a list on the new fire system and they receive the open directory. Let us doing Ellis on the open directory, and they received apart fine, which contains the output. It is God that and there you see the final dessert. For now, I would suggest you to experiment a little bit the input of the program and see how the result changes. 20. 019 Combiner And Tool Runner: Welcome to a new listen in this. Listen, we learn about combining functions on make a little enhancement in our driver class, which we wrote in the previous. Listen, let's again look at the solution, which we discussed in the previous listen with simulation off parallel maps running. Let Martin get input. One comer, Toby or not Toby and second map get being put us showing after being fake to map. They would produce their respective outputs. When this is filled into a combined function, it would produce our goods has shown it is recommended to use combined function in your solution. If it's possible, the function of combine ER is to process the map or put locally so that they are less deserts to transfer to reduce it. So in this, what we can do is that we can added the occurrences off the words on the map machines locally, and this can reduce the map output. And so you would see that the combined function has compressed the map once reserved in this example, the second commander didn't had the repeated words and hence didn't reduce the output. So we can see with this example that the idea behind the combining step is to reduce the Lord on valuable asset in her new processing. That is, it's needful bandit All this. It is recommended to have as less map outputs as possible so that it is easier to transfer the maps output. In this case, Commander is doing nothing but the same thing which were doing in the reduced face. It is adding up all the values off the keys, just that it is performing the same thing locally on the map machine on reducer applies it to the global data, which is collected from various maps. But it's on. The steps are the same as we have seen all year. The combine is awful would be sorted, chauffeured on partition and fed to the producer, which processes and produces the upward. Let's look at the key points with the CA miners. If you write combining classes, they exchange the reducer class. When thinking off, combine er think off reducers, which are happening locally on my machines, So program structure vice there exactly the same on extended release a class like the producers do. The combine is logical build and introduce matter exactly in the same manner as we discussed for the reduce cross seeking key point is that these can be applied only in the case is with the nature of the problem is community on associative. It is just a complicated way of saying that the operation done by combine er's should not depend on the order off the values that are treated to combine an operation. Let me explain this. First, I'll brush up about the associative on communicative loss. Committed of law is a please be. Is it going to be Percy? This means we can stop the operations and yet get the same result off the same operation. Associative law is a place deeper. See, with a place bigger together would be equal to a plus B plus C with people see group together. This means that even if the grouping off Prince has changed, the result is the same. The reason we need these laws to apply is because combining step Candra more than once on maps output. We would learn about this indie days later, but the key point is that the miners camera and multiple times so as to reduce maps open in case off readers the manner in which the values is processed is often random, So the operation performed by combining and have the same values in different order with everyone. This this change in order should not change the wood or desert. And so the combine and function should have the operation which for those associative and communicative law, what s we would get a rainiest results. In our case, the operation is off. Simple addition and hence it is fine. Something making admitted mean one for this rule. Third and the most important point is that implementation of combine er's reduces the transfer off the data between maps and reducers. It is the most important underlying idea off combine. Er, if Combined doesn't perform this, there's no point off its design. Let us look a program which implements combine our function on At the same time, we would learn a new and better be to implement our driver class. First of all, sitting our combined function is as simple as writing a single line. Of course, were the positive minor class to the function job, Dot said, Combining class, we would reuse reduce a class in the program as it is performing the same function so If you want to use your combining class, you just need to write the processing logic in a class on Pass it through job got sick. Combine it. Last function. The combining class hold would extended, You said Class on will be cool it in the same way as reduced class as we have discussed in the previous listen. Now we look at one more change we have done to the quarter in the previous driver. Plus, we had returned our logic in the main function. Hear, hear, extended configured class on implemented tool interface In the mean function we have just used to learn object regard Iran function, which has all the logic in exactly the same manner. What this does is that it gives a beauty to set properties at their own time, and we need not write a single eye off cold 200 them. I explained this with an example leader. Firstly, let's try to run this program in the usual way. I would just export the job. Fine. I would do what I did. The first thing I do is to check if all the demons are have been running. I do this by GPS Come on. In this case, all of them are running. If not, you can started with star hyphen. Although a search command let me doing a less on her new fire system, I would just get the in fire right now. So it has only 19 We are not to be. I would suggest you to put more lines in the file and try to experiment a little. When you finish this. Listen, I'm in the being fuller itself where I've exported the jar file. Let me do it. A list on the local fire system so as to check if the finest there. Okay, there it is. Now I run the program with Come on her new job. Well, condo jar then the driver function, which is world conflict. Combine er then in and I'll do out and out one are already present. So I choose outdo directly. They receive the program running. It is doing a list on her. Do find system to see if the open directly has been created. No, it is doing a listen onto territory. So there we see the part. If I let me just get it. So there's the desert now Let's see the magic to run are being stored court. Now I run the same program and set up the job to run with all the producer. Not just after I've mentioned the driver trust I will do a hyphen capital D space Matthew Daughter used docked US equals zero and then give the in fight on dogmatically If you notice that we need not exclusively cool for handing this perimeters as we have used to run an object around the court, we can hear specify as many properties as we've born with hyphen D, followed by the property name on tour would be able to handle it. Let's first Inder the job runs to completion with reduced as 0%. Let me see if on tree God created there is Lacey convince off on three. So there we see the part file with him, which indicates a mop up a fight your user would find me always having are there. Let us got the part fight and there we see the output. Here we get to see the map output exactly the same as we had discussion simulation months Now I would suggest you are a few lines to input and play with the properties such as Sitting A produces two to all set of maps to do on research zero and C two maps, output and so on. 21. 020 Recap Map, Reduce and Combiner Part 1: they can do in your dressing on a quick recap on map. Reduce income miners in this. Listen, you do a quick recap on the things we have learned along 1st 1 is a jury, which we need to keep in mind while designing a solution in the next lesson will do a recap on the courtside. What we have learned that no first thing that we need to keep in mind when designing a map of the solution is to divide the solution to two faces the math face on the reduced face. It is always to be remembered that the mouthful take input as a split, which would have multiple records for each record line map function would be called on. It would break the input record line into keys and values. We should smartly design and mop logic so that at the reduced freeze when we look at the values from same key, we should be able to reach to the objective which we wanted to bring about with the execution off the job. Next thing to be noted is that the input to reducer is in the form off key and list of values and the result is in the form off pair off keys and values. Also, we should keep in mind that the map logic may be executed on a different machine and reduce on another machine on the network. This transfer off keys and values from all the map machines to the reducing machines is all taken care by her group itself. We need not write anything for programming to do it. We just need to smartly designed the map logic on the reduced logic, which turns the record into the keys and values, spares and all the values off the same key our process and reduce ways to produce the result. The whole process of smartly transferring the data is all managed by group, and that is done through shuffle, sort and partition steps. We're learning details about these steps later in the course they learn about the CA miners . The idea off the miners is simply it's only objective is to reduce the map output so that there is less amount off map output to be transferred to reduce it in jobs which produce a large amount of leader. This step is critical to the performance efficiencies off the job. If There is a lot of math output that needs to be transferred to the reduce machine. It is a good idea to design a combine our function, which reduces the map output. On this, there is less amount of data to be transferred. The minuses will have input in the form of keys on list off values and are put in the form off key and value bear. Again. Let us go through a simulation on how things are looking in my produce. And there we would also peek into what we're going to learn in the leader stages off this course. First of all, my food in most cases get their input. Spect locally. Remember her Duke does its best to locate map stars on the machines where they desperate is locally present. This would be a possibility always, but how do for try its best to do so? This place is processed by map Logic to produce are preserved Haruf Sorts and groups This map are pulled by key and programming me nor called anything for this Now. In case if there's a combined function design than this map, output would be fit into combine a function. Remember that map are put can go through the combine er multiple times and so nature off operation done by combine er on the data would be an associative and communicative operation by my on police treated to combine and multiple times request detailed understanding which we would go through later in the course. This combined produces the output desert and its idea is to reduce the size off the origin map output. This map are good, has multiple partitions. Partitions are nothing but the portion off data that needs to go to the same reducer. This partitions are done by the partition function. We would, as we learn indeed is about the partition function and how can we use it in our solution cases These are partitioning and combining is done on the map machines locally like this mapped us There are many mapped us that would be running across the network. These politicians are sent to their respective producers by her group again program anin are called anything for this on the reduce machine Do combines all the partitions and feeds in the most five to the producer on release performs its logic toe all put dessert So I would again be you treat the things which we have covered a long in the course forces start We need to only design the map logic under reduced Rajic on depending on the case, the combine a logic the sorting off data and transfer of data is all taken care by her group itself on. We need not worry about that. Second, it is not mandatory but would be great if we design combining function which has the idea to reduce the map output so that there is less later Toby sent across the Net book No. The most important thing to note is that the minor function behaves just like reducers as it has input s keys on list off values and just like producer it are puts key in value pairs. So the comm miners and reduces inherited this producer class programmatically But it is to be very well understood that they are logically very different. Although there come in many situations where we can be used to reduce a class as the combining class. But it is not always true, as we can see in this diagram, the whole and sole objective off combining is to reduce the amount off map output on the objective off the reducer is to find the logical meaning behind the key on its values, which will help us to reach to the ultimate result. So they're logical meaning and importance on stages at which there, executed on their design objectives, are a lot different from each other and should never be confused with each other. Next, we learned that Davis partitioning that happens because of partition function. Partitioning is a step in my apple use toe around the five. Which data goes to introducer? This same logic to identify the partition is applied across all my poppers individually on the map machines, and these partitions are then sent across to their respective reducers. We're going to learning details about partition er's in the coming lessons. The partitions are much of a fight and fading to reduce it to produce. I deem it reserved. I hope by the end of this you should be absolutely clear about the rule on importance on the order of the physics map combining partitioning under you, sir, when exuding a job as well, you would have gotten a little idea off. The map produced spring workings under, we're going to look in depth a few steps like partition. Er, leader in the course. Please do remember the fees is on in which orders they come about when a job is executed in my pretty scream book. This would help you better understand the fees is and its importance to the solution in the next. Listen, let's just do a quick recap on court side of the things. 22. 021 Recap Map, Reduce and Combiner Part 2: welcome to the second part of the recap in This isn't viewed some of what we have learned now from the courting point of view to write a job, we just need to design and cord three classes first the map class second, the reduced class on the third, the driver class. If you understand the objective behind these classes, the court would be religiously simple to understand. Let us full start with the objective off map. Simply its function is to bring the input record in key value. Pairs Objective of the reducer class is to process each key on its associated values to produce the ultimate reserve. Last is the driver class, as you know how do takes care off Distributing the map code on the reduce good on the network program. It communicates to group. What are the input files? What is opera dietary? Which class is a map class on which class is a reduced class and so on using this driver class. So all this information, which is related to the execution off the job, is communicated by the driver class. This is the fundamental structure, and if you remember this, the court is a simple Java court, which should not be very difficult to understand. Let's look at the court. Let's start with my plus now if your job export this would be elementary for you. This listening specifically designed for someone related Lee new to Java. Every map class inherits mapper class math class is specifically designed for her group. Maybe inherited this class. We will write the map function, which has all the map logic. The fundamental idea is that you would already know where the user defined map logic exists on to call it. It just needs to call the map function. It helps a loop to distribute and execute the mapped US in distributed manner. Then the map logic is designed 200 different types off do greater types, which we would study in the next segment. So the first pair specifies input evaluator types, and second pair specifies the output key value data types. These are turned in Java as type parameters, and here we would see only Helou Pacific leader types. As you can see, we have put in a few variables here, which would be required for the map logic. We declared this final and static so that it is not created again with every call to map function, then the map function has three parameters. Input, key value on context. So here the data types off key on value should match with a bow defying die parameters. Context is a perimeter in which we write our bookie on value pair after writing into context, for Duke takes care off, sorting, partitioning and sending it across to the correct produce a machine. The return value off the map function is always worried as well. It throws. Are you exception and interrupt exception. These are necessary as they are defined in my upper class, and so inherited class gets these exceptions carried over from the barren class. These exceptions are just to handle the unexpected scenarios that can occur during the operation or in case if the task is interrupted because of some reason. Then comes the logic, which is simply called in court jar. There's nothing special to mention here, but just that there is this logic to break the input record line in the keys on value pairs on right it toe context object. Another data types off the arguments passed in context or right metal should master type parameters mention nable. This is it. This is the fundamental map structure, so insure you just need to change the type parameters on the type of arguments and scenes. The map logic, which would invite context or right metal. And that's it. The restaurant section will always be Peter. Yes, of course, As we moved to advanced programming, we would see a few more functions, but the mean team would remain the same at the beginning. Just think maps function is to take the input record on Break It into set of keys and values and that's it. Then let's look at the reducer. Good country to use the class would inherit the reduce the class, and the reason is the same. This gives her due east to find and execute the user defined reduce logic like the map of class release a class. Israel has four type managers forced to specify the input keep and value later types, and the last two specified our bookie and value data types. Then is the reduce method, which takes argument as keys on list off values on context, which is used to write the ultimate desert. The data types should master Bo mention type parameters. Religious function also returns void, just like the map function. The idea is to write in the context short, right metal. Then comes throws Exception Line, which is present for the graceful termination off the court in case of an error, then comes to deduce tragic. This for Loop will be a common factor across the solutions you create through all the values off the keys in almost all the solutions. Then through the context, start right metal you Positively and Value, which ultimately is passed to her group on her duping tone, writes in the opera dietary, which we have specified by running the program. So it is in introducer. You would see the same structure last year. We look at the driver class, remember the whole and sole objective off the driver class is to tell her group with math class will reduce a class on the input on the output on the way to execute the job. All this is done by sitting up the job Optic on. That is all we have seen the driver class. We extend the configure and implement tool interface, which helps out Dr Class 200 parameters passed through a program at the wrong time. At this point of time, I would regress you to go through all the lessons off this section one more time if you haven't already gone through it. Mark. Lose and combine hours are absolutely new topics and the quest new way of thinking, and it takes a little time to build understanding about them. One more iteration off the material would help you understand it and mold it. And then I would give you a little quiz. Here we have to use producer asked on my Net game. But it is always to be remembered that the function off the reducer is different than the function off the combine. Er, I feel the combine er is to reduce the map upward where reducers mean objective is to look at all the values associated with a key collectively to produce output reserved. So my question is that here in the reducer class, if I would have changed this line and instead of that riddle someplace equal to one, that is, I would have been commended some with one. With this logic fill in the combine in Israel. My second question is good writing reducer like this would have lived in the scope of ca miners. If yes, What kind of a combined we would have used. Please give a tinkle with this and contact me in case if you're confused. 23. 022 MapReduce Types and Formats: Welcome to a new listen in this. Listen, you would learn the fundamental idea. Why? How do breeder types were needed on via we didn't use the already present job data types in my previous framework. To understand this latest understand what serialization is when the due process in the communicate, for example, Mark communicates to reduce. Then, in that case, the data is transferred in terms off objects. Serialization is the process off turning the structured object in Dubai Stream for transmission over a network all writing tobe assistant storage, which eventually would be read by another process. These civilization, on the other hand, is a process which the receiving process does to the byte stream, it reads. It is a process off turning the byte stream back into the city's off structured objects. In the process, communications happens by our PC's remote procedure calls in her group features that are needed in serialization for it to be effective with remote procedure. Calls are first come back the message that are transmitted over the network bandwidth. It should be as small as possible. The smaller the data transfer better would be the efficiency sinking fast serialization and the serialization should happen quickly. This is in many ways related to the first point. If the serialized data is smaller, the process off serialization and the serialization would be faster as well. Could extensible political change over time, and it should be able to meet the new requirements. And lastly, interoperable. It is desired that the process returning one language can communicate with the process. Returning another language. For example, map might be redone in Java and reduce would be in some other language, say, by time. Then, in that scenario, Israel, the civilized framework should be effective. So no, we understand that. Or do folks on remote procedure calls and civilization isn't important underlying concept for its efficiency. But why do pleaded new leader types could not use the Java civilization framework itself. The answer to the question is that Jarvis invalid serialization had a few shortcomings. First and most importantly, it wasn't compact. It had words when the data was serialized. Java serialization would send the middle Rita, like the class definition, along with the reader sent this considerably increased the serialization size on Israel increased the processing time. It was basically designed as a general purpose in the process. Communication mechanism The Hadoop serialization framework. Assume stand. The client already knows about the data. That is to be expected from the sender. This decreases a lot off ordered and this right able serialization framework was designed. Let us take a look at the framework here. Rideable is an Indo fees. Great evil comparable isn't interface which implements right about. And then we have data types which we use as keys on values in matter use framework. Next received the table which use all the group leader types on their corresponding javelina types so that we can drop our little on, understand and relate to them better have put them in notes for this lecture so you can have a look at them in detail. Stare even a custom rideable. Implementation can be done by extending the right able comparable interface. But in that case, the falling functions should be overloaded majorly because they are inherited from the interface. Are being used in sort are shuffled stages. I've put an example off custom right table along with this. Listen, please go through it and have a look at it after the listen. But as you notice that the right table framework only supports Java data types on is language dependent, so Abaroa language neutral serialization system was conceptualized. It is a project by duck hurting so as to build civilization framework that supports many languages. Another advantage with Afro is that it future proves the data, allowing it to outlive the language used to read and write it again. The important principle is the same. After you assume start schema is present both at the tying off read on right, Every Scheme Us ordered. And in Jason, this is an example off how a schema is declared in Al through This contains the field on the name on the type off feels this human needs to be declared in the reading and the writing programs. Avenue is an advanced topic, so we'll stop here itself. I would recommend the exercise given after the listen to build more knowledge in this field . 24. 023 Experiments with Defaults: Hello and welcome to a new listen in this. Listen, we would experiment with the default settings and this explore and learn more about the map produce framework. Let's just revisit again. Map. Combine on reduce functions and see them in a notation. Form Martick's keys and values as input and outputs a list of keys and values combine. It takes input as a key on list off values corresponding to the key and produces the list off keys and values and exactly same is the case with religious function. Hence white implementation combining extremes reduce it plus another thing to be noted here is that for a single input key pair, the map are combined. Already use a function can emit multiple Kiva repairs and knowledge Biscuit. Introduced to a new function partition. It takes key value, appear as input and produces an indie Jha as dessert. This in teaching is used to decide toe which reducer the key value pair would goto. We'll see later in the Listen the Defour partition mechanism so that we can override in case we required Oh for no lettuce performing experiment. Let us try to run our driver program with minimal job configurations and see what it does so years before driver class. As you can see, there is no job configuration put in this class. We haven't specified the map plus the reduced class, not the combined class. Neither we have specified the input data types on now the open data types. We just set the input part on the output part. Let us try to run this fancy. Let me first bend in book. So the input file has two lines off input, no lettuce, but run the program. Let us see the output. And so this is how the output looks like. You can see here that the output line is a new American teacher which signifies the bike offset from the start of the fire on the line that for lose it So 21 specifies that lying that is the question starts from the 21st by position on the fight. This is from the default runoff. My put under Yusor it is understand how the default mapper and reduces look like the map of class, as we already know has key value input on key value. Pair upward. This is very put The leader types this is the map function, which we otherwise override. As we already know, this has three parameters. Key value on context in the processing. It just plainly image sticky on value pairs, which it received the default input key data type is long readable. It is so because it can handle large numbers. The default input value is text Andi Ford or could key value data types are the same as the input after map has produced its input. The partition is responsible to divide the desert on distribute. The value is to reduce, sir, by the fort. There is no combining class. The default partition is hash partitioning, and this is how it looks like. Get partition function takes key value. A number off reducers as input. It simply produces the hash, caught off the key and performs on operation with the indigent max value on Modelo it to get to which producer it should goto. Suppose if the producers are three, then the result off all the keys would come out to be 01 or two. Depending on the result, the key valley would be sent to a particular user. It is to be noted that only key is considered on which reducer the key value should goto. And that is how it should be, as we want all the values to particular key to go to a single reducer. But this can be changed as per requirement off the situation. It can happen so that you would need certain key value to be processed by certain producers only. Let's take an example on this. Let us assume that we across single file with People's forced me second name and colors they like. Suppose we're looking to search for a crazy pattern If there is a coalition between the names on the preference off color, so regard looks like Marry, which is the first name Fisher, which is the second name on the color preference that followed. So the file would be filled with these kinds of record. So in that case, we have decided to set the key as a second name in the map. This helps us to grow the records with the second name. And so the map are put his second name as a key and the whole record as the value, and this would be fit to the partition. Now we want that people with the first name should go to the same reducer. So in that case, we would add the first name as well to the harsh partition er to calculate the hash. Cool. As you can see, that Fisher James has been sent to a different producer despite its key value is the same as Fisher Marry. This is because off the custom partition define okay. After a little detour, there is now again jump back to our mean discussion, which was to understand the defaults. Let us look at the default producer again. Like the map class. The photo you sell would specify the data types for input key value pair on the data types for the output key value pair. This is the reduce method which we generally override. It has three input parameters. First is a key second, the table feel off values and third is the context in the processing portion, it just outputs the value it has received. The defore dinner types are long, readable and text. Let us look at the four driver which is explicitly specified as discussed. The default Mapple class is mapper. Auntie is long, right? Able on value, Ext Defour partition is harsh. Partition default number of producers is one default releases is really sitter. Class on output key is long rideable on value is text sit output and input format is not what we have discussed along and it is what we discuss in the next listen. 25. 024 IO Format Classes: Welcome to a new listen in this lesson. We would explode on understand input output for March In the last lesson we lived at said Input former plus answered output. Former class functions notice Understand the mean idea behind the functions. Mom gets its input in the form off keys on values, the data types off the keys and values are defined in my upper class definition. So Hadoop has to supply the key value pairs as it reads the data from the fight program. It controls this reading mechanism on key value. Parsing by using said input former class function. Similarly, when reducer image the key value pairs, then In that case, it is a set all put former function, which gives the programmer control over how it is to be returned to the output file. Let us see videos input formats on. Let us understand the basic mechanism so we can put it to use when required. The first is combined filing put former class. This is using cases where many small fights are needed to become mine as input. The problem with many files as input is that they wanted of data Locality is lost combined five input for my class usage preserves a little advantage. It has an internal in build mechanism off considering the data locality. So it's still performs good with many fights, although the keys with 1,000,000 put files is not a good case for my produce from a performance perspective and should always be avoided. Combined fighting but former is an abstract class and would be needed to be customize Asper . The scenario next we look at takes input. Former it takes input. Former supplies the map with key as long writable, which is the bite off shit from the start of the file on value as sticks Lying, which excludes any line Terminator. This is a default input format. Next is a key value takes input. Former It is used in the case where the key is already present in the input file on Key and Value is separated by Demeter de Limited by the Ford is a top character, but it can be customized through map really used or import dot Key Value line Record breeder dorky dot valued or separator nixes in light input. Former on line input former divides the input into spirits with fixed number off lines So if in is five, the every map would be distributed with five number off lines as input. Before we continue and look at sequence file input format. Let us look at an interesting case that can occur. It can happen so that the input splits may cross over, though. Is DFS Block boundaries? For example? In this case, suppose split size inch at 50 record on the value of that record crosses over the boundary off the block on the block is present under another map machine. So in those cases, the portion of the record, which is not locally present, would be fished from the network. This loss in data locality cause listen one person off over time. Now let us return back to would mean topic on understand about the input for months. Next we look at is this sequence Rice lettuce Forced to understand what sequence for its are sequence file is a special flat file, which contains finally included key value pairs. It would look like as shown on keys and values would be binary included. So these are special fights and cannot be processed directly as text objects. These fights have seen point mechanism on our compressible sequence. Fives are mostly used in the scenarios without put off my produce job. So as saying, put to another Democritus job. These are good with sorting Israel, so the intermediate map results are returning. Sequence 54 months. So do provide ease off sword before the reserves. Afraid to you, sir. So there are the following three former classes to process sequence fives. First is sequence file input format. In this, the Muppet and Key Value reader type should my eyes if I Leo Second is the sequence file as picks input former. It converts the keys on values into takes objects, and so both keys and values will be treated as text sequence file as finally input form art , which is the whole record. As an object on retains, the binary, including mappers, should have the Process 200. These now let us look at the output for March. The output for March. Decide how the data would be returned to the file fools do. The former is the takes are put format. This is the default or perform. Art is well in this. The keys and values are converted to strings. The keys and the values are separated by the the limiter, which can be controlled using property. Marple use. Don't all put dark eggs are put former dot separator by default. It is a top character on. This is why we see keys and values separated by tops in the outputs off the various runs we have. This can be sick toe any value by using configuration dot said sequence file output. Former, as we have already discussed, are very encouraged. Special fights. These are helpful if the all put off one job is to refer to another. My previous job. My files are special sequence files with index lookups. First is the mouth file, which contains the data, and second is index file, which is used to look up. The later multiple output format is used in this. In areas where there are multiple are puts needed, it provides greater control over the output file names. We should take a look at it on understanding it with an example. In case you need a detailed documentations for these classes, you can find it on our Dugard apart you dot org forward slash dogs, which would be generally the first side. Many search it on Google 26. 025 Experiments with File Output Advanced Concept: Welcome to a new listen in this. Listen, we would add a little more twists. Tour board con problem. We have still not seen only one Piper reducer in this vacation. Let us create multiple fires, Producer. At this point, we have reducer which all puts the world on its occurrences. Now it is Have a producer segregate the output alphabetically. That is it. Foot put all the words starting with E in 15 and starting from be in another fight on so on . So in this case we would use the multiple output for my class. In the release of class, we would be clear a private object off type multiple outputs. Then we would all right, set up on reference it with the context object. Then we would use the right function off this 100 with the perimeters as key value on a new told one, which is the filing. The name of the file is in the form off name hyphen are hyphen 000 When name is apart, we can control our represented user And 000 represents the reducing number or the partition number. So in our case, the name would be an alphabet, then we would just override the cleanup method on Verity. Let us look at the map of classes. Will. I just made two changes here. This area lying to convert everything in the lower case. So that capital toe is not taken different from the smaller case too. Second, I'm taking that every word that this past is starting with the character. They can be a lot of checks that can be made to clean the later on. There is a lot of school for that. But that is not the focus off the listen. And so I haven't put all those changes. Then comes the driver. Driver is Astra, as we have seen so far were mentioned the map output key class on the value close on my upper class. That is four count M'appelle. Then we have declared the real use a class that is my tipple Output reduces class in the open key class. On our put body class, we're used to learner as in a previous examples. Let us try to run this and see. Let me just I'll put the input file first. So this is a long file. Let us run this OK, it's done. Now let's see the output. Since all told, we see a lot of files in the format. Alphabet hyphen are hyphen. 000 Let us try to bring fun fight. I print the one starting with in and so we can see all the words printed with the number off occurrences. As you can see that I haven't got a perfect mapper. Your numbers and numbers comma are treated as separate words. This is a little cleaning off data required in map function. This example are just to explain the concept. And so I didn't put a lot off additional side code. But sure, this is room for improvement. Onda. We can clean the leader on the map side. Hope you learn new things, seeing the next listen. 27. 026 Anatomy of MapReduce job run: Welcome to a new listen in this lesson. We learn how do carries out the process off job execution and what happens from the time we have submitted the job to the time the job gets completed. So what we have seen so far maybe submit the job. There is a detailed description that comes related to the job. Execution on the job completes the job, gets submitted but wait for completion function, which is the last statement off every program. Be everything. It is a last statement in the run method, which you right in the driver class when we're using the tool runner on the last line in the main function. If we are according out driver logic in me in function, wait for completion. Metal causes the job to be submitted for processing the job. Execution depends upon a couple of properties in zero point toe release. The property name is mapping door job Tracker. This is preset in configuration file. Mark rate site dot xml. Its default value is local. If it is in pseudo distribution or fully distribution mold, it would have calling Separated who stand poor pair in case off local distribution. The job Cracker star striker. All would run on a single CVM pseudo distribution. More would completely immolated believe distributed more by running job tracker Anton strikers on separate gbmc on a single Lord. In case off, there are 23 release Our Leader released that is, one not extort eggs. There is a new map produced framework implementation. The new implementation is called Ma Produced two and is built on a system called Yon yon stands for yet another resource negotiator. We were looking deeper about it later in the course, but the important thing to be noted is that in case off new releases property marble used our framework. Dot name decides the framework off execution. It can be said to local, which is as good as running in local more. It can be set to classic, which is what we study next. Or it can be said to Young, which we would study later in the course. What we naturally is the anatomy of job that runs in fully distributed much. So let us see the job in classic mob produce framework U V C. That the client node has the joke line running joke plant is the part off produce set up, which is responsible for interaction with her group. It is important that job plant runs on the machine that accesses or interaction with her group. What is the machine won't be able to interact. It is the job plan, which is a Java program, which carries out the whole process off interaction with her Do It interacts with Job Tracker, which is again a Java program named Job Tracker on Joe Cracker. Intern communicates with multiple tasks. Trackers, which again are Java programs named Star Striker Job tracker, runs on a different note, and our striker runs on many notes. Palaly here we would consider only one star striker for ease off understanding. So as the fourth step your plants submits a job to Job Tracker were placing it in the job. Trackers que There are many sit ups and checks done in this face like if the awkward is already present or not, or if the input fire exists or not. After these verifications job cracker picks of the next job from its Q and A science, A two star strikers, a single star striker node has multiple slots for running mapped US and reduced us it constantly and drops with job cracker about the three slots and according to which job Crackers curios. That task for the Han Straker on assignment are Straker takes up the task. And since regular reports to the job Tracker, which in turn combined still reports generated from all the tarts trackers on up dudes, it line in the next lesson really deeper life in the steps which we have discussed here. 28. 027 Job Run Classic MapReduce: bacon doing new Listen in this. Listen, you did die on how job execution is carried out in classic Mapple use, so we returned back to her diagram, which we lived in the last. Listen as soon as the last line that is job got beat for completion. Excuse, it triggers a job Klein to start the job submission process. That's a full step job plan. Can extra job blacker and ask for a new job i d. It connects to job cracker using the entries from Mattered hyphen site dot xml Configuration file. After the new job, I Lisa, sign your plant. Performs a few checks on HD office, get first checks in the open, exists or not in the output automatically already exists. The job stops there itself. This is an error proofing techniques applied in her do so as to avoid any loss off efforts by overriding the results after that, it country and saying put splits. In fact, it checks if improvise exists or not. It was an error in case if it doesn't find any input file, saying it cannot compute the splits if it finds staying put. File. It proceeds on copies. The jar on disabused to 80. If it's with a very high litigation factor. The defaulting, then, after all, the distribution off jar on important files have been taken. Care off your plant submits the job. All this process is taking care by an object off class jobs. Amigo After Joe Klein has done the set up, it puts a job onto the chalk you off job tracker job skill you and will pick it up from the queue on Initialize it. Initialization involves creating an object to be present, the job being wrong. The object encapsulates its task on bookkeeping information to keep track off status and progress. After that, the skill you retrieved the import spares from 80 if it's and creates one map. Task desperate. The number of producer is decided by the property. Ma Peridot produce doctors. What can be set of by job? Don't sit now reduced US function in the driver program. It has a default body off one, but it is advice that it is customized to a higher value, depending upon the size off the cluster to draw it Wanted off. Vandalism in the reduced fees is with Joe cracker. Israel creates set up and clean of jobs on Star Striker that needs to be run before and after the map reduced us. Run on our strike and Lord after this feast comes at US assignment fees. At this point, job trackers should know which starts. Trackers have free starts and which ones are busy that our strikers simply Russ, a loop that periodically sends the heartbeat. This helps chopped like her to understand that the strike is active or not as a part of heart big job trackers. Since information regarding the street this is off the dust running on the bass tracker, this hits job record to assist the loot on Star Striker. And this a sign in New Job. A single does striker camera. More than one map will use task at a time There's a single star Striker machine can have multiple starts to run. The task. The number of slots depend upon the computing capacity on the machine. The main deciding factors are the ram on the course off the CPU. Now the joke Rapid knows which star strikers to assign, and it assigns them that us. So after this comes at arse execution fees, they are stalkers as a part of set up retrieve the jar, which will put on his defense by the job client. This is where we see that the cool moves to the data for processing, which is very different from the traditional architecture er. After that, the fast track launches two new GM's to run East us it is beside. Remember, it can have many at the time that our strikers, since regular average about the percentage of completion of the task to heartbeats. And then the jock like a combines the progress off all the tar strikers to update line. We go with our progress. Calculation off process is simple in map task, but a little tricky introduced face. We would look into it in the next side. Then, after the last reviews Job has finished, the striker cleans up the intermediate data that was created while running the US. In the end, the job is finished and wait for completion function, which is the last sign off. What program which started all this chain received certain value. On this light, we look at how the progress is calculated, that is, the spirit on the user console were marked us. The percentage is simple to calculate, as the input size is known on the leader that this process is known through the internal counters, which maintains so any given point, the total amount off data on the amount of data that has been processed is no, and hence the percentage off WILBON is easy to calculate. Reduce. It's a little tricky as three things sort shuffle on reduce, contribute to the total amount of book. So for the calculations, the A foot contribution by Sort Scheffer and Reduce is considered to be one for each that needs in case if the reduced race hasn't even started, the completion status would be one battery contributed by Sartre, plus another one. But three contributed by Shuffle that concepto were three that is 67% it. The producer has process half its inputs. The completion would be won by three. Contributed by SWORD less. Another one by three contributed by shuffle on one by six. Contributed by reduce. It is one by six because half open batteries run by six on. When these Ahlers summed up, it gives 56 that is 83%. Let's just get the quick recap off the lesson we have seen on how the job is carried out in classic Napoli Scream book. The function before completion causes a job to be submitted. So as a part of jobs submission fees, your plan gets a new job. I li from the job writer. Next, it copies all the relevant files ankle to age, the effects with higher application. Next, it's a medicine John by placing it on the job crackers que. Then comes the job initialization fees. We're job. Bracker creates an object off the job, which in absolutes that us running and has bookkeeping methods. It finds us personage DFS and creates one mapper split. Then comes a NASA Simon fears, where Job Tracker looks for the free slots on the dance trackers that's trackers communicates this information through heartbeats. Then, after that comes at US execution phase with Star Striker completes the court from HD office to the local machine and launches it us. It has been a sign it sends regular a bridge too hard reach to Job Tracker, which combines all the results on the space to the clients concert and the end off. The last reduce job in the media data would be cleaned up by the town striker in the job completion face job like a sense that it don't called through the wait for completion function, which completes the job. 29. 028 Failure Scenarios Classic Map Reduce: Welcome to a new listen in This isn't we? Look at the fetus scenarios that can occur and how they're 100 in classic my produce in classic mackerel use. They can be three times of failure scenarios. First failure, math already use task. Second period off Star Striker on third figure off Joe Cracker We look at all the pieces one by one. Let's start with the feeling off the bus. Thus, on the user courts, it can be a scenario where the user called me run into an infinite loop. In those cases that are, Straker would observe that there hasn't been any progress on the house for a period of time , and then it would mark the job. Fear the observation time is said by property mattered the task. Don't time out. It can be set to zero is well. In that case, Star Striker would never feel a long running job. This is not suggested as the slots will not free up quickly in case if the task is stuck, this would bring down the widow performance off the plaster. Another reason of failure off user tasks can be done by mirrors. In that case, the area is reported back to France. Tracker on bass tracker would put it in the user loss in dock scenario. They can be really case that Libyan may have been exposed to about while my produce Good Ron. In that case, the Dodge even can crash along with Star Striker. In those cases, your placard notices that the child's dream has exited on Monster Task A sphere All the field. Ask Thames, a notified to job cracker on jock laterally. Skill. Use the execution off the fate us on a different class tracker. This is done so as to ensure that the reason off failure is in the underlying hardware. The number of theorems that would be made on a map task is going by the property map grid dot map dot marks Kim's and severely for the reduced us is going by map grid. Don't really you start maxims, but before their sick before next fetus and you can be the failure off star striker. In that case, the job Kratter stops receiving the heartbeats from the bass tracker. This is the job traffic concludes that didn't are strengthens did. In this case, it risk abuse that us on another dot striker Your Cracker Lease canoes that us which did complete. And it asked which God completed but its job incomplete. Even the computer task. Our leader as the reserves would have been written to a local disk and they would have been lost because of the crash of the town striker. As soon as job like realizes that the heartbeats from the Dallas strikers have stopped, the job Tracker removes the star striker fromage cool off available to our strikers. But that is not the only criteria on which Star Striker can be removed from the available pool. If the number of Pasfield on a fast track process it pleasure in gets blacklisted and renewed from the available pool of strikers, the pressure is said by the property market dot max. Stop placard or video ingested it. Our strategy is blacklisted. It joints back on the restart or after a certain period of time. The final media keys can be the kiss off job cracker failure. It is a most CS 1,000,000,000 in classic map produce on nothing much can be done. In that case, job Cracker is single point of failure in classic map reviews, and so it is recommended to be running on a better hardware so as to avoid this scenario as much as possible. We need to release amid all the jobs in progress. Once the job cracker is brought up again. In young, this situation is a little improved. 30. 029 Job Run YARN: we could do a new listen in this. Listen, we learn why there is need. Often you might produce framework on how the job is carried out. In young, Yang is an abbreviation for yet another resource negotiator. It is also known as my produced to all the next relation my produce while using my produce one. It was observed that the Skilling really got saturated when the closer size increased to 4000 plus notes immediately because of the Lord on the Job cracker. In 2010 Young started the project to create the next generation my produce with more features to increase performance by smarter memory utilization on enhanced scalability and flexibility so it could accommodate Andre and many versions off distributed framework in parallel on the same plaster off all the teens is the main idea was to split the job driver's responsibility into portions. It's a joke. Cracker got split into two force resource manager, which dealt with jobs killing in portion off the workload and Second Application Master, which dealt with the task mind during portion off the workload Prevacid and in my Apple, use one or older FBI silver will on yon with the introduction off Yang. Only the framework that is the V off execution off my pretties program changed and so young supported both the programs written in the older MPs on the new MPs. On this light, we look at the advantages young brings or classic my produce. First and foremost is that this calibrate increase dramatically with splitting their responsibilities off job trackers in tow. Second, more than one young could co exist on the same cluster. Along with my produce. They can be another disputed frame book alongside it on the same cluster. 30 is a better utilization off memory with the introduction off the containers concept. Conveners conceptually are similar to this lords in classic my produce just that in classic my produce, the slots are fixed in nature, while containers are more flexible in classic mapper use, a single star striker would have figs number off slots specific for mapped us on reduced us . However, the containers in young candor and map reduced or any other us and our flexible. This is a in better memory utilization. Next we look at the injuries in young. First is the client, which is the same as we saw in classic my produce. It is responsible to submit the job on Interact With My Producer on If the of history Book Second is a resource manager which is responsible for locating the computing resources that are required by the job. Even in the resource manager, the job responsibilities can be close. Fight in tow. Do when is a scapular which only deals with scheduling off job on it doesn't perform any monitoring or tracking off application. Streeter's On another portion is the application manager which monitors the application status. Is third is an old manager. This is present on all the sleeve notes and is responsible to launch and manage containers . Fourth is application Master Peace known that earlier mentioned two portions off Resource manager asking dealer on application Manager on application. Master is a completely different entity. Application Master is responsible to carry out execution off the job it is associated with . It is the one which coordinates that us running on monitors the progress on aggregate sit and since reports to its client it is spawned and ignored. Manager on the instruction by resource manager, it is spawned one for every job inter. Minutes after completion, you can think it like an officer resource manager hires to execute the job and fires it after it has done its duties. 15 30 Easy on child. This manages the run off the map and reduce task on is responsible to say update on progress to application. Master Last Entity is the Disregard five system, which contains all the necessary input on where amplifies our return to. So let us see the steps on hard job runs in yon free book. First. Few steps are exactly the same as we have discussed in class. Might produce the job. Get somebody to joke Lined and your plant request for a new application i d. After that, it checks if the Open diaries already created, if it finds, are poor dietary, it would through a nadir and stop there itself. It didn't verify, sir input dynamically. Then after that, it copies the resources to HD. If it's with a very high replication and then it finally submits the application to resource manager. Then comes a job initialization fees. So, as we have discussed earlier, the resource manager has two parts. First is a scapular, which will just do the scheduling and locate the resources on the other. One is Application Manager, which monitors the status on progress is off the jobs. So as soon as the job scapular picks up a job, it contacts a node manager to start a new container on launch. A new application monster for the Job application Master creates an object for bookkeeping purposes and task management purposes. It retrieves the spirits fromage DFS and creates one task pursed lip. Next application Master decides how to run my produced us. If the job is small, the application master decides to run it on the same GV in itself. Since the wintered off locating a new container and turning them on it would cost a lot more than running it on one narrative These kinds of jobs Which application master decides to run on a single Zeevi, um unknown as Hubert us. Then comes the signing fees. If the US is not Hubert, it requested visas manager toe a look it the resources needed Skeletor and this time knows where those plates are located. It gathers this information from the heart, beats off the new managers and thus uses this information to consider data locality while allocating the resources. It cries as far as possible to locate the note so that the data localities present. But if that cannot be the case, it considers the rack local notes. If it feels to even find such a note, which is racked, local, it locates. Any note randomly from the WILBON notes mixes star since Yushin application must sit contacts and no manager, no shipping dinner. Then the young child is launched. Yeah, Charlie is nothing but a joke program named Jantscher with a mean class s young cherry young Children's on a sippy GBM toe. Isolate the long running system Demons from the user called This Step is taken in classic my produce as well to supreme the past Straker from the User court. But one difference is that in classic map, reduce the use off deviant off star striker waas possible. But in yon, the usage off the same debut as Young Child is not supported as a next step. Young child retreats. All the job resources from HD effects on localizes them on ransom are produced us for the next phase, little sis, queen of the diagram and drop all the arrows. So the next phase is the progress and updated fees. Here, Young Child sings the application master The progress reports every three seconds on application. Master, a producer progress on a basic line directly in the job completion Face application Master on the Task container cleanup thing, the median data and eliminates itself on job completion. Let's just have a quick recap off the steps forward. The program triggers the Joe Klein and the joke. Land Contact, Sir Resource Manager for the new Job i D. Then the joke. Land copies a job resources two days the effects with high replication and then submits a job. These are the same as we look in. Classic my produce. Then the resource manager picks up the job from the job Q and conducts a known manager and sponsor New container and launches application master for the job application, Marceau creates a new object. It relieves the input space from its DFS and then creates one passport input, spit application. Mustard then decides if the job is you but or not. If it is uber job, it runs on its own TVM on a single load. If it is not a uber job, they had contact she resource manager to look it. Computing resources Resource Manager considers data locality while assigning the resource is application master. Then communicate certain node managers which launches the young child. Young child retrieves a court and other resources from each DFS and interned at US Young child sense of progress to Application Master, which creates the report. And since the report to the client on job completion, young child on application mastered done minutes themselves on release the computing resources for the next job. This covers a solution off a job in young in the next listen, we would look at the feelings and values. 31. 030 Failure Scenario YARN: but come to a new listen in this. Listen, we look at the failure scenarios in Yon Cream book. We can be the falling bilious scenarios in John Finn book gas, Viniar Application, Monster Failure, no Manager figure. And lastly, resource Manager asked various scenarios 100 in basically manner to the past. Failure in Classic My Produce, they can record related issues like in Tonight, Lou. In that case, application Master stops getting the progress of age on application. Massive would be for some time decided by the property. Mapfre doctors don't time out. It is the same as we saw in the classic my produce. After this time period application Muster would models your best fear. Then they can be case off runtime errors on TV in failures as we have seen in Castaic. My produce under action taken in Young is also the same in casement model. Reduced US fails. The Field house is really great on another machine on the number of few teams made on. The task would be decided by the property math grade dot map Dark matter. Tim's on map grid don't really use. Don't match temps with these properties have before value of four. This want you to notice that in classes my produced the properly Indian with Max Door victims on in yon? It ends with massive gems without a doubt in between. After those many failures on the early attempts with space finding, leave properties the complete job with Amanda's feel in some jobs, which were process huge amount of data with hundreds of stars period off sometimes is acceptable. And so failure of one or two job must not Mother complete job as a failure for those cases . Married Don't map dot seniors Don't match posted on map Really don't reduce doctrine Years . That Max person would be the properties which would be sexually side acceptable percentage video off the map and reduced us respectively, before declaring a job to BP. Then comes a feeder scenario off application Monster. It is the application monster. Fears that does that have run under it need not be submitted. They can be record, but by before very good is not switched on property. Young door tpp brought mass produced art. E m don't joke dot gov dot enable would be needed to be said for this feature to be going on the steps taken by her do in case of failure. Off application Master is on similar lines to the steps taken in case off past failures. With the application master fears, the resource manager stops getting the heartbeats from the application Master Resource Manager The good nights and the application master has feared like in case off past failure . Resource managers start the application manager on a new container. If the recovery option is said, the street this is off the bus is recovered on the execution. Off the job is continued. The number of EU Tim's on application Master is deter mined by the property. Yon dot resource manager dot in dot mats iPhone read rights. Next we look at the key is better known, Manager feels if the North manager feels it, stop sending the 100 students. Those manager resource manager Beach for the North manager put a bite constrained the case . It might have screwed down if you're suspended, doesn't receive a heartbeat for certain period of time. It assumes that no management has crashed. If in occupation, Master was running under the field known manager. The steps describing application master failures are followed. All the meeting us at least born on Annunzio advantages. If the US under specific known manage if he often on across a treasure the notice taken off from the available pool and it's black history, blacklisting is a process to try the poorly performing notes. This procedure is a seem, as we have discussed Intar strikers in Classic My Produce Resource Manager Failure is another failure on the most CS failure that Annika without this night of the job's not the task containers can be launched, but in young there is a real improvement. There is a mechanism put in place to recover from the crash. The checkpoint mechanism is pulling the place, which is an improvement from the classic map released Single, which Hannah after the crash. A new resource, milder instance, is brought up by the administrator on it regards from the last Steve State. So the readings off all the jobs is not required. And then I would just like to mention that Young isn't attempt to get closer to the Google's original marble newspaper. Google have released papers, but never had these the courts, so they're much more advanced in the rest of the world. In her duke like dispirited computing technology 32. 031 Job Scheduling in MapReduce: we come to a new listen in this season, we look at how the jobs are scheduled in my previous Scream book. The general scenario would be that there will be multiple users issuing the jobs on Hadoop distributed network. The scheduling scheme would be employed at Job Tracker in case off my produce one on the source manager in case of my produce to whether the falling schemes can be set up in my previous frame book forced in four Star, also known as fee for scapular pair scapula and lastly capacities figure my produce one comes with a choice off all the three which forced in for stone as a report on my produce. Two comes with just a face killer on capacity scapular with capacity scapular as the Deport . Let us understand what are these killers? And it proves in constant. Let's start with forcing for start scaling scheme very early version off her. Dube employed this living scheme in the simplest form. This diagram shows knows, starts on Nords, the diet like you and the jobs insider on the job tracker. This starts can before the be divided into map starts and really starts as they're supposed to be, for example, off both the types of starts in classic map reviews. I'm not shown that detainee in this diagram, as it is not related to the main idea that needs to be sure. So the job that was submitted forced, would take up all the resources on would be exuded, forced in this case of skating. If a large over submitted just before a small but a high priority job, the user of the small job would have to wait for an unreasonably long time. This situation was a little improved by deploying, Applied his scheme along with this. So now the jobs could have been prying eyes to very high, high, normal, low on very low. So the situation improve a little as a smaller, high priority. Jobs moved high up the order but sit. In this case, the preemption was impossible. And so the smaller job had to wait for a long time if a longer process was already taken up and was in the process. This game gives clients unequal share off the cluster on random turnaround time. Next we look at the capacity scapular. This is the default skin either, which comes with my values to all yon Sit up. This takes a little different stance to my to use this cabling. In this case, cues are divided on the basis off users, all groups off users, which is stone less organizations. This killer is designed with an idea so that the same cluster can be rented to multiple organization, and the resource is maybe divided to solve this specifically question facilities for the organization. This organization will not mean they don't cluster, but they can rent out a portion off Lester, which would be up for their services. So in case of capacity scarier, there are multiple queues specific for the organizations. Each you is given a portion off resources off the plaster. These generally are soft and elastic allocations but can be configured to hard but in many different ways. On the basis off requirement, let us see this with a simulation. Suppose a job enders organizations a que so it would be picked up as there is no job running model with. This would take up as many resources as much as available. This would effectively utilize the cluster when a job in organization be appears. Thus, of the first job would be cared to feel the strokes for the new job. There are many features available in this killer like capacity guarantees elasticity, security, etcetera, which can be customized by the administrator for the situation. Next, we look at the fierce K do that. Conceptually, this really seemed to capacity skier with minor differences like the capacity skill leaders , the queues are divided on. Here they are down that sports. So the jobs would picked up from the pool and would be given their portions off the resources. Suppose if another jobs comes to the pool the capacity schedule, it would process it like forcing forestalled our first in four starts with priority in this case, a small, high pride. The job has to read for a long time. So this situation is a little improving. Face computers that the jobs which have waited in the queue would be picked up on will be processing panel so as to give a better user experience. This killer is indeed a street on the work on it is still going on as I make this video, this computer and discussion in regards to scapular see you in the next Listen 33. 032 Shuffle and Sort: we could do a new listen. In the previous system, we learned about the jobs killing in this. Listen, we would look at Shefrin sort steps, which are cool and hard toe every map. Previous job every matter. His job goes through the chef in In Sort Face, Matt program processes, input, key and value. Then the map output is sorted on this transfer to reduce it, and this is known extra for we'll see through a simulation run on how things happen. My processes. Input on the output is not radically returned to the disk, but Israel Dental in memory before sighs off This before is decided by the property. I don't start dot nb. It's D four sizes 100 me, as Map writes in the before before fills up on Brenda before reaches a threshold. The threshold limit is by the 4 80% background. Three. Let's start writing the before contents to the ducal disk maps. Output continues to be returned to the before while this period takes place. If the map has more off output, it may fill up the before, and in that case, map would be paused for a while till the spill in Pisa before off the spirits computer map me again reach to the treasure and in that case now this bill would be returned. Spirits are detained in round robin fashion on these are returned to the directory specified in the property market dot local Norby ir, so there can be many space before the last key value pair has been good in by the map task . Each spill is partition and sorted by. The key on this is run through a combine. Er, if the combine our function is designed for the job, this all is done by background thread. Once map has finished two crosses all the records, all the space are then merged into an open file, which this partition and sorted. If more than three space on March together, combine a function is again run through the final output. Remember that the combined functions can run many times without changing the final. Reserved, like a minor function reduces the size off. All put, which is advantages, as they will be less amount of data that would be required to be transferred to produce a machine. If the maps output is going to be really large, it is recommended to compress the maps output to reduce the amount of data. This can be done by sitting of the property mattered. Dot com press dot map dot output to drew and compression scheme can be specified by the property map red dot map don't open dot compression dot Codec. After this comes the copy fees. There would be many mapped US running on the definition different times As soon as they finish, they notify the job cracker or the application master, which asked the release her to copy Desert to the local disk. And so the partitions are copied by the producer from the Net folk. After this comes the surface on In this face, reduce emerges the maps output, which are then failing to reducer to create the final result. The Gaza in surface is a little more in world. Let us look at the sort phase in this face. Property, which plays an important role, is most factor on its said by property. Are you dot sort dot factor? It's default value. Istan, it signifies, is how many fights can be most at one goal. Let's understand this with the simulation room, I suppose if they reduce. It receives 35 from different maps, then these coming Bush in batches of 10 on in three rounds, it would create the intermediate most price, and in the final round it would be faded. Daddy came to the producer. Just know that the most fights need to be sorted by the keys as well to increase disk io efficiency that actually got Adam speaking literally friendly. It picks the first Orefice on, merges into one and then picks up the next patches off pain in the final round. It would take the remaining six price on Muslim on directly feed them into reducer. Doing it like this increases the disc. Ire efficiencies. This wraps up over discussion in regards to show off and sort. See you in the next listen. 34. 033 Performance Tuning Features: we could do a new listen. In the previous ism, we learned the details off shuffle and sort execution. In this lesson, we will learn about a few performance tuning features in her group. First we look at speculative execution. The want is that Hadoop has is that it brings it us into smaller to us and processes them in Parliament. This family processing gives her Duke advantage over the conventional single note processing I'm so how do can produce higher throughput. But in case if one off it starts performs poorly, the performance off the whole job goes down. In this simulation, you can see that the past three has fallen behind the rest. This can be because of hardware degradation. Also famous configurations as well. In those cases, her do pre launch is the task to another machine. The one that would finish first would be taken for the desert under the one would be killed . They're these two important key points that needs to be kept in mind when thinking or speculative executions first speculative us are only launched after all the dust of the jobs have been launched. Job record, then pointers. If there are jobs falling behind and they only it would. Speculative Lady Executed task Second, it is an optimization feature and not a reliability feature. What it implies is that if the task is running, screw because of a Bugsy cord, but you won't be able to fix it or diagnose it. Often point area on the court. It is just ensuring that underlying hardware and software configurations are not the reason for the slow progress of the task. And so it will try to run it on a different note so that the whole job finishes as quickly as possible. I did off original dusk and speculative tusk can finish up, forced as soon as one finishes. The other one is killed. Properties in regards to speculative executions are map grid, dark matter task, not speculative, not execution. This is the property Ford matters. Then map dot reduced our past or speculative dot execution that is for reduced us. These are bullion properties which are by before sector True, which implies speculative exploration is enabled by default. These properties are committed to be set true, but they can be set to fall, says Well, there's not lying. Speculative executions. This would be done only in the case. If the cluster is already overloaded on, we need not overload. The resource is with speculative us some installations before no speculative execution under your site. It is done so because to start another copy offered you, sir the map. All ports would be needed to be fished from the network, which would considerably increase the load on the network. Next, we look at another feature that espoused medium to use. This feature can be used for performance gains in case if there are a lot of small jobs for smaller jobs. The overhead off launching a new GBM is significant, the last run on a different GBM so as to segregate them with the long running system demons . Reason behind it is that the user court has a high probability off being erroneous on. In that case, it may disrupt the system demon and hands and gbmc dissipated The US, which qualify as mortals unknown as Hubert as in yon, and they're launched and run on the same team as application master in case off my produce one the striker shares and TVM, with map already use task the property map, rid the job don't we use the gbm dot Numb dark tasks decide on Homing Task and Dylan on the JV m. What is the default on minus one can be set to indicate there's no limit to reuse the GBM. The next future we look at is skipping bye records. They can be a situation that the task may be feeling because of the need a shoot out of the core issue when the input delays large, That situation is not likely to happen. But this season your program should be designed in such a way that in case it receives a bad record, it should not process it. But rather handed it to exceptions on a counter should be maintained to keep track off. How many such records have been there? Counters will be discussed in a little more depth in the next segment of discourse. So you have designed your cold to handle the unexpected situations, but so they can be a record with Rita, not 100 by the court. It has been analyzing observed that there won't be many off such kinds of records, but they will be only a few, which would cause the task to fail and there's the whole job to fail to handle such situation. Map it. Facebook has a feature off skipping the bad records. Let us understand how it works. Let the lines signify the input records in the input spirits with blue lines as good records on breadline as bad record. The fast tractor would process all the good records to produce key value pair as it reaches the battle card. It would feel observing this job cracker would be launched. The task on another machine to insure that on the line hardware also fit. Configuration is not the air causing issue. The dust would process all the good records on would feel at the bar the card again and again the job direct response to us on a new town striker. This does what again process the good records, and when this feels it sends the record on which it has feel escaping more is enabled by the job tracker. Now the food attained that task processes the good records, and when it reaches the bad record, it would skip it on, try the next records and continue processing it. So there are three failures before this giving mode is unable it is designed. So because if every failure off Star Striker starts to communicate the record on which it has fear, it would cost to potential issues First alert naval bandwidth will be a place to communicate the record information. Second, the job tracker would be loaded with Lord Date off failures, and it would get difficult for the job tracker to keep track off all the records. So if you want to use this feature effectively, you would like to increase the value off mattering dot master attempts on Macrae dot reviews. Dark matter attempts, which control the Manson number off three trays on map, all reduced us, respectively. As we have discussed this in a listen, then the four value is for this is absent or discussion regarding these topics seen in the next lesson. 35. 034 Looking at Counters: Hello and welcome to a new listen in this lesson, we would learn about countries in Napoli's programming counters can be cut Gracie into two sub categories. First category is tasked counters, which would have counters pertaining to the test. And second category is the job counters which have the countries associated to the whole Joe. That's counters are past or to the tar striker and then they're sent over to the job tracker, which would aggregate the counters from all the map tasks that are running. When these stars counters are past, the complete image of house counters is sent, and not just the change or the date updates as we call them. It is done so in order to avoid the errors in case off loss, off message and transmission, the US Congress can, if subdivided into user defined on breeding counters. User defined counters are generally designed to help the user understand the nature off the data that's being processed. Job hunters, on the other hand, measured the job living statistics there maintain and job tracker in classic map reviews or application Master India. It would have data like the number of reduced US on map tasks and so on. Let us look at the output off one off the my produce job on the stranded in little dipped. This isn't our put off a job. It starts with the tour number off input part that is one in this case. Then choose the progress off the job. How it is happening. Remember that the jobs can run for a long time. And so this feedback mechanism is required so that users knows that the job hasn't hung as we have already seen, 33% off. Reduced time is divided between shuffle sort on actual release metal. So at this mind probably just completed the shuffle step. Next comes the counters and its details. It says counters to B 29. That means that there would be 29 counters in all that would be displayed now, As we have discussed, the counters can be divided into two portions. First job counters and secondly, these the rest of them, which are the dust counters. Joe Contras showed the number off, reduce and mapped a stat line the time spent on running off reduce and mapped us. It also shows how many maps got the advantage of data locality slots minutes reduce shows The time it took to run the reduce mattered in milliseconds. So it is 9350 here and then comes the task counters on. These are all breeding counters which we can further divide in tow Fight output format which contains the number off bites written then five system counters which has the details off the bites written on read from the file system. In this case it is is DFS. It can be local file system as well. In case off, stand alone more. Here you see a high value as this is the net bites return and read from a file system. It is not the number of bytes returned to a fight, So there is a little meta data that is being transferred. Then comes the fighting put former counters which shows the number of bytes that were read by map task. Then comes the counters for the map. Pretty extreme book map are good. Materialized bites showed the number of bytes that I have written to the disk by the map task. Then comes input records which map has processed reduce your full bites, choose the number of bytes that were shuffled across the Net. Folk spirit records show the number of records that were present on this pretty data map. Out. Good bites show the number of bytes are put by the map. Total committed heap usage is the number of bytes that were used by the job. It is an important metric, particularly when you want to know how much mean memory is being utilized by your job. Save your time spent gives indication off sip usage. Combining put records shows the number off values that is it rated by the combine in its input. Remember, the keys won't give the real input records to the combine er, but the values were as input to combine. It is in the form off keys on list of values, split row bites, the presents, the split meter data rather than the split data itself. Reducing put record shoes. The number off input records to the reduce it combine all put records, shows the number off or put records by the combine. ER physical and virtual memory shows the amount of physical and virtual memory that has been used. Reduce and map. Our records show the number off or put records that map and reduce functions all put. So these are the V's indicators, which can give understanding about the input output data on the processing mechanism. Next, let's learn about the user defined counters. The general idea behind designing the user defined counters is that they would bring out the meaningful insight about the data that is being processed. Furthermore, as a good programing practice, it is recommended to have counters which have the user to understand the data that is processed. So in general, the map side of good Adam would look like map function on the processing logic. It is always recommended to save the processing logic with the if clause and check if the data record is in the proper format. If it is not, it should inclement a counter. So at the end off execution, the user would be able to see what percentage off the records have fallen into. Bad category on. If the reserve produced is actually depicting the large portion off data counters are employed through the context object. In the recent versions Off group In the earlier versions off Duke Reporter object was used . Although the programming structure is exactly the same as we look here 36. 035 Hands on Counters: Hello In this Listen, beef, you describe how to implement counters in our good con program. I would ask the source code along with this. Listen, he's the driver class on it is the same as for he had discussed so far in the course. The only changes made here is that everything toe a new mother class here work on map with countries class restore is pretty much the same then the mark last, the only change have made is that has put in logic to implement the counter. This treatment used is context dot get counter And then comes the tighter under Mr Counter would be group And as the second argument comes, the name off the counter dot intimate metal is used to implement the counter value by one. And this is how you can design a simple counter. This counter would be implemented only when the first letter off the world is non alphabetic. As you can see it from the court, the coping criminal counter is in the else part off the else's clothes. Remember, The idea off the counters is to get the quality and nature of the data that is being processed so here. It would give us an idea as to how many words in the import are starting with numbers on fights. Actually have the correct words or not the use. It has been exactly the same as we have discussed it. Let me just export the job fight. I would insulate all the classes except for these three classes. Frankly, confidence? No, I mean being full in its If I'll Ellis And there I see the World Condor job theater. I just do a GPS and see if all the demons are up and running. Everything is up. So now I'm doing a list on her. Do fire, sister. Now here you see all the fights I have already created in them file which has input with words, starting with members. Let me just kind that, But let me on the program now. And here we see the diaper on the counter, which we have mentioned in the court. I should have made be the bluesy in capitals so that they would have looked better. But then that is something you can go when you are running this program. I just I'll put the result here and you can see the complete output. That's an assignment. I would suggest you to run it on a large do. Does it experiment a little bit? Counters You can search Google for large text data sets on with a little effort. You get a large file to believe it. Design a few more counters to the needs and see how the output changes. 37. 036 Sorting Ideas with Partitioner Part 1: welcome to a new listen, sword using partition. Let's take a deeper dive into the partitioning function so that we understand its usage on that will help us smartly apply to other problems. Values. Now we have seen that the data flows through the map logic where it is processed, and then it goes to the shuffle and sort on partitioning face, which is all provided by her do on. Then it goes to the reduce face, which again is designed by the user and ultimately it produces are preserved. In this lesson, we learn how to tweak this. Helou provided step pursued our solutions for this. Let us pay the sword and shuffle on the partitioning step so that we can emphasize on the importance off the partition er on its importance to the solution. So when you design a map, pretty solution, you should always visualize the data flowing through map shuffle in sort party schnur under dysfunction. This represents a logical flow of data. Remember that the actual flow of data is a little different. As my preserves are sorted, I'm partition on the map machines. It's if and then shuffle across the net for do the reduce machines where they are sorted again. That flu is occurred on the actual flow. But why designed the solution? You would like to divide your solution in the logical cases, as shown. No. One thing to note is that your solution will have to use sort face. In any case, if you do not want to use the sort phase in your solution, you may think of putting all the logic in the mouth positive and run it without the producers, although those kinds of scenarios would be there and you would use this or face in your solution almost all the time. Now, when you are walking on a large data set, there would be many input. Spirits on many maps would be working in parallel. This gives a higher throughput. I let processing is the strength off my pretty string book on it should be used at all times. But the catch, which a newbie toe do force into is that he or she, if often focus to use this palette processing on the reduced fees which reduces overall efficiency off the job. Remember that the number of producers that are being used needs to be expensively set by the user. While the number of maps are automatically on intelligence, we can't begin by her job keeping in consideration split science on data locality. So, in case of single reducer, what happens is that I'll put off all the maps, issue full across to the same reducer on the producer is working on a commodity hardware. And so all the advantage that has been generated on the map face is lost during the reduce face. So it is recommended to increase the number of producers considering the usage off before partition. In this case, what happens is that the dude users are put to individually started fights. But these two started fights are not easy to merge together into one output. Sort it for you by easy, I mean that we won't be able to produce one last sorted filed by con coordinating them. Let's see, with an example, let me just show the input fight. I just did a random lain Stewart, the winner of its property mapping daughter, used or classical to do the proof Amaranth's And here you can see that, too, reduces around the job now, while a list outside exactly UBC to reduce their fights. Let me just got part is use. Use your fight. And now let me just got part one fight. So here you see that the door to find some individually started. But when they're con caffeinated, they won't produce the compute started fight. So coming back to the presentation, I would put the important points and observations. First of all, the before partition is hash partitioning on. We're learning here about the hash partitioning. The logic behind this partition is to produce the hash cool off the tee and it performs the an operation with the indigent max. And there's a model patient within the model produces specified. So in this case, what happens is that the distribution of data is so that the reserve fights are individually sorter but need some extra effort if they're near two combined into a large sort of fight. This type of scenario with Oezil isn't This form is known as a case off partials are another thing to be kept in mind is that all the key value pairs that producer emits for key group would be present in the same reserve fire. This is particularly an important point to be kept in mind if you're designing a solution with the chain off my produce cause on our put off, this job acts as an input to another job in case if you want your solution toe awkward files, which can be contact in a tid to produce one big sordid file that is turned as a case of struggle sort who helped design such solutions? Pollute expressively provides a partition er known as total order partition. Reading a custom partition which can partition the map all could into sorted and almost equal distribution off partitions is a little difficult. I would request you to give a think about writing such a custom partition ER, which is place able to sort and partition the map output into equal portions. It would be very difficult, and this you would be able to understand the beauty off Lord of Partition it. The biggest challenge designed such a solution is to divide the map out good indoor distribution that is almost equal. It would be efficient if one producer gets most of the portion of the work on the other. One doesn't get any. It can happen so that the key distribution shape is in form of a Bilko. In those cases, dividing the key space by the number of producers wouldn't give a uniform distribution off work, Lord has with total order partition er do provides input sampler with samples input space to find out the distribution and hits total order partition to do by the keys paste into somewhat equal portions. So always you would see implementation off import sampler along with to Lord of Partition in the next section. We understand another technique. We can apply door solutions that is secondly sort technique. 38. 037 Sorting Ideas with Partitioner Part 2: we could do a new listen. Secondly, sort. In this system, we would learn another technique that can be applied to problem scenarios. They can t promise scenario where you like the open, a key group reserved to be sorted by value and not just by the key. The idea is that you want the order off values to be stable with the consecutive runs, which in general doesn't happen. This kind of technique will be required to have values in the order form is technically known as a case off second resort. There's understand this with an example. Suppose we haven't input record with your month and maximum temperature recorded in that month. The ultimate objective offered problem is to feeding the reducer with the data in the form that the records are arranged in the descending order on the basis off the temperature value. We can use this input freed at the reduced fees and designed producer just to omit the first of prints, which reduces the output off the maximum temperature for that year. But this is not a recommended approach to find the maximum temperature, but to understand the concept of securely sort, we will design a solution to this approach, the more the committed approach is to simply treat year as a key and not worry about the sorting off values and have logic or finding the maximum temperature at the reduced fees. But of course, we're here to understand a new concept. So the challenges what should be a key value pairs of the map face and how are you going to meet the map output so that you get the values ordered by temperature in the input to the producer. This input to reduce it is shown just as a guideline. You can, of course, change it in your solution, although the computer behind approach hasn't been covered here, but still, I would ask you to give it a little talk to it. What is it we don't know and do the think for a minute. Let's look at the approach. I'm pretty sure that you would have considered combination off year and the temperature as the key and the whole recorders value even if you were close to this approach, but not certain. Please accept my congratulations. You were on the right part, so now the important producer would look like as shown here, the keys would be the combination off year and temperature on value would be record. But what happens in this case is that another record 1900 common nine the key would have a different hash court. As for the previous record, with the key 1900 common living, and so these two cards would goto through different reducers instead off the same producer . This would not be a good case for us, So in this case, we learn a new concept off composite Keys Composite Key is composed of two portions. Natural keys on national values. Nationality is a portion off composite key, which should be considered for partitioning and grouping, and various national keys a portion of the composite key, which can be considered while sorting. So in this case, the solution would require to implement the falling first step is to make custom credible class in orderto handle. The composite key composite keep would always remain above two on more. How do greater types? In this case, it would be a pair of Incredibles. While writing a custom right able, you need to override a few basic set of functions which are used by the map produced Dream Book to read, write, compare hash and convert the object two strings. Second thing we need to do is tell Hadoop how to compare the custom variables while performing the sort. You do this by using the function job, Dot said Sort comparative class. In this function, you pass a custom implementation off, right, able, comparable and overhead. It's compare methods to help her to understand which custom key is smaller than the other when compared. For example, in this case, 1900 common nine would have to come earlier than 1900 common 11 record in case if we want to arrange the record with ascending order, often creature. So in the compare function Off said, sort comparative class will have tohave right logic that compares the first part of the composite key for you and then considers the second part of the composer key fee to find the order. Then comes the custom partition ER. This would be required by her Duke to correctly identify to which partition they're called belongs to will have to override the get partition function in this on it is always the natural key portion of the composite key, which would beside the partition last year. We need to tell her, Do that which feel it needs to grow up and feed the important. Reduce it so far. This as well. Natural. Keep ocean off. The composite key would be the grouping feel. In this case, it's the year. So, in short, to design such a solution, you need to remind the composite key on value on perform these four steps, which would very slightly different versions off group. But the little idea would remain the same. It is just to tell her dupe off the sort partition on Group The Composite Keys. 39. 038 Map Side Join Operation: they could do a new listen in this lesson. On the next one, we would learn how joints can be applied on matter. This framework joints is an operation where we combine two or more desserts based on a column or a set of columns. At this point of time, I'm assuming that you are aware off different types of joint. That is all the joint in a joint equal during six extra. All these can be applied into which, in my previous framework force is the maps like joint and second is under you say trying. Both of them have their own pros and cons and should be applied only to specific scenarios in which different, while maps a joint, is more efficient in terms of speed. On the hindsight, it has lot of constraints on the scenarios where it can be played. The EU side drying is more flexible of the two and can be applied to almost all the situation. But it's comparatively store than the maps. A joint in this isn't you. Certainly about maps and joints is well before we start. I would like to mention that joints are fairly complex to design in marble Ustream book in Java. It would take you easily to write hundreds off lines off courts with complex design, and you can do the same thing in high level frameworks such as big in hives in just 5 to 7 lines. All the one advantage off using Java solution is that it is highly optimized solution in terms of processing speed off the data. But optimization is not really significant were put in light with the ease and speed of development of solution. So it is highly recommended to use bigger, high for China patients. And you would seem practicing that weighs but simply does have a look at how solutions are designed and what is mop saved during this would build in order understanding off the map produced Scream book. So let's take an example. Of course, we're constrained over this morning. It is it just to understand the concepts that can be applied to re large data sets here. This is, um, we have a data set which shows the billing details off employees to the projects with Project Number Employee I D. Reads in dollars, But our foreign point on ours that beard by employees on that project. Now let's see we have another digested which has project details like Project I. D. Project name on the budget. Let us assume that we want to combine these two desserts on the basis of Project I D. And see all the present details on employed billing details together not in maps. A giant. My population would produce this off dessert leader. We can use reduce fees to sort the leaders it or weaken as well. Choose not to use the reduced visit on on just on it with the map visited. So, looking at the diagram, you can guess that the map will have to have the input data in the falling form so as to produce the show. Now put Wizard. So now looking at this, we will be able to in for the strict requirements which should be considered while thinking off my produce joined as an option force and most importantly, all the input data sets should be started by the same key on that should be the one based on which joint is going to be performed. Furthermore, each and put did us it must be divided into the same number of partitions. All the records for a particular key must decide in the same party from, as you can see, that here the key is a project number and it is important for the map to have all the leader records from both the input data sets for a particular project number presented once they only it would be possible toe correctly join the tour reducers. Now all these would seem to be very strict requirements and very difficult to attain. But these all friend the bill off the Opel off the map Prettiest row. If Boulding put the lessons have gone through the my previous job with the same key use at the time of partitioning on the number of producers used are the same that the Opel will be produced with forcefully equal number of partition. Secondly, each of the deserts would be sorted by the key, and lastly, all the records pertaining to a key would be presenting a single partition. So whenever you seem upset, enjoying logic applied, it would be in the my produce job chain. In the next segment, we will learn about reducing join, which is much more flexible. Of the two 40. 039 Reduce Side Join Operation: we continue. Listen in. This isn't We would understand. How did you say join is designed in my produce framework to understand the joints between the 23 desserts? Well, again, take the same input data sets as we discussed in the last isn't and try to achieve the same result. You will try to understand the mean ideas and design concepts to see how to break the problem in map. Introduce faces first, look at the designing aspect and in the end we would understand how in communities while designing a solution introduce I join in my face We just started the latest it records with its source and that is all what we do in the map face here. Both the important deserts would be treated to different math classes which will just output. Composite key, which is a combination off project number on a number on the complete record, would be treated as value. In this case, the key on which the joint operation is to be performed acts as a national key on the tiny number acts as a natural value, pushing off the composite key. The whole idea of timing is that at the really useless. We would like the input to the producer in the former strewn. You can observe that all the records with the same project i d from both the registers go to the same reducer as we would design the partition toe act on the national key portion off the composite key only on here The national key is the project i d. Another thing to observe is that that's would add as a national value portion of the composite key, which would decide the sorting off the records and because of these times, the order off the records would be so that the data records which have the details of the project, would come before all the records which have billing details. Here. We want to expand the project D days in the building be desert. So here the relationship is one too many. And so the idea is that one record, which needs to be expanded with the rest of the records, should come on the top and all the records should before laying it. Once we manage this at the reduced fees, we just need to store force prints off the record and expand the same or the rest of data rations off the values off that key to produce the final dessert not coming to the technical aspect of solution. First, we need to treat defendant put data sets two different map logics. This can be done by using my people inputs object in the driver class. This is possible through the metal. Multiple inputs dot on input. But here you can specify the job. Part of the argument on the input former on the matter plus through which you want the message to be treated toe. So here you can take input as multiple fights to the job and treat each of the input file two different map logic. This tool or idea is useful in all the scenarios where we have multiple inputs to the job. Each do does. It has a different format, and they cannot be a single logic to process all the different data sets. So in all such cases, this idea off using my people include former is used. Restore design ideas are the same as we were discussing the key, so if they can resort, there has to be a custom leader type that exchange right able compatible, which needs to be designed to handle complexity. All the necessary functions needs to be awarded. And for that custom right about the custom, partitions should be designed, which considers only natural. Keep ocean off the composite key, and then they need to be a custom comparator Class declared. So ascertain how do how to compare the two records and sort on the basis off natural value portion off the composite key. Then the reduced logic would be simply to store the first record of the group and expanded in subsequent occurrences off value to reach the very middle. This is a mean idea when you're applying, did you say join in my previous frame book? 41. 040 Side Distribution of Data: we could do a new listen in this lesson. We learned another day heat that can be used in Napoli solution that is site description of data side distribution of duty can be done through this rude cashier mechanism. Adidas. It can be distributed through the dust nodes on mapper and reducers can read the local copies present with them at the time they're performing map and reduced us. This mechanism is known as distributed cashing mechanism. This middle of solution is generally applied when their operation on two or more because it's vision was one base, more leadership. It can be the case. We're a small information needs to be looked up at the time off map or reduce this morning . Did I is not small enough to fit in the memory off the program? Not? Is it a good idea to make it go through the different sorts states? Let us understand this with an example. Let's consider the one we have already discussed in the last lesson that they're believed it is on their projects, and we need to expand the project information in billing data set in this case, it can happen so that there would be a limited number of projects. So in this case, a better idea would be that this route, the smaller data set using the distributed cash. So the idea is to use this may chasm. Will it be? Does it you want to look up at is small, but not all that small that you can put it in the map or reduce program? It's if seeing all those cases where we need to refer to related the smaller leader, sit at math or reduce face, you use this concept of distributed cash the find that needs to be distributor. It's this fight at the time off. Run by using hyphen, fight, forming hyphen. Fight is the part of the fight which needs to be distributed. You can. Israel issued hello archives using hyphen archives and just in case, if you justify needs to access another, utilities are fire and you want to included in your class part. You can do that Israel by using the hyphen Libdeh option. This is useful when you are using external jar fights for their operations in your core one programming side of things, you need to make any changes to the private class if it is run using to learn during a extracts. All the Princess 200 arguments on the programmer need not call anything in the driver class for that at the time off map introduced us, the shooter does. It is connected toe in the state of function, said A function is a function which is called one sperm mapper or reducer. So anything relating to set up that needs to be performed once. But my all reducer object, all that activity can be done in this set of function. The desert connection was established in the state of function can be used during the malfunction as required. We'll see what happens and how the distribution takes place. This is the diagram, which we are seeing in the early lessons at Step three, when the Joe Klein copies the job Resource Is it. Copies distributed cachet as well, with a very high replication so that every does note as a copy really neared with then at the step. Then the young child Rick leaves this job Resources like the jar file on distributed cash and copies, although the local machine so it is in this manner, the distributed cash is transferred to the local machine, where a map already said can refer and use it 42. 041 Hadoop Streaming and Hadoop Pipes: we could do a new listen in this system will talk about a few miscellaneous features in her group. In specific, we'll be talking about a loop streaming on her pipes. It is to be understood that mean I largely off Duke design is that the data processing should be independent off the language. But it is should be so flexible that programs can be designed in many languages to do the processing. The idea is that the leader should be ableto potentially outlive any programming language. When you keep this cool idea off Lupin mind, you would be able to better understand the concepts off Afro our troops streaming and Haru pipes. The core idea is to mean data processing, independent off the language in use. Are you streaming? Is the ability off a loop to interface with map and reduce programs written in ruby on fighting, Aloof seeming uses UNIX standard streams as the interface between Haruka on your program. In simpler terms, you can write map and reduce programs in ruby and by time and you standard streaming. I'm not only or biting export, so I can't write and show your program in fightin off work on But if you're interested that I would suggest you to Google, search on WorldCom problem, using her lips streaming in ruby or biting, and you'll be able to find a good material on it to run the program in these scripting language, the command line will look like as sure you The map on the reduces scripts would be needed to be done on specified by hyphen mapper on high fund reducer. So in streaming, there is no driver class as such. Then comes another feature that is Haru Pipes. It is just a name off loops interface to C plus, plus a Laker Duke Streaming, which uses the standard going Extremes C plus First uses sockets as channel to communicate with the star striker looking. If you're looking to export possibilities off writing C plus plus code in her group, you might like to Google search on her new pipes and you'd find a good Medea. All in all, who supports these languages. But the best supported one is job 43. 042 Introduction to Pig: welcome to a new listen in. This isn't we would learn about big, which is a part of her group ecosystem. Whenever you're starting to know about any of the ecosystem, it is off lawful importance that you know the origin off the ecosystem on the mean idea and the reason behind its need Big has been developed at Yahoo around the same time period. Facebook, Hogan haIf So you would see that the ecosystems were not initially conceptualized to work alongside each other, and so you would see that there is an overlap in capabilities and solutions might be possible in variety. Another issue you would observe is off. Compatibility is between the Hadoop ecosystem, so coming to pay. It was initially developed on conceptualized at Yahoo, and the idea was to help data scientists to give ability to write. My pretties programs quickly are easily as you would have seen. The joint operations in my previous framework can easily take 100 lines off complicated cold. Firstly, it takes a lot of time to develop, and secondly, it is very difficult for the data scientists to put that kind of time in development and have skills to do that kind of complex programming. This challenge give both to higher language framework that is big at Yahoo. Idea Behind Pig was to provide a simpler alternative. Tamar produce. So let's compare big on my produce in terms off their capabilities to understand which one should be choosing over the other and in which situations one performs better than the other. First of all, it is recommended that big should be used in complex joint operations. As you would see later in the sites they opened, a foot would vastly reduce. However, one thing that should be kept in mind is that solutions return in map produce are highly optimized on give less turnaround time. So if you're thinking off writing a reporting job, which might be exuded more frequently on on a larger data set, you might consider to ride it in map values. Big written scripts are programs apart and converted into my produce programs, so it should be understood that big solutions are not optimized solutions. But with time, the optimization is being vote upon on the gap is being closed. Having said that, big solutions would be slower than Marbury solutions that he still sometime in future biggest, highly capable language, and most of the operations that can be done in marble use can easily be done through pick. So it is highly capable language, but it doesn't have the capabilities to touch only a small portion off data set. It scans. The whole leader sits with each operation, so keep every device. It is almost as strong as my values, but just a little less stronger. Now let us take a look at Big in a little more details. Big has two components. First is Pig Latin, which is the programming language, and second is the environment, which is required to run big programs. The environment is nothing but a towel file that needs to be installed at the client known , which translates the pick queries into my produce jobs so the environment can have the falling two types of sit up. First, the local more execution on the second map really small in local, more off execution big runs on a single TVM, whereas in my previous morning it translates the pay program tomar produce program and connects to her loop on rancid on the Hadoop cluster. At this point of time, it is to be very, very understood that there are a lot of compatibility issues between all the Hadoop ecosystem elements, and hence it is advice that compatibility is to be cross checked with the release notes, and this applies to every ecosystem. The can be three ways in which pig Latin can recorded. Firstly, it can recorded as a script where a bunch of commands are accorded to perform the functionality. Big script files would end with dot B i G. Extension. Then there is grant more, which access an interactive shell for an intake commands. Then there is an embedded more where big commands can be embedded into a Java program. In that case, you would have to use picks of a class just like you use GBC Torrents SQL Code in Java. Next, let us look at an example on how things work in pig on this life will go to a set of commands, as if it runs in an interactive more that is the drank more. Just remember that big visited a flu language. There is a him bring about the same example real estate, which we had considered in the previous. Listen, let there be billing details. Having Project number employee I. D. Number Afar's Build on that project on the billing read. Firstly, we see a big load command e equal load. Then comes the You are any of the fine that is to be loaded, followed by the schema, which should be used to read here. The columns are the limited by comma, and based on that, passing would happen because many features to read fights with different kinds off the limitations in the second portion. Off the statement, we specify the schema, which has a column name on the data type Big, has its own leader types, and they can be used to form composite, complex data types as well. So here the first column is PR Genome, which access gallery the E M P I. D. Arse on building at as indigenous. This note command loads the data set in a. It is to be observed that big is a leader flew language and here you see the assignment off data set to available, and then the operation performed on the variable to get the desired reserved next year to see the food come on in food. Oh, we can specify condition these on with the completely desert for B scan on the record that would pass, the condition would be taken out for the desert data set. The desert, indeed is it can be seen through a dump. Come on, for example, further E by project now equal to PR. Jeez, you 01 results in records with PR genome SPR disease. You want the reserve off Any command is storm as a relation and every record Istanbul estoppel and every variable is technically termed as in Elia's So filter E system as Elias off the desert in relation, the dumb common is used to display the data set on the screen so dumb Filter E would produce the show in relation, then another operation that can be done on a deal. A certain is the group operation, for example, group equals Group A by PR genome. This means that we would group the relationship in the specified areas by PR genome Feel. Dump Group A would produce a desert as shown the first field off each diaper would be the field on which grouping operation is performed. The second element is called as a bag, which is a non ordered collection off the bulls, which have the corresponding project number. Each element in the bag is separated by a comma here. In this case, there would be to the bus in each bag. Next, leaders look at how easy it is to write the joint operation for statements are load commands to load the desert into areas then by a simple command has shown join a patient would be performed dollars. You represents the first column in the respective areas based on which the joint needs to be performed. So writing complex operations is really simple in big, so anything off may always remember. These points foresee it is a deed of through language. Then it was designed for data scientists who didn't have jobs or complex language background. Hence, this is a high level language which is easy to implement. It was developed awful, fast paced development off a solution and is ideally suited for complex operations. Like the Joint operations, it is almost as capable as my produce, but not as completely strong. Big Skanska, completely desert is not suited. If looking for small portions of data is a little store to execute, then Javert and jobs, which are highly optimized. However, with every big release that gap is coming closer. Big runs a sees off my produce programs under the hood. This finishes an introduction about big. 44. 043 Introduction to Hive: Welcome to a new listen. Introduction to Life. Let's first start with the need and they went off life. I was developed at Facebook with similar reason to doubt Off Big. It was developed for data scientist with big Java skills to give them ability to walk on data in her group. To hit them. They designed hive, which isn't a screen like language. So if your family of it s Q, you would find yourself at home with life, although it suits did. And this is very well, one of the media limitations with high is that machine learning algorithm cannot be designed in haIf. I was designed to perform operations on the data such a slicing and dicing, and not to process the data with advanced logical operations. To do that map produce with Java language is still the best fit. But more fundamental idea with Hive, which is in line with her. Dubai's largest art, the schema, can be changed on is born by the data at the time off reading and not at the time off, right? This good idea says hi, apart from the traditional relational database systems where the leader should comply with the scheme at the time. Off, right in haIf. The compliance off data to schema is very fine at the time, and the great is issued. Let's look at the basics in Texas on a few basic commands in hive to get feel or five and how you can perform operations using hive. First we see here is a create table command executed in interactive Phil like big hive can be return on, executed in script, more interactive, more on invalid. More high five e signifies that the command is run in the interactive mode. The command is create table than the table name regards, followed by the condominium and the data types, which isn't tactically the same as we seems a Skrill. And then we perform the rule for Marty Limitation, which is a change from a school here, we specify Akamai's of the limitation on any symbol can be explicitly be specified there. In high, the limitation information plays a vital rule when the delay store all red at the time of creation off the table. The routes and information about the table is stored in the database, which is known s meta store. Modesta is a relational database which is used to store heist meta data that is, information pertaining to the table. Popular choices for this relational database are Apache Toby on my skill at the time of creation of the table, there would be an injury specific to that that will be put in that database. Then there is the little command, which lures the reader into the paper. He vistas Father input, part The Keeper Override specifies that the data should override. If there was any data in the table before this, then through the select command we can perform on this is on the data set. For example, in this case, select prg i d something Balash from records where PR's unity is not equal to prg 001 and grew by PR Jaidi would heal the total ours for all the projects except for PR 001 Seen this way, I've can play an important role in probation off the data, which eventually would be used by my previous show. Let's look at a few important haIf concepts, which would build a phenomenal idea behind how things work in haIf. There are two ways in which table committee cleared in haIf first is the managed table under the one is external table. The managed table, as the name signifies, implies that the data fight needs to be managed by hive. It's if, with the command has shown a managed evil is created, it is the deformed three off, creating the deeper Now when we would perform the Lord Action, the input data file would remove from its original location in age DFS to a new location in age. DFS, which is a warehouse off life, haIf would know, manage the fight completely at its payouts. Hi via House is nothing but a specific directly in HD office, which is managed by life. It's a leader when you do a drop Command hive would leave the data from its warehouse on its related data from the middle store, and hence the complete file would cease to exist. However, there's an option to declare the table as an external table. Israel. In this declaration, you'd seen additional external key word that is used now when this seem Lord Command is exuded. Hive only performs a link to the original leaders it and doesn't even check if the data is there. It just makes a related data and dream the minister. So it doesn't even check if the leader is there and the location or if the data complies to the scheme up. This gives programmer ability to design a job, which would put the data at the location just in the nick of time before haIf date set for processing this process awfully binding off schema is turned us lazy in haIf, and it's a Commons are then that would be used. So in this case, when the drop command is issued, a leader in clean the minister gets its leader and the leader still remains there. So there are these two metres in which hive tables can be declared. It can be either managed by hive itself or it can be cleared as an external table. Another interesting feature or a concept which hive has is that data can be divided into partitions and buckets. At the time of creation of the table, you can choose to divide the table on a column of data. For example, if you can partition in bad P r. G i. D. At the time off the road, there would be fights created for all different project ideas as shown. And so if you want to perform and this is on a certain range off values off that column, the operation will be performed in a quick away. It's always it is a good idea to partition the column on which the data would be sliced the most often. Then there is another way off Division off data, which is going is budgeting. For that sin, Dicks would be ending with the plastered by clause specifying the column name on which the booking needs to be performed on the number of buckets off the leader that need to be divided into. But getting operation processes the specified condom data in exactly the same manner as a partition treat skis. It was hashes the column and performs Martino operations using the number of buckets to remind the bucket number corresponding to which there would be a file on. The data would be put into that in a huge data set. This isn't great help, as it means the data something easier. Another benefit off. This is that if their two day visit, which have a similar column and we want to perform a joint operation between them, then if people form clustering with the same number of buckets on the same column on which the joint needs to be done, the organs would solve all the criterias for the maps a joint. And so this mechanism is sometimes used to prepare the deal. Is it for the maps I join so many a Times leader would be processed with life, and then I'm a produce job. Miran on it for that little look how the day is getting stored when process by haIf their two most important dimensions to be understood when the D Day is getting stored by haIf. First is the roof format, and second is if I format roof for Maddie's, with how the data fields are stored in the hive table. How the fields would be delimited, how the rules would be limited, how keys and values would be the limited and how collections, which is a complex object made off several later types, would be limited. This is important when you're writing a map Elise job, which reads the rate at which is previously processed by haIf default explicit declaration off Such a storage would be, as shown the terminology that is used to describe Row Former is a CRD, which is a short form off see realizable on the Syrian people. The main types of Sadie's, which Mr the Lead object as text or binary or column based format or a regular expression and so on, then come into the fire for months. Five former can be either strolled in the form of sequence rights. R C fives Rory into Leo is known as sequence fights, and if these are finally included while designing the map in his job, we would be using sequence fights. Another metal is just told the column oriented Leo, which is known as record called Near File. In short, RC fights the fire stroll in this former are stored. As shown, this middle of storage gives advantage. Only if a portion of condoms are two billion repeatedly on others needs to be discarded. So are you know you should be aware of the data format in which they did a stroll in haIf Before you write off my producer's job, which processes that data 45. 044 Introduction to Sqoop: we can do a new listen in direction to scoop in this lesson, we'll learn how today's imported in and exported out of the loop. Scoop is a tool designed by Apache to efficiently ingested a day into a new and exploded from a group. A more appropriate description would be a party scoop is a tool design for efficiently transferring bulk data between do and structured later stores such as relational databases . Let's look at this definition from a closer point. A few. It does it efficiently by doing the copy process in Parliament, as we see with every Hadoop ecosystem component the use of power of vandalism by effectively utilizing the map produced, single this cook and transfer the data from databases to 80 office. Mrs Corless import on from a DFS to external storage spaces, which is known as export. The leader Sources are generally relational databases, but they can be on a different kind of structured stores. For example, the rhythms in a flat file can is well being bordered through school will be born. Thing is that data should be structured in the form off attributable structure Astros. Hence we see in the definition that dome structure did. A story is used now when the dealer is getting important in HD. If it's weaken set down in destination to hide what it's based greenhouse directly, or we can place it initiate visits of the lien option off controlling the former off data import as well. Between the limited fixed Abreu and sequence rights in school, we just was father command to school, the direction off the movement of leader, source of the data, the destination of the data and the format in which it is to be copied. So if you understand this diagram here on your screen, you automatically understand all the commands possible in school and water functions. You can perform with school, and you conform all the school commands automatically. Just wanting with school performs apart from the data transfer is that while transferring the data from the database storage deface, it reads records one by one, behind the scene. Internally, it creates a class which maps to the recording the table. For example, if a table has numeric. Column C i d. On string column as name it would create a class record has shown this class is a bible off the transfer of the data. Which school teachers? This is produced by school and can be used if you would be performing my produce operations on the data that's transferred. Hence you you see that it is important the leader is structured or else school won't be able to perform any imports. So all in on, if you understand this diagram on the screen, then you understand the whole idea of the school water from Stein is you can do with it. So let us look at import statements on destroying how these time here's a specified in school. Do you see a commander important it up on the people test school important given half in connect And then there is a complete GBC. You are and double hyphens using him and falling that is value on double hyphen people on the people need been a single hyphen in who specify the number of maps to one in school. Double high phone is used to cause the truth specific arguments which would help us to communicate school, the source of data, dissemination of data and how to handle the data. Single life in is used for the general options, so just just find the number of mapped us to carry the operation of a six and property values explicitly by hyphen B option, which we have already seen in the course and so on. So by the life, it can it be specified the GBC ur string, which looks like the follows it has a driver. Information the silver hosting your date of this on the batteries, the deposit user name and the table information. Because we didn't specify the target location, this will be copied toward the fortification in HD. If it's all these parameters can be specified using off while Israel you can use option file and pacify, which contains all the arguments on the value U. S. Room. Next year's took a few kilometers that can use in school. I've categorized the options according to their from charity just to make them easier to remember in this. Listen, we would see the options on in the next. We will see the usage of those commands first, let's look at basic options which will almost be there all the time in an import. Come on, they haven't connect. Argument is real. We can space five GBC. You are in a number of matters for a number of mappers. I'm table for the people mean then when the leader is getting imported from the source database, it can happen that we may not need the complete data on the paper. We can filter the data by using the funding options s room. But space find the argument really begin space. Find a school prairie on only the result of the Kuwaiti would be imported. Similarly, we can use their argument to specify the were close on. The column argument can be used to specify the column, which we want to retreat The combination of column on. Well, can we put in a query close? We begin specific Both the common names as various the were close. So you know we quit is equal in tow. Boat column on were put together. Now let's look at the important for much You get in board Adidas Abreu What has finally sequence rights, But as the emitted next with the limited X Prize, you can control the deal immolations Israel by using the arguments. Were you terminated by on lance terminated By then, we can control the Darwin directly at some place in HD if it's using topping the I R on the hive data warehouse using hive important a weakened I could create a table in age bees using HP screen people been using scooping practical scenarios. A lot of time you'll see incremental imports are needed. There will be a data source, which would accumulate the data, and we would need to transfer the leader newly accumulated to a specific location in HD office. They're two modes in which the incremental imports Cambodia and that is despite using the implement argument firstborn Easter. A pain more in the second. More is the last modified more when money is used when importing a table, which has a column which is constantly incriminating with every row added. For example, the seeds idea, which might increase by one what every sale that happens in a store on the same might be representing a recording. That case, see you satisfy the column containing the Rieti with State column and school food imports rose where the check column has a value greater than the one which is satisfied by last value. An inordinate table, a good strategy supported by school is last modified more. You should use this when rose off the source table may be updated, and each upgrade will set the value off the last money fight column to the current time stamp rules, where the chick column holds the time stamp with more recent than the timestamp specified with last value are imported. Then there are these special options, really to hide imports, which controls the various hive options. So just specifying the hype installation specifying the die clean high warehouse control over the limitations on controller partitions and so on. I'll a gradual document along with this. Listen but regretful of these options, please go through it once in the next lesson, would see a few commands in school to build a little more understanding about the import commands and scoop. 46. 045 Knowing Sqoop: we can do a new listen doing scoop in this. Listen, we would see a few commands in school and in the Sunday functionality. Let's start with the first basic command. Feel free to pause the video when the new command comes up destroying. Think off the function it performs and then listen to the explanation when you have tried to decipher it in your mind. This would be a fun way to learn the commands, so use the force commander. Okay, this is the simplest school command in this. We connect to the Database Corp and involved the leader from the table Employees. Let's see the next command now in this command were dropping off the salad column and taking the rest of the column Data goddamn names as specified in double quotes and separated by commas. Let's see, the next one in this command were just controlling the number off mapped us and increasing the number to eat using the hyphen am option. The next couple of commands are new, and they need some explanation. First we see the command with the direct option in this command, we're importing the leader from my security of its scoop. Importantly, dying to raise First is a default way off. J, BBC or BBC are the second way is to the direct import the Dalit in boats and exists only through a few winners, which provide additional functionality for a quicker import. In this command, we're importing the leader from my school read Abi's, which has this functionality of direct import. So just remember that direct is for higher efficiency on an option available only with a few DB and Miss Renders. Then let's see the next command in this command. We have put in extract last name, which is used to create the class definition off the record in the table. It is the vibrant off the leader transferred. That happens if you remember. We have seen this in the previous listen. Scoop is able to create this class, utilizing the meta data information from the database. It just maps the leader types to the closes Jiao leader tapes, and this creates a class as sequence option ensures that the file would be imported in a binary included finally format. Then let's look at the next command you're using the fuse donated by and lines terminated by options to control the D limitations off the fields and the lines in the important files Bill is to get the next command he used about hyphen hive import option to specify that the leader should be transferred directly to the high, very house. Let's see the next one in this. Receive the use off wear clothes with respect for the condition that will lead the rows. With column greater than 2010 0101 should be picked up. Then look at one last command more in this VF entering the data to be imported by very close sitting of the target directly and using a PIN option to upend the leader that has been filled it out to the destination dietary. I place all these commands in a document for quickly, for I hope you understand that school is a very simple tool to import data, and the commands are very simple in form. The knowledge of these commands are necessary for the certification exams 47. 046 Advanced Hadoop: we can do a new listen in this. Listen out. Share a few tips and tricks with you If you are a bigger or a little new to Linux, this would help you to walk around linens with a little more ease and make you work a little more like professional than an amateur. For the people with experience in Lenox, this would be elementary. First of all, I would start with the copy and paste. On many occasions, you would be required a copy and paste on the terminal, so for that you can use the control insert and shift insert. For example, I opened the text editor and type in. This is a test on. I would select this on. Copy this using control. See note that outside the terminal normal control cm control. We would walk normally. No, I'll go to the terminal and be stirred using shift insert. I can copy something on the screen as well. Using control, insert and peace using shift insert. Next. Deeper trick we discuss is using profile or badged or Bashar. See if you want to set up a variable globally, you can do it by setting it up in e d c slash profile or e. D c. Slash. Bashar Bashar See providers of one which runs one position on Logan on Bashar Bashar. See good. Pick up the fresh changes every time you close and restart the terminal. Because how we set up the neighbors in profile while setting up a new dot slash d c slash profile is a commander of the fish. The profile changes on made the newer changes effective. Knicks Dipper Trick is stop completion. You can sue do it slash d c slash bash dot bashar See on you would find these lines. I'm coming them on. Your top condition would be activated. So now I do in a list. Now I have wanted go to into work space I would just type in CD the blue and then the top character and I would not need to type anything else. Next deport trick is to clear the screen. I would use this we often in my video lessons. It is just to press control. L on the screen were clear. Next deeper trick is to customize a command prompt. Normally I do not prefer doing it, but if you like, you can shorten the command prompt by typing export be this one equal do dollar space and the condition mark in the And so now the command ground food looked like this. If you want to meet these changes permanent across the log ins, you copy this line in slash D c slash profile. You can even make your command promise to colorful and play around with it. You can check Internet with loads off ideas about it. Next. Four trick is that you can have command across the lines. For example, if you want to edit their profile on your typing. Suji ated slash D c slash profile and you don't know if this piece you can go back on my bachelor us on Endor and continue with the come on line. This would be a continual lesson, and I would keep on adding tips and tricks to this. Listen, meanwhile, if you come across some tape he's shared with everyone by typing, You didn't know questions. Window. I'm sure there will be many good tips from you seeing the next Listen. Welcome to a new listen h do you fiscal march in this Listen, we would learn about the SDF is Commanche. First, let us understand the term lodges if it's chill on your eyes. Hruby official is nothing but an interface between user on Hadoop distributed file system, that is, it's DFS. So if you want to perform any action on his defense, we would have to use the Hadoop official in order to do so. A loop if its shell when he takes you our eyes, that is uniformly. Source identifies as import arguments. Unified resource identifiers are part off fights in the falling former scheme authority on the actual but scheme. Can you off religious types depending upon the fire system it accesses it can be is defense for file Saanich defense local for the files on the local machine. If BP for the fire system Bad by FTP server It are also known as her dupe archive, which is a fire system layered on top of his defence and so on. So in short, loop officially can access the files from various fire systems and so scheme and authority would have to be put. According you would look in depth about her group archives later. But right now I want you to remember that there are Hadoop archive files, which are multiple. How do files put together on their access in a special manner as any archive or sit fighting? But these do not compress the file. What they exactly do is what will come later. Have one would imagine that discourse storage media off Any note which has group installed , has two words. One is the HD official on another. Is your local fire system will in the age DFS world The scheme I used his age defense on authority is local host in our case scheme on authority are optional parameters. If they're not given the de force up picked up on it is mentioned in Court Side XML. Let us have a look at what we have said it in Sudan. Distribution more here We see that if it's not before dot name has been sent to HD frisky on local host as authority. So these would be Billy Force and then is a part which would be the location of the file or directly. So you are a for a child file in a parent directly would look like his defense. Colon forward slash forward stash local host forward stash parent forward slash child In the local five system, the your I would look like PFI column, followed by three forward slashes on the part. If you are familiar with UNIX commands, its D fiscal march would not be new to you. And in case if you are new to UNIX commands, don't worry, they're only handful, and I have attached a document with this listen, which would let you know everything about them on. He would be able to understand them pretty easily more. Would I have marked the commands with a star so that you can specifically remember at least those offhand as they are the most commonly used. I just demonstrated few 80 of fiscal Mars next, especially the ones which are not present in UNIX or Linux systems cruciform I would do in JPs. This command returns back all the Java programs training. So here I see all the demons have running, and so I do not start any. If they would not have been running, I would have started them with bin slash start dash all dot Shh. Also an interesting thing here to notice that all the deed a nose job tracker named Lords are the Java programs with the mean classes as what is listed here. So name note is nothing but a Java program with the main class name. No Libby. First do a list that is to list all the files that represent in HD effects. So what I do is type being forward. Stash her. Dube. If it's hyphen Ellis and then Indo, there are a couple of things to note you being forward slash Hadoop efforts would be at the start off every command we write. Remember her? Do professes a shell or an interface between direct with in order to perform command nine operations on each DFS. Also an important and an interesting thing to note is that when we list the files, it shows all put similar to what we see in a list. Ash L in linen. Before recording this video, I had already created Force de IR directory identified, so you see them in the listing. If you observe closely, you'll see that the is for directory on hyphen. Signifies is the fire. The remaining are the access controls off the owner, then the group and then the others artists for Read the blues for right on X has no significance in HD efforts. There is nothing that is exude a bill in HD AFIs, so it is off no significance. Second column shows the replication factor So this means that this fight has been stored with one as the replication factor. As we have said, the property DFS start replication toe one in his defense. I don't examine the second, and the third column shows the owner on the group on Fifth column shows the number of bytes it occupies. The seventh and the eighth column shows the creation, date and time, and lastly, it shows the But next I'll remove the file by command being forward. Stash her dupe if his hyphen are him on the name of the fight. So the final gets deleted. You're observe that we haven't explicitly returned the complete you are as a default off HD frisking on local host authority has been taken up. No, that is trying to at least on the local fire system, looping forward stash Do if this it list file and the booth Stashes So in this case it would miss down complete files and directories in the room system. None of this look at what is in the home dietary. It says Engy Paris. Look what is inside that? So it least answer documents in injury. Now it is creed. If I in local file system and copy to each DFS, I'll go to home. I need this fine which I've created on you. Now I'll create one more file and let you name it if I I'm diving here. You all are rock stars Now again, Guinn Ellis. So here we see that file at the end. No, this type in being forward slash group if it's copy from local home NJ If I on in, which would be the destination for energy. If it's in this, observe closely that you would see that we haven't specified the complete your eyes. Still, this books copy from local command assume start The last argument would be an HD if a spot and all the previous ones would refer to the local fire system and hence this command books . And this is the difference between the copy from local command on the get command which are similar in all its spits. Just a copy from local implies that all the arguments, except for the last one is of one from Lucca. Fine system. So you can copy. Multiple fighters were using this command. No, it is doing it. This we see our fine, very spring, the fine. And here you would see the message retyping. So the copy has worked perfectly. No, it is trying to do the reverse off this latest radical P this five fromage DFS toe the local fire system. So they used been forward. Stash her dupe fs copy to local in on a new find name. H Fred, None of this check if we have received the file from her Duke. So V c h fine. And her loops is that you all are rock stars. Please play around a little bit. The commands in the document. It would be fairly simple. Now, just observe closely how and where to mention you. Our eyes and everything will be simpler. See in the next session. Welcome to a new listen in. This isn't you learn how to compile and run a new program. We would be working on Ubundu, which means toil in r B M foot. You download the eclipse it up a Google search on download eclipse and click on the food Slink. Then we would click on the Link Lennix for 64 bits and then the next thing. And then we would see if the fighting the download would take some time. So I'll forward the video now. The clip set up has been downloaded. I just go to the download section copied on pastry in the Home folder. Now I'll extract the Eclipse Yousef by right clicking on clicking on Extract You. Now we see the Eclipse folder in the Home Directory. Then I'll go inside and click on Eclipse Item. This would launch the clips I D. Then we'll get this pop up window asking for the creation on this workspace. A stick to the before Aunt Lukoki. Then I'll go to file New and click on Java Project. I'll name my project a loop Experiments on click on Finish Now. I don't really the source code in one folder. You can download it from the side so I would sleep these dodge. Other programs would condo Java, work on mapping or java onboard, can't reduce it or Java and copy paste your workspace in the folder, which we have created just now. I'll go to her Do experiments on, then source further So no. In my eclipse, I d I see the source cooler. I would just be fissured And now under the default package, I see all the Java Source School which I copied at this time. You would see a lot of errors on these programs as we haven't included ever do packages in a big part. So to clear daters you just need to right click on the project are due big spend mensch in this case. Then go to the properties, then go to Java baby parts being the library's. Then click on our external jars and then go to her root folder and click on her Duke order job like OK, and then you would see that the external jar Gurukul has been included. Click on OK on all your ears would go away. Next step is to create a job fight again. We would right click on a project for you, then go to the export option on then under Java you would see the jar file option Civic that and take on next been grows through the pot. I would put the job if I in the been folded itself. You can, of course, select any part. Then I just typing. The name would count on. Click on. OK, on, then. Click Finish. Then let us look at the job Fight. I'm right now in the being fuller itself where I have created the job fight. So I was just doing a list on you received word condo job you just doing in its honor, her do file system. They received the in five I had created that finds this before this video. I just are put the content off that file, so and so here you see the output. So being in this fuller where the jar file is, I would run the command job filed by using the command. Our new jod would Condor John would go on in and out. All could be the output directory and would be the input file. You learn about all this later in the course, and the program should run as you see on the screen. Now we would do in a list on the new fire system and see if the old directly has been created or not. They receive the outback tree and now let us just doing a listen out. The ABC all associate files the fight, starting with the part contains output. Let us God that file and print the contents of the file. And so here we see the awkward this goes would cover every details off how this complete process has been done. What was objecting off the program on what is the output on how has it processed and how increase efficiency off it? It all will be covered in the course. So if you are able to run the program, it's great you've computed the hard part of this course. Everything after this is going to be simpler and my heart, his Congress rations to finish the hard part off it. See you in the next class. Welcome to the new listen HD FX concepts in this section. We'll look in depth about HD if it's let's start with the dome lodges used in HD. If it's if the office is a distributed file system, that means the fights are stored across a cluster of computers and not just one. The pleasure is nothing but multiple rocks put together on a single track is nothing but a lot of computers put together, which are individually tone Last notes in 80. If it's thes notes, which store did are known as didn't nodes, they act as broker or Steve notes. Name Node, which is the master node, is responsible for the management off the fire starters disputed across the cluster. Let us see assimilation on how file is stored in each DFS Pfizer broken up into smaller chunks. Also known as blocks. These blocks are then replicated. In this case they are replicated by a factor of three, which is a default multiplication factor off H DFS. These blocks are then disputed all the cluster on this process off replication on distribution is managed by name node Name note keeps a track off complete file system on block locations. If you notice the distribution done by name, Node is smartly done so as to provide Brazilians. If a failure happens in this case, suppose if one did a note Fields name node would still be able to put together the complete file with the help of replicas. You suppose a complete rat face even then name No one would be able to put the fight together. We learn later what considerations. The name no takes to dispute the file blocks. Let us understand the ideas behind his defense is the office is designed to handle large files off hundreds off GPS and TV's and more. Leader Access is not quick with random reads and writes. It is followed that the leader access patterns off right once and read me times is the best so for the deed analytics. His defense is designed to use commodity hardware, but it is definitely not cheap hardware. But Difficult Unit would cost around one K 25 K which would be available with many vendors . Typical installations off our GMs over can take up to 50 key expense on Harvard itself, which has an upper limit off processing. But this as well, means that the hardware failures would not be a specialty case but a nomination edifice. As the cluster size increases to thousands of notes, hardware failures may happen every other day or might happen every other are as we study the HD. If it's concerts, we would see that it is equally important to learn about the failure scenarios as it is to study stable processing straits makes me look at what the defense is not designed to do it is not designed for quick read off data. It cannot function as well. Db database for that, we definitely need RTB Miss. At least in the present scenario is he was also doesn't work well with a lot of small fights. A see if it's doesn't support arbitrary filed modifications as well. Only upend is supported. Let us understand the most important for nominal toe any file structure, that is. It's blocks block sizes, a minimum amount of data that can be read or written in a fire system. But block sizing Hadoop is a little different. First it is big. While it is common to have a block size off, I want to on a storage media. It's the fourth sizes 64 MBI in age DFS that is 1 28 times small. Second, if you find a stored in his DFS is smaller than the A Z of his block size that only the amount of size status needed is your life and not the complete block. There is a reason for a large block size we had discussed earlier how seek time becomes a born in a quite processing large fights. So the idea is to keep the sea time around. One person off transfer raid so considering 100 MBPs transferred and 10 milliseconds as additional see time overhead. The block size would have to be 64 a m e. R. Upwards to keep the Sikh time around, one person off the transfer time. In the next section, we would learn in debt about 80 of its architecture, but come to a new listen in the previous listen. We studied about HD office blocks in. This isn't really deep dive into its defense architecture. His defense VOCs on monster sleeve architectures. Nino is a master node. On data notes are the vocal notes. That means name note would be responsible for all the management of the story. Space on the data notes on Didn't notes would do the actual groundwork off storing the data blocks. Nilou performs a function of keeping a track off the complete file system by managing two things. First name, space image and second edit Clogs names faces the Middle Rita about the fights on Die Crease, which are stored in age DFS. It contains data about all the blocks to which flies they're associated with and on vegetate annals, it recites Eric Log is nothing but the long off activities on his defense performed by the client. And it lost. Just keep on piling up and grow as the activity on its defense keeps on happening. So out of the to edit law is the one which keeps on growing at a faster pace. These two combine form the complete file system image, giving details off all the fights and block Saanich defense. The block information is a pleaded by the name Lord, as in when data notes joined the network. That means as soon as it did that no boots up and connect to the network, it would send them, know the information about the blocks it has on this name. I would upgrade the name space image with the data. Both the wetlands on name space are maintained in the main memory off name node. This helps name no too quickly. Look up for the blocks as and when required. No logistical Look at the keys when the name not feels as you can give the compete file system would go down on will be unavailable as complete name space image on data block information is lost. For this reason. Name notice also referred to as a single point of failure. S p a wave image DFS That is why it is important for the name. No to be resilient to hardware failures on it is highly advisable to spend more on name notes. Hardware still, with upgraded hardware failures, can happen to counter those situations. Falling Resident Edition is done. The name space image on 80 clogs are transferred to a highly available remote in the first month by name. Lord from time to time. Additionally, secondly name note is also added. Do not confuse it to be like another name node. This is considered to be one of the naming renders in her Duke. Secondly, Name no. Doesn't function like me. No, it's mean and sole purpose is to combine the name space image on 80 clogs, so that name knows me. Memory doesn't fills up because of the ever increasing Eric logs. Secondly, note also create strict points off the name, space image and every plans much together on right sit profile. This hips name no to release the mean memory occupied by the Edit loss till the point off last trick point on this is the only purpose off taking the name Lord second reading note is a Java program, which just combines the idiot loss and the name space and creates a checkpoint. That's it. This operation of combining the idiot laws and named face is itself complex and CPU and memory intensive. So secondly, named Lord needs to be running on a good hardware configuration as the job of combining the edit loss on the name space requires good computing resources. At this point of time, I just want to remind you that the name node and secondly name notes are nothing but Java programs that run with mean classes as Name Lord and secondly, name? No. So, in case of failures off the name node Hadoop administrator needs to boot up a new name. Note. This is the case on. Leave it there. Earlier releases off Loop have moved on to three release on CDH. Four have high availability features available in them. In those cases, this situation is a little improved. We would look into them later in the course. So in the previous releases to her look dark to three on in case off CD s three in case off failure off name Lord Administrator would have to bring up another machine as name No. But this machine had to be off good configuration as name node system requirements that high. So in that case, most often in a small cluster machine that ran the second, the name node is used to reconfigured as a new name. No, again, Please do not confuse that it is secondary name notes. Function to take over has finally named Lord. It is not just that the machine, which ran secondary name note, is most often the best choice for the new name note in case of failure. So, in case of failure, the last information from benefits mount is retrieved manually by administrator to the machine, which would take over as a new name note on the machine is then reconfigured as the name No . This process can take around 30 minutes to return to the stable street. Next, let us look at the guidelines for the name notes mean memory as it Lester size increases the number of storage close. That name not has to take care off also increases it Really. The block in the storage pool would consume some amount of name knows me memory. So it is important for the name Noto have enough mean memory so that it can properly. Man is the pool of data blocks as a rule of Tom 1000 People. 1,000,000 Stories Box is recommended. Let us take an example off 100 north cluster with full T B disk and let the block size be 64 MB. Then the number of stories books would come out to be two million. That means name no should have around food ZB off me memory in the next time are the few key points from the last two lessons. The's possibility. If you may like more time to read, - Welcome to a new listen in this. Listen, we would look behind the scene on as to what happens when you read all right into age. DFS Let us force deep dive into HD. If it's right, process is the office. Klein is a GM that has to run on the node, which interacts with H DFS. Know that DFS daughter replication is a property which contains the replication factor off the blocks. This property can because to my eyes to any set up in pseudo distribution mood off deployment on each DFS. It is overridden and said No one in the configuration file HD afis hyphen site, not XML, but before its value is three. So as a first step climb foot communicator name node that it wants to write into its DFS. At this point, the name would perform various checks on the request, like if the file exists or not, are like if the client has eric permission levels or not to perform the activity. If all is fine, name node would return back to 80 office Klein, with the list off notes to be copied on at this point, Klein Foot connect to the first data node and asked it to form a pipeline to subsequent data notes. The data notes would acknowledge as they successfully copy the blocks. Step 34 and five would be repeated until the whole file gets written on his defense. After that, the line would end with a completion message. In case of data node failure. The Iranian snowed escaped on blocks would be returned on the remaining notes name not would observe the under application on would arrange for the replication author under replicated blocks seem would happen when they are multiple node failures. The data needs to be returned to at least one note on the under. Replicated logs would be taken care off by the name Lord. Now let us look at how data nodes are selected by name. No. If the client node itself is part of the cluster name, node would consider it to be the first node where the replication should happen. If it is not the part of the cluster any known within the cluster is chosen. Keeping in mind the north is not to busy are loaded. The second note is chosen off the rack. As the 1st 1 was chosen. The 3rd 1 is chosen to be on the same rack as the 2nd 1 This forms the pipeline. Now let us look at the simulation drunk, which we have seen in the early listen. The file is broken into blog's and then replicated and then distributed across the fight system. Now, if you observe if one off the node, but even dropped feels there are still all the blocks off the file available failure off my tipple grass is most CS one, unless probable to happen. Also, it is to be noted that the whole precision off selection and replication happens behind the curtain on developer all. Klein doesn't need to worry about what happens in the background before we look at how it happens. Let us look at how distances calculated in each is your face. In a distributed network, bandwidth is a scarce commodity. Hence, the ideal distance is based on bandit block to be referred on the same day. Donald is said to have zero distance. If the block recites on a different date an old but on the same back, the distance would be counted as to if the block recites on a nude on a different track, distance is considered to before. And lastly, if a block recites on a node in a different data center, the distance is taken to be six, and these are only possible cases. Now let us look at the anatomical freed for the easy afis, Klein sends a request to the name No. In response, named Lord returns the data nodes containing the first few blocks. Name node returns in this starting from the closest node containing that block. Do the food ist so the client would connect to the first note on Read the blocks one by one . Let us again look at the feeling cases that can happen while read they can be to failures. First, the leader block its current. In that case, the next data. No containing the block is contacted. Second, if the dude they don't know itself feels weird. CD seven fears. While the Block B one was being read, then the next note in the list would be contacted. In this case. Climb food. Make a note that the seven is about data node and would not consider it later. If it appears in another list, please go through the key points for this. Listen, but come do a new listen on HD of his concepts in this. Listen, we would look at the new features added in her Duke Doctor three release that is H Davis Federation on high Availability. Let us start with HD Office Federation. This feature is added in order to balance the load on name node as the closer size increases. Let us understand this with an example. Let us say there is a directory tree structure. Crude on Under it are two folders for the one on for the two and let us assume that there are glorifies under it. As the closer size increases, The name note has to store more information pertaining to plugs in its mean memory. So for cluster with high number off notes in the range of 2000 name notes, memory becomes a limiting factor for scaling under federation, A new name Newt can be added on the filed restructures on the dock pull can be divided between the name nodes. This East name note has to manage only the pool off blocks it is associated with and not the complete pull this reducing the load on a name Lord. It is to be observed that the same data note can be associated to different name Lord's at the same time. And failure off one name no would affect that other name node, for example. If name No. Two goes down, the files in Fort of one would be still be accessible. Let us just look at the key points we have discussed. It's different refrigeration addresses the limitation off name Nords memory to scalability . Evening note. would be responsible for the name space volume on a block pool. Detailed notes can be associated with my different name. Notes Name. George won't communicate with each other on failure off. One would affect the other. Let us look at the next feature. High availability. This feature is to address the time taken to come back to the stable street in kings off name. No failure, as we have already seen that the name node is single point of failure on it takes around 30 minutes off time to come back to the stable street after its failure. So to address this une nose is always running on standby. The prime meaning no understand by name nor share the names piece on it. It locks where highly available and if it's storage mount in future releases, zookeeper will be used to transition from primary to this stand by one. In this set up, the details are configured to send reports to both the name notes. In this case, if the primary name not fails, the standby can take over very quickly. In practice, it takes around a few minutes for this feeling for transition to happen in this set up. It is important for the standby toe way to confirm that the primary has gone down. They can be a situation where the primary might not have been completely down, but just a little slow to respond. In that case, there can be too active family notes, and this cost corruption and chaos. So to avoid such a scenario, the reserve node fences. The primary node when it takes over fencing means that the standby would kill the name known process, revoke shared access and disable the net put pulled off the previous blindly node. In certain situation, it goes to an extent that it got stopped previously active name node from the power supply itself. This is often called us. Stun it, shoot the other note in head. As you can imagine naming this standby node. Assessing Henry named Lord would have bean apt. But there isn't leaving mistake that has happened. This wraps up our discussion for high availability for a quick revision off key points on the slight peace positivity, you know, Hello and welcome to the listen here. We would discuss some of the special HD if it's commands which we haven't discussed so far , in the course. First we look at each are also known as her group archives, as we have already discussed that lots off small files is not a good case for its defence, mainly because it except the name notes me memory. Although it is to be understood that the small fights do not actually pick up the complete block size on the desk, that is, if a finalist NMB on the block size 64 nb, then the file would just occupy the enemy off the storage space. So the issue with small file is that it occupies the name notes mean memory as need, not has to maintain meter reader for each file, the more than a matter of fice more would be the middle later, which name known has to take care off. So name notes main memory becomes a limiting factor. Hello, archive is a tool which helps in such situations. In addition to this group, archive files can be used as input to matter these programs as well little see an example of for loop archives and understand how old books. Just before recording this video, I've created this small fight folder on my local system in the home folder. In this I've created two documents. If I I'll be fine. I just do a GPS to check. Everything is running or not. Everything is running soon. I just copied this file structure to HD. If it's using the command copy from local now I do it a list to see if the fights have been created. So there we see the directory. Nobody archive this final restructure. So the common is her. Dube, our kaif hyphen, Archived name on at this point well hidden in there and there we get this index off this command. So the Sendek says that the command is archive hyphen archive name followed by the name off The HR file, followed by iPhone be followed by the parent part fallen by the source on then the destination. So I type in her group archive hyphen, archive name. They have won the Hadoop Archive file to be filed one dot h a r. He's know that here we need HR is an extension which indicates how do bar guy files? These are handled differently. These are red in Britain in a different manner as we'll see on to differentiate them we use dot HR extension the hyphen p on the parent part would be slash user slash injuries slash then followed by the name off the directory structure that needs to be archived, then followed by the destination part, which which would be slash user slash injuries slash i would press ended at this point on the map release program would be involved. I'll again do analysis on her do file system and see if the group archive file has been created or not. So that's how do back I fight. I do a list on her. Do our guy file. So as you can see, there are four fires that have been created for her do by gunfire. First fight in the Success, which marks the successful completion, often archive command. The powerful is a one which has all the countries off all the fights Con coordinated together. The Doing Next Files index on Master Index contain the indexes used to look up for the content it is doing. Recursive Ellis on our new park I fight in order to do so will put H A R scheme so as to specify her group archive has been read so it displays if I be filed. The drill sign are the temporary files that were made when we copied the small fire directory structure from the local fire system. They were created because we had opened them in exito. Next, we understand the limitations off her Dubah guy fights. First you create an archive file, you need as much as this space as the original. Her loop archives currently do not support compression, so it is like a duplicate fighting second, her loop archives are immutable. Are you? Remove fights from her loop archive. You must recreate the archives told if you're reaching the limits off the name notes memory using a DFS federation would give you a better scope in scalability, then using her dupe archives. Next we look at another command. This CP This command is used to copy the files from 105 system to another. The coping process is done in a parallel fashion. This index of this CPI is as follows Helou, this CP falling it would be the source folder on. After that, they would be the destination for node one and they know to would specify the name Lords off the different age DFS deployed. This command would usually be used when you're using age DFS Federation on your cluster and have two or more name notes on the same cluster, and you want to copy from one HD office to another. I flew. Sis, listen at this point on, see you in the next lesson. Welcome to one. You listen from this section, you could look at the most important encore topic. Napoli's notice. We'll start with looking at the dome. Nagy's that I use in my produce force is the Spirit Street is nothing but the fix chunk off data that which so as input to my house this want you to remember that blocks and spirits are two different concepts. He didn't mind that laws are 80th responsive and belong to the HD of this world, and spirits belonged to democratise. For notes on this modest sized data stored in HD, AFIS and spit is the data that is input to the mark jobs Mattias processes display and produces an output in the diagram have shown that the map output is smaller than the map input size. This is a gender on a good case, but I do not want to pain so that it is a restriction. It can be equal or even larger than input science as well. But that is not a good case. It would be advantages if it is as small as possible. You see Leader, Why civilly thing off map values? The problem is divided into two potions. Forces my part and second part. Instead, reduce all my jobs running panel and produce our food. All the results are stored under careful on much together into a file. It serves as input to the radio show The release job. It's this as input and produces the desert. The world's off execution is controlled by two nodes. Job Packer on the bass tracker. You can drop Palin between the node in HD office on job tracker in marble. Use its funding to deter notes. They will be tasked Trackers in the Japanese word. Let the didn't holds uses tools. The leader. There are structures run on each data node. This pleading running off Martha's on reduced US job trackers. Johnny's to manage stealing does get on on our strikers at this star. Strikers duty who run not on reduced jobs and send progress to the job tracker. Okay, I want you to imagine Jock lacquer on gas trackers as a job for jobs that are running on the machines on. They are not the hardware. One move parallel between a DFS and my police work we can draw is that night, Benin notes Figure is a more serious one. HD afis lake voice Here in Napoli's word good job crackers Failure is the most easiest one as all the jobs in progress and our strikers status would be lost. That is why this wisely to spend more on the hardware that one stopped lacquer notice, Run through this ignition again and try to understand the faces. In a little morbid, the Lupin has to bring the problem into two portions. Force is the math phase and second, this introduce face the March jobs would broke on the beat down, which is located on the normative. This principle is known as data locality. It is important that map jobs get their input which are local. If they're not local, they would be needed to the fish from the network and so lead agency would be added. In that book. I'll on the performance with the greed, hence optimal value. Off stripped size is equal to lot size as one complet