Learn Hadoop and Big Data | Rodney Rossi | Skillshare

Learn Hadoop and Big Data

Rodney Rossi

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
9 Lessons (9h 48m)
    • 1. 01 Introduction To Course

    • 2. 2 Add Value to Existing Data with Mapreduce

    • 3. 3 Hadoop Analytics and NoSQL

    • 4. 4 Kafka Streaming with Yarn and Zookeeper

    • 5. 5 Real Time Stream processing with Apache Kafka and Apache Storm

    • 6. 6 Big Data Applications

    • 7. 7 Log collection and analytics with the Hadoop Distributed

    • 8. 8 Data Science with Hadoop Predictive Analytics

    • 9. 9 Visual Analytics and Big Data Analytics for e commerce


About This Class

Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications in scalable clusters of computer servers. It's at the center of an ecosystem of big data technologies that are primarily used to support advanced analytics initiatives, including predictive analytics, data mining and machine learning.

Hadoop systems can handle various forms of structured and unstructured data, giving users more flexibility for collecting, processing and analyzing data than relational databases and data warehouses provide.

Prerequisite knowledge required:

  • Some activities will require some prior programming experience
  • A basic familiarity with the Linux command line will be very helpful.
  • You will need access to a PC running 64-bit Windows, MacOS, or Linux with an Internet connection

This class has these contents covered,
1. Introduction
2. Add Value to Existing Data with Mapreduce
3. Hadoop Analytics and NoSQL
4. Kafka Streaming with Yarn and Zookeeper
5. Real Time Stream processing with Apache Kafka and Apache Storm
6. Big Data Applications
7. Log collection and analytics with the Hadoop Distributed
8. Data Science with Hadoop Predictive Analytics
9. Visual Analytics and Big Data Analytics for e-commerce


1. 01 Introduction To Course: This is a syllabus for the hard, of course. You confined its syllabus in the top level folder of a documentation package, starting with the key driver for hard, oops! Success, the ability to add value to existing dark. In our first topic, we've moved on to see her. Hadoop has grown into this wide ecosystem and hell hard oop adepts and integrates with other important Darda technologies such as no SQL and Streaming. We've moved on to see how our Duke and big data applications are language. Polly got applications that is, we need to be comfortable programming in a range of languages. Now the preemie, a language of data sides ISP iPhone. So we have focused to a great degree on integrating pie, often with Java, and how do to perform data science and predictive analytics with tools such as machine learning. This is the technology grip that you can find in the top level of the documentation, and it's outlines the tools and technologies we use in the course. Young and zookeeper is important to be aware of these for configuration purposes, and they work seamlessly behind the scenes In all the topics may produce. Still a very important algorithm, and we show in the first topic how you can use just pure map Raju's to do good things with algorithms and clustering in particular. Then, in the following topics, we move on through the hard job ecosystem, focusing at first how you can get hard oop and daughter in hedge DFS to work with other important technologies, whoever they be streaming with CAFTA and daughter analytics. We introduce the python AP eyes for hard oop, such as py do and Pie Spot and show how we can use them to directly interface python with data in HD offense. And then we can use Pipkins, both mature and robust machine learning out rhythms found in its package, psychic burn and associated tools like like patsy and pendants for data matrices, and met bluntly for visualizations. We look at the Mongo High Do Bridge, which is a connector from hedge defense to mongo DB for no SQL Integration with Hard Dude, and we look at Hedge DD access and in particular, the tools for mapping relational schemers to and from hedge DFS in particular, Apache Hive, Apache Flume and Apache Scoop. Finally, look at the very important evolution of big data into the public cloud with Amazon Web services. Let's move on to the first topic 2. 2 Add Value to Existing Data with Mapreduce: So now we're going to build and run the curd. So the first thing is to check for this test we're using a small version of the input data we could build and run occurred inside our idea. In this case, it's intelligent or it could be in clips, but to make it more cross eyed e friendly, we will just run it from the command line. So we're going to issue inside the terminal inside the top level source folder at the same level as the palm dot xml. The maven Bill commends maven clean package, and if we scroll down here, we can see that the director where the uber job will end up is empty. So we're going to build a new uber job to run the Kurd. So let's run that field configuration. So now it's running maven clean package, and it's important to remember I've set the source code output level as seven because when it runs in, the hard do cluster is going to run as Jarvis seven. So here we can see we have built success, so we'll go down now. We look in the directory for the uber jar and we can see, we have created the uber jar. So what we need to do now is just run the uber jar and it should find the actual input files that it wants to use in its directory that it lives in. So let's do that now. Now we've used the system Property User directory. So that was set. The absolute location from the card has being the directory it exists in. So we look inside this directory for its input. Far how we're using a small input far, which has more duplicate key values because we want to check. We are picking up those Jupiter key values when we run the code from the job change to the directory that the Jai's residing in, and we just run it simply with Java hyphen jar. And in the name of the job you boot watch are. So we run the coat and it'll quickly passed through the files. Now, if we check in our directory, we will see inside the same directory that the job was in. We have its output directory of time stamp and we can open up the output. And if we make a speaker so we can see what's happening. We're looking for Duplicate. So he's our first typical values. We found some for this company. Now we move on down. We should find a lot of difficult because I optimize this particular import data set to have many duplicates. So it's finding all the Jew pickets. What we'll do now is we will rebuild the code. What would use will change, too? The big version off the info. Tada. So that's a big data import. So now, as you can see, we have not deleted the uber job because when we build again, it will copy over and overwrite existing uber jar. But if you're worried that might not be doing that, you can ways to leave your job if you want to. So now we've changed the name of our input data file to be the 10,000 line data file. So we rebuild the code, so when we run the job, it will reflect the latest value. That's when it's doing this section of the build, its town learning everything to go inside the uber job. So now we change into the directory where who which are exists? So now we're in the terminal directory where Who Bejar exists? Going to run the uber jar again? So it run it again with the big data set expected to take a bit longer this time. Now we noticed it seems to be about the same amount of time. So when we run the 200 line data set for the 10,000 line data set, a lot of the time that we're seeing here is just initialization of our studio cluster. You could see it could easily handle very big datasets. So how do It's a great framework. It's a great way to be doing this sort of clustering code. So now we look for the output that has the latest time stamp. So here we are. 9 to 2 is greater than 459 So this will be our latest results set. Now, if we go through, we could see for the 10 fasten line data set, there are not many duplicates. Now, What I did was when I set up my small data, said I deliberately chose sections of the data that had a lot of duplicates because I wanted to check that I could find the Jew pickets easily. And now this larger 10,000 line died of said, is a more original faithful to the original database where this data came from, so there are much less duplicates. But if we go down will be able to find some duplicates eventually. That's here we are. So here we found free duplicates for a X A Advisors Limited Corporation. So now was successfully built and run the first phase of our code and that is creating MEP ridges code get set key, Then using that key value looks for duplicates via the shuffle sort step. And so for each unique key, it will aggregate the values for that key. So we get this almost for nothing in map ridges. If we were to write equivalent Java code that mind through all the 10,000 lines created a list before the keys and in recursive Lee mind through the data set again finding Jew picket values for those keys that will take much longer and be much, much more code than what we have produced here. So this is a very good way to use my produce to find duplicates and also to find initial seeds or centers for our later clustering algorithm. to come Now we will go through the steps to run the curd in hard Do highlighting the importance steps as we go along. So we need to boot up the Horton Works virtual machine and it shows you the credentials in its landing page. Then in a terminal we can access the machine with this U R l and the password is harder. Now we are locked in. If we look what we have in the root directory because we land in the root directory, we can see we have already copied over at Ubu torture. And, uh, dot is it? So we have tin fest in one data set and we have our uber jar. Now we can confirm this by looking in an s head stage, wild grasses. So here I've also connected with an ssh far browsers, so I can easily copy in my uber jar and my data set So they're all set up really to go. That's the first step. Now the next step is to log in on the virtual machine as the hasty F S user. So we'll do that now. Just log in page TFs User, it's now logged in as a hasty of his user. And we have all the correct permissions that we need to access the Hatch DFS file system. The next step now is to create a root directory. So we need to run this command, create a root directory, and I previously run this command so we'll have a look What's in our root directory. So here I am, Ellis ing the root directory, and it should be empty. So we're just waiting for h defense to report back. And I actually have from a previous run where I was checking the code would work and output directory. So I want to get rid of that output directory. So if I go here, this is a command that will get rid of that output directory for me. But its actual timestamp is a little bit different, so I'll copy it. So I make a copy of that command posted in here. So now I've got the correct time stamp, and I'm going to run this command to get rid off that output directory in the root directory. So previously created a ridge rectory off the top level directory, and now I've run that command. So let's check now what we've got remaining in our directory. So now I'm ls in my root directory. And I'm hoping that my root directory will be into Yes, my root directories empty so I can start with a clean slate. So my next step is to use the hedge. DFS to copy in the file said I need so a good place to get started with hatred. DFS far system commands Is this link here? So you can find many examples off hedged if s transfer commands here. So the first thing is to give the permissions to the red directory. I need to give it a read. Write permissions before I can copy or run anything in my nearly created rich directory. I need to make sure it's got ALS permissions for read and write and execute. So this is Commander. Does that So execute his command and all these commands Serena documentation here and the next step will be to copy over the falls that we need. So I need to copy over the data far. And Syntex put name of far destination the fire will go to. So now, with everything in place, I'm ready to run the joy with Hado Space Jar Space, Name of Jar, which is Ubu Dodger. We run there and we hope that everything works out, so it needs to be able to find the files. So recall we set up the system Doctor. The root directory. All the parts in our application are relative to the location of a jar, and you can see it's running the job. If it gets set. Fire generally means that it is working out. Okay, okay, so now it's starting on the map produce phase. You could see map there. 0%. Now it's mapped 100%. It's into the reduce face and reduce. Seems to be a little bit more work, so it's taking a bit longer. That's where we're aggregating the keys. Did the job work? So we read for we're looking for errors so we don't see any errors. So everything looks good so hard you job ran. So we need to do now is get the output out into a place where it's useful where you will find you generally have errors with hard do code is in parts, so resolving the path from where your code is running to where the other artifacts, like in profiles and output falls that the code needs. This is where the errors will be in the parts. So if you get the paths right, Javaris, fairly easy language to work in. So once you get the paths and the cluster set up correctly, everything is very solid and robust in our do. Now we need to get the data out of Haiti Fs So the same way we put the data into hedge. DFS weekend used to get command to get the data from hedge DFS. And we know we're going to be in an output directory that looks something like this. It will have a time stem so we can just use the royal card dead to grab output start, and it will match to the single output directory that will match on the time stamp. And then we destination for I get is actually are so structure, you are root directory. So try that. Try that output. Now, let's have a look in our far browser cuts. We're gonna follow browser here. Do we have an output directory? They're so we'll just refresh nous changing to something I need to read then the inside output. So now we're in the root directory of a virtual machine and you can see we have the output . And this is where the SS age for browser makes life very easy. If we look here in the output up in which he had it since the output data and we could see it's the same as when we read it locally with a test, not that many matches. So we just need to go down for wall. So he's our first Supercuts. So we ran it in her do on 10,000 records and the code worked out. It's what we need to do now is upgrade our code abuse so we can set it up to be useful clustering. So what? We need to do his output thes Jew pickets and use them as seed clusters for further clustering. Let's just summarise the case steps involved in running the code in the hard do cluster. Firstly, we need to boot up the Horton Works virtual machine and then log in using this. You are Oh, that's our use in a and this is the URL for an ssh session in the terminal. This is the password, and then we need to work with the hedge, DFS or the hard Duke Fire system. So here's a link with lots of useful information about getting started with Hedge DFS. The first thing we need to do is log in as the hatred of his user in the virtual machine so that we have the hedge DFS permissions. We need to resolve the paths correctly. Respectable local far system. We are running the jar from How do so? Our job is running on the local file system in the VM. We need to resolve the hasty A face for you arrows correctly, so that means we need to create a root directory. So in hard Duke for our system, we create a root directory, and we need to give that all the permissions that needs, reads and writes. We need to put the data into that directory in the local fire system. In the root directory, we should disc amend, which runs the Bujar in the harder cluster. Then, once it's finished, we need to be able to get the airport so we can use the hard do get command, and then finally, when we want to clean up. This is command. We can run to delete the output directories in the harder you fall system. We'll replace it with the world card. It will pick up the output directories and delete them, says summary. Off the critical steps and to repeat generally when you have problems, it's going to be with paths, resolving the paths quickly and how do can be a challenge. So what we do now, it's will move on in the next part, and we'll look at creating the ad on clustering card. So now we found the duplicates. Let's see how we can modify our map ridges code so it puts two duplicates and we can use those duplicates as seeds for a clustering algorithm to cluster on the doctor. Now we're going to extend our map. Raju's Kurt. So we are put the Jew picket values from the mat reduced size. So we start off by renaming our entry point class for the MEP ridges code to this name entity and Elsa's M r Chop. I may produce job for Anstey analysis. The only difference is we now have a new reducer class, so we have created a new entity reducer cluster seeds, glass and the job of this entity reducer will be to airport to duplicates to use as cluster seeds. So let's have a look how that works Now this is our new reducer class. The only difference now is that we are not writing all the data we just riding when we find a duplicate. Now let's have a look at the dollar to remember how things were working. Recall. It's the name that we're using is the key. So what? I'm highlighting here. These are the entities. So it's the name of the company that we are defining a C entity, and these address fields were calling the value. So are key value. Pairs are the name of the company will be the key and everything else is the value. So when we find we have Jupiter values for anyone key, we were used, he says. The seeds for our cost re so we only write out when we have these Jew picket values, so we look at the values in the iterated. If there's more than one value, then we will ride out toe output. The next change we have made, it said when I met produce face has finished, then it will respond with zero. If everything is correct, then if we are running on the cluster, not in testing mode, we will migrate the data out of the cluster file system. Hey, HD affairs to the local file system for analysis. So if we look at how that works now we create some paths and he's a mapped out L poo path and then we use are far system object to were created to copy to the local file system. The data that we have met produced out to the output where we want to find. So then, when we have our own else's car, we were running office mapped reduced output data. There we have some static clonal constant variables which were just quickly explain. So firstly, we're running on the cluster. We said this to be true, so we can have correct paths for the cluster as opposed to just testing in the local file system. Then in the output data we have, these flags were defining when a block of data is the start of a cluster. And when we're at the end of that cluster block, it's best to see in the output how they work. Then we had parts for our ruled out input and daughter output paths. Before we demonstrate the new map produce code and it's output, we just need to understand how we will build for the different functions of the code. So now we have a May produce function, but we will also have a daughter analysis function. So we've added these new dependencies for our daughter analysis patchy crunch and a scholar string metric library that we can use to create our own clustering algorithms. So we need to build for the two functions. So we created these profiles. So we have a map ridges profile and this builds a source code to run as a map ridges job. So the main class in this profile is a map ridges job. Then we have an oven profile, and that will be our analysis profile, which, recalling ET l and E. T. O, is a standard analytical patterning hard oop that stands for extract transform. And Lord, this is our analysis profile. And when we build a disparate fire, we will build with our main class for the jar that will do the analysis We need to build the jar for the appropriate function that we want to do. And so we will name the jar. Then we build it. According to that function, When we're building the map ridges chart, we will call it hyphen m r. When we are building the analytical code, we will call the jack hyphen e T L. So we have to separate blue profiles, one for a map produced skirt map reduce. And we have another profile for a Nelson's code, and that profile is called E T L. Let's run the Kurd locally for testing. Now we will run, have leverages Job has the main class, and we will understand the relationship of these static final variables here. So firstly, because we're running locally, we said Run on cluster to re false. So now we just run the class and have a look at the data so we could better understand these variables. So what's running out? And he consider, and really quickly the output. When we are running locally on the local fire system, we put the map reduce starter to output, so he's output here. So let's have a look. What started looks like make it better now. Here we could see we have a star cluster flag and the in cast of flag. So we look at the data, what the doctor looks like. So we opened it up. We have key value peers. So our key is going to be the name of the company, which is the name of a entity. And the value is everything else. That's not the key. So we look at the map, reduce output, we could see that we had these duplicate keys. So we had more than one value for this key. So we're calling that a Jew Pickard. And we're also calling that our cluster. That's what we running them averages for. We want to find these in along the route data. So we need to know when it's start of one of these clusters and when it's an end. So that's the role of the static final variables here. If we look when we are actually running on the cluster, this is true. We run the migration. This is the airport dollar when we're mapping from the cluster. So when we run the data locally is a test, we can just output to this directory. But when we are running on the cluster we need this extra step where we my grade from hedge DFS from my MEP ridges job to the local far system We need a slightly different path. Don't have a look at that occurred here we are creating a fire system object Then we're creating our absolute directory to where we are working with our jar and then we create a path for the Mac Raji started to be copied from hate cfs to the file system And when we create a path, we need a slightly different structure. So that's why we have this extra variable here. So that's how we run the data locally. And when we run it on the cluster, we need to be aware that we have had extra step where we migrate the daughter out of hate DFS into the local fire system of the node we're working on in the cluster. So we need this slightly different path for that to work. Also, we have the fact that we're no longer a pending the time stem, So we need to make sure in our output directory and hedge DFS that we don't have this directory existing. When we run the map reduce job, we need to delete it first. And the same for when we mapped the data from Hedge defers to a local fire system, we need to make sure we have too little the artifacts that were copying into from hedge DFS . They must not exist in the local fire system or we'll get a runtime era. So let's look at how we can now build the map. Reduce job. So from a command line in the same directory as the palm dot xml file for the top level directory off the project, we just issued the following commands in a terminal envy in which is a maven command space , clean space package space, hyphen kept to pay and the name of the profile may produce so that hyphen capital P may produce, tells Maimon to build the code with the MEP ridges profile. So let's try that it's making you Bejar with the main class as our map produced job. So we go now we will see it's made the Ubu hyphen Emma jar, and we look at the profile. The end run plug in has created the jar with the hyphen m R extension. So we know that this is our produce jobs. What would you now switches quickly copies to the cluster and ran it in hard do and look at its output from So now let's go through the steps to run our freshly bill jar on the Horton Works Virtual machine. If the Horton works Virtual machine running within ssh Terminal and the password is hard up , if possible. And ssh file browser. We go to the jar, copy the jar into the root directory, the virtual machine, and we need to make sure that we don't have any old directories that might clash with our do. So delete any output directories from previous Children's. So in the terminal we log in as hatred DFS use. Then we need to check inside the Hetch DFS Root Directory. But we don't have an output directory, because if we do, we'll get a runtime exception. And so there's an output directory. There were not a pending a time stem to it anymore, so we need to delete it, because how do weren't right to a directory? If it's a previous directory, they're with the same name, so we delete it with this command. My finale in, I think EPPS a lot. I mean, if that's a reclusive to me, let's check that we have in the retract Tree. Ellison hated Professor directory that only the data file is there. The 10 fast online data file is set, So now we run the job by running the job Hard do jar and the name of the jar wife Emma. So we run his job. And while the job grants, we will pause a video briefly, a different points while the job runs. So the job ran successfully. So amep Ridges Face had no errors in our migration face. We've out put the data into the local fire system in the apple directory. So let's check the output daughter now. So we need to refresh the file browser. We can see we have an output directory inside our output directory. We have a map produced Styler, which we open up, and we can see this is the same as when we tested locally. We have the key, so let's have a look where we have a few duplicates. This key had many values. So in a Alpo, Tara, this key had free values. Here they are here. So we're calling this a cluster, so this is our star cluster flag, he says in Cluster Flag. So when we move on now to write our own else's code, we'll go fruit and to a fervor cluster federation on this style based on the address field . When we move on to that, we will need these flags to identify, to start in the end of any set of duplicates. So when we go through and right on Elsa's card, we will mind for this code. We regret thes sets of values, and we will Rikers for these values to discover when they are the same or when they are duplicates. So it will be the job of the analysis code Now will step through the clustering code. This is our eventual project structure one top level package. Ha do part one. We have two classes may produce job NC analysis m our job and we have a main class that runs are clustering flow. The main class drives entity analysis e T. L where e t l stands for extract, transform and load, which is a standard hard oop analysis pattern. Then the entity strapped and the Jupiter construct just support the cluster flow. But the 80 0 classy entity analysis class is the core logic for the clustering code. Let's just briefly discuss what we mean when we use the term clustering for the algorithm we will develop. We are talking about horror ical clustering, and this is a link on Wikipedia where you can read for yourself. But basically it's something a common technique and data mining and statistics. And there are two approaches a glum retrieve and divisive, and you can read about them at the link and more about cluster analysis. Basically, we're going to take what's called in a kilometer of approach, and this is a diagram attic understanding. We can think of these values Year of A, B, C, D and E as being the keys that we obtained in the map produce job, and we produce this step here, B, C and the D step. That's when we decided with my produce, the turkeys B and C were equal. So then we created one key for birth B and C, but we had more than one value for that key. So now we want to go the next step up the Harkey, and we want to aggregate our values. We want to find. When do we have identical addresses? We want to a glum aerator our values into being a single value when we have a single value . We are considering that to be a Jew picket address. So now our cluster analysis code will work on the values where the map ridges code worked. At this level on the case. The technique we're using its a glow motive. It lives inside the expression cluster analysis, and you can read a very good explanation at this link. Before we step through, the clustering code will talk about the new dependencies and how do packages we will use. Firstly, we will use Apache Crunch package, which you confined at this link. It uses what's called map bridges pipelines. There's a lot of utility code in there we can use to write more efficient map ridges code, so it adds a lot of power to any how do related code when we use Apache Crunch, the other library will use is Rocky Madden's string metric. When we do our clustering, we will need some concept of distance we will look at two strings and have a concept of distance between those strings. That's where the metric comes from. Metric is a mathematical term for distance or measuring distance. So string metric is a whole lot of Skylar functions that allow you to apply different distance calculations to string values. So we will use this in our clustering algorithm and we use this for the interface between our daughter in our Dube. So we start with the dependencies. So this is our crunch Apache crunch dependency. And this is our string metric dependency who uses for the input in the airport and the doctor flow. And we use this for a clustering algorithm where we measure the distance between different strings. Everything starts with our entry point class run cluster flow and we read the MEP ridges data from the output of the map reduce job. And when we pro sister dot in our clustering algorithm, we have put our final result to a photo called Clustered and we create our paths. We have an input path and an output path. It's important to realize these are not how do far system parts. These are local far system Java paths from the in Io Java package, we start our e t o logic. So we're calling this flow an entity analysis e t l. Where it EEO stands for extract, transform and load. We have three distinct phases and extract face a transformed face and a load phase when we are using the mapped data that's come from the May produce job from the point of view of the clustering code, that's a unstructured data because we're going to aggregate that Donna. So we consider it for clustering purposes to be on structured. So we look at that is extracting the data, extracting the rule daughter. Then we perform the clustering algorithm in the transform phase and in the load face, we just write the transformed data to the fire system in the e t o pattern the load data. It's a transformed data that we load into some of our system could be a visualization system, traditionally an 80 0 process. When it extracts data, it cleans in formats Adada for some further analysis. But when the doctor said is too large, the GTO process can become a bottleneck. So we need to think about what tools can be used to avoid the bottleneck. The tools that we have available in patchy hard oop are these things called pipelines also serialization of Adada compressing the dark. So we use structured, optimized pipelines for the data to flow on. And then we look at serializing that Diana, let's step through our extract phase logic. So this is our extract, Artem, if it in a entity analysis e t o class, this class encapsulates the e t o flow. The first construct we find is a crunch construct called a meme pipeline class from Apache Crunch has a get instance method which returns a pipeline. And this is an optimized channel for Darla Flow that we were just discussing the need for this to avoid the e t l bottleneck in the Apache documentation for men pipeline pipelines. Every crunch job begins with a pipeline. So this GTO logic we are constructing from the start with the pipeline for optimize data flow. And we're using the men pipeline in this instance because it's an easily available one that doesn't take much configuration. And they say this is the sort of thing you can use when you're testing your logic It's a very convenient want to use. You can use it anywhere outside of hard oop. It's just a Java class, but it's optimized for Darla Floor, A P collection, Liza Crunch collections. Or these are collections that are optimized for Darla flow down these optimized data pipelines. As we just saw in the documentation, we can use our pipeline type to create a P collection and we do this by reading in input to the pipeline. So the apple from the MEP Ridges job is now the input into our clustering code and it comes in through this pipeline into a pay collection which we just discussed. And then using that pay collection, we can obtain this p object. So first we start with a P collection and it's just a collection of strings. But now we can turn it into a list of string. So this is analogous to an a list of strings and the reason we want this type is because it give us an iterated. And with that it trader, we can now go through ah data and start to pass our data. So we create our types. We're dealing with our data, and we create these flags for knowing when we have the start and the end of a list of possible Jupiler Kurtz and then using some typical passing curd. We passed through that code and we create for each wine and object, which we're calling an entity struck an entity type. It's a bit more lightweight, then a job of bean. It's much more like a standard see struck type. So each line of our import is going to be now stored in this entity struck. So we look at our entity struck. Now one domain that's very close to clustering is NLP, or Natural Language Processing, and one domain inside NLP is called Entity Extraction, said his entity struck. We could use it for entity extraction as well as clustering. And it's a structure to contain information about an entity, and the entity will be the name off the actual commercial business that we're looking at in this case banks. It will be the name of the bank, and the entity idea will actually be the key value. That is the name. Then we have these other struck's. We have a distance trucked and Agip a construct. Thesis one holds information about distance. So the difficult struck were whole information about the entities duplicate values. Each entity will have a set of distances to other entities, and we store them here. We're only considering entities that exist inside the same key value pair set, so the extract data creates a meme pipeline. Greater pipeline creates an optimized flow where we obtain a list of these entities trucks that will eventually have a distance information for each key instead of value data. And that's returned to our main clustering flow logic. The next step is to actually do the clustering. So when we transform the dollar, we obtained the clusters. We've obtained our list of entities, and now these entities have a distance trucked a recall. An entity is a key, a unique key and a set of values could have 123 or four values so it could have. It's sit of distant Struck's. So let's look at the clustering. We get thes distance trucks, and once we have those four distance trucks, we're going to look at them as distances. We're gonna look at the distance between each string. That is the value. We do that in this Rikers entity myth. So here we have our distant struck's. The action happens here on this line. This is where we use the string metric library. We create what's called an overlap metric. There are a range of different metrics we could use. This one just measures. How much does wound string overlap? Another string. And if the overlap value is one, that means they are identical. So we obtain that distance day and that will be our distance that we store in our distance struck. If D is equal to one, we then add that child entity to the Jupiter Construct list in the distance trucked. So this is simple clustering code. But it soon becomes very algorithmic and very intense. And we're not so much concentrating on the fury now, looking at the practical application of the curd, so we will leave the rest of this clustering code with you. What we'll do now is we will just look at how we can build the clustering current and how we can run the clustering card. Now we will test our code locally. We have previously run a map ridges job, so we have a rule may produce starter, but this is yet put of the MEP ridges charm. We're now going to run a further clustering it oration on. We just run the run cluster flow main class, and it produces his clustered output. And if we look at I up, we could see we have a noise clusters. So we'll just go through and choose one. This one, Big Mac, Oliver and El Sisi. So if we grab the file, I d. And then we open up the road control if into the what file ideas takes us to the record. So there's a record for Mech and all of that. So now this apparently is a duplicate. So if we take the second file, I d go back to our initial unstructured raw data and we could see we have a second record and we could see that we have an address here. So if we look in our clustering the actual clustering data here we compare, the addresses could see the addresses of very similar. The only difference is that we have a different telephone number. So the actual address field, it's identical since the same purse curd, same street address, same number. So these are indeed duplicates and we could see in the raw data they would have been almost impossible to find in a row data. So this shows the utility in the value of the code. Start with donors will go through the steps to build Dacourt and run it on the cluster we now build to teach by opening up a tone going to the same directory that contains the projects maven build far we should is commend envy in space Clean space package space hyphen, uppercase p then the profile CTR This is how we run the maven build in the e t o perform This will build the correct char with the correct main class in the manifest off the jar. This is our empty staging directory. This is where the build will end up. So issue the build now and it will go through construct the uber jar with the correct main class The end ran plug in will output the final job into a staging directory. So we go to our staging directory and he is our final jar. So that's how we build with the Maven profile now to run occurred on the Horton Works virtual machine in. How do we simply copy the Bill Char into the root directory off the virtual machine? So there's our e t. Old job. We must check. We don't try and output to a directory that exists when we ran our CTO code, it's going to write to a cluster directory. And if that directory exists, we will get a runtime exception. So we need to make sure it doesn't exist. Also, we need to make sure we've run. I met fridges job, and we have our output. Tada, she looks like this. So this is just the same data with the start cluster flag in cluster flag, just as when we ran a locally. So now all that remains is to run the jar in the virtual machine, which we just do java space hyphen, jarred space Ubu hyphen et old job. We run this, so we've run it on, um, effigies. Data have looking over directory. And you see, we have is clustered have looking clustered openness which he had it. You see, we've got out clustered data. So we checked out our clusters before his concludes our analysis off the map. Reduce starter. But which is which is summarised the main points. Now during the video, we have focused on the problem two main on the many little issues you encounter when running Hadoop and my produce code. But now we just take a moment to summarize the main points in the high level conceptual domain. So the problem is, how do we add value to existing dark In this situation? We took advantage off the fact that may produce aggregate starter in key value pairs, and we could see one example of how this can be very useful. The output from the Met produce is a uniqueness in the case and an aggregated set of values for each key clustering aggregate startup based on um, ITRI. That is some idea of an equivalence that if the metric who's won their equivalent, we just have some sort of way off understanding, using the difference between two objects, how close they are to being equal. Then we exploit unique key and aggregated values as an initial clustering. When the MEP ridges minds through the data and it finds all the keys are the same. We configure that as being a clustering with for metric, we could figure the map produces being an initial clustering of the data that we can exploit that, and then we run another fervor. It oration to custody aggregated values. So now we saying the keys of clustered, We just want to cluster the values. And when we have to values that we think are equivalent, we can decide they are a duplicate in the value, and we saw that it worked really well. In this first part. We are really going to get going with map produce and learn hard do and may produce concepts in the context of problems that we will solve. And the motivation for big data largely came from adding value to existing data sets. In terms of the commercial success, it really gained a lot of momentum when big corporations discovered they could add value to their existing data. So that's what we're going to look at in the context of a practical problem that we were solved, and the concepts that we will get on top of while we solve his practical problem are 1 may produce Shuffle sword and what's known as natural language processing or NLP entity extraction and clustering. So we're going to start from where it all begins with my produce and then move on to something that's quite advanced in this first part. We're looking at how we can leverage data that previously might not have had any value. But with the advent of this technology, now we can extract or harvest value from that data. Previously, the sort of data that was just lying around that might not have seemed to be of value with things like access logs, Air Logs, comments on blog's and some types of database content. Now we can come back to his starter and look at it as being valuable. Let's look at a problem scenario. Imagine we have a small business were marketing business, and we're going to create a marketing campaign based on a database that we've been given access to. We want our campaign to be professional, and in particular, we do not want to send out two copies of our marketing materials to anyone Business entity . 3. 3 Hadoop Analytics and NoSQL: firstly picked outta means more than one language. When we work as big daughter programmers, we will be working with at least these free languages. Hard oop itself is written in Java. The later analytical engines, such as Apache Spark a written in Scala and then Python is Solingen Franca off the doctor scientist, and this will be their preferred language for analytics. Python is a very concise language. In this first segment, we will see how we can create a streaming Twitter data source using the Python language. So let's look at the motivation for moving out of our comfort zone with Java and coming to grips with languages like Scholar and Peiffer. He is an article from the end of last year, and it's saying that the demand for job skills relate to data processing or performing analytics on data working with no SQL Data Stores and Apache Hard Do and Peiffer. In this will be the area where we will see the big growth in demand for developers. In this video segment, we're going to get looking at how we can use python with big data that we're going to get an introduction to how we can work with hard oop and no SQL data stores. Now, this is the Horton Works sandbox Virtual machine. Here I am logged into the virtual machine with an ssh far broader. So I can easily transfer files between my husband and the virtual machine. And here I'm logged in to an ssh session in the virtual machine. So we're going to look at how we can access Twitter data with Pipin. I'm going to look in particular at the Twitter wrist. Ap I This is an endpoint to obtain streaming twitter data. So we're going to use the Twitter rest ap I We're going to see how easy it is to use that by using this python library called Twitty. And here is Thea Actual link to the tweet pee documentation. So this shows you how you can use pipe in with the Twitter rest a p I. And this has to be easiest and quickest way to access the Twitter wrist, Arte. So that's why we're looking at using python because it is the best way to get the Twitter data into the hard do cluster. So, in our terminal show, if we type python space minus V, it returns. Python version 2.6 point six. This is an old but very stable version of Python. Both the Horton work sandbox VM and the Cloud Ear sandbox. Both use Central six, which comes with this old version of Pipe in 2.6 point six. And we can't use the later Python packages that we need for analytical data. We've hard it. So what we need to do is upgrade the pie for 22.7 and about. But we have to do it as what's known a parallel installed. So we're just going to highlight the steps for installing Python 2.7 as an alternative in store. Now we're just highlight what we mean by alternative In store. Santos comes with Python 2.6, and it uses Python 2.6 for its build scripts. If we just go young, installed Pyfrom 2.7, we will break our system scripts, so we need to have the pie from 2.6 as it is now, and we need to installed pipe in 2.7 as an alternative pipeline that we can use. I have gone young grouping store development tools here can just copy the command from here and will be the same for all these commands where we're installing the dependencies. So its development talks for the tool chain and in these libraries that we will need for the python packages that we want to run and the same for the w get. If you're new to this environment, quite often, it'll ask you, Is it okay? And you go? Yes. And you continue on with the in store I've installed on my dependencies with young install . I downloaded the pie printable, and I extracted it to this folder. So now I'm in this step update the native tool chain, and this is where becomes a little bit more difficult. So we have to be cared. So if we go to the instructions that we have here, what we need to do is in this directory here in the etc directory. There's a file LD dot eso dot com We need to update that far to include this line in red here. So what we'll do is we'll copy this whole thing and put it into a note pad because we want to be careful with the characters, so we want to try and use clean character sets here where we're working with system files in the terminal. We type V I spaced in that path, so we just try that again. So then you press the I key to go into insert mode. Now we can scroll to the end. The line a copy, this edit paste and now we go escaped to leave in certain shift Cohen W to write and shift call on you to accept. If we make a mistake there, everything will be broken. Now we run this command from the documentation to update the runtime Hlinka. Now we must change into our source directory pe STAI arena source directory. And now we issue this command from the documentation here to configure the runtime chill chain now insider by from 2.7 source directory. With this issue, I make commands with the out installers. The important step Southern is built as an alternative installed. So we just run. This command built has completed. If I issue this command, which python? I will get the bin directory where the initial pipin exists, which part? From 2.7. It's pointing to the user local being put alternative python. If we had installed pipe in 2.7 on top of the system Python, the system scripts would be broken if I change out of my source directory type pipe in 2.7 . I'm attempting to call the Python interpreter, but I'm getting an error, and it's saying that it cannot open a shared object file this dot s, er this is the line X equivalent to the windows. Dll far D L l stands for dynamic link live. So the line X equivalent of that is an s. So far, it's a shared object or a shared live. And so what's happening is the linking is fouling. So what we need to do when that happens, it's just run the link a configuration again, and that will fix that error. So it type this and we run this command. So we just called that conflict and Sanibel Tribe pie from 2.7 again and you could see this time it's work. Pyfrom 2.7 hyphen Capital V. It should tell us the Python version. So that's Pyfrom. 2.7 point six. If I just go Python about the 2.7, I'll get 2.66 so I could call my system Python, which despite Finn and I'll call my alternative installed private apart from 2.7. So when I want to run my tweet p code need to make sure I'm using the pie from 2.7 point six. Now we can install and configure Twitty so we w get some install scripts as outlined here using our new python interpreter. We run the install scripts, then we can install tweet. So these are the steps since is in the documentation. So I will leave you to do that and I'll just show you the final result. So I downloaded my set up script set up easy and store Iran easy installed to in store pit running the easy and store and its search for Pip had installed Pip. So now I'm ready to do this step So it is quickly run Pip to install Twitty, you can see tweet is getting installed by the Pyfrom package manager So now we have quite a good system on a porton work Sandberg, Java and the hard oops ecosystem. We also have a good version of Pipin. Have this python package manager, which is like in P M for no Js the note package manager. So now we can install new python packages that are great for working with Hadoop and big data pie. Finish the preferred language for data scientists tries to develop. We need to work with the start of scientists Winnie to support them in their work. So we need to be good setting up python environments inside big daughter environments and that's a learning objective. Now, with our new Twitty package and our new python, we're going to construct a Twitter Stream processor and we're going to do that with stream lister object from Twitty. Now the code is at dis Link. Here we go to the code, we concede it is here and you can see we have a director here, hyphen scripts. We go inside python scripts. You can see a lot of those scripts that we used our in here, but we have stream Jason Stream Tex with her filter stream texture of out of filter. So the next step is to clone this repository into the Horton virtual machine. So Horton works of Carnelian store get for us. So we just go get clone and the euro from the get hub repository, and it pulls down the source code. If we look in our ssh far browser, we can see we have the directory here and we have the python scripts. Direct drink before we can run the parchment source code. We need to go to the Twitter developer console and set up some credentials. I need to have a Twitter account so I can log in. And I need to go to this You, Earl, register my app and get my credentials. So this is what it looks like. So here I've already created new happen via this button here, and it's a bit of a dance that goes on here. They might need you to put in your mobile phone number and get a token and authenticate. Prove your real person, not a robot. So then once I have my app, I go to my app to keys and access Tokars. So this is my consumes secret and key. And then I need to do one more thing, and that is create a access token. So, Hughes, my access token here and I can regenerate thes So once I have my consumer key, my consumer secret and my access token in my access secret. I copy them. I'm going to use them later. We will now step through. Had a recently high level the python code. So this line declared See encoding. That's fairly important. Then we import Twitty package and also adjacent package. Then we create a class that is our strain processor that uses the tweet Pee stream listener object in terms of authenticating. This is where we use the credentials. This entry here, this is the corn cake. And this is the client secret. This is the access token. And this field here is the access token. Secret. This is all in the tweet Pee documentation. If you want to check it out. One possible way to work with the Python files in Line X is to use the vin editor, which I can install with Young. Install them. Then I can work with the file by just going them name of the file the festival will work with is the strain tech start Python. So our first job is to put our authentication credentials into the python script. John Mitchell, where I want to work. Then I just hit the I key. Copy that into the buffer. So now I click where want to paste and I've inserted hit the escape key And now I want to write my change. So I go shift Colon And so there are into w two, right? And then shift colon que to quit Now let's have a look. Did it persist? Are change? Yes. Now I persisted. The change What you need to do now is just put in your keys and values. That's just summarize all the steps to get to this point. Firstly, we install a newer version of pipe so we can use the new apartment package. We pulled this source code to our virtual machine. We have a folder called Python Scripts. Then we went to the Twitter a p I, and we set up credentials so we can access Twitter streaming in points. And we went on the virtual machine and we put the credentials into script. So this is the script with the credentials in place now in a position to run the script against the Twitter rest ap I So we do that by using our new python interpreter like this. Run this script pulled his starter from the Twitter a PR You seen a tweet P A p I and its Associated Python code is very easy. The tweet p a p i we've imported a trippy object So he recreate a full possession handler wolf. He takes a client key and secret and then object. We can set it if I access token and we can use it to sit up an AP I object. We do that before tweeting object. Then again, with our twitty object ap I object and are off object. We can create a Twitter stream passing referees, ation object and an instance of our stream process a class that we are defining here and this stream process of class once an A p I object returned by Tweety and also a number of tweets. How many traits we re process do in a stream processing us in It's constructor passing an instance of the tweet pee stream listener Here we create some variables and pass into the super class the stream list Then we have an interface method Where into this method will come these objects that are just Jason structures This is an instance of our containing class, But we have fed in a A p I method and some other parameters then from Jason Bag for the tweet. So this tweet data is the Jason structure of the tweet we pulled in by our stream listener . Now we passed the Jason Darla in pipin using an array like syntax. So if the text key is not know in this chase and structure and the energise an object user language field is equivalent to e n, then we have filtered out that particular Jason and we just printed to the standard output as we saw. So what we can do now? It could pipe that out to a text file. We run this streaming Twitter data is now getting out. Put it to this text file and you can see it's the text, all the not Mel text where the user language field was equivalent to English. We have another private script which we can use just to get the raw Jason. We go to the sandbox and we run the Jason script to standard output. So don't print output of a Jason. So well, pipe this to a file called twitter dot Jason So he's our Jason. That could do is we can look for these fields. We go here, let's look for the use of field. So here I have my user key and value. If we have a look here, you can see this is the key and this is the value. So he's are laying. So this field, the language was Spanish, but this field languages English. Looking at that field off the embedded user, Jason object inside the tweet object when it's English and the text value is no no. Then we process that data. To quote the standard output on, we could pipe that standard output to wherever we want. In the first failure, we looked at a gentle introduction to Python, and we saw how we could one in store price, often in a difficult environment. That was a learning skill. The second learning objective was a gentle introduction to programming in Python when we looked at creating a streaming process of with the Twitter, a P I and our motivation from that the demand for particular types of job skills where there were big data skills, but they also included python and also no sq Oh, so that will be our focus. No SQL. And we're going to look at a very popular No, it's girl database. It's widely used and it's called mongo DB. There are a number are very good no SQL databases that you can use with how do. But we've looked at Mongo because the demand not only for Hadoop, Typhon and no ehskyoo but also general cloud skills. The means stack is a development stack you can use particularly well. On the cloud, the mean stack on the Google cloud. They're saying the mean stack. It's mongo db Express angular Js energy s So this means stag M stands for Mongo DB, and the main steak is cloud orientated. There's a mean io, which is a JavaScript framework, which is extremely easy to set up. Built on top of mongo DB in trying to stay focused on job related skills as a developer. Big Donna, no SQL harder hyphen and the cloud we have a pathway by hard dude to mongo db to the Cloud first beam is to install Mongo DB on a Hadoop cluster. This is a link to the Mongo DB install on centers and centers, his groom to red hat. So this link takes you to this documentation. And the first thing to do is to create this file in this directory. Easiest way to do this. It's just copy this Inside the terminal we go vim and the name. Copy this. So then I go to our escape shift Colon w and shift colon Cute to quit. And I've just created this file. Install instructions are asking for now that I have my repository enabled in young, I can now use Yemen store. So to install the latest stable mongo as it saying in their official documentation, are issued this young command and it should pick up the Repo monger Souther download completed So it installed monger. Our work is not finished can be problematic. The centers box that we have installed monger one is a very stripped down. It didn't have all the language character sets. This command gets around Any language character said he shoes you might get into The next thing is we need to make sure we have the start apart. Sit up and we look, we don't have a directory, so we need to make that directory. So in the show we can do that cause we're route. So we go back one directory, and now we're gonna make that data path. We have to use a Morris P flag because we're trying to make to directories at the same time . So now the next thing to do is to tell manga to use that To do that, we need to give all permissions see hitch own minus are for directories for recursive minus f, and then we'll put path and will make it or permissions. So we've done that. And now week until the mongo to use that as it started path tell Mongo so mongers going through now. So I just said that go through, You know, the final thing that could go wrong, which monger I could get a bit problematic is going out with that sort of orphan lock file . So if that happens, if you issue this command that will delete any orphan look. So if monger crashes, this can end up in an inconsistent state. So the easiest thing to do is just to remove it. We reviewed the virtual machine and sit and get into the Munger Shell. Now we rebooted the virtual machine in the terminal. If we issue the command mongo, we will go to the Mongo Shell. If everything works out, it's worked out. We're in a Munger show. If we got control ill, we could get a nice clear screen. Now, if we talk show DBS that will show whatever databases we have. And so we only have a local database, which is the initial database that has the mongo configuration data. If we go use the name of the database, that's how we switch to a database. So use local. We have now switched to the local database. Now, if we shoot the command show collections, it will show all the collections we have in that database. And we have to some logs and some indexes cause this is just a system database. No, Mongo is what we call a document store and the documents in mongo, our Jason structures in every database we have a set of documents and a collection of documents or liked opulence is what we call a collection. And a collection in Mongo is in l. A gist to a table in a relational database. If we type exit, we can leave the Munger shell. Everything's worked out. We got moon go sit up on same sistemas are harder cluster. So now what we want to do is glue harder to mongo. That is We want a bridge so we can get data out of how do into mongo, the high Duke job, that we will develop the code. For now, I will demonstrate a mongo. Harder bitch. This is a diagram to aid understanding of the data flow. So we have our twitter, a p I wrist in point that we access with our python script, which creates a twitter streaming object. So we stream the unstructured data into the cluster via Peiffer. Then we analyse and process on structure down in a job, and we're gonna write that job in the Apache pig scripting language and that will run inside hedge DFS as a harder job. And then the results of that job will be strained to the mongo database. So there'll be no intermediate step. We will go straight from a harder job running in Haiti. AFIS input into the mongo database there. Once we have the data in the mongo database, it will be structured, Jason, and then we can use that as a basis or to some other roses. The mongo ha Do bridge or the Mongo Hard job connector is a advanced thing to set up. This is an official mongo db Blawg, where they're actually talking about it, and it's saying it's supported by the engineering team. So the Mongo DB connector for How do they're supporting it? But it's quite advanced in terms of gluing together a real time bridge between Hatch, DFS and Mongo DB. There's quite a bit of work that needs to be done to get it to work, especially with the latest version off hard. Do you have a fairly complex dependency set up to get it to work with the latest version of Hard do, which is 2.7 point zero? I had to use these dependencies, so these are the exact dependencies that I use, and in particular I needed to use the latest version of Peak, and I had to include a classifier to tell pig to use. How do, too so by default pig uses. Harding won the latest version of How do How did, too. You've got to direct instruct pig Vira classifier to use harder to. And these are the other dependencies that I needed to include in my Java class path. There was a peak dependency layer on. We'll look into code how that works Here. We can see the documentation here for pig. If you want to read about pig, I needed to include the's jars directly inside pig, and we'll look at how you do that. These are the same jobs except for these jars here. And this is the actual mongo hard to connector. To get these jars, I had to build the latest version off the Mongo hard to source code. And when you build the latest fish, everything works. So if you just use the download dependency sort of available on the Internet, they're not going to work out of the box with the latest version of How do That might work with older versions of how do? But they won't work with the latest version, and we'll go through the steps. It's not hard. What we'll do now is we'll just go over the code and we'll look at how these dependencies as set up discovered runs on Apache hodoud job using Apache pig and this class, the peak run a class is the entry point class. It has a main benefit, which is the main method for the job. And this main method creates an instance of its pig e T l. Where ideologue once again stands for extract, transform and load. We just look at our ET or class. This is where we have our pig dependencies. Here we have. What's known is a pig server. This is an object that will run our pig scripts. We can run peak from a command line. We can run pig from gooey application like Holden works. Or we can run peak as embedded scripts in a job of program, much like we run embedded SQL. And what we're concerned with now is just looking at how we set up our dependencies. So this is how we set up John Dependencies. This is our monger hard core dependency. And this is our mongo. Hard to pick dependency for Mongo, Harder connector. And these dependencies, like Jason Simple and Jackson member. These are just Jason dependencies we use. This is an important dependency. It's what's known as the piggy bank job. Just a the moment we have thes jars and we declare them in this block of code here. Then we create an instance of RPG's server and then we register the jar with the pig server . So the key point is that we will have a path to the jar. And that's the project Rupa appended with pictures and in the name of the job. Well, Java class path dependencies, we have Ah ha did version, which is the latest 2.7 point zero. Then we declare our how do client and hard DuPage DFS in our Mongo driver. Then we declare our version of P for maven. This is where we use the classifier. We don't have this hedge to as a classifier block declared in our maven dependency for peak , it won't run as a harder to pig job. It will run a Zahhar do born pig job and fail on a hard due to cluster. So this is a key pen and see attribute here and we have other dependencies. Support the Jasons for the Mongo hydro dependencies. We go to the get hungry polling. We clone the repo onto a hard drive. Then inside the directory where we have cloned the mongo. How do the top level directory at the same level as to Gretel Bill. Far. We then issue the greater Bill Command like this and we build the project. It takes a while, especially the first time you build it because it has to download. All the dependencies were just poor Servilia while it builds and the build was successful. You can see I previously build it so well. My dependencies were up to date, but for you it might take quite a bit longer. Then, once it's built, go into the Munger hard group directory. Syria. Longer hodoud. Inside each module there is a bill. So the top level jazz Aaron build lips. That's the top level job with a mongo. How do and then core. We need courts. So we go to the core directory to build two lives, and there's a Mongol Hodoud cordial that we need to use for our peep dependencies. And then, of course, we need pick to go into the Peak directory and to build into lib. And there is our mongo hard dude picture of the two jazz. We need our core end peak. So that's how you obtain them. And we need to do it this way to get it to work with the latest version of how do Now we can build source. We can locate the jobs we need. We copy them into the peak Jars directory in our source. Then at run time, we have this project route path and this will be the absolute parked wherever you Bujar is running. And then the pig jobs will be relative to that path and in the name of the jar. So when we registered a job with a peak server, it will be able to find them wherever the Abuja is running. The rest of the curd is concerned with running the pig scripts in our pig radical us. We have two methods. One is Lo Datta and we have our detail transform. So we load the daughter and we transform the doctor. Here is our load data Mefford. Now this is going to load. And so we load the raw data has a character array. Then we filter are raw dollar by a match. So we're using a regular expression here to match to the word please. Now our faces for our sentiment analysis of the Twitter data that manners air going out of fashion. So we're going to look in all that roar, Donna, for when people are using place, because that's an expression off good manners. It's just a friendly, easy analysis to show the basic idea. This is a match expression. So this is like a match query. So if you've done any SQL, we go select all from some table where something matches, So we load the raw data, then we run a match query, and we obtained the results of our magic query here in C and we got this is local. If we are doing local testing, this will be the expression slightly different path for local. And if we're running in Haiti fs, this will be the expression. So here we are, storing our match query results as Jason Darla to hear we're writing them out using Jason storage. The data comes in as raw, unstructured data. We run a match query on the word please. We get some results and we store them is Jason. So this is very concise scripting code that saving a lot of Java boilerplate transform method here, we're taking the data for our transform, if any non e t o flow. We are transforming the doctor from being hasty if estara in a hasty office cluster to being mongo documents stores. So in a mongo collection, every document in the collection is like a row in a relation database table. Because we had 10 results will end up with 10 mongo documents in a mongo collection. I will look a syntax for that now. So here we are, loading the daughter out of hedge DFS using that Jason Loader. And once we got the data loaded in out of Haiti, if it's into our pig load container long ago, Jason, then we store our pig mongo container Mongo, Jason into Mongo DB and this issue Earl toe access mongo DB Test. Here is the name of the database. Tweets will be the name of the collection, which is an allergist to a table in a relational database and were able to push the data Our off HD affairs into Mongo because we are using the mongo hard do connector and it's running as embedded pig scripts in a Java pig server object we don't always have. Nice goo is where we can run our pig scripts to move on. First, we will test it locally and then and we will run it in the hard do cluster to summarize some important issues. Firstly, to obtain the raw streaming data we need to copy dream dot text to the root directory. Then we need to edit this file and set the number of tweets to what we want. So for the examples in the video were sitting the number tweets to be 1500. And then we run the streaming text with the Python interpreter that we installed 2.7 name of the Python file and an output to raw data dot text. That's how we obtain raw data. The next important point is in the source code. We do not have peak jars directory, so we need to make this directory in get have a repository for the source skirt. There is no Pete jars, so when you pull this code, you're not going to get the Peak Jars directory. We need to create the pig jazz directory just underneath for top level directory for the source code and then on Internet. We find and download these jobs that are listed in the dependency section in the documentation, and we build the manga hard Duke connector and obtained these jars as outlined in the video . So that's the pig jars, and that's getting the raw data dot text. The next important thing, if you're having errors, is to make sure that is local equals. True is set for local testing, and we set this to be false when it runs on the server. So to get ready for the work to come, we should have cloned the source code somewhere. Then we need to make the Big Jar directory in Horton Works and also inside wherever we've cloned a code for local development there, we need to clone the monger hard oop project and build that with grade all as we outlined previously. Then you need to copy the jobs to all the pig jar directories on Horton works and wherever you're developing the code locally, we showed you had to find them in the mongo. How do project and you can find the other ones on Maven Central? So we've given you the name of the job and the version off the jar, so that's what's really important. So here, for instance, among Go Java driver, its version free now. Previously, we pulled the source code onto the virtual machine by issuing this command in the virtual machine, and that blow owns the report. So that's the link here to the guitar Grippo for the source code, with the source code cloned onto the Horton Words machine. We then build the source code in the pig profile hyphen. Capital pay is for profile, and this is the peak profile, and we're just quickly step through our pig profile. Now the object of the peak profile is to produce two jars. The first jar is the big module jar, which we need to set up the peak server. A second jar is the uber jar that will run in the cluster. So this is the shape again, which produces that uber jar. It needs to have to find the fully qualified name of the main class for the harder job. And that's P. Granna job plugging, and that creates a pig module jar. And this may be exact plugging we can use for testing if we want to run our class in side the virtual machine to test something. So here the manga run a class I have here just runs a simple mongo connection. So that is just a way to double check that Mongo is running. So that is the goal of the profile to produce these two jars, the jar for the pig dependencies. And we want the uber jar for running the pig. How do job so inside the source to retreat, issue the following command to build a code with the pig profile. So we get a runtime jar, they will execute our pig scripts. So issue that command and it's building now. And if it's the first time you are building, it might take a while while it pulls in all the dependencies it needs. This is where it's downloading the dependencies to run inside the jar. When we run coding, how do we need to have ruled the dependencies assembled in a job legacy, build success. So now we're going to our root directory. We go into our source directory, we're going to the ruby jar. We can see we have the Ubu picture. So it copy this into the top level directory. So it pace at Ubu Peak JAA here, and once this has copied into the top level directory. We're in a position to run a job. Now. We're ready to set up the cluster so that we can run the load phase off our pink build inhe HD. If it's and so these are the commands that we're going to need to use because we have to create in the cluster file system the folder where we can keep all the jobs that pig is going to be looked in. Four. So we just go through the steps carefully because if you make him steak here, nothing is going to work inside the hedge differs file system. Let's have a look what we've got in a red directory. Inside a rich directory, we only have RCs feta from part one. So the first thing to do is to make the directories where the pictures relieve. So we have to make the top level job, which is his command. We have to make the top level folder, which is this command. Then we make the peak chart fold same way. Do we need to give the permissions So we need to give the read write permissions to the directories so that we can use him. We can read from them. They can write, copy falls to them. So we give them read, write permissions to the top directory. I would give the read write permissions to the pig jar directory. And now we put all the jobs. So we're going to mind. First, start on job will pick up every job in the local fire system of the so in the local file system, not hedged airfares. So startled Job will get all the jobs in the peak jars directory. So then we can just copy all of them into the hasty if s Peak Jars directory. So it's around this command. Now, we need to be a little bit careful when we copy the jazzy. So we've set up parallel when we run the code. Locally, pig looks inside hard absurdity peak jars in the source folder for the jobs. That means when we run on the far system when pigs running on hey ht affairs. Then we need to parallel directory and hedge DFs a pink. We're looking equivalent page DFS directory for the jobs it needs. So when we copy across, we don't want the backslash here because we're copying from inside that directory so we don't need the backslash. If you would the backslash there, it's not gonna work. You'll get an error message. So we need to be a little bit careful there so we can run this. So this is the command. So there's no backslash stir hyphen, port space and no backslash for hodoud. And now it's finding the jazz, and it's copying them across. What we do is just double check. We got all the jobs in that we need. So they were Ellis ing direct the directory. So I have a look. We want to see all the jobs in there. So, Pete, confinement run time, and they're all the jobs are now in. So now we have the parallel directory structure in the heart of cluster inside hedge DFS that we have when we run pig locally. So to test our data locally, first we must check his local is true. Otherwise we will get errors will go through the load data step because we can only really test the transformed out. But we mapped the data to Mongo inside the cluster. So, in a far system inside the source directory, we do not. I want to have a pig out directory because we will get errors. We have to make sure we have our war died a text of raw data there for the test. So then we just run this as a main class. Run this now and look at the terminal output so you could see its running has a job. Now there is one area here if we go up right at the top of the terminal. That put is one area where it's complaining. It can't find how do pan. So here's looking for hard Duke home. We don't need to worry about that when we are testing locally. What we do now. We'll go and have a look at the data output that it made. It would go up here. We can see it's credited Pig out directory where it's map to the Jason for nets or it's have a look at our data and it's our Jason Data will just make the speaker. So it's matched on the police match text, and it's created this nice Jason structure. So what we need to do now let's move over to the cluster and test and run on the class star so Now we've created the parallel directory structure before the peak jobs we need for when the code runs in. Hard to we need to actually put our daughter. So our 1500 line Twitter text raw data file. We need to put that into hard oop as well. So let's do that copying that file and then we'll check everything is correct. So l s a root directory. We want to see the heart Absurdity directory with the Peak Jocz so hard A 30 contains a pig jazz directory which contains all the jobs for peak when it runs on a cluster. And now we've got our rural data dot text. So I got a text file in there, So let's check. We have the correct structure here, so we don't want to see any pig out directory here, because when we run this code, it's gonna output the final data to pig out to retreat. If it exists anywhere, it's gonna throw an exception. So now we have ah, pig jar. And we built this picture with only the load Darda step in a e t o running. So now we're just going to run the load dot a step. So what we do now? We just run that job we've had do. So we go hard, do jar and the name of the jar and hopefully everything will work. So now we're running the jar in her do. It's going to run peak. It's going to loader and it's gonna line fruit and extract on a match value so that she did that very quickly. So here we can see we have successfully Red could see their successfully read all those records and then it's found. It matched 28 values on our match text, and then it's transformed the data to Jason. So what we can do now is have a looking out of root directory and then we need to refresh this. Just change somewhere else could see it's created a pig out. So it's mapped the data out of hasty effects into the local far system of the horn works. So we look in here and we'll see results. So I transformed to Jason Data. So this is the data that matched on our match text, which was pleased, and it's nicely met that data to Jason for us. So our next step is to load this starter into longer Now, the next step to load the data into Manco is to change our code so we only run the transformed data. So we commented out load data with a NCAA mentored, transformed data. And always we need to make sure is local is false. When we run on cluster, get this code onto the virtual machine. So now, inside the high density, we run our command to build with the pig profile. Then we go and delete our old picture goingto our code and get on your freshly built picture into the root directory. And now we're going to here, change into Haiti of historic tree. So now we're in hedge DFs we just wait, takes the water copy. The pig jar is 55 megabytes because it's got a lot the have a dependency jars insider. So we just wait a few seconds. So now we're ready to go. So now we're hoping that our code reading the Jason Records that we transformed from the unstructured data to structured Jason repeat. Now we're hoping load this from hedge DFS into longer, so we run the job and it looks like it worked so If we go, we look in the output can hear. So it seems to be connecting to Munger here. So among go hatred TFs to Mongo Bridge looks to be up and running. 4. 4 Kafka Streaming with Yarn and Zookeeper: in this section of the course, we're going to build a distributed streaming application. To do that. We were used CAFTA yon and behind the scenes zookeeper. We can think of two different types of applications for how do we can think of simple batch jobs that are map produce that running hedge DFS distributed parallel may produce applications very useful, and we saw in the clustering section one how useful they can be. So how do Born was extremely successful at massively parallel may produce applications, but data scientists use a range of technologies and a range of algorithms. So we needed to expand the architectures beyond map ridges. So we evolved to harder to, and the application that we're going to construct in this segment is, ah, hard to to application. And at the heart of the hard job to architecture is this technology known as yon. Here is a simplification off the hard due to one architectural. In this very simple outline, we would submit a job to a job tracker running on a central node called the Name Node. Then the job would be find out to the data nodes for the work to run in parallel. And while the work is running, we have a tar striker looking at what's going on on the data note reporting back to the name note. And the job tracker in Hard due to the job tracker, has been replaced by the yarn resource manager. This is a very important object for us because we'll need to know how to configure and submit jobs or applications to the young resource manager and then for every application we have running into cluster and application master running in a doctor note with a node manager and the application master work with what's known is urine containers. So before when we had a daughter node with Qatar Straker, Now we have an application master with a young container. Now what happens inside? A young container resource manager and yon and the application master encapsulate the details away from us. But where containers or young containers become very important for us is when our application fails because the application will fail inside a container. So we need to be able to debug our code that fails inside the container to get it to run, to get the application to work. And so that's where the container becomes an important object. We need to know what's going on inside that container that's causing her application to foul and configure are set up so that that doesn't happen anymore. So this is the two high level objects we are really concerned about his application developers with yon and harder to its one, the young resource manager, into the young container. This diagram goes a bit deeper into Hodoud, Tu and Yang. We have two pathways we can submit to hedge DFS. We can submit to Young an example of a harder to application that we would submit to hedge . DFS is in the second topic, where we looked at the Apache Peak to Mongo DB connector there wheat protests and daughter in hasty affairs as an Apache pig job and then stream the data to a Mongo database. So we didn't inter react with the yarn resource manager even though it was running on a young cluster. In this segment, we will work directly with the yarn application manager, so we have two pathways to submit our application. Hatred Defense system, in hard due to is different because we have a secondary name node and a standby name. Note. The's developed because of concerns like single point off failure, data recovery when the cluster goes down and we have our dad alone. So this is a more high level view. Zookeeper is something that works behind the scenes. So when we have a distributed streaming application, we work with what unknown is cues or messaging systems. So these are quite well known. There's Active and Q Rabbit MQ many different distributed messaging systems that work with what are known as message brokers. So when we work with a distributed messaging system, we have notes. So we have these cues running on nodes. So we have this idea of what's known has a node Q or Acuna and these que nodes need to be managed the same way. The resource manager and the application master in yon manage the process that is running inside. The young container zookeeper is going to manage the process that's running inside to distributed que node so we don't inter react with zucchini, but has to be there. Has to be configured has to be running, but we don't interact with it. It's working seamlessly behind the scenes, doing a sort of distributed management for our queuing system. Apache Kafka is the messaging system for big data. A single calf could broker, and we can think over Catholic. A broker in a simple sense as being like a message server, can handle hundreds of megabytes of reads and writes per second from thousands of clients. And if you look at the people that use CAFTA, so capture was invented. Adlington and it's used by Yahoo, Twitter, Netflix. Some of the biggest companies in the world rely on CAFTA for their streaming systems. There are many challenges to overcome when we build and design messaging systems and streaming architectures. So patchy sensor is a framework which encapsulates a lot of those issues. So Apache centers of framework that allows you to easily create CAFTA streaming applications. So if we look at SAMHSA, we could see it's all about fault, tolerance, scalability, reliability being simple. That's a keeping it simple. So it's a framework that integrates CAFTA with young. So what we'll do now is will move on, and we'll look at creating some very simple samsa code where we set up a Catholic, a stream from Twitter and integrated as a young cluster job now that we have to find the systems that will be working with, and we created a bit of a vocabulary for the objects that will be talking about what would you. Now it's what. Have a look at the source card and see how simple it is to set up a distributed streaming application with Apache. Samson. So this is the link to the sorts girl, and this is the source covered for our same store application on Get up. We'll need to clone this source code into the environment. We were going to build and run the application. We will start by looking at this class. Pass Twitter Strained task This glass is a sense of glass that acts on the underlying traffickers. Strain is the CAFTA Stream is set up and implemented in the young cluster, and it's streaming daughter from Twitter. And in this class by implementing to strain task interface can process the stream, so when you get a message, it's inside an envelope. So when you have a message in the stream, it's inside an envelope. So it's just a rapper for the underlying stream. Object and weaken grab its contents like that. And in this case, it's strength now because the data is coming from Twitter. We know it's Jason Darla. So we user or Jason. Simple package to extract to Jason Object. That is the Twitter Jason object. Then what we do is we just grab the text field out of the Twitter object. So we go from the overall Twitter object to just a string of text that is the text in the tweet Now using the same interface, we can process the data in the underlying strain, accessing via envelope object. So we have a stream. Let's call his strange tweets. So we have one CAFTA stream called tweets, and it's pushing through this object that is the Twitter Jason object we could process said object. So now we processed it. What we're doing is we're creating a new Catholic a stream. But how? Putting the object that we processed has started with one kept a strain, but now we have to, and we didn't have to do very much to get that second calf constrain. So that's a really powerful thing about Sansa is it lets us create thes multiple CAFTA streams based on the processing that we do now. It's not as simple as just this code here. It works out to be a lot of processing on line ex systems to get everything fine tuned and working. But in terms of the code that you end up with, it really is very simple and powerful, and it's a good metric for code. If you have simple code that does a lot, that's good code. If you have complicated code that doesn't do very much, then you need to maybe start thinking about re factoring what you have. We started with the original stream. Let's call the tweets. We processed it and we're out putting to a new stream called Tweets passed. So it's really very simple. So let's look at what's involved now in sitting up the underlying streams and building the code to build occurred and successfully deployed to a yarn cluster. We need to get on top o the build artifacts involved, the files say existing and their dependencies. This PSA maven built far palm dot xml. We are using the latest version of CAFTA and the latest version of Sansa. We need to be using latest dependencies to get it to work on the latest bill of How do hot effects for our build floor are in terms of importance? The Maven Assembly, Blufgan. Now, Previously, we've used the Maven Shade plug in to create the uber jar for this project. We do not create an uber job. What we do is we create an uber compressed file were in uber tired. Oh, jeez it When we build and packaged this code, the code is packaged into a compressed tar dot G said file and the Maven assembly plugging has that task. It assembles all the artifacts, and many of the artifacts are jobs. There's also file Syria's well, and the maven assembly plugging is configured by this file source. Start XML. The maven assembly plug in is tied to the package gold. That means when we enter, install or package assembly plug in Ron. Then it's important to point out we are building to Java. Source level 1.7. It will not work in 1.8. Then we have the aunt run plug in this copies. The built artifact, the tar dot G zed to the Staging Directory and Staging Directory for this code is called Deploy Copies. The compressed file, but it also de compresses the compressed fall. So in Deploy Directory, we have the compressed file, and we have an unpacked directory structure of that file because we use some of the files inside the unpacked directory structure in the cluster in the file system as runtime descriptors. So we need the whole compressed said of Build our Effects. But we also need some of the files in there decompressed, available at run time. And so that's the job of the end run plug in. Now the curd is very sensitive to the dependencies, and it needs to have the latest version. But you also need to include thes exclusions because you get no class definition errors if you have a dependency. Here is one here, Samson, a P I that contains the sensor runtime exception. Glass. The center, a P I is also dependency of the same succour, so that means you can have two copies of the center of runtime exception class in the Java class. Load a class path at runtime. If you have two copies of the same class on the class, load a class path at one time. You can get what's called on no class definition era would just go over that again because it can be confusing. So here I have a dependency. The samsa ap I hear I have another different dependency, Sam. Succour. Now, the sense a p I is a dependency over sensor core. This is known as transitive dependency. Transitive dependencies can trigger lots of different sorts of errors. So we need to manage them and we manage them with exclusions. One off the heiress taken trigger is a no class definition error. Here is a link to a good discussion of a no class definition here, and it goes into some of the reasons why you get and no class definition air and how it's different to a class not found exception. And it points out that the areas that they mentioned here are just some of the areas that can lead to a no class definition era. The main point for us is that we don't mention these exclusions. We will get those areas at run time. This is our package structure, which is make it a bit bigger. Now, this is the package structure before we run the build. So you see, we have a directory for our Python scripts. We will be pulling the Twitter daughter via the Tweety Package. As in the previous video, we have our assembly directory source Main Assembly, where we have so start XML. And this is the XML code that configures our assembly. Plugging that builds are compressed archived packages All the jars and the runtime build artifacts and we have our Java source directory. Then we have the resources directory and here we have thes property farts and these property false. This is where we configure the inter reaction between our source code CAFTA in the young cluster. So what we do now is, well, run a build script and we look at the result of running a build script and we look more closely at thes properties and how we can figure the CAFTA streaming with the young cluster . This is our project structure. Before we build here, we have our main build for pondered XML. In resources, we have some properties farts that are configuration files and in Maine Assembly, we have this source dot XML, which is a major build artifact in the I. D e. We can see the structure here and here we can see in our maven build file palm dot xml, our assembly plug in and the Aunt Run plan Assembly Plug in will work with her source Start XML to assemble the build artifacts. The end run plugging will copy them to a location which will be a deploy directory. We build the current in the same directory containing the palm dot xml by issuing the command maven clean install. Now it's beginning the build process. When you do this on your box for the first time, it's going to take quite a while to download all the dependencies downloads a whole lot of scholar dependencies. So it's wrong the build process. So let's have a look now how our directory structure has changed. So it should see now that we have a deploy directory inside deploy directory, we have bean con and a leave. This is a screenshot off the deploy directory taken from the I. D. Just so we can easily see what's going on for the video. So you could see we have our deploy directory and down here we have deployed dr dot G said This is the archive that the maven Assembly plug in assembled and then we expanded that archive in this directory. So deployed contains the assembly plugging archive, plus the decompressed values off that. So whatever's inside deployed tired Aggies ed is decompressed here, so you can see we have been in a confrontational it in the BIN directory. We have a whole bunch of scripts. Now, when we submit the sensor job to yon, we use this run job dot s hatch. Now there's some other useful scripts in here. Yon killed. The young job could be useful, but run the job. This will be what we were used in the first instance than in our conference subdirectory of deploy. We have our property values that configure our CAFTA streams. Then you live. We go back to the idea you can see if I open up lib All the dependencies that are normally you, Bujar are in live. And so everything that seemed been con and lib is also packaged up in deployed. A tar dot g said or denounce would just look at the XML code that makes that work. So this is the XML code for the maven assembly plug in. So this is in main assembly. So start XML And this is how we drive the assembly Plug in flow. The assembly Blufgan syntax we use here is very simple. So we haven't I d its distribution and we have a format. So we using the line x tired. What Jesus would make sense if we're deploying to a line X file system include Vice directory. We don't want to include the base directory, which is everything. And then we can include the read me if we want to end licenses. Also. Now we declare a list of files that we want to include in our archive. So this is the syntax for including a file. So we have the source of the vile and the output where the ball goes to. So here's a look for J configuration for I'm putting in the lib directory where over jars go. And for this while we're looking at a caf FCA stream configuration far Twitter pass adopt properties or putting us in the confidence rectory. So this syntax is very simple. Now, when we work with the maven dependencies, the syntax becomes slightly more complex. So we have an airport directory for a dependency set. Now What's happening is we're pulling this out of the maven dependencies, and so this is one type of dependency. So we we declare a dependency set for we set an output directory, and now we look inside the Samsa shell jar. So we go into the build file, we go down, we will see we have a samsa show dependency, Samson's shell and notice. It has the distribution, classified and the type, So it's set up from this weekend pullout in here. These scripts, these are all coming out off the Samsa show jar. So that's the syntax. It's getting all of them. Then we said that permission. So they have runtime execute permissions, and do you want to unpack them? Yes, so these dependencies live inside this jar and maven goes down to the Maven Central and downloads the jar for us, and then the assembly plug in grabs. That jar understands that it's a distribution job and pulls out of it what we want and sets it up just how we want. So that's one type of dependency, and that is the runtime scripts that end up in our bin directory. Then there's another type of dependency which is a runtime clasp of dependency, and this goes in lib and it's really pulling all the jars. So if we look inside the live directory, we should be able to find this sense a p I job here we have look since a P I, but these dependencies will pick up the Jazz Center on our class park that are poured in by Maven and put them into the Slip directory. One dependency set for a scripts come out of the Samsa Shoja. Then we gave away the charts, including our own job that we make when we build our own code, the gavel, the jars and put them into lib directory. But we have many, many more jars in the live directory, and that's because these jobs all have transitive dependency. So this is what that we have these exclusions because we are pulling in so many jobs as transitive dependencies, for example, the Samsa Show has as one of its dependencies, the Samsung AP I so the chances that we could end up with another version of this M a p I inside our class path because it will be inside the live directory, so we need to exclude all versions of Samson, a P I that airport intransitive Lee by Sam's a show. Ditto for the other exclusions here, all there for the same reason to keep a handle on the transit of dependencies. The other important thing to note is that when the code is running on the cluster and run time need to have all the jobs that the runtime code needs on the class path. And we do that by pulling them in a bill, time with Maven and then putting them into the Slip directory with the Assembly program. The other plug in that plays a role here is the end. Plug in. It grabs the actual assembly plugging bill jar and renames it to a smaller, more useful name. We've chosen Deploy here and copies it out of the Target directory into the Deploy Directory. Also unpacks. So it does the Untied Archive file. So the assembly plugging grabs those scripts and jars, makes everything as an archive in the end, run plugging fine set archive, renames it to a friendly name, puts it in the Deploy directory and then unties. It'll decompress. Is it so? We have two versions of the artifacts one compressed inside here and one exploded in the directory and indeed compressed in this directory as well. So we need to put a bit of work into our build configuration. But once we have our calf constraining properties set up and we run the spilled, we're ready to configure. Yon and young configuration is mainly in the bin directory and the dependencies for yon in the lib directory, then the Catholic a streaming is configured in the contradictory in these property files. So when we submit a job too young and round the calf constrains, everything will come out off here. So we do a bit of work here. But then this less work to do when we submit the job too young. Now we will step through the syntax of the Twitter pass, adopt properties far and see how it configures CAF Co two work with young and a source, Kurt. So if we look at one here, firstly, we create them in the Resources Directory, the Maven Resources directory, and they get mapped across to the deploy directory. So when we make changes, we must make the changes in the resource directory, then build for the change to turn up in the Deploy Directory start by declaring our job factory which in this case is theon job factory from Samsung and the name of the job. So the name of this job is Twitter pasa Then the path to our archive created by our assembly plug in. Then we declared the task. So this is the past with a stream task we went through before. This is where we are configuring the Catholic a stream to come into our sense of code. We will create this calf Gadot tweets strength and in the replication. If we were running on a normal cluster, we would have more than one. So this is like your hard to replication. How many copies so normally would be free because we're running on a single note were forced to set up toe one. So this is the address for zookeeper. So zookeeper has a socket listening on this port. So we need to hook into that to run correctly. We're going to run on a system where we use the Samsa builds group and configure the cluster. If we were using a different system, for example, Horton works, it listens on the same port, but with a different host name. And we also need to connect to what are known as CAFTA producers and consumers. We have a producer here that's listening on this sport. This is the syntax for configuring CAFTA with yarn, and our code are CAFTA. Streams will be streaming Twitter Datta using Python scripts, just as we did in the last module and we could find those parking scripts in the Python Scripts directory. And the one that we use is raw stream dot pie. We look at that now the source code is the same. The only change is that we use the whole Jason object. We pull in the Jason data from Twitter, we checked. Is the text field in the Jason object from Twitter Know if it's not now, then we push it into the stream, do a no check on the text. If it's not now, pushed the entire Twitter Jason object into the stream, and we set it up with same credentials that we used in the last module. In the Deploy directory. We have a bash script called Grit, which we will look at now. Now this comes from the sensors source, but we've modified it to make it so that the code will work on a DB inbox by default. It's looking for a red hat box. We are using a Java seven Oracle built. We can't use job eight. We are sitting up some bash variables and then we are downloading CAFTA yon and zookeeper, and then we install wineries. So we have these bash functions. One is bootstrap. This will do being store the full install. Then we have start all and stop all. And that will start Zookeeper young and CAFTA. And then we have a stop wall that's very important. When we have finished that we start calf Qiyan and zookeeper because if we just exit a box with calf Kenyan and zookeeper running without shutting them down, then that will break all our stream configuration. So when we go on to a box to start running this code, we're gonna have to use this script to download and install Calf Gagnon and Sue Keeper. We're going to work with a really basic Cuban to box with Jarvis seven maven and we're going to run this script to install Calf Qiyan and zookeeper on that Cuban to box, and the reason we have chosen the Bantu is because you run into issues with permissions with the System temp directory, and it's easiest to get around those permissions in a bun to trying to get around those permissions in Santos is much more difficult. So we want to use the bun, too. So we've modified to Sansa install Kft yon and Zookeepers script, So we'll work with you. Bantu. So this is what my home directory on who Bantu virtual machine looks like I have Jarvis seven. I pulled the source code from the Get Hub link. I've installed the trippy package, so I still have the artifacts from the tweet peon stroll I didn't need to install by Finn 2.7 point six, because that comes by default, with whom 1 to 14. And I have my grid script that I've copied out off my source directory as well as copping out the python script towards stream dot pie. So we'll have a look now at the system variable. So I've got set up on the box and will create our first calf cows stream typing, printing in a terminal on the box. You see, I have this standard who bunch of 14 system variables. The only variable that I have set up that's different is Java home, So I have an explicit Java home, Variable declared in my Bash RC. That is the only configuration for environment variables. Now we run the script that installs young calf ca and zoo keep. Because it's due in stores script, we will need to run it as root. So in the Home directory in a terminal, we run this command. It's pulling down the tar balls, so here we can see zookeeper hard group CAFTA. It's installed quickly because I have previously run this script. So when you run the script, it might take longer. We have modified the script from the original code, so we'll work with Ruben to because it uses different system variables. When it runs on red hat. You can ignore these messages so now it's pulled down. Discourage. If we go too far system into home, you will see we have a directory deploy and in here we have CAF, FCA, young and zookeeper, and we also have a confident rectory where we have a young configuration focal yarns site XML So when we see that we know that we have the correct set up for the code to run. So now we have the dependencies. We need to run Yang CAFTA and zookeeper. Now, if you get any permission problems, then you need to make sure that when you are running the coat it has permissions to right inside the home directory. So it should have permissions to right inside the home directory when you run as root. But if not, you might want to change the permissions for the home directory. If you get permission exceptions. Now we have installed Jahnke after and zookeeper on the box. We can move on now. In the next video, Sigman to create Enron her first calf cows stream. Now we have installed yon CAFTA and zookeeper. We can start to create CAF FCA streams and what we'll do is we'll push dollar from Twitter into a CAFTA stream using a python script from the last video segment. So what we'll do now is will create a new topic tweets. This will be the name of the stream tweets. This is what we will put into the line X terminal to create the stream tweets. And if we look here, we can see we've got the home directory of top level home directory, then deploy, then CAFTA and there's a bin. And then there's a script, a calf cut script. So let's have a look at that now. So if we go to the virtual machine, this is the home directory. So here I have my user home directory on the user. Ubu, if I go in here, you can see I have my source code that I pulled from the get, huh? Blink and I have some artifacts left over from when I installed the package Tweet pee in the bun to 2.7 python and I have my Python script that pulls the data from Twitter, and it uses the credentials I set up on the Twitter console in the last video segment. So if we go up one level, we're in the home directory, and now we have the two directories created when we ran our bash in store script. So now if we look inside the deploy directory, we have our home directories for CAFTA yon and zookeeper. If we look inside the captive directory, we look inside the BIN directory and we can see we have these scripts. This Windows does not refer to the Windows operating system. It refers to a particular aspect of CAFTA which allows what accord? Window ing functions and windows basically allow you to look into a stream and take slices off a stream. We have our tweet shows script that we are calling, so we're calling the capture topics. So if we look at that one now and here it is. So this is the CAFTA Topics Shell script. Now let's look at the syntax for how we create this stream tweets We call the script, which is the topic script. Then we use zookeeper flag. We bind to the zookeeper socket on this domain on this sport. Then we create our topic. Give a name to the topic. We set the petitions and the replication factor. Now let's break this down a bit. Firstly, why we using zookeeper in this way will remember in our discussion we talked about when we had young containers running in a cluster. So here we have the resource manager and the application master for our application, and we have our Dida notes with their young containers, and these containers need to be managed because they are paralyzing the job, and that is a job of the resource manager and the application master. So when we're thinking about the streams, which are cues, they are paralyzed in the same way. So we have many copies of a stream running on nodes, and they need to be managed cues or the paralyzed streams are managed by zookeeper. Best way to think about a topic when you're beginning CAFTA is to think of a topic as a type of data, so he would have died a coming from tweets, which is Jason. We can think of that as being a type of Jason Twitter Donna. The topic is a type of data. If we were to work with some different sort of dollar from Twitter, might be XML, for instance, Then it might be a good idea to create a new stream because we got a new type of data, and in the partitions and the replication factor, these are where we are paralyzing. So this is to do with the parallel nature off the stream because we're running on a cluster that has a single node That's why we have to sit them toe one. So to break this down, when we have a particular type of data, we created topic for it, and that's really a stream. Then that stream needs to be managed by something, and zookeeper is what manages it. So we need to plug it into zookeeper, and we need to create it and give it a name. And then we need to set up its parallel runtime attributes. So that's the syntax we're creating at Stream in CAF, FCA, some basic issues before we run the scripts. So the first issue is we are creating our own environment. Here we have a blank move on to virtual machine, and we've run a bash install script that's configured CAFTA young and zookeeper. Now there's a lot of work correctly setting up a young, hard do cluster, and we haven't done anything like the amount of work that's needed to get at totally robust system in place. So as a result, we could run into some issues one issue that we can run into his permissions. If we are running into permission Eris, then we could get around a lot of those permission errors by running as root. The second issue that we will demonstrate now is with the environment variables. So we have Java home configured as an environment variable. When we first run this script, we're going to get a job, a home, not sit ever. So we'll show that era and we'll show you the solution. And this is typical of the sort of things that you will need to do because depending on your environment, you could have many different types of configuration errors that you need to solve before the code will run. So we need to start a cluster Damon's Iand zookeeper and CAFTA, and we do that by running our Bash script. We've start all so we run this. We will see the job environment Variable era Can you see here? Java home could not be found if we get a clean terminal and we try print end and we looked and see what do we have? Java home configured? And here you can see we have Java home configured as an environment variable. But our bash script is complaining that Java home is not set and could not be found. One way to solve its is to run a far browser as rude as you do Space Nautilus, then navigate to the top level home directory, go into deploy, go into yon, go into etc. Go into hard do. And then we're looking for the hard oop environment script here so hard. Do environment script. Now we open this up with fruit privileges in a text editor. Just scroll down and we could see where it's asking for the Java home. So here it's looking for a system to find Java home directory, and for some reason it's not finding that Java home environment variable even though we do have it set in the system. So the easiest solution is to just set it directly here. So rather than worrying about finding it from the environment, let's just set it to what it should be in this scripts. And we save this and we'll go back to our shell, make sure everything is shut down before we try and run it again so we could issue do stop all to run that command, shut down and then start out with her new environment. So what we'll do is we'll try and start our Damon's again So now we're running exactly the same command and everything should come up will. So we're starting yon and you can see it's got fervor and we don't see that era Java home not set. So now we have our Damon's running correctly. Now we're in a position to create the tweet strain now that everything's running, we just copy this into the show because our diamond friends are running and run this script and you can see it's created the stream tweets. Now that time we didn't need to be route because we are running the script that's running in a Damon Fred that was started with fruit. Now we copy Dis Command to describe the tweet strain, so we copy that command and paste into the terminal, and it will describe the tweet strain. So here we concede a description, so it's got the partition count of one because we're on a single node cluster and replication, which defaults to free when need to sit it to be one, because we only have a single node and the topic is tweets. So this information here describing the tweet just confirms, was created successfully. It's what we could do Now let's move on to the next difficult being, which is to now get our strain streaming some actual data from Twitter. So this is how we get data into the stream. First we pull the roar Twitter dollar in Jason, former from Twitter before I pipe and script. Then we pipe the output into a producer. So CAFTA uses Hatton, where you have producers of data that push into a strain, and consumers of Darla that read out of the stream and other pathways as well. But they're the two basic entry pathways he produced Outer Interred Stream. Any consumed data from the stream. So here we are, piping our or Twitter dollar into a CAF co producer. In this case, it's a consul producer recall the broker was like a calf can message server Now, because we're running in a paralyzed system, we can have more than one note acting as a server. So we have a broke A list and we bind to the vocalist with this domain and port. And then we bind to ah declared topic that we created previously tweets. If our diamond fruits running and our newly created stream copies is we paste into a terminal in the machine. And now it's running. Sometimes we will get an error. If the cluster it for Damon fruits will not shut down, probably sometimes the stream could be corrupted. And we'll show you how to deal with that at this time. Because I previously shut down the cluster correctly. The stream is now working, so I'm pushing the data into the stream. We can consume it out. So if we open up a new terminal So now I have a new terminal, and now I consume the stream data with this command. So when I issued this command, push the dollar with pi thin, I push it into a console producer. So I'm using the producer for the producer Consumer Pattern. So now I'm going to use the consumer for that pattern, and I'm going to consume the data in this stream with a consul Consumer, this is the syntax. So I copied is seen and we should see the streaming data hyping for So there we go. So each one of these is a Twitter tweet. So this is the Jason, the entire Jason for a tweet. So now we are consuming the streaming tired of. So in this terminal we are producing the streaming data, and here we are consuming it. So, uh, Catholic a stream is set up quickly the issue with the corrupted stream. So I shut down his consumer. Now I will shut down this producer. Now, what would use will go into a file browser? And this is an important issue. We will go to the System temp directory and you can see here I have these directories on one of these will be a calf. Kellogg's director. Here it is. Here I go into here and here is where the tweet dollar is being persisted. So if I look at this log here, look at its properties, you can't see it's 23.4 megabytes. So that's that. That's where the Twitter data is ending up in the stream. It's getting persisted there. So if the stream becomes corrupted, what we need to do is delete this calf Kellogg's directory and we need to have permissions . And this raises another issue. We were running to all sorts of issues. If our young to keep her and Damon friends do not have all the missions to read and write from his temp directory, it's easy to give them. Those permissions will be slightly different on every build of a bunch, too. Basically, you might need to see her own attempt directory so that you end up with the correct permissions to read and write. So you, Damon friends, can read and write from this system temp directory. If you get an error that saying You're strain is not working and you just delete this catalog directories, you recreate the topic just as we did before, and then you will be called to go. So we just reviews syntax now the syntax and a conceptual framework of the work we have just completed. So firstly, we have thes free scripts we work with that could be found in the CAFTA BIN directory, the Topic script, the console producer and the console consumer scripts. So we work with those free scripts in the first instance. Now a strain is a topic, and this is all part of the messaging conceptual framework. So if you worked with other messaging frameworks such as rabid MQ, then you understand when you have brokers that you published two topics. So when you listen, when you subscribe to a messaging system, you subscribe to topics, and when you published a messaging systems, then you push two topics. But caffeine is a streaming AP I first, but it's inside this broker message architecture, so we need to understand that at a high level, so extreme is the topic inside this messaging conceptual framework, where you have an architectural where messages air coming via message brokers then because we are in a parallel framework, we don't just have one broker, we have many. And so we have a least of workers when we're running on a riel, multi node class star. And so we push our strain data, which are messages to a list of these brokers. Now the Damon Fred's must be running, obviously won't work and let's say running now. If the's friends crash or you do not shut them down correctly, then the chances are that your topic or CAFTA stream can become corrupted. And if that happens and all you need to do is just to lead this Tim Calf Kellogg's directory and recreate the topic and start again. One of the issues is that your Damon Fred's need to have the full read. Write permissions in this temp directory, so you may need to see her own this directory if getting 5. 5 Real Time Stream processing with Apache Kafka and Apache Storm: Now the topic has described his real time stream processing. What do we mean by this term riel time thinking in this case of I O. T. Which stands for the Internet of things. So here we have an industrial and logistic process, so we have a number of senses. So in the cars we could have GPS in the industrial processes. We could have all sorts of safety and other types of machine senses. So these senses are sending data providing Internet in real time. So there's a continuous stream of his starter. And one of the protocols used by these sensors is the messaging protocol we looked at in the last topic where we have message brokers. So here we have a data broker. This is really a message broke, and these senses have topics and this subscribing and publishing two topics. So that's what we mean by real time streaming data messaging system with topics and all these distributed sensors continually publishing data and then other analytical and other types of senses that is subscribing to the tropics. The data is being published own. Now we need some process for analyzing that data. The analytics, the queries and the analysis needs to be happening as the data is streaming. So we need some way of interacting and analyzing stream of data when it initially comes in . In addition, we need some way of organizing the dollar. We need some sort of a plausible architecture to be able to group the streams together. So we might want a subgroup of streams where we have the crane, the engineers in the car and the fuel truck. What group together for some resupply analysis. Then we might want to change that so that we have a different grouping of the senses, that grouping of streams. And this is what remained by topology. So we think we have this network where we have strains and topics and then groups of subscribers and publishers, two different topics. We want to be able to change the way those groups inter react. And so this is what we mean when we talk about the topology off a strain. So not only do we want to work on analyzing the stream, we want to work on the topology off a strain. This is a key concept. If we go to this link here, we can see one leading company in the telematics industry and there streaming real time analysis architecture. And they're talking about, in addition to the Standard Analytics tool set and this week and think of as being queries on sets of data, then machine learning and real time modeling analysis Using Apaches Storm In the real time modeling analysis, they're using Apache Storm because of a unique feature about Apache Storm and that unique victor is that ability to reconfigure the architecture or topology off a streaming. So when we're talking about reconfiguring the architecture, what we mean is changing the groupings of the streams, changing the topology of a strain. It's Apache Storm that gives us that particular ability. In the last topic, we looked at what we talked off as the messaging architecture, so this is simple but highly effective architecture. But we have message producers and message consumers, and the producers published to a top it, and that topic is pushed to a message broker. So that message broker differentiates all the data coming in via topics. Then we have the consumers subscribe to the topic so the consumers can pull a particular set via the topic. Then we saw with CAFTA. This allowed us to use Ah ha, do ya on enable cluster to paralyze cease architecture so we could have many producers and many consumers and many message brokers that maintain this particular architecture but paralyzed it across many nodes. So that allows us to scale. And we saw we could use the emerging technology Apache Samson. We could use the patterns, inherit their to build and deploy these distributed streaming applications. We will create a top logical conceptual framework where the two artifacts we will concentrate on at this idea of a consumer group and a producer group go to original architecture. Here we have isolated producers and isolated consumers working with topics and a message broker. And then we paralyzed this. But we still have this isolated producer pushing to a paralyzed message, broker structure and isolated consumers consuming from a paralyzed message. Broker and strain change the architecture. So now we have groups of producers. So here is one producer pushing to another producer, which then pushes to the topic, and here we have one consumer consuming a topic, and then we have a second consumer, consuming the topic after it's being consumed. So Now we have a group of producers and a group of consumers. This is the key concept that we need to take on. This is a common Kavika architecture, so I produces. They're interacting with zookeeper and the paralyzed captor broker seamlessly behind the scenes. What we end up working with are these consumer groups now Storm. It's an architecture for simplifying this where we have this idea of spout. So spout is a producer group, and we have these boats and boats can hook into each other and feedback their messages based on the analytical behavior. So we could see the analogy here. But a consumer groups working with zookeeper in the broker into reacting in storm. We have these bolts to spout, isn't produced a string, and then the boats is where we have the analysis in storm. We have this idea of a Catholic a spout so we can create a storm spout that we can also have a Catholic a spout. So that's how Kafka hooks into store. So our producer becomes a cast a spell. Consumer groups are these bolts, and so we can group the bolts. Let's get started with the code. Firstly, we have the source code at this kid hub link. If we go there now, this is the source code on Get, huh? We pull onto our development virtual machine. Don't try and build this code on Windows. Here I have the code are not going to try and build a code or windows. I need to build this online X in a virtual machine and we'll go through all the steps. This is the package structure. Here we have a very simple econ structure where we have agreed Configuration class. A real time event processing topology class. This is our storm code. The real time event producer. This is our calf Kako. This code writhe agreed. Conflict configures the runtime environment in terms of what coordinates the real time event producer creates the Catholic a strain which we pipe into a cactus spout and the real time event Processed apology shows how we can work with all the bolts. So, in our real time event, apology will be working here inside our real time event producer. We produced a strain and we popped out into a spout. So inside the real time event pressing topology class that will take the calf strain from the event producer class I put into a spout and then pipe it into our bolts where we will do our analysis and then we need to build the coat. So there's a complex dependency and a complex build flow which will go through there are distinct build fluids. Firstly, we have a build for local mode and this is for testing and development. When you actually run your storm analytical code, it's quite hard to get good debugging information. So we need to test our storm code before we deploy to the cluster. And we do that in local mode and we develop in local mode. Then there is a build flow to produce a CAF co producer artifact. So we are firing off our analytics from a capture producer, which we pipe to a storm spout. And in the final billed flow is when we build our storm analytic artifact for local murder to test and develop, we build a producer jar and we build a storm jar. That's what it breaks down too. So let's look at the bill code. So this is her build for our palm dot XML and here we are declaring project dependency versions the versions and the relationship between the dependencies is quite complex. So we need to hook into the holding Works repository and used these exact versions for the code to run build profiles. So we have a producer pro far. This builds the calf Ghajar. We're compiling the source of the 1.7 level. We build our uber jar with our maven shade plugging. We declare our main class. This is a CAFTA real time event producer and then we haven't Enron Plug in which copies and renames our producer jar to the staging directory. So produce a jar is p one because this is our first producer. Then we have a profile called Spout, and this builds to storm artifact again. We compile it the 1.7 source level usar shade plug in. We declare the main class with a shade plug in to be a real time event processing topology class reuse, Ian run plug in again and we called. It's sp one, cause this is our first spout artifact. Finally, we have a main build, which is foot local development and testing. And this bill has the exact plugging and we used the exact plug in to Dr Storm so we can run the code in storm local mode and that's where we'll start with the source. The real time event processing topology class is our storm curd, and we just walk through at a high level. So firstly, we have a run on cluster bullion which is currently set to false, because when we first start to write the card, we want to develop in storm local murder, because debugging on Acosta is very difficult and this is just a simple main method class. And if the run on cluster is false, it's going to set up a local configuration. It's going to set up a local storm cluster and run the code locally and these methods we use to configure the code for when it runs on the cluster. When we are running in local mode, we build out apology in this code block. Here in the main method, we have to ports. Here we have a real tom event, local boat, which we use to develop locally, and we have a real time event CAFTA Stream Boat, which is the boat that we use on the cross that So when we run on the cluster, we don't use the local boat. Then we have a scheme that works with her real time event. Locals spout. Now we have this local spout for real time events. But then we also need to be able to pipe in the calf. Could spell that's walked into the calf strain for a strong coat. Everything starts referred Topology, builder, object and then we must decide. Are we going to run it on the cluster? And if we run on the cluster, we use a storm. Submit er to submit our topology we created with our topology builder object. We are running in local mode. We create a local configuration and then we set up our local topology. Now the idea of our local topology is to test and develop that apology. We will eventually submit to the cluster. When the code runs in the cluster. We will have a real time event Catholic a strain boat that will act on the real time CAFTA strain. We have a local spout. It creates the same data as the real time CAFTA stream and then we have a local boat and that will log the output of the local spout so we can check. It's the same data as the cluster calf strain. We output from real time events. Local bolt into the real time event calf constrained boat so we can look the title from the local spell check. It's the same data as a calf strain and then output from that local boat to the real time event kept constrained Boat to check that that, but will work on the real time data. So that's the idea of this local coat is to reproduce. The doctor strips that the cost a code will be working on. So let's look at how we do that. Their spouse has an I. D. Local spout, and that's going into a visible, which is a local boat. And so it's hooked into the local spout i d. With the field grouping on real time event. So this is the input field. The local boat expects this field coming in with then called this boat law. So then we have a second boat, which is our cluster bold that we wanna test and we are grouping it on lock. So we're wiring up the input for custom ball on the log i d. For our local bolt and its input fields that it once had these fields here the week declared in a real time events came off. Here is our local bolt. It's out putting these fields this real time events came fields and so I caf kable when it started is coming in. It's looking for these fields. Local boat is hooked up to a local spell. Its fields are indeed real time event. Everything is wired up, so we have an i D oven at effect. And so when we haven't artifact, that's change that we chain via its ideas. We declare the idea of spout and hook in with the fields grouping and the type of field that the local spout is a putting is our import. And in similarly, for our local bolt, we give it an idea of log and then for our cluster bolt, we said it to be grouped on the idea of a local bolt and its input fields are the output fields of local boat. So that's how we can recreate locally. The topology that will exist had ran time when the code runs on the cluster And that's what we'll move on to now setting up the cluster environment because we'll need to run the code locally on the local file system of the cluster because it won't run on windows. Use the Horton works Virtual machine running in virtual box. It's a question environment. This is the machine. I've looked into the machine Griffin Ssh File browser. And here I am in the top level directory. Some important points to note. If we go to the user directory, then find a hey HDP directory, then inside 2.2 point four. This is the installation for all the hard uber and its ecosystem. So here we have a calf ca folder inside the CAFTA inside the bin I will the scripts that we need when we run Kavika inside Hard Oop inside the etc directory inside. How do we can find a lot of the configuration file? So very important on every important directory is the root directory Inside the root directory. We will see that we have start M. Barry So they start M. Barry script is what we were used to configure Storm and CAFTA. We will need to do that when we move into the run on customer. But first, we will all our source code and build and test locally in the file system. Here I have an ssh terminal. So it s s staged into the machine using route and hard Jupiter password here, we can see is might start M Barry his hate script. And so I run that with this command here and now it's starting in Bari. Now it can take a couple of minutes depending on how powerful your processor is Once and Barry has booted up, we can log into M. Barrie with fish. You are well and user name and password admin. So let's do that now. And we talked Admin had mean So we're looking in to embarrass. You can see here I have Storm and Kafka. If I look at caf ca and I go here, I can see that I can start caf caf car is not running and we'll leave it like that for now . And the same with storm to storm. Look at the service actions. I can start storm so storm is not running. But everything else is good now. It's quite possible when you run the machine for a while that you will get red here, that there will be areas inhe HD if it's in yon and we'll just look at one possible fix for that. Now go to the System Temp Directory. You could see there's a lot of stuff here in the system temp directory. Quite often, if you clean out this directory, it will fix any errors you may have in these services. Hedge, DFS and yon but basically hedge DFS Aniano up Storm and Cathcart stopped, and later, when we're ready to run everything in the cluster, then we will start these services. For now, we will move on to local testing and deployment. We move on to local development and testing. So first, we're going to pull the source code onto the virtual machine with this command here. So we pay set in and execute that command if we're looking at it, says Sage Far browser, we could see we now have the directory hegemonic, real time in the root directory, and it's contains our source. Before we can build, we need to install maven on the virtual machine. Older centers maven version 3.0 point five works very well copy this into the virtual machine. So it through that first we change into the directory and we pay settling king and you could see it's downloading that version that we want frequent 0.5 Antah Oil browser going to the root directory And you can see we have this Apache maven free point Syria Point Far we're going to do now is just rename that to maven because that makes our power of simple Now inside here we have the bin directory. So when we said our maven home variable, we point to the directory That's one directory above the bean. When we set our path, we set the path True that bin directory. So what we need to do now is in a dark Beshara see far verse it are maven home variable An incident, our path variable Boesch RC! And there it is. So we just go down here I makes of space and now we're going to copy in these variables and we paste know he'd escape So way out of insert moments, we can't make him steak. Good. So now we just got shift colon W to write and shift Colon cute. Quit clear the terminal with control L den type saw Stop Bashar see to reset the environment Variables dente. Then we tried print in to make sure that we have a maven home and he's our maven home. So we successfully set up mavens and off we go control l and we talk maven version We should see maven come up and report its version So we successfully installed maven correctly on the centre's box change back into the source directory and we shouldn't a command maven clean in store. So for now, when we just run maven cleaning store, we are sitting up occurred to run in local mode. Now it's gonna take a while cause it's gotta pull a lot of dependency to build occurred for local mode. We just issue maven clean install to run the code in storm local mode We need to issue this command which has the flea qualified class name this storm docked apology flag for maven and it runs the eggs IQ plugging so in our build for this is our exact plug in and this will run our main class correctly in storm. Now the fully qualified class name is always the package followed by the name of the class . So that's what we give to the D Storm. Doctor. Apology, Flag driver Exact plug in with this flag for the storm. Run time with a fully qualified class name and then because we want a process A look of the output because this is a test were pipe it into a file. We want to test it will work before we go to all the trouble of running it on the cluster pace. Daddy, and it's gonna pipe the output to that test doc text file. So running the command now so we just let that go through and once it's finished, will have a look that text for if we're going to our source good, it's important to know that we only run the code for 10 seconds. So if 10 seconds we run a local cluster and then we shut it down. So there's only a small amount of time that it runs and we could see its run. So let's have a look at our output now. So in the root directory for the Ssh far browser, we go into the source code directory and there's output and see it's got 282 kilobytes. So we opened that, huh? So I go here is to discover his art apology Functioning is every link in our chain every node in our top logical graph off spouts and bolts functioning and passing on. Would it should to the next note did our dollar flow through our top? A logical strain. We'll go to a sore skirt. Now, this is the end of our strain. The real time event Catholic, a strain boat. This is what will do the processing on the cluster when we run in the rial customer and everything is to support getting data to this bolt. And then saying will this boat function and in particular, this boat is going to look for the's fields in the data Tupelo coming in. So if those fields out there and they're not correctly set up with key value pairs, you will get exceptions. Here we have this system out. Print mine. So what we can do is copy that stream and go into this mess of test data and search for that string And him. We found the first instance he started Copy the data from around here so we can examine the output. And it looks like we were getting through the chain to the real time CAFTA stream bolt. And so are topology is functioning, at least in a local murder. But we look at this started a little more closely. Now I've copied the data into a better texted it so we can get a handle on what's going on . And here I found this local spout free. And here I have this output stream event legitimate six. Now, if I go in my source code So this is my real time event, local spout, and this is admitting stream event legitimate and then a numerical count value. So that's what I have found here. Stream event. Legitimate six. And so that sourced from the local spell. We go down and find where it gets to the end of the chain in the real time event. Calf cost. Rainbow Real time event. Kept a stream boat. It's coming in here as thes keys getting string by field, and these are the keys to get a string from this triple cause. His triple is like a map of key value pairs, so I call the value in a triple by giving it the key for that value. So we go back to output. We could see we have made it to the real time event CAFTA Strain Boat and it is submitting the value that it received from the spout so we can see our chain is operational now. You can learn a lot from this debugging output about the whole storm flow and the topology structure, but that's sufficient for now. We can be sure now that we have run the code successfully in the local store mode in a local cluster. So we have successfully set up a spell, generated real time events with that spout and then changed into a log boat which has set up the fields exactly like the real time CAFTA stream when it runs in the cluster and it's emitted the fields into the real time event CAFTA Strain Boat, which will run on the cluster. So that's receiving the fields it will receive when it runs on the cluster and its functioning and its processing those fields. So we know out apologies valid and will work. We ran a test in storm local mode that ran our storm topology and we checked at every step in art. Apology was functioning. There are no exceptions, so we're confident out apology will work on this storm in cluster mode. So now we prepare to submit occurred to storm running on the cluster. And there are five distinct steps and each one of these steps must be completed correctly for the following step to work. So our first step is to build our Kavika and storm artifacts. Now, in our local mode test, we simulated are Catholic a stream before local spouse for the cluster. We will create a calf strain and we will create a CAF FCA producer job or artifact. So we create that with our product profiles of proud stance for producer. So with this profile, we are creating a jar that will be a calf cut producer and that comes from a real time event class. And he's the source code for the real time event, which we will look at later. So there's a lot of configuration steps need to be done for this class to get it to work correctly as a calf producer. A storm build artifact is the job that we will submit to storm running on the cluster and we build this with a maven profiles for Goto. Our source coat. The name of the profile builds a storm out of fact is spout because spout and storm go together. We are using the real time event producing topology class and packaging that into a jar which we called Ubu SP one jar. So we have a spout, you are that we will submit to storm and we have a producer job that we will run to make the calf strain. So that's our build artifacts for CAFTA and Storm and we need to log into M. Barry and start caf FCA and Storm on the cluster. Then, as we did in the last topic, we need to create a Catholic, a strain and a topic real time event. Then we run our CAF co producer jar to produce the real time events strain that's based on a real time event produced Java class. Then, when we have the Catholic A stream running and producing real time events, we submit the storm jar to storm running on the cluster in cluster mode. So there's or five steps, so we'll start by looking at what's involved in setting up a CAF co producer with a Java class. To create a CAF co producer with a Java class is fairly simple. You can see we're using a calf care packages and we create a broke A list and a zookeeper host. So this should be familiar from the last topic and we have a zookeeper to manage all our parallel brokers in our broken list has before a stream is a message string. So we must have a topic so that our brokers can differentiate the messages by their topics . Topic. Israel, Time event and our producers create real time events, and our consumers consume real time events. That's all managed by any broker in our broke A list. So now things become different as to when we worked with Samsung. Firstly, we create an instance of a Java property start and we set up these key value piers for our properties. So we have a key for our broker list you Earl, and we have a key for a zookeeper host. We have a key for a serial Isar because these are bite streams and we need to turn them into strings as we will see later. So we have a string in coda. This property configures the strain. So then we use these properties to create producer configuration object And then we create a caf co producer object with that configuration object and now we create some data to push on to the stream. This is the same dollar that we created with our storm local spout for our storm local mode test. Now we have this keyed message toe So this is the bag that goes onto the stream. So it's a message with a key. So the key is the topic for the stream and the doctor is the string. Reproduce the payload for the message and then we push that onto this strain. So creating a CAF co producer, he's relatively simple for a job class. Now what's important to understand is that we're going to met this capture producer in storm to a Catholic. A spell in the next beer segment, we will build our defects, start M Barry and move on with the extra steps to submit a storm job to run on the class star to run the code in storm in custom mode, we must first build our capture and storm out effects with our maven profiles. So first we need to check that We've reconfigured the code from local mode to custom owed inside our storm class. Really Event processing, topology class. We must set our bullion run Acosta to be true in a terminal issue a command like this So go to an ssh terminal were changed into the directory which contains our source code. Copy that, command here You can see of issued that command And now I'm in the class Scroll down to wear my very beliefs Now hit the I k two and insert mode. Go across Change it to be true he'd escaped Shift colon w To write too far She have call on cue to quit. I have now edited the file so that the variable for run on cluster is true Inside the storm job, everything's ready to go because the Catholic a code is ready to run on the cluster. We only needed to make that small change on the virtual machine to the storm Close. Now we're ready to build on a machine Kavika and storm artifacts If I maven profiles recall in a maven build for we have to prefer parts. One profile is spout Storm Pro Far One profile is proud producer pro for the main class for the job inside the shade, plugging in the main class attributes and we can't see the name of the jar in the Antron plugging So you boo p one. This will be the producer job. Then you boo sp one will be the spout charts of the spatula is the storm jar and the producer jar is the calf Ghajar. Now we're ready to build without profiles. So to build with the producer profile, we just added the end. Hyphen kept API and the name of the profile, which is proud for producer. So copy that into the terminal. Just post the video while it builds so we can see the build was successful and it copied one file into our staging directory On the machine, which is the name of the staging directory, is uber jar any and build far. We call that to deploy so we deploy to a staging directory and a staging directory can be one convenient location where you assemble all the build artifacts that you need for your application before deployment. Whenever we do, it's will clear the term. Now we copy our second bill command. This will build the storm, Jaar said. This 1st 1 with the prod profile builds a calf kjar. The spout profile builds a storm jar and we'll end up if everything builds correctly with two jars in our staging directory. So it just pours while it builds and we concede build success and we copy while a new bridge our into our staging directory, which is you. Bridge are. So if we go in a cluster file system in the ssh file browser in the source directory, we go into our staging directory average are we say we have to jaws before we can copy the jars from our staging directory Uber jar to the root directory of virtual Machine. We need to configure the read write permissions. So we need to go in the virtual machine to the very top of the file system. So top ls we could see we are the very top off line explore sister. And then when you sure this coming, N C H Mart Space Myers Capital, our space minus lower case air numbers for 777 to give all permissions and the name of the directory. Run that command. And now we tried. Ls we can see that we have this read. Write permissions now in our root directory. Now, for this to be effective, we need to update the ssh session. So to do that, we need to completely log out reboots the machine. Now, when I look back into the machine in the root directory by issuing this command, I can copy the jobs from the staging directory Ruby jar to the root directory where I can use them with Storm and Kafka. Far browser. We go into uber jar. We could see we have are two jars in a terminal. I should a smooth command to move the jobs file directory. And I look in the root directory I could see. Now I have my to add effects in the root directory. Now that we've built our Kafka and Storm runtime artifacts, we need to start calf growing storm by using em Bari. So in the root directory. So we should this command once and Barry has booted up, we can open up in a browser. The M. Barrie, You are Oh, now you could say I have a shoe aerial in a browser. This is a log in its admin admin, and I'm looking into M. Barry. I need to see these green for HD. If it's producing yon, it's just check your on. Yon has started its green, and there are no errors. You end up with red flags here for Yang and Met, producing Hedge D affairs. If you clean out everything in the system temp directory and restart the class start, that will often fix many problems. So we'll start with Calf. So we go here to service actions and we go start confirm Start and while it's starting up, will go to convicts and we will check that it's writing to the correct logs. Change to the System Temp directory. Save that and restart to summarize. Sister, I need to check in the conflict settings in the conflict tab that I'm writing to the Temp Directory for the Calf Kellogg's Cause. I'm going to be deleting those logs, and I set up my temp directory. You've read write permissions in my service actions. I can start and stop, create my configuration and then I can start or restart, and I check my boarders. Correct. Check that categories running correctly. So here I can see categories started. I can also check by checking that only stop is available. Look at zookeeper. Check. The zookeeper has started, so I'm good to go. Assume was Zookeeper and CAFTA are running. I'm good for the next step, which is to run the CAF co producer jar. Now we're going to create a Catholic a stream by creating a topic called Real Time event. We need to find the CAFTA Topics script in the CAF FCA bin directory. This is the path to the calf cabin directory on Horton works. Then we walk into the zookeeper host import. We create our replication factor and partitions to be one because we're running on a single node cluster and then we give the name of our topic. So we copy this into the terminal inside the root directory. So it copy that? Hopefully it looks like it worked. We created the topic real time event. Now we need to check so we can check by listing the topics. So we copy this command into the terminal. We will list the topics that exist in, Captain on. We see we indeed have the CAFTA real time event topic. So now we're ready to drive our producer jar that will produce the data for the Catholic A Stream real time event. So now we will run to Cath CA Producer Jar so it will push starter into the CAFTA Stream. We just created real time event. So if we copy this command into the terminal so we run the producer job and hopefully it'll work out. Now it's writing the data into the street said to run for 4000 records. So it's just written the 5000 messages into calf strain and now we can check with disc amend. We can run the calf console consumer to replay our stream from the beginning. So this is the path to the script in the CAF Corbyn directory. Plug it into zookeeper, declare the correct topic for us Israel time event and the from the beginning flag. So we copy that into the terminal basis command into the terminal Note. We have two hyphens for each flag for getting errors. Make check. You got two hyphens but each flag in the command. So we issue this command in a terminal and it should replay our stream from the beginning. So there occurs. That's what our stream looks like. We've got 5000 records. So we've successfully pushed data with our real time event producer class into Cathcart. So it would was just quickly Look at that source code review that source code again. So this is our real time of in producer, which pushed the daughter into the CAF constrain as we just saw. And it's only a very simple class, but it's just here to demonstrate the key principles we have to hook into a broke A list. So this is are you Earl, which we check that we had this sport on M Barry zookeeper host and we check We had our correct port for zookeeper. So we need to set up these two artifacts which we do here. And these are the correct keys. There is going to look for run time for these artifacts of Broke A List and zookeeper. We create our producer with the correct generic types. So here we're using a string encoder for a stream generic type, and we create that before conficker object. And then when we pushed the data onto the stream, so it's just a simple stream. When we push that data onto the stream, we use this Cade message class. The CAFTA Stream is a messaging straight and in arcade message. We have the value is the topic to declare topic for the stream, which in our case was real time event and in the data for the message, which is just a string. So that's our CAF co producer. So the Cape Point now is that this capture message train, which is really defined by this object because this is your object, that we push on to the strain a key value pair that is a message back with the stream topic as its K and its payload as its value. This is getting pushed into CAFTA on what we're going to do now If our storm curd when we run her storm jar is convert that Catholic a stream into a calf, got spell Storm has a type which, of course, CAFTA spout Ah, our storm code we could see now that I run on cluster is said to be true, So we're going to go into the cluster configuration rather than the local configuration. We now step through the configuration value with this class that pipes to capture data to the CAFTA spout type so that we can put it into our storm boats. So just as we ran the code locally, we're now going to run the code off the calf constrained. So it's going to be the same key value pairs coming into our boats. But the key thing is that we need to convert CAFTA Stream we just created to a storm CAFTA spout. So we start by creating an topology builder object. And this creates a configuration of our spouts and bolts. How everything chains, inputs and outputs output from spouts into bolts output from bolts into other bolts. That's art apology. So we start by piping the CAF FCA to a calf kiss spout, and we give out an instance of our topology and here we create instance of our calf could spark. We need to configure that so we'll go to the configuration for the Catholic a spout, and here we could see as before it. Once a broker hosts configuration value and as zookeeper value, we have some other values that we need to work with because it's a different situation now . In particular, we need our CAFTA topic and we need our spout i d. So it's convenient to keep these things in an inn on which we will call. Agreed. Comfy is when we are wiring are things like zookeeper CAFTA. Then this is what we call a grid. As we saw in the last Samsa topic, we had a Samsung grid which was a key Perignon and Kafka There we have agreed. So we have agreed conflict in Um, if we look at that, we could look at our values. So we have a zookeeper host, which is the overall for the cluster. And then we have what is like a local nodes who keep a house which is just local post. Then we have our declared CAFTA topic for the CAFTA stream that we created and we have a calf could spout i d. And this one is important. Will have a look at that. Now, in our config, we pull in our topic. A zookeeper route value broke a host value and here very important. A spout i d we will get our captors spout i d. Recalling CAFTA's spout how we use this We can see from when we look at our local built We give our local spout and I d of local spout And when we change our bolt our first bolt in the chain that received start up from our local spout It's grouped on the local spout I d and office boat His group on the local spout i d So we are doing the same thing Here we are sitting are spout If I spout i d We are hooking it into our bolt Here we are build a set bolt If we look at the grouping we are using the i d of the CAFTA spout i d So this idea is doing the same job So our caskets about i d is doing exactly the same job in our CAFTA spout to boat Cheney that we did here when we had our local spout If it's a local i d. So here we are sitting our first boat and its field grouping his local spout and our local spout ideas local spout If we go down to our cluster code when we are changing tow our boat we're using for its grouping the CAFTA spout I d. And we are setting when we create a calf could spell We are sitting, are kept a spout I d to being that same I d. So that's a key point. If we don't have that chaining, sit up quickly then I topology will foul on the cluster. So we were able to check that the fields and the code would work locally. But we don't chain at apology together correctly when the cluster and her code will file. So now in M Barry, we go to store and we check that our storm configuration is correct. We want to check that our storm you I server is set up correctly because we're going to need to access that to check that everything is running. We won't get a lot of output from storm in the terminal when we run the code on the cluster . So we're gonna have to go to his storm stores. It's logs, but we need also to use a storm. You I so everything looks good. So we got a service actions and we start storm. So we've kept a running. We need to make sure Catholicos started so storm you can see here. One operation is in process. Storm is in the process of starting up and one storm is started. We can submit our storm jar, so check that we can pipe Kev ca stream into a calf could spout and chain that correctly into our storm bolts for analytical processing on the CAFTA stream. So we'll move on to that in the next video. Now we're ready to submit a strong jar that we built with her storm profile. Two storm running on the cluster. The way we do that is with this command, which there's no hyphen on the jar. The name of the jar and in the fully qualified class name of the main class. We have already checked in M Barry. That storm is running. So stormy started. CAFTA has started. Answer Keeper has started so we can 6. 6 Big Data Applications: one of the biggest players in health care industry. It's the Department of Veterans Affairs. It has a budget of nearly $90 billion a year. On one of its core responsibilities is Deliver Healthcare. In 2000 and 14 there was a scandal known as the Veterans Health Administration scandal, and that was just simply based around people having to wait too long for access to medical care. This organization, he sounded huge pressure to provide health care quickly. They developed many different tools, which they make freely available, and one of the tools they provide is known as HDD access. And the hay HDD here stands for Healthcare Datta Dictionary to integrate medical terminology standards to make your application interoperable, semantically information exchange and analytics. So remember our cocoa is to be able to create medical healthcare applications with big data . So had us hate city access Relate to that? Well, it relates through these two course statements, so the first course statement is to integrate medical terminology standards. So whether we have a big data healthcare application or just a little health care app on a mobile, it still has to use the correct medical terminology the second way we can use HDD access is to make our applications interrupt the semantics of our daughter, become critical. So they're the two key points for medical terminology, standards and semantics. How does that translate from the high level theory to the real world? Practicality? Healthcare applications? At some point, they're going to have to interact with what's known has an electronic health record. Now this is the National Library of Medicine for the United States, and here this is a central repository for information about Elektronik, health records and the meaningful use. Our Exxon is what's known as occurred system. So this is a system a taxonomy off codes related to the use of chemicals in medicine. They have a system of inter changing information, and they use a code system. And that code system is. What's known is Iraq's norm, and this is only one part of the information here. Hey, HDD access is important because it is a breakdown of the key information contained in this repository in the sense of a database that you can query if you like. Hey, HDD access is a road meant to allow the branches of knowledge that you find We've seen the National Library of Medicine as it relates to Elektronik health records. You mentioned the Rx Norm code system. This is a taxonomy system to do with exchange and breaking down information related to the way chemicals are used in medicine. Snow Mitt, Systemized Way of talking about medicine Inside a clinical perspective we have. This other card system called Loi is used when we are processing. We're dealing with information from clinical, other medical science research. Medical science does not exist in a vacuum. It's a human science. And when we talk about humans, we talk about value judgements. So, for example, if you belong to a certain religion, then you are forbidden from having a blood transfusion. So this is an example of a value judgment about the health benefits of a blood transmission , and it's very important for the adherents of that religion. So that is a value judgment. So that's an example of the sort of information that doesn't fit well inside occurred system like our X nor rabble we have. What's known is a value set, so we have this breakdown of value sets and curd systems. HDD access is a bridge to those values sits and code systems. Previously, this taxonomy of knowledge only applied to United States. It has gone international overarching taxonomy for all those underlying code systems and value sets. It's what's known as I C D with that stands for international classification of diseases, and this is overseen by the World Health Organization. Healthcare is probably there to treat and prevent diseases. So hey, HDD access is the bridge to these code systems and values sets for our application and then Elektronik Health records and other health care providers and applications that are application will have to interface with are all using this international system off knowledge. So that's what we mean when we talk about semantic interoperability built here. The computer system to exchange data with unambiguous shared mean the goal is that a patient anywhere in the world, for example, Vietnam then moves somewhere else. Say, for example, ice land. Our goal leads that a doctor in ice land can access that patients, Elektronik, health records and using information in a meaningful way in the context of semantic interoperability. What we'll do now is we just move on and quickly demonstrate the HDD access application so we can see in a hands on way rod it is and talk about the technology stack that it uses. This is the application running on local host, and it's a database with a search function. Now we have to search functions representation and in C i d e n c i d. It's just a code system. I D. And representation is a representation of what's known as a concept. So here I am entering in the medical term, diabetic and I have selected the I C D. 10. So I select the code system into a representation search, and I will have a look at the results that this is the code system number for this particular diabetic concept. So we click on the N C I. D. That takes us to a representation off the concept that this particular NT I d points to. We have this relationship between the code and a concept. So now if we take the sensi I d do in the search form, choose N c I. D. And do an NC I d search and we could see it takes us to the same concept Now. We were just quickly review the technology stack on underpins HDD access. First, we have a Petula Hussein Apache. Racine is a search engine technology, so it's a way to index text content for very fast lookups. Now it's generally used in Web search engines, but it's used here to overcome some inherent limitations. When we use relational databases and certain types of queries, STD access uses loose ain to speed up, join queries. We will look at a Petula scene, but we will move out of Apache Racine to what is known as Apache solar indexing. This is away our processing text content so that the search on a particular match text is much quicker than if you didn't have that indexing. So that's the first technology that hastily excess uses a Petula scene and we have OS G I. It is a spring J. P A O S G I layer database, the HDD access source code, and I mean it's Web module and this is its spring configuration. XML fall. If we look in here, we can see we have always geo tech to finding and always July module. So add it started base layer. It's built out of thes os G I modules. And each of these Rostov modules represents a particular entity structure in the data base model scheme. This is a class referenced in that O S. T I model structure in its conceptual schema as an iris form. So this is an iris form persistence layer implementation glass. And you can see here it's using J p. A. For its persistence. Last. If we go down, we can look at its queries. This is one of the queries that it's using. That brings a lot of its technologies together. And we've set up this query here, and we have highlighted the key aspects of its query and how it works. If we look at the query, any objects it's using, we can see this defiance to conceptual schemer of the database. So the objects that has our concepts, relationships, what it calls rs forms and iris contexts. So it's based around concepts and relationships that iris forms. This stands for a representation form, so you represent the concept of a particular form in a particular context and between the concepts. You have relationships. So it's a very sophisticated data model, and it's set up to perform these sophisticated relational queries structured into a persistence life. And so it's very sophisticated, but it's using quite traditional technology, So that's our technology stack. We have loose scene. We have OS G I, which is a modular application to We have spring for the City I or the daughter injection JP for the database persistence, and it has an object relational mapping, where we have concepts, relationships and representational forms. We also have the idea of an NC i. D. Where that stands for new miracle concept. Identify curd systems that room that to these NC ideas and that allows us to store a reference to a large knowledge domain with a simple in Georgia. And by using this new miracle concept idea, create a sophisticated data set. It's highly optimized for queries off the form that we're going to look at. So if we're giving inputs of a string match in C RD's, if we have visas, imports and a typical search will be off the form first research with a joint on the tables for representation, relationship and context, and we're looking for a match for a concept in C. I. D. So we have some source concept might be diabetes, and we might be trying to link that to blood pressure. We do this joint and then we do a match. And so that ties in the Hussein So we can perform these sophisticated joined Cruz in smaller amounts of time. And that has been the core focus of the work and the research in developing these sorts of applications. And that's what big data comes in. Because if we could take this start, is it and method to a cluster paralyzed these searchers, then we can move this technology forward. We're going to try and take what is the central components of high HDD access and map it to a big data step. So when we have this joined match query, it's working with a conceptions game. This is not really an object database. This is a conceptual database with a semantic data model, and you could read this information and take out of it what you want. And we want to stay more focused on the practical skills. What we need to do is try and add value to things. So here we have an extremely sophisticated data paste. It's extremely useful in the health care world, and it's sitting in what could be fought off as a legacy stack even on really powerful boxes. So if we could move his content and we can move this schemer to, ah, hard oop environment, then we can add tremendous value to this content. The value here exist in an abstract level, so it's not so much in the database design, the way they have stripped out everything and just come up with a set of concepts with the's. New miracle. Identify IRS. You have a set of vintages that breakdown vast domains of knowledge. You too, extremely useful conceptual objects. HDD Access definitely has a conceptual schemer that set up based on a semantic data model. Firstly, we have a vocabulary. Then we have an ontology, which allows has to define relationships. So when we have a hierarchy of terms, there's some parent concept that has child concepts we can create as semantic woes. When we have some name space RCD 10 and it's subset off code systems, and then we select that results it from our daughter said. And we have some relationship that exists that it's the basis of the query. Him could have some attributes for the query. This is essentially what we see here in a relational query, but we are in a two dimensional space rows and columns when we have cemented queries. It's essentially the same thing, but we break out of that two dimensional room. And so now we are crowing on what are known as Stupples and the classic Jew pole in the cement equerry world is what's known as an RD. It that's what we do now is will move on and we'll go through the practical steps to set up a HDD excess on our box, then extract, the doctor said. And in method to ah ha Do Acosta in the title of his topic, we say that this is big data and healthcare applications now one of the principal functions that companies found they could use, how do, for very successfully was adding value to existing data. Now the HDD access is a sophisticated database that has some content where a tremendous amount of semantic modeling has been employed. To reduce the data set to such a size, it becomes manageable for installing on a local box. But then it extends and bridges to a vast domain of knowledge and information, and it's all focused on healthcare and healthcare applications. So we're going to follow that process, adding value to existing data. And we're going to take the conceptual schemer and cemented model that the hay HDD content provides and map that too big data technologies. So the first step in doing that is obtaining the content. And so what we need to do is install hasty access, set up its database, and it's Hussein Index and then map that content to the big Dada stack. There are some requirements to consider before we start the install. The first relates to time management, so it's outside of the basic requirements. Now my box is an eight course CPU, a four gigahertz, and I have a solid state drive, and with that set up it, contact me about an hour and 1/2 to install. Important continued to my Skrill and create loosening day. So to do those steps could take me 1.5 hours with have reasonably powerful box. So if your box is less than this, for instance, you might have just in I five at free gigahertz, for instance, then you maybe want to double this. Maybe you could do better than that. If you set things up reasonably, you might expect to double it. And if you don't have an SST, well, this can really take a long time. You might want to just follow along discourse and not actually do the install yourself. Or you might want to follow along with the video and do the install at some time. When you've got plenty of time when I'm trying to make here, it's that unless you have a good box, this can take a bit of time. Now the requirements are it is possible to set up pasted E excess on windows. But because of the unique nature of Windows, I'm not going to support that. But if you want to try and set up on Windows, it definitely works and is a window installer. We need to install my ask euro, and we need to download and locate the mice Curiel J. D. B C Drivers Jar. Then we need to download the hay HDD excess binary that's the install program and some other files, and then a separate download for the hasty excess content we need to have jobs seven or later and as we mentioned before, we need time for the process to complete. So let's start with the download step. Now. The downloads is fairly straightforward, but there are a few different downloads on the HDD access site. So we just go through the steps just to make sure you don't make a mistake. So the first step is you need to register to download. You need to register for the site. So this is the register page to register link. Once you're registered, you need with your user name to log in at this link. After you've Lodin, you need to go and download the correct installer. So for the correct installer, we want to download the line Extreme store and you notice here there's the windows installer. I have installed hastily access many times on windows, but it's a bit harder installing on windows, installing online axes much easier, and it works much better for new users. So in this video we are only supporting the line expat. But if you want to work on windows, it works really well. And then once you've installed HDD access, you need to download the content at this link, and this is where you can make a mistake. So you need to make sure you're downloading the correct content. This one, you need to download that one. This is the one who wants so double check that your downloading these ones. After that, you need to install my skill and create the database that will be used for the content. And so we'll go through that step now. Now I have downloaded my HDD access Born Arian Storm. My terminology content. I need to also make sure I have my my school connected job and the download link for the my school connected job. If we go to the link and we choose platform independent for when we're not running on windows, if you're doing the windows installed, of course you would choose to be nervous, and then it's a good idea to have everything in one location. Now I have here the sources. If you want to go into the source code for hedge duty access, it's really good value because it's excellent code, but I need to have my free main resources, my binary in store, my content and my connected job all in a location where I know I can find them. Extract the binary installer into the home directory. And that creates this director here lining school. And inside this line X Scurry directory is the actual install script. Before we can run the installer, we must first have my Skrill installed and also have a user which we recall our user HDD. Now there are many resources on Internet for installing my school, a new Bantu. It's particularly easy, I would like to point out it's best to install my Skrill server fight went five because that works better with other. My school tools, like my Skrill workbench. We run this command and we check the mice. Girls started and then we'd logging into a terminal preferred route, user and password. So here I am already in a terminal, clean the screen. So I'm going to run these commands. Now I'm going to create my user this command. So I created a user HDD. Now it's very important on my Skrill that you give the actual host name as well. So we must make sure we supply the host and I give the privileges. And then I must flush the privileges create the database. Now I can check. Everything is good. Let's check that my user worked out. So I'm going to run this query in my schedule to check that I have correctly created the use of. And the important thing here is that I have the hosts as well. So that's really important. Needs to have the hurts is my user. There's my host. So that worked out have created a user. Now check out that we have the actual database created correctly. We didn't make a spelling mistake with some of the little era that would break things. And today is my database. Because we have my skill set up correctly is we can exit mascara with this exit command and we're ready to run the hedged easy access in store strip. When we extracted the heights duty excess finery, we ended up with this folder Atlantic Space gooey. And in there we have a script. This is the script that we weren't to install. Hedge Didi access. Now that I have installed my my Skrill user name and password set up my database. I'm ready to run me in store script before I run in store script make sure I have the My Skrill connected jar in a place. It's easy to point to, because I need to give the path to that job when I ran the Install script. Now it must also give read, write emissions to the install script because it's got that space in the path. We can handle it that way. So now the in store script has read write permissions. So now I just run the installer script. So if I run is from the Home Directory, that wild card in this shell command will handle that space in the line expects school. So I can issue this command in a terminal to run the installer. So issue this command and it's running in store that, and it's running it in graphical mode. I first I accept the agreement. Then it's going to install. So I got next. It's going to give me messages, which I don't need to worry about. It's accepted of faults, so now it's extracting the daughter. Now, this is the first tricky thing we need to choose custom installation, then next and now we need to select the type of the data base, which will be my skewer. Next Then do we have the driver? Yes. So we can select next. So now it says it can't find a driver. Please help it to locate one we slicked okay, wants us to find the file. So we go browse because we put it in our home directory. It's going to be very easy to find. I select Hey, and I have now found the connector job. I go next, and I can leave the defaults here next and at my user name, which I created in that step with my Skrill HDD, which is the user name I created. Then the password was high H d d a. And in the database you my real. So this is what we want on a standard. My scroll in straw. Make sure there's no spaces. And so now I contest my connection and connection success, and that's the end of the hard part of the install. And so when you select next, it will go through and do the rest of the in store and you just choose the defaults. So now we've been stalled. Hedge didi access. We will go through the steps to run hedge TV access. Before we do that, it's worth it to just spend a moment and look at the configuration files for hedge didi access. You can find the hedge DD access configuration for us at this path here. So here we have the hasty access Install directory. We're going to hedge DD server, find way bets, roots women glasses met in. So we'll just review that path again. So starting and hates duty. Access. We're going to the server, their bets, route reading glasses met in. And once we're there, we can't really get lost. So now we go to spring and here we have the application context. Open this up with G era, and if we go down, we're looking forward. The data source. And here we have the data source. It's not pointing to the my Skrill u R l It's pointing to this J and e I name. If you have database access problems, this is the start for debugging it. So this is its system variable that's using to access database, and it's doing that through Tomcat. So that way we can access that we go back up to the server directory, we go to corn and We're looking for the context. XML open this Richie era and we go down And this is where we could see we have my screw you . If you have a mistake or your my secure configuration changes, you don't want to go for all the steps of reinstalling hedge duty access, Then this is where you can reconfigure your my ESC, your connection stream. This is worth it to know the two configuration files for sitting up the access from the HDD excess web application to its database Back end. Let's go through the steps to run hedge TD access. So we start and stop basically in different terminals. Now, if I was to run this with start instead of run, then I could shut it down in the same terminal that I started it. I don't want to do that. I want to see the debugging output. So what I'm gonna do is use run instead of start. So I copy this into a terminal. So in the home directory one directory above where I installed hedge DD access, I'm now running this command and you can see Tomcat is starting up. This is important that I can use run here because if I get errors, then the debug information in this run terminal will help me solvers errors. But I'm not seeing any areas yet. What's important is the port is here. So this can all be configured that when a change the sport setting I used this context XML file to change the default port in a browser. We used this u R l So here we are. This is the application after a fresh install and I don't see any search domains here because I don't have any content in the database. So the very first thing to do is to log in. So log in and the user name is had me in the password is admin. So I log in now I'm logged in. I go back to tools and I go to content management, choose import. Then I go to browse and I go to where I have my content management so extracted to this directory and inside that directory will be a ZIP directory. The one I have highlighted here happened that and you can see it's fantasy directory. So when you download too, content comes down it zip directory you extract that and inside is another ZIP directory, and that's what you load in at this step here. Now when I select import, this is when the big time consuming process occurs in court, the content into my ass, Curiel. Once that's complete, it will create the losing index. We need to make sure we have a fair amount of time for this process. To continue the's screenshots, summarize the import process, so after you select the import button, you will see this screen in the browser. When this finishes successful, you get a message that the import finished successfully and you can see for me it took an hour and 10 minutes, and this was just importing the data into more askew. Oh, now, the next ways which it goes into automatically, is to create the loosen index, and that can also take a large amount of time. So here you can see in the terminal the output you would expect if everything is working out correctly. The index finished successfully, and now I will get this screen. So then it's best to shut down. Hey, HDD access restart hedged et access and you will see all the search domains we've now successfully created our content, which we can now go through the steps of extracting that content, mapping it over to the hard do cluster and importing it at first into the hive database by using Apache Scoop. Now we've set up hedge Didi access on our local box, and we've imported its content, and we've looked at how we can use it as its traditional stack application. Now we're going to move the data across to a big data stack. Our big data stack is going to consist of the following core technologies. So we have Apache scoop. And it's a tool that's designed for efficiently transferring broke data between Hard Duke and structured relational databases. Exactly the database that we have set up and it's a relational database and so patchy scoops going to take that and it's gonna put it into her do in particular. The next element of our stack is Apache hive. The data warehouse of Hodoud and Apache Hive is improving all the time, so we use scoop to get the data out of my ask euro into Apache hive for the loose ain't Index. We will use Apache solar hastily access as it stands now is indexing with the call of seen . What we would do is we will create an index in solar build on top of a Petula seen so we'll see how we can create indexes for the structure data that we push out of hedge DD access using Apache school into high. Then we'll look at the data in hives. Ransom Hive queries will see how hive can paralysed the queries so no longer running them as a sequential very process. They can now be run as a map ridges process that greatly speeds up the groceries. And then we'll look at how we can use solar to create the index. So we will have scooped for data migration hive for our data warehouse and then salute for a searching. And what we need to do is get the data had a my SQL so that we can access that data via Apache scoop. So these other steps that we will follow her to migrate that data to the cluster before we use Apache scoop to push it into Apache hive. First on the local box, we must export the data. This is a pretty standard thing, so you just use the my scroll, dump you to Zeke. Now, this will dump the entire database into this while you copy the data dump. Then that's this file here to the cluster. And we're going to use the latest cordon works. Virtual machine inside my skull on the horn works virtual machine as before, you just run the same man's to create the user if the privileges and create the database. And then we import we just run this command and we import into the starter base, the data dump, and in one sets, done, we will run some queries in h d. D access and were first look at the speed of my Skrill on some different sorts of count queries. Then we will look at some match queries, and then we will look at what's known as joined queries, joining with matches. And then we will look at the sort of query that definitely would benefit from mapping the data set to hive and running as a map ridges job on. Once we've done that, we will then use scoop to migrate the data. So I simply copied his command into determine no need to be logged into my school. Now you must give your password here for this command to work and they execute the command . And this meant will take a while. So now this command is running. It's dumping the entire contents of the database. And it's quite a lot of data there. This could take quite a while. Now I have the latest version of the Horton Works Virtual machine running on the box. Yes, hedged in with a terminal to the rich directory. And I also have an ssh far browser set up. And I've looked into the root directory I have copied across from my home directory, the Dart Adam HDD access SQL. I've copied that into the root directory. I log into my SQL by just typing my SQL hitting return. And I mean Corton works have set up my Skrill under virtual machine. You don't need a use name off us with, but we will need to give user names and passwords when we run the script script. So we need to create the same environment we have locally. We need to run these shell commands. We need to create a user grantor privileges, flush privileges, and then create the database, so we'll just check Everything is good. So we run this query, it'll show the users. And here you can see I have HDD set up for on local host. And now if I talk show databases, you can see I have the hay HDD access created. So I've created my user HDD and granted them all the privileges. And then I created the database. Justus before I can now import the data dump into the Moscow on virtual machine exit out of my school here. No need to have my SQL running for this step. So now I just issue my import. He's my import statement and this room import the data that I dumped out of the local my skin well into the database that I have created here. Now that we've imported the tartare, that's run some Berries to look at the performance of my school and super, you can discover a former query that would benefit running. It's a map produced job that's used the database. So now we're using hedge duty access, and now we can enter the command show tables. We have a table here concept. Hey, Che and another table Iris form. Hey, Jay Harris form. Hey, Che is much bigger and concept page A. So let's run a count on concept paycheck. And that was really quick and 629,000 roads. Let's change that now to iris form, and that's free 1,000,007 100,000 rows so you could see the order of magnitude so still accounted. Uniformity in rows in a few seconds. But I will be able to show you an example with his status it where it would make sense to be using a map ridges job on the query. Let's do a match text. What we could do is we can describe the table concept. Hey J described Concept H A. You can see it has thes fields I haven't seen. Added the comments. Field is empty. So let's do account. So run that ran through and you can see that it returns Europe. So we have found all the nobles. So now what we'll do is we will describe Iris. Full look Aris form. Looks like we could see it's got this field up representation. So I'm gonna do now It's run a match takes query, so a match text. So when I used to match Tex. I put in a stream, and this percent sign here is a wild card. So match with anything that begins with f R E. We look on the iris warm that's around the same query, but now nearly four million rose and you could see what the limit on the query was very quick and we'll take the limit off. And it found 470 residents search for nearly four million. There was no way to stay. So now let's try joining two tables together when we joined tables. This can slow my scroll down. This is the argument we're not using my SQL. I'm claiming here that when we do joint queries, relatively big data base like this join query. We too slow now. This is a particular type of joy. This is what's known as an inner joy. So let's run an inner joy. So concept definition is in consultation. Paycheck table. The up representation is in the iris form table, but I did have a limit on the query, so let's take the limit off. See, it's taking longer. This is the longest we've had to wait a lot of matches there because the data set was so big. But it didn't take very long before my skull was finding results. So what I want to do now is give an example of the query that is the query that would benefit from a parallel process. So this is an example look very that would benefit from running as a map produce job. It's what's known as a full joining in other databases. My Skrill doesn't implement a forgery, but we can simulate it with the union off right joint and hey, left joint. You know, the designers of my school knew that this was extremely inefficient use of my schools, so they didn't bother implementing it. So let's have a look at it with a limit of five, even for limited five, it's taking a long time now. The reason is is because it has to match. All the records were forcing it to match all the records, and if no match exists for one of the cases, it just shows no record. For that case, out of joints are not often useful, but there are times when they definitely are needed. And so this is an example of a quote. As you can see, it's not returning every other query that we hit my A Skrill with even when we did the inner joints on a table with four million rose and 600 fasten rose. So it's a reasonably big join data set. My Skrill was excellent. It returned really quickly. Now, when it's doing this full join, you can see that this is not returning. If we're running for joints like this on the current set up with my schedule at the start of set, it's a long way from satisfactory. Souness is the big issue. Big data is such a buzzword. My school was a solid far started, but it has to be a real reason why you would want to run the query outside of my ask you. Oh, well, he's the reason. As you can see, this four joint is just not working. So let's move on now and let's map data out of my screw into hive and look at what we can do with queries on hive. If you're new to rescuer or the SQL commands that we just issued are in the documentation. We also use SQL in hive Hive has an SQL interface. Also, it's Girl turns up in other surprising places in the data, so it's a good language to learn, so we'll move on now. We're going to move the data out of my Skrill into a petty hive using scoop. So first we must check that scoop is up and running. To do that, we need to copy this command into a terminal on the virtual machine. But it needs to go in for this one line run that Now we're getting this warning about a cumulus home. We could ignore that and you could see it's found that databases in my SQL So scoop to my ass. Curiel is good. So the next thing to chair the state of survival on the cluster So now we need to start M. Barry. So this is the shell script. So we run that in a terminal and was starting in. Barry and Barry started up. So now we go to log in you, Ariel for, and Barry, which is issue arrow here. And the password is admin Edmund. And now we want to check. Is everything good with the cluster? And we can see everything is good and rest importantly, Hy vee screen. We'll have a look at you. So we go to the Hugh. You are? Oh, the password here for one's 1111 And so we go to beeswax and hive. Looks good. We can run queries in huge. We could also run queries from the command line. So now what we could do It's run are scooping import command. Now, this command must be horrible in one line. So for copied into this tech sector here, we could see it's going to go in all this one line you could see we use the you Ariel for the database, and we use our user name and password that we created in my SQL. But here for the import where we are mapping with migrating from my school to hive, you can see we have this strange Syntex here. That's what you need. We declared the table, so we're going to import one table at a time. So the first table we're going to map over his concept. Hey, let me run this. It's going to kick over, um averages job. We run the query, concede down the bottom Here it's now met doing a map produced maps. Percent reduced Syria percent. So this is going to actually go through surprisingly quickly. So now it's moving into the reduce face. And he was getting some messages about column NC i D had to be cast to a less precise type , so the precision in my SQL was higher than the position in hives and it's finished. What we'll do now is real. Let's do the table for RS forms. Run that same command. But now change this to Iris from Hey Jay. We expect this one to take significantly longer. Iris form is nearly four million records, so this will take quite a bit longer. While this one's do is, we'll go to Hugh. It's far outside, so we were beeswax before. Now we will go to Hugh Farber outside, and we will go up to user, and we will go to root here that it's creating the directory iris form Hedge. This is just a staging directory for the map ridges job, and once this map ridges, Job's finished. This structure here will be deleted. Now it's finished. It took a bit longer again. It's surprisingly quick. The map ridges parallel processes really speeding things up here. Moving from my school into hives in terms of the migration phase definitely benefits from a parallel process up I've and that will take us into the high command line interface. Now we rule used. The SQL that we used in my skewer have is very faithful to the original esque You. So many vanilla SQL commands will work inside higher. So here we consider listing the table. We have had two tables Concept H A in Paris for major. So now we run one of our select match queries. He was selecting from iris form with iris form up representation field like French with the capital, F R E and in the wild card and any limit to five. So this is our same vanilla SQL were used in my school and it should work in higher. That's not quite as fast for this sort of query as my skill, and he could see its finding the same amount of fields. So now what happens if we take out the limit would take out the limit? I would look at what's happening. Okay, so it's very quick. It's almost as quick as Moscow for even for a big search so that iris form has to formally in rows so you could see that heifers had no trouble with the size of the data set, even on this small cluster of one node, so you could see straight away by being able to go to a parallel multi node environment. We've already vastly improved our technology stack as fathers as far as the persistence Legos. So what Would Jr was to move straight over to that complex join, So join that was my SQL was unable to deal with. So what has to happen is that has to go in well, as one commands if we put this into a text editor. So we want this to g 7. 7 Log collection and analytics with the Hadoop Distributed: this will be our learning flow. First, we will see how using the latest Horton Works virtual machine, we can install what's known as I pi phone, which is an interactive python which has evolved into this Jupiter project, which is a remote access show for a hard do cluster. So this Jupiter project, we can see this link. You can find out information about it here to see it. So it's evolved from IPE iPhone where I python was this interactive python shell, and we're going to install this and set it up with all the dependencies for advanced data science using python. Then we'll move on to create I python code that generates simulated data. And we can find that at this link here so the project source code can be downloaded from get hub onto the virtual machine. Then we will move on Twin store Apache flume. Patchy flume is a streaming technology. What we could do if Apache flume is we could take some sort of server log or some sort of log process, and we can use it as a stream dinosaurs put into a stream and then output output it for some analytical function as the rebel grows, whereas the log data increases in real time. It goes into the strain where we set up some sort of analytical function, and we can use the how do distributed file system to deal with the size of the files and also when we run a analytics week and run them in parallel as smell produce or over yon related jobs. So set up. Apache flew in to do that, and we were out. Put the Apache flu to what's known as Apache Catalogue and Apache catalog is the technology that allows us to interface data that's relation ALS to other layers, such as Apache Peak May produce jobs and Apache Hive Hedge catalog is a very good way to set up the Dina in hasty offense so that we can then process the data using hive pig. And even if we want a cheese spark, what we'll be doing as the glue for all of this is I python. So one of the key concepts we want to get across in this learning module is what a great tour I pi phone is. Why is it worth it to learn by my iPhone? This is a significant part of the work in hard group and hard IT related projects that's analytical. This part of the work is mainly formed by data scientists. As developers, we need to support them and write code that they can use iPhone. It's ago to language for data science, and that's because of the tremendous package support in python for data science packages such as Psychic Learned. So we got a psychic learn. Here it is here, this is psychic low. It has a really wide range of excellent tools who machine learning. So this is just one of the many packages that are available. The python. If we look at the Java up options for machine learning, we have em lib from spark, and we have my hoot. And if you worked with us to very, very good, but in terms off the diversity of the algorithms that you might want to use, then a packet right Psychic moon really extends the options, and that's what data scientists will be looking for in the applications that we develop. Hyphen interfaces exist for most of the major. How do packages that are analytically related such a spark and peak and in the third thing , which I think it's really important, is that with this IPE iPhone, you get a remote python shell in a browser. So that means if you have some big cluster in a cluster farm, somewhere might be on Amazon, and you can be at a completely different location anywhere in the world. And you can log into that process with a browser. And then you have this interactive show where you have all these packages. All these wonderful scientific data size packages of python brings to the table. You have all of those available in an interactive shell that will then allow you to interface with Spark and Peak and other major hard Duke components. So that's incredibly convenient. So you could be on a Windows box on Mac Box, a Line X box anywhere Logan with a browser to a running hard do cluster and be writing python code and the I Python remote shoulders. A lot of things for you. It manages your code that you write really well, so it's a very good tool. It's a free tool, and it's one. We're going to learn how to set up and use, so the first thing we're going to do is in store I Python, and we're going to install the latest version from Jupiter, so we'll move on with that. Now, these are the show commands for installing I've by phone project Jupiter. Now this rest come in downloads all the dependencies we will need or Python 2.7. The latest shortened work boxes still sent or six side has a wife and 2.6. So the first step is set up pie from 2.7. So this downloads of dependencies on the virtual machine in the root directory, I just copy that command in. If your box is a little bit holder, then this will take longer. Go Yes, to accept the downloads. So this could take a while, so you should see that complete and success. So our next command this is make sure you have all the latest compilers like GCC and may, So this could take a while once again going to have to accept the download and install so that when for you should see complete. Now our next command installs pie for 2.7. This one can take longer, so we'll just let this go for so that one to consider be longer. This enables the 2.7 interpreter. We will often have to use this one. So every time we log into a new show going to have to use that command to pull the 2.7 interpreter into the path. So now we download Easy set up, I So it copy this one. And now we can use that two twin store pip and pick is the apartment package manager like NPM for no Jay's. Now this step is possibly the longest step. So this is installing the scientific and other packages. So this is that the machine learning package. This is a new miracle package. Many of these packages take quite a long time to install. When you run this step, you need to be aware of time management issues so that dependencies, all the machine learning and scientific and data analysis have now been in store and you can see them there. They're on installed. So a number by new miracle pie site pi scientific by the psychic woman in this topic and the scientific data analysis topic to come, we will use many of the functions in these packages, everything's installed now. Our next step is to move on and in store. I've by for notebook, which is this command here, So this one goes through relatively quickly. We just let that go for and you can see it's getting all the Jupiter dependencies and that's completed successfully. That's the dependency phase over all of those steps. It takes quite a while. These other show comments and these install I Peiffer notebook with all the data analysis and scientific Pyfrom tools that you will ever need. We'll move on to now configuring I Python, and we will configure it so it sets up naturally for working with the wife and interface for P. In this step, we create what's known. It's a pipe in profile. So there are other interactive packages for iPhone, and they will share this idea of a profile. And in a profile, you can configure some properties like the of a P that is interactive. Shell binds to when you access with the browser. You can configure the port, the actual directory that you landing, and so we're going to get it to land in the pig directory. So when we write our Python scripts inside the IAP iPhone Jupiter remote show. Those scripts will point to by default to pee Home directory. This will create a profile that set up for working with pig and in particular on the Horton Works Virtual box. The default port for IAP iPhone is used by another process. So we need to configure with a different port to the default port so we can make a remote connection. At first step is to issue this command in determined and that's going to create this hidden directory Jupiter. And in there will be this far Jupiter notebook conflict dot pie. And we need to change the contents in there. We need to include these lines here to set apart. Configurations are in a terminal, will issue this command. So I'm in the root directory. I issue this command, everything worked out. You will see. I see here and help we go to a far browser. And now we have that in directory. And if we go in here, this is the fault that we need to change. So now I've changed into that dot Jupiter directory, and now I'm going to V I The file now I go in a copy. And so now I go down. I hit, I certainly I could pace those Lauren seen and maybe one line after now he'd escape. And so I've put the continuing the file snuggle shift Kahlon, right? She call on cue, and I have successfully updated the file. So now I'm ready to run my interactive I pipeline shell provide the Jupiter server. But before I do that, I need to open up this sport on virtual box. So I could of the virtual box interface, and I got two sittings I go to network ago. He deport forwarding, and I create a rule where I just have exactly what you see here. 889 for the host sport. 889 for the guest sport. So I create a rule. Looks like all the other rules and it's just got that port there. And now I'm ready to run now. I just copy my command, my pie from that book, and we look carefully at the output of this command. So round this come in. And now if we look here, we can see it's serving from this directory here, which is a directory reconfigured, which is a peak home directory. And if you don't cure it saying that no book is running at all I P addresses. That's binding to every I p address. That's open on the box, and it's on the port that we opened up for it and it's the same port is our rule. So now in a Web browser, we go to issue a row which is just local host and the port that we happened up to put we configured. I performed to listen on. So if we just go to that, you are oh into that and it will take us to the interactive shell and he could see the peak job. Piggy bank job will be right with the pig interface for pipe run. We were used to piggy bank charge, so it's there, ready to go. So our interactive shell has opened up inside the peak home directory. Now we'll go for the steps to run the part from script, which generates a simulated log data. So we got hitting you and we go down to no books select pipe into. And so this is our notebook interface, and this is where we put our python code here. Now, if we go over to get hub, go to the gym, logged out a pie and we could just copy this from here, then we go to our new notebook and we pasted into this is called to sell. Key idea now is to learn how to use the I part from shell. We're going to write to our root directory. You want want to right into this directory Because we have write permissions Here are home directory for our I pipe and shell. It's now forget to user hedged keeping this number directory Now we go on foreign peak his pink and this is our home directory. And if you look here with this while here entitled I pi And then we got to rename we'll give it a name and like Arjun daughter So we've named Chin Data One okay, And if we go back into the far brows so that we need to change your directory to see the updated So now my wife a notebook has been renamed to gender one and the file extensions I pay in my baby which stands for I pipe in that book. This code is safe here. So now, opening thus far here. So we got a route. So giving the path from the top of the fire system here, top of our system, tree root. And in the name of the file. So we go back to our Barbarossa. You know, he would say we have a top of the fault. Sure, we have route. And in this directory here. So what we'll do now is we run this skirt so we pointed the output far. So it's going right to the smoker. So this code is going to simulate a server. Loved up. It's gonna have an I P. Location of time and a process. So we need to do is go to sell then Ronald. So it writes really quickly. It's gonna write a fasten record, so I could write that very quickly indeed. So never get about four browser. We go refresh our root directory, so we need to change the directory. So we see the update and we want the event lock. So there it is there. So we open this. They said, look, data to demonstrate that again to leitess. So just deleted that far that no longer exists. and we just repeat this. So Randall. So it's running and it takes a second to run go back, refresh root directory and the father's bag. So it didn't take very long to generate affairs and records. So if we would just sit that to be a 1,000,000 say some large number, we can generate a very large log file. So now we have set up I Python, and we demonstrated how you can rename so we can rename and save. We can put python code into a cell, and we can choose run or to run the cell. So I hope you could see how convenient this remote shell. It's because we could be writing this code from anywhere. So what we'll do now set up Apache flu output Donna from this far, So this fire will generate the logged up, and this will be the source for the flume pipe. So using Apache flu, we will put it into a hedge DFS, and then we will use a hedge catalogue and process it with hive and pick, and in particular will be writing more. I piping code to work with peak so we can write some pig beauty ifs. Now that we've installed our I Python remote show for writing Piper scripts and running in the harder and harder to cluster, we will now go through the steps to in store and configure Apache flume. And this is very quick, unlike installing I Pi Farm. So firstly, we copy this command yum install minus wife loom and the minus y just means auto resolve. Two years. So this should go fruit quickly. That flume in store is particularly simple. There you go. So it's done, said flumes, very quick. So now we have flume on the system. So our second step is to configure the flume come. So if we go into a ssh far browser, we go to this directory, etc. Flume com and there will be a template flume dot com file there, and we will have to change the values for our set up so we'll just go through that now and then we'll talk about I've highlighted here the critical value. So this is the import to flume, and this is the output in the th DFS far system. I've looked into the top level of the file system off the Horton works. Virtual machine with a ssh far browser. So no, I'd locate the etc directory going to exceed directory. And I'm looking for the flu. It's flu. I go into flu and then I want the confidence rectory and he could see. I have a flume dot com. Now, I'm gonna have to open that in V I. So I copy my command into the tomb and you can see everything is commented out here. Good answer The bottom select I They could be the space to paste into Go into a documentation copy across this configuration back here. Paced now he'd escaped. Now go shift w shift gold on W two. Right shift colon Cute. And now we've configured flu the configuration values for flume that we set up in this flume configuration for, Ah, very clear and simple. Firstly, we have an event source, so configure this is being like a riel time event, and this would be some logging process which we're going to tail. Then we have a channel and this generalised telling bloom that this will be from file so from a file to a file and the sink is in the heart of far Sister. So the starter is going to the harder file system. So it's coming from some local fire system somewhere from a file, and the output is to the hardy far system, and it's like a stream with a channel. And then this is our event source, so we're going to tail this fire. So as for Web server, running on the incident would be continually updating its lochs. So these are like events to get pushed into a strength and get output to hedge DFS. This is the path in haste, DFS, and what's really good about this is we don't need to create this in Will it need to. Goto, Although troubled, create this in haste. GFS. It's going to created in hasty FS for us, which is really good, and then this is a buffer, if you like, So this allows. If we overflow the memory. There's some temporary far we're rights to so runs out of memory it can back up its events to the file system. It's very simple configuration, and it's very effective and it works very well. Our next step is to start flu, so we copied this command into the terminal on the virtual machine and you could see flu is booting up. Once it comes up, you will basically see the command that we just put in. The last part of this command here gets tail to the screen, so we'll just look at that. So here we could see it here. So when we see this, we know that flu is up and running. So the next thing to do is to generate the low data with the iPod for notebook so that flu will pick it up. What we need to do is we need to change that output part for our python colored in AIP iPhone to where flume is expecting to see it as an input. So we put it into the root directory at first just to check that we could write the data. We're going to need to make some slight changes to our private script. So we need another terminal open to the machine. One terminal is running flu, and now we start out. I I found a book before we started. We must run this command to enable the pie from 2.7 interpreter. So now we could start a I pipe in the book and you can see it's giving us the output on Port 88 night. So now we just started, pray. Awesome. And we got a local host on that boat and I pipe in his come up. You just put that down so you could see we went to a local host Triple 89 And my partner has come up, and now we load up on notebook for a notebook, has come a and we're gonna make some small changes. So the first we're going to change the output. So we need to go to via log. We need to change this slide. Just copy that. It's on your mind. So he were writing to the far system. Make sure we have that new line upended there. So it writes our new record on a new line. So that's good. So we can save this save and check point there we go to sell Ronald, and it's running and it's generating the data, and now we're gonna have a terminal FluMist running and so flu. It's not telling us this any output into hedge DFS. So our next step is to create the hates catalogue. So by now, our pipeline process will have finished. So this will have finished so we can go to our Pyfrom terminal says I'd python terminal and hit control C twice and that shut down. I know what we need to do. It's run our age catalogue command and this will create an interface with the hue interface to the dollar in hedge DFS if everything has worked out. So we run this man, though it's now booting up. It's a young job So just let that go through the young job completed successfully. So now we need to check that everything worked out so we can do that with you. So we go to our hue euro I'll just put the stand so we can see I would go to hewing to face Goto Hatch Cat and we should see a firewall firewall logs table So this was our outpour. If we look in a catalog, we are creating this table for roll ups And so in Hue we find that table he forward looks we selected and we go to browse starter and if everything worked, we will see the data. So now basically we have this as a relational schema we can use this starter with hive and re pig. And so our next task will be to work on using and learning. I've been interface to peak so we can start to run some analytics on this starter. But I hope you can see from this exercise what an excellent tool Apache flume is. And also Apache Page catalog and I pi form. It's a very good way. Very convenient way to write code to drive daughter, and we will see it's very convenient for writing Peak. Now, before we can start to work with the data in Python and Peak, we must understand its structure on the fire system. So a convenient way to do that. Let's go to the hive interface beeswax, then go to tables and we will select follow logs and we will view here. So we got to view and here Can you see here view for location, go here to view for location. We could see what the daughter looks like. So where we actually I hear influence events and you could see we have all these little log falls. So these are the flu events as it's riding to the hasty FFR system. Look at one of them. We could seize his little subset of the data, so that creates a problem in the first instance, because we can't read from all those subsets because they're obviously would be too many because we're working with very small set of data here on your fasten records. So things go quickly for the video. But in the real world, on a real log off, maybe two gigabytes, then obviously it would not be feasible to read from all these files. That's what Page catalog comes in because it is simple to more as a table in hives. So if we go back to hive and we go back to our tables, we have the spy world logs. If we go to browse starter, we can see Page catalogue has assembled a hive table view off Madonna from all those small log files, disassembled it into a view in hive. So now what we will do is perform some analytics and we will look at two simple forms of analytics. Very, very common. One is visualization, so we'll look at how we can use the eye Pyfrom notebook and pie often to do some simple visualizations on the data. And then we will also look at how we can use pig to do some filtering. When we do the visualization in Pyfrom, we're going to need to have subsets of the dollar in the local fire system. And so our first task will be to use high to export the results of hive queries to the local far system. So we're going to try and get a query like this to work where we select from the eye view from Hedge catalogue and we output into the local far system. As the CS V file page catalogue has assembled a high view of all the data in all these event log files. What we want to do first up is collect them into a single far, and the easiest way to do that is to create an external table. Now, external tables are useful when you want to work with the data from outside hedge DFS. So this will be our first task to create this second external table. And then we want to map the data from the hedge catalogue Highview into this single external table in the SDF file system in the root directory, so we'll do that now. So we go to our beeswax Query editor and we copy in our external table definition and we're using the same schemer. It's viable blocks and we're calling it Fire World Logs to So we just create this table. So execute this query. And now if we go to tables, we see that we have firewood logs to We can't view the table so we can see it's schemer. And now what we want to do is insert the diarrhea into this table from our original file World Log stable, which is all just a view of all those different files in the Flume Events directory. But first we need to understand these two inserts now in hives. If we insert overwrite, we will overwrite any existing data that's useful when we want to refine queries and insert into will add Dina to the existing table. So it's important to understand the differences. And now when we run this query, it's going to populate our new external table and overwrite any existing data. So we execute this query Well, this query is running one directory above the root directory issued this command so you won't have any permission issues when it tries to export the data. Go back to a very table and we can see it's finished if we go to our tables and now we go too far, war logs to And if we look at a sample of the data, we can see the data is there. So now we have this starter. What we want to try and do now is put it out into the local file system. But just before we do that, let's view this file cassettes useful. So here it is. Here it's just got that name zeros. So that's not the most useful name. So it's important to know that. So now we're going to run our export statement, and we're careful that we have given the permissions where we were going to export it. So we need to make sure that you will have write permissions in that directory. And so we're going to explode It is the CSB far gonna try and export all the data from this father looks to external table that we've just created Go back to beeswax copy decision and before we do this, we need to make sure that we don't have the directory or father without putting, too exists at the location without putting it too. So execute this. So it's running. It's a map. Produced Children went fairly quickly. Now, if we go to our local far system and here we can see we have our new directory, which were writing to If we go there, we can see the name of the file. It's not that hurt helpful. Now, if we want to open this file, we're gonna have to rename it. So we'll rename it is something dot CS fee. So now it's something that fire system can recognize. We open it and we can see there's an issue here. This all one line delayed this that's now deleted because I was with the exception when we run the query again and we can add this lines terminated by end. So let's at that Want to, ah, hive query. And we're hoping now, with the lines terminated by the new line character and we open up the CSP far, we will see it all. Each wreck, one record toe a line otherwise are further flow on. Work will fail, so we'll just have a look on the fire system. So we refresh the far browser. We're looking here. That's rename it to these dot C S V. So what we did here? What's we rename the file to a C S V far? We wanna check it has the correct eliminator. So we copied it to the local fire system off the guest box and we opened inward pad because no paid wasn't respecting the eliminator. We could see it is indeed everything this is our data are correct. C s a v eliminated dot Now we successfully exported the daughter from high. This query exploded the daughter to the rich directory of the virtual machine with the correct CS feed eliminator and it terminated the lines. So now we have the data on the local fire system of virtual machine. So now we can access it via I Python. And now what we can do is we can look at some of the visualization tools in AIP iPhone and we can also look at how we can access the daughter in high HD effects with pig for filtering, but that we have successfully mapped the data into the local file system of the virtual machine, We can now perform some analytics in particular visualization of Adada. Look at the ways we can set up and I Pyfrom notebook so we can visualize our data set to help with this, especially for people that a new toe i pi for We have carefully laid out every step, so we always make sure that we run this command before we start up, out I Pyfrom notebook. And once that I pipe from that book is up and running, we log into it with tissue arrow. So let's go to a terminal on the virtual machine and we issued this command And now we will issue this command to start the notebook so no is coming up Now we can access the notebook on this you are. Oh, and now we will create a new notebook. So we go new and we open notebooks and that we can now rename this notebook so we can rename it to something that we like like dot ofhis We can save it any time we've saved in checkpoint So now we're ready. Copy Over. But we will think of us. I hit it. So this is all the imports and some other commands. So we copy this into hit herself, which is our first cell, and we'll just talk about what these imports are. Numb pie SNP. This is new Miracle pipe once a whole lot of modules for New Miracle code. See SV is an import export for CSP files Now. This is the important one, Penders says. PD. So we're talking a minute about what Penders is and for the visualisation and Met plot to similar to Medlab met Web style visualizations. We have the start of structure module because Penders is a data structure framework, CSP, Far module and any miracle module. So they're the important ones. And in this line, this tells high python to embed toe lots inside this browser. And so that's why this is a really good tool for quickly visualizing the data. So when we working with big data sets, it's incredibly helpful to build it quickly, visualize the data. Now in our next block, we're going to define a function that will read the file from the far system. So we look on the virtual machine in ssh! File, browser says, in the root directory, we can see we have the directory re exported from high. And if we go in there, we have the file. So here we have the path. So whatever path we end up with when we export to local fire system, we need to import the data with that path. And this is just standard piping. We won't go too much into the piping. We'll just highlight the important parts. So this is just a file Inp iPhone suggesting array. In this next block of CO, we populate our ray using a pandas object PD, And you can see here the speedy has a read see SV function that takes a file. Now it has penned, assesses idea off an index and index in pandas can often be just common names. And here at the end would give it the dilemma utter by the fault it will use the Commodore limiter. We're just putting it here for clarity. So we populated the array using a pendant object PD. We close how far and then we return. Our panda object and stupendous is a data structure. So it's returning this object. You can think of it like a matrices if you want to, but it's going to have as rose contents of our CSTV file. So we copy the code, go to the eye Pyfrom notebook created cell pasted in. Now we can use this function and we've declared our function re TSV next block of code were used to function. So we create what we can think of in pandas as index, but is really just the schemer or the columns the names of the columns in our data. And we can use our function we just created and it returns his pandas object which we're gonna think off as being like a maitresse e p handle dot shape when we GOP handle dot shape it's going to return the number of rows and columns or the shape of the May tree and we create a cell for this we paste in there and if we haven't made any mistakes now, if we go to sell, run all we should see the shape of the Matrix and there it is. So it has this many rows and this many colors. So now we have our data. Inside a python died. A structure and distorted structure is a Penders data structure. So let's just quickly have a look at what appendices now. So this is pandas, and it's a life in data analysis library, and it has thes data structures. Two basic ones once called a daughter frame enough is called a Siri's and data France and Siri's have their own embedded functions, and so you can learn how to use it from the documentation here. Extremely useful. One of the points I'd like to point out is this comment. Penance allows us to focus more on research and listen programming. And this is the key thing about python in General White so popular with Dad assigned assistant U The speed that you can get something up and running so they don't want to spend any time learning syntax. They're not interested in frameworks. They just want to be able to run their functions and stress develops. We need to be able to come up with tools and development environments to support data scientists so they can just walk in, pull up a terminal, typing the functions that they want to visualize and grind through and generate the data for. And so that's why we are looking at I Python in the context of its module Let's look at how we can use panders to do a simple filter on the Datta. So we're going to run discovered block, and this slide takes our pandas object and returns a view as a darla frame. And here, this is the filter coat, and this is just to aggregate the data. A subset of the data aggregate on this field status service will return all the roads where the country field is equal to U. S. So this is a simple filter and in this line is just like a data dump. And so it's easy to see when we actually do it in the code. I go to a mobile created cell. We will think of this. So it's our data, Visualisation daughter announces. Seltzer will change what's in here. So let's run this code. So we got to sell Ronald and you can see it's printing out and it's going to truncate the data. It won't prenatal out, which is very good. So, Prince, at the 1st 48 it prints out the last 4000. This was a good way to get a quick view of a daughter, and of course, we could change our field value. So that's a very convenient function. And this is returning a view of the data as a daughter frame. So data frame is like a maitresse e the rows of the matrix, E R R data. There are a lot of operations we can perform fund out of frames, not all off the data frame functions that are embedded in the data frame. Objects are available in AIP iPhone, but many of them are. And this is just to give you a view of what you can achieve quickly and easily with pandas . What we will do now is we will look at what we can achieve easily and quickly repentance and visualization. So to plot the data we met plot, we need to be able to get some new miracle values. And so I will do is we will use the pandas object grouping function and we will group on our schemers country field. We have four countries, so we will have four blocks off data and then we will count the number in each block. And that way we will train a numerical value and then we can plot that grouping with a simple bar plot So copy this code. We will go to our eye Pyfrom notebook paste in the code. So, Ronald and there we can see it quickly provided us with the plot when it's a good plot. But the key point here is that we can do it so quickly. So it's very easy to obtain some sort of useful visualization of data using Matt plot. So that's are too quick. Analytic illustrations. We looked at field tree, so using pandas was very easy to construct a filter and do a data dump off the filtered results. And then we saw how is it was to set up some visualization. Once we had some way of creating a numerical value from the data, we could quickly create a plot, and the key word was quickly Now to export the data to the local flower system so he could perform thes analytical techniques of filtering and visualization, it was quite difficult using high. It's a hatch catalogue very easily set up a view off the flu in output. Say so the flume output with schoolers, locals and Hatch catalogue made dealing that number of files easy, getting the data out of high into the local far system was not so easy. Page catalogue and flume are all about doing things quickly and easily, and I pi for notebook is the same. It's all about doing things easily and quickly using page catalog and pick it so much easier and quicker way to get the data out off hasty effects into the local fire system. Now, having said that, it was still worth it to learn how to do it. Five. Because here we are unknown. Horton. Work, sandbox. Everything is provided for us in industry. When we work, we will not always have such good tools. It was good to learn how to do it higher. So what we'll do now is we look for the easy way with page catalogue and its interface to be when we ran H catalogue on the flu. Tired, you created a hedge, a long table in high called far locks. And if we look at the location for firewood logs we can see we don't see a nice far war logs set of data. We see a lot. These log vaults that are the flu events toe the table the hate catalogue creates. It's just a view off his starter. Then inside high, we were able to through a series of tables. First we created an external table that we called firewood logs to. And then we populated that table with the view data from age catalogue table forward lochs . And then we ran a further SQL query to export it to the local far system. And when it was on the local file system, it was not a user friendly. Father had a strange name. Now let's look at how we can achieve exactly the same thing with much less work using the pig interface to hate catalogue. So this is our code for the pig script, and we don't need to create all these tables. We just run these lines. Now we're using the hedge catalogue loader. This is the hedge catalogue interface to Peak. You know, when you see the peak interface to hide because you can see this is in the hive package. So we loaded in hatch catalogue, then we store it as a single set of data inhe HD effects. So this line stores it in the Hatch DFS file system, and then we use peaks copy to local function to copy it out of hedge DFS into the local fire system here. And to make all this work, we must tell pig to use the page catalogue and we go to peak scenario pig, and we must give it a name. Now we have a name and down here we must use this use user page catalogue, so we type in their hyphen use hate catalogue. So I put that in there and this is a key step. You have to hit return. So it's gotta look like this. These reliance, let's check in our local far system. It's going to output route died on one into the root directory. So we need to check that it doesn't exist because it exists here. We will get a map reduce era and we also need to check in the hue far browser. So we go to root. If for data exists here, we will also get a map produce exception. So we need to make sure that none off our output directory. So we have to one in hedge DFS one in the local fire system there must not exist because if exists, the job throw an exception. So we're ready to run this now. So we execute takes a little while to execute. So we'll dispose the video while they're executes so we can see our pig script, which kicked off a map produce job, has executed and completed because the green that says we have no errors. We go down and we look at our logs. We could see all the output of the map produce jobs and this is summary here at the bottom . So it's taken all the records from the hate catalogue view and it's told them in Haiti, if s here, stores have been hatched. DFS So now if we go and we look in HD effects, go into the far browser, we will go to root. We could see we have died of one there. These are the files, and now you could go into our root file system. Refresh. You can see we have data one, and you could see it has output. So openness. We have the same data now in note pad. It's not working that well, and you could see we have output data. So working with pig was a lot easier was much less work than when we ran the hive scripts, so we had to create an external table. Then we had to populate our external table with the data from the page catalogue view off the flume output. And then we had to run export query to ride it out to the local fast system. When we're in pig, we just ran this one script and it did everything for it, so it was much quicker and much easier. Now it turns out that pig and pie often go together really well. So in the next video segment will move on to writ 8. 8 Data Science with Hadoop Predictive Analytics: out of science is an eternity of activity. This means it follows a clear sequence of steps. When we formulate our question or hypothesis, our first step is to acquire our data for usually through some sort of scientific process. What some data acquisition process then The data is unstructured, so we need to structure the data or clean it out. This is where we will have, ah, hard to map ridges. Step then we need to visualize and explore the daughter. Then we can refine our hypothesis or evaluate our model measure or evaluate the data against our model. We get some metric or some measure of how successful are model fits Siddhartha. And then we can refine our versus measure on Our model showed a positive result. We then acquired the data based on our model that we have evaluated and we repeat the sequence of events. It's an eatery tive process. Now machine learning is a part of artificial intelligence that you used to train computers to do things that are impossible to program in advance. So the computer is learning. Has the process runs an example of handwriting recognition? It's a classic example of a problem that can only be achieved through machine learning. Search engines like Google and Being Facebook and Apple's Photo Tagging Application and Gmail spend filtering are all examples off machine learning at work. We can break machine learning down via the algorithms used in machine learning, and it falls into two categories. Supervised and unsupervised algorithms for supervised machine learning include regression and classification. Unsupervised learning algorithms include clustering and recommend ER's. When a machine learns the data set full of learning is often referred to as a training set . The training set of data can often be seen as a table of values, and in that table, the names of the columns would be called the features of the training set. In this topic, we will look at an application that we will develop for machine learning, where we use supervised learning and regression algorithms to produce a solid application based on predictive analytics. We obtained training daughter and we evaluate our model for a process of continuous training. During the training, we could change the labels off the column headers in our feature matrix. So we continually evaluate our model, and while we're evaluating our model, we could change our feature matrix extraction process where we're extracting from the mess of raw data. Just those fields. Just those columns. We need to best support our model when we've been through this iterated process a few charms. We have a model. We can use that model to make predictions, and our predictions will be accurate. To the extent that we have evaluated our model successfully fire our training data, let's just take a month to look at the context for Big Data Analytics. Big Data Analytics is a major growth area for professional development and information technology, careers and jobs. So let's support this claim. So let's look at this first Link, which is back in 2000 and 14. And it's saying Cloud Era has just raised nearly a $1,000,000,000 by selling an 18% stake of the company to Intel, and that raised their market capitalization to just over four billion. Can look at this link, which is at the start of the year. Cloud era has just one cloud contract. Will the C I hey, where Amazon is providing the infrastructure and cloudy here is providing the software any analytic engine, the major telecommunications carrier for Indonesia just chosen Cloud era as the big Data Analytics platform. So this shows huge growth for just one company in big data. The big growth area for I T and programming careers and jobs now and in the future will be big data. The idea has experienced such phenomenal growth They consistently show they can provide solutions that other people struggle to do. Accessing datasets in Haiti affairs with iPhone to use this package pi doop access HD affairs with a hyphen. I was unable to get it to work on Horton works. Yes, I do is supported on cloud ear this blawg from a cloud ear engineer and he's saying that Pipin has this fantastic scientific stack. Pipin is the language of choice for data science. So on cloud era, you have a bridge from hedge DFS to python known Aspire do And this would be one of the technologies that we will use in this application. So let's go through the steps now to construct such an application. Now, for preparation phase, we are following a short sequence where we were playing the daughter we can obtain our data set from source curdling. This is what I died. A sit looks like it's related to temperatures and soil moisture. We have a scheme up here with a data set, and we're going to extract of features matrix from this data set. So this is a link to download the cloud era Quick start for him. We looked at Cloud here, and it's stack is definitely with learning. Then we have the latest. How do Kurt changes? We will be using the very latest How do pay people right? We will set up a map ridges job using no depreciated methods. Once you've downloaded the cloudier virtual machine from that link and unpack the down, then imported as an employee since so in virtual box with virtual box running Yugoimport Appliance. Once you've imported the over file in the settings, you need to make sure you given it plenty of them. It's really memory hungry, and here you can see I have given it 12 gigabytes, so I'm in a position to give it lots of memory. I've got four processes and 12 gigabytes of RAM devoted for the machine. When the machine first boots up, you need to run large cloudier express, so you just double click on this icon and, uh, poor families terminal. Now, if you have run the machine previously, you don't need to do this. You only need to do this once if you've done a previous, so you will get this message. But if you haven't, the script will run and you will see some upward in the terminal. Once is complete, you just hit enter to exit the terminal. When the machine boots up, it also boots up Firefox and you see the following browser screen. Now it's similar to a warden works. You have a hue interface and some other interface screens. None of them will be available because the cloud era Hadoop Stack is not running at the moment. To start the harder stack, we need to access to cloud your manager screen. And to do that, we need to wait quite a few moments, possibly 56 minutes for the machine to boot up and initialize. But after five minutes or so, you ready to log in. So you agree and you log in its cloud era cloudier. So the route news The name on the box is cloud here and then you logging now the important point to realize Here is this cloud era Quick start. This is the actual clusters Over here I can add a cluster. So what we're looking at here is too cloudy. Rea manages software and it can manage more than once Cluster. But we want to use the cloud era quick start cluster. So we need to click on this here. This could be a bit confusing. Swung pointing it out. So now we've gone to the cloud era quick start VM cluster. So we're in the cloud era manager for the cluster that's running on the box. So we need to make sure so if we go back, just double check that. So when you first log into the manager, you see this screen and you need to click on the left here, where it says cloudier. Quick start. Then you need to go to actions and you need to click, start, restart or star. So we'll click. Start quick, start again to confirm running through and starting all the services. We will pause the video while all the services Buddha, this latest version of the cloud era quick start VM It wants a lot of memories, so you may see Samaras here if you don't have enough memory now, we started the services in the first instance. We really only care about Hugh and Hedge deafness, but we don't see any areas next to those services, so it looks like we're good to go so we can close out off quick start manager now we will get our source code out of get Hub into the machine loaded into Eclipse and run our first Met produce job. Now, one of the things to note is the map produced on Cloud Era, a little bit different to the map produce on Horton works on Horton works. We had a hasty effects user are card era. We don't do that step. We just work as a cloud era user. Also, the far system treat for hard do is it'll be different waves set up. Initially, we will use H DFS at first to run a map produced job. There. We set up raw data, we clean it and map it into the hedge DFS Far system. So to do that, we will use eclipse and we will need to put our source code for a map reduce job from it hub he is so linked to the source code on guitar now in the virtual machine. So they're here. System tombs, far brown. Sir, We can see I have a jalonick cyp iPhone pulled onto the machine. Now I launch clips, so clips he's starting up and it's gonna ask me for a workspace. Now I go fire switch workspace Other. I go to Cloud Era Home Directory. Choose where I downloaded the source code from Guitar age. One x i pi phone and then I choose sides Java. Okay, so I'm using side java inside edge when it's ripe. IPhone as my workspace eclipse will switch the workspace to the new workspace. And now what we need to do is configure that NC's inside eclipse Inside my new workspace. I need to create a new project. There are many ways to do this. If you are not sure, just select New Java project. Give it a name. Predictive analytics. Give it any name you like finish and he recite project, which is set up as just a simple job project now clicking on the source directory. If I go right click import, then I go general and I choose far system next, and I go browse and I go to my downloaded source into side Java. And I went inside the source the simple package. So I select okay and out saying there are no resources currently set for import. That's because we just need to select in this little box here. I hope you can see that. And so now we can import the resources and we go finish. And so now we have the files for the Java map images job. They don't have the dependencies. So our next job is to configure the bill path, and we need to point to the system dependencies that are available inside the hard to contradictory in the card era virtual machine. So if we right click and we go to build part can figure build part. And now what we want to do is add external jars. So here we have external jars and inside user live. How do client the jars are all there? So we want all these Giles so good and the bottom and pressing the shift key. The conflict, All the jobs in one go select. Okay, Okay. And now we have all the dependencies that we need. We look the reference libraries. Now we can say we're referencing all these libraries in user leave. How do client? So these are all the dependencies, Winnie, and we have free errors if we look at the errors not saying the declared package does not exist. So if we go to a source code, open up that class and we can select the quick fix. So it's saying moved the class to the package and we can repeat that for each of the classes. And now we can delete that redundant package. Okay. And if we look at our errors, we could see we only have warnings. Look, of the warning and it's saying that is appreciate it. So we'll look at fixing that next. We can fix that warning by adding a bowling and true. And now we go save and we should have no warnings. So we're not using any depreciated code. That's the key point I want to try to get across is we've set up this map ridges job using the latest hard do AP I We have no depreciated warnings. Now in my main class for my MEP ridges job. I'm going to set a Boolean that allows me to run the code locally to test locally, says Online 70 and set it to be false. So now I've said my bullion to re false. It will run with the correct parts point to the Home directory in a local far system. So now I will export my herb a jar. That's my jar with all my code that will run as my harder job with this set up on Cloud era and eclipse. It's quite simple. I just go down to export and I select job Jafar next click to include my source code. And then I must give it a name. So I got brave hearts. I can call it uber Rupture. I select the home folder where I'm going to run it and I slipped. Okay, Inish. And if I go to the cloud, your home directory, I can see I have this jar. Now if I open a job with the archive manager, I can see it has on my source, Kurt, but it doesn't have the hard do dependency code. So it's a different herbal jot. And the recent that this will work is because all the dependencies my code needs would exist inside cloud ear at runtime, cloudier of set it up so that allow these dependencies are available at run time on all the nodes in cluster. If that wasn't the case, our code would fail. So now what would do is remove on in the next video segment. We will test, occurred locally and were tested on hedge. DFS review the source code for our lack ridges job. If we go to the online guitar repository, we can see we have thes packages side Java logged out our analysis and side out of announces. We are concerned in this topic with cydar analysis and side job. I'm averages Kurt exists inside job so we can see we have four classes aside. Data simple data test. This is our main class from averages job. Then we have a training data member and a simple reducer. So these are the free classes that we run in our MEP ridges job. If we look at our main class, we are using the latest hodoud a PR and to the best practice for that AP is you know, Main class to extend the Hardeep ap I class configured and implement the hard do AP I interface to two and Tool Runner coming in from How do you till package and the configuration is coming in from the hard do Kong configured. So that's how we set up our class by extending, configured and implementing too in the main point is that we have this tour runner used to to run a run method, and we feed it a configuration object and the instance of our main class itself. And then in the two interface that we implement, we create the configuration this way. So, using an instance of our main class and recalling its instance, method get called. So I Main course is of type configured because extends configured and it has super classmethod get corn, which is an instance method. So we call it with an instance of the class, and this allows us to get the configuration at runtime on the cluster so we can change it dynamically at runtime. So this is a new configuration flow with the new hard do a P I and into use explode correctly. We have to create the job with the new how do by using static, get instance method So in the job glass we call the static get instance method to get an instance of job. So they're the new constructs were the new how do we have this configured super class and its implement to We have a two run a class with aesthetic run method, an interface run method and this new way of creating configuration and this new way of creating jobs that we do that we won't have any depreciated methods in our map ridges code . Now that we have soon how you configure the main class in Java with a new, harder A PR how to configure jobs and create jobs, we have seen how you can pull, occurred into eclipse and set up the build part for the dependencies in eclipse for the cloud era system, we've seen how you can export the jar. Now we will look at how we can perform on cloud ear. A local test for had to map Rogers joke. We'll copy the data set to the Home directory, and then we will run the exported job with these flags. Ah, highlight flags with these flags, it will run not on the cluster but in the local fall system. This is convenient for checking that our map ridges flow is set up correctly. How we are in the cloud era Home directory on a cloud era virtual machine. This is Are you bitch? Are re previously exported from clips inside are so scared that we pulled from Get up. We have side out of dot text and we copy that into the Cloud Ear Home Directory. We copy from the documentation with the correct flags to run as a local job or a local test . So now we're in a position to run a local test. So when we submit this hard do command the's flags that I've highlighted here, I mean, it will only run in the local file system, not paralyzed on the cluster. So, having set up correctly, we run this command in the terminal, and it should just run very quickly as a local job. And indeed it's finished already. It ran as a local job. If it'll keep, you can see it completed successfully, so the flags here ran it as a local job. This doesn't work on Horton works. Let's go into the home directory and now you could see we have our output directory. Go into app with directory. Open up output from Oh, now you can see we've cleaned our daughter. We used a map ridges job to turn this into a training set. We've extracted the fields just to fields that we want from Corpus or a main body of data, and he can see it's tabbed eliminated. But it's not faithful to the tab. Sometimes it's just one space. Sometimes it's many more spaces, so now we have made it wiped, eliminated and everything is very clean, and we haven't I. D. So this is our first training set, and these columns are feature set for now. We'll just go through these steps very carefully. So we create a lube job from the clips, the copier rule Corporates off data We ran on my produce job and we create a training sit with our selected features. Matrix now will step through the code that perform some averages. We just test it using the new hard do pay p I. We can set the airport eliminator from a reducer, so it's convenient created method. Here we have credit, a method called set textbook output format separator, and we pass it an instance of a job and the actual delimit nater that we want to use. So if you look at that method, we just call the configuration from the job. And then we used this cave value here for the conference map and we give it a value of a separator. This is standard harder code where we're sitting the main cost for the jar. And I was sitting the output type for the MEP Ricky and he was sitting the output start for the mapper value. So the input for the key value pairs coming into the reducer is the text class Ormat produce version of string. Then we said arm EPA and reducer classes. Now remember, class is a training Darla mapper. And here we do most of the work for the map ridges job and I reduce The class is just a very simple key value reducer. So we look now at a training data member class, you know, training daughter, my pot. We have ah input k value as these long rideable and text rideable. My produce tights. Then we have our output to the reducer key value pairs as just texts. Now we grab a line from the value coming in. We convert it to a string. And then we split on this spaces. So we saw in our import data was a variable number of spaces, so we can handle that with that expression there. Then we check that we don't have an empty line. We simply ride from a array of string splits, the columns that we want to set up in our feature set. Finally, we set up our pipe eliminator in a string format. So we put the pipe between the features said column values Now for output to the reducer, we send a first split value, which we will consider to be a key. And then we send the values and the values will be the columns of our feature set part. To eliminate it, I reduce a class simple reducer just passes fruit. The key value pairs it receives from the mapper back in a map. Reduce make loss. We now can see how are mapping phase and are reducing face our set up here. We set the input for a map ridges face and he reset the airport for a reduced her face and the input type will be a text input and the output type will be a text output. We configure the different paths that we need. If we're running, it's a local test or we are running on hedge DFS. Now the way Cloudera is set up is a bit different to Horton works, and you can see here with the output paths on cloud era. By the fault, we will be in the cloud era Home Directory on hedge DFS, and that makes our output parts simpler. But we must be aware of that. Then we get output far system and we delayed the output so that when we run a job the second time, if you have put exists, it won't break the job. The output falls existing hatch DFS. When we run the job, then it will throw an exception. So by deleting the output here we avoid that exception and we set up our input and output paths. Then we said our job name, then using the tour runner. We are submitting the job and waiting for completion. So that's our map produce code and that is the preparation face in our scientific data analysis. So in our federation of faces. This is the prepared daughter clean Darla phase. So now what we'll do is we'll run this onhe HD effects, which include era is very simple. To run the map ridges. Job on cloud here is very simple. By default, Cloud era will default to the home directory cloud here. So this command, where we just put raw data text into cloud era Hetch DFS We don't give the path to the cloud era home directory. We just go put and it goes into the 100 by default. Now, we can always clean that home directory with this useful command. And we can look what's in there that will use these commands. Now, I have previously issued this command and you could see I'm giving the path and how to the cloud your home directory. And I am removing everything in there. So if I issue this command, I should see an empty cloudier directory. And I paced his commanding carefully and that returns there stuffing to accept what I put into the trash. And so now we'll try this notice. Now we're not giving a path to the cloud your home directory. So we issue this command using that a fault locations. So we're issuing this would command. But we're not specifying the final location of where we're putting the command to on by default. Cloud Era will put it into the cloud era Home Directory. And this is really useful because if we work with that cloud, your home directory, it really simplifies up ABS and it's works in the Java code as well. So now we'll look what's in the user Cloud Era directory and we should see our side out of text. So there it is. So we're just doing to hedge DFS with no off to its final airport and cloudy renew to put it in the cloud your home directory So that simplified arts and it also simplifies the parts in the Java code. And as far as I'm aware, this is something that only works in the cloud era implementation of HD affairs. Before we can run the jar in the cluster, we must rebuild an export, the job from eclipse, because we need to set that Boolean so it changes our parts from local to cluster mood. So we do that now, so we go into a clips and change the path So in a quips we have run on hatch DFS is false We now need to set that to be true. So now we set run on hedge DFs to be true Save everything go into the vast system with before system browser delayed the job that existed Now in eclipse go file export jaar and it should be set up. We need to make sure we include the source. If you don't ticking that box here, it won't work. Now we've set it up with the correct name from previous times. So it's just go finish now if we look in the far browser, Jarry said, and the bullion will be set correctly. So now if we just copy the command, go into the terminal, clean the term control, carefully paced to commanding, run the command and it should run in hedge DFS and it should have its import path set up so it will find the input file. So we just wait Well, the job completes. So the job completed with check it completed success. Wait. So it completed successfully. Now, in the cloud era browser, we won't go to Hugh takes it a little Well for Hugh to reconfigure itself. Now he's configured. We go to the far browser and we could see in cloud area Home directory. We have our raw data, and now we have our output directory, just as we had with the local tests. And if we have a look at the data in Hue, we could see we have our training set. So this is our feature matrix that we were work with. So what remains to do now is to install life in and pie do. If I do, it will be our interface to hedge DFS. And we will now form the Predictive Analytics using the scientific data tools that are available in life in and they're arrange off truly excellent scientific data analysis tools , available life. So we will learn about them and we'll learn how to configure python with Cloud era. And I do so we don't have to to copy from local We've Java, Hetch DFS. All we do is round. I do interface with hedge TFS and we can run all our python daughter announces code directly onto the files onto the training sets that exists in HD affairs. So that's what we'll do In the next video segment, we will install the dependencies. We need to support our analytic code for addictions, and this will be pipin dependencies. We will need to set up Pyfrom 2.7, as we have done in previous topics. And then we will install the Hetch DFS to Piper Enbridge Pie Dough and it's only works on Crowded. I do want run on. Horton works at the moment. Then we will install python modules for our analytical functions. We'll install modules for visualization and predictive analysis for our visual analytics, we will rely on Matt Plot, which is a Met. Glad plugging for pipe in Here is a documentation you can read about it and for us it allows us to embed useful graphic engine inside and I python show. Then we have our bread and butter. Till then, we use all the time in scientific pipe in its known as pandas. Now pandas are a tool which is basically you can think of it as a matrix or a relational table, and it's extremely useful has lots of powerful embedded functions, then for our predictive analytics, who will use SK Learn and the online documentation for SK Learn is here. The psychic learned package has the regression are driven that we were used but has many other supervised and unsupervised machine learning algorithms. But we will rely on this library for regression algorithm. Finally, extremely useful module for part fun, Patsy. Now, Patsy works really well when we are going to run some sort of statistical modelling, engineer pie phone, what etc does is it trivializes the task involved in setting up the imports to the model. So when we submit our data to our modeling engineer, Pipin, Patsy makes it easy. So there are core dependency. So what we'll do now is will go through the steps to install them. As in the previous module, we follow these steps. So first we install the shell off the native level dependencies for python. We install the bill tools that Piper needs. They may install the pipe from 2.7 itself. Then we enabled the native linking. Then we download the tools to install Pip. And once we got pipping stool, then we can install modules. I refer you to the previous module. We won't go through steps 1 to 7. Now, what would do is to start at Step eight because there's an extra module. We're including here. So everything we have done to step eight is the same topic seven. But now here in step aid, we are including I do much with any runs on cloud ear, some important points tonight. Firstly, you must be logged in as root and the root password is cloudy. Then we won't go through or the issuing of these commands because we've covered this in previous modules. So this is the first part where there is a new component. Previously we did not install pie. Do so. We will start at this step in the virtual machine in a terminal where we are loved in as root. If we are in the terminal and we haven't previously issued in his terminal to enable the life in 2.7 Interpreter, we must run that command. Now, if you have bean in this flow of installing, you previously run that command. But if it's a new terminal, you need to make sure you run that command. Now we copy our machine learning dependencies, including pi doop into the terminal. Now this part will take a while, so we run our command. So it's going through and outs, downloading all the dependencies reports. A video of this goes from and we'll show you the final part so you can see the installed success message looks like when the install for all the dependencies is finished. If it was successful, you would see this part here. So there are key libraries, so they all installed successfully and there are no rid era flags now. The next step is just as in the previous module, we installed the I Pyfrom notebook and generate a Jupiter configuration far. We also need to have a different home directory. So in the last module we pointed the Home Directory of I Piper notebook to the Pig Install Directory. But now we will pointed to the Cloud Year Home Directory. We install I python still logged in as root. It's a quick install. Now we exit route because we want to configure the I pipe on Jupiter configuration as cloud era permissions. So we go to a documentation. We were now generate our configuration file for Jupiter. I buy for notebook because we're changing shells. We need to make sure that we issue our command to select our interpreter. It's apart from 2.7 interpreter. So because we exit out of the roof into a fresh, cloudier user show, we defaulted to the 2.6 Python interpreter. So now we run this command again. It will work as it's running with the correct interpreter. So that's the important point is we often need to reissue this command to reset the part for an interpreter to 2.7. So now we've generated our notebook fall. And so now we will open our notebook far in V I. So this is the location into that directory location and we will go V I in tow I copier configuration and we carefully paced in. Everything was correct. We'll just double check. It's learning quick place. That's easy to make a typo here and then that Brexit configuration. So everything has to be correct. Double check. Everything is correct. Shift colon W to Wright Coghlan. You quit and we have can figured by Python. But before we run, I part from we need to open up the poor. So if we look in our documentation, we want to access I python on this port Triple 89 So we go to the main virtual books manager. Right click settings go down to network port forwarding and we need to open up this sport. So we find a typical configuration. This one would and we go copy selected rule and we will call this rule. I pi give it the port for both the guest and host. Okay, Okay. And you could see it updated in real time. So now if we go to the machine with the command I pi fun notebook. Now I pop on his running river opened up the port So in a browser, we should be able to access my pie Fun on this You are ill so enter that you are well and my part form is now we're ready to run Pardon and excess the doddering h DFs. Now we go to the source code on get on and we want to top level side data analysis folder. So we're going there. Pastoral daughter I pi NYPD. So this is the i pi for notebook extension. So if we click on this and we could see the notebook is readable on, get up just by default. Copy the top cell will go to a I pipe run in the browser running on cloud Here, go to new go to parf Into and out We can rename this to what everyone? So we will call it data under school analysis. Okay. And we entering does imports. So now below this. So we're just clean things up. We just enter another so right print test and we will just run this notebook as a test. So we got to sell run or and it's printed out test. But most importantly, we didn't get any errors, saying it couldn't find thes dependencies. So all our dependencies are installed. So let's step through I python code now to perform some analytics. Now that we have run our tests to confirm that are I pi for notebook appears to be picking up all the dependencies. Let's create a function that will read the data from HD if s directly into our python code using pi do so We copied this block of code from the get hub repository. So I copied is so and we'll go to our browser. We paste into that. So where we had the test. Now we have our function the first line in our function is we're importing rpai. Duke module has a hasty effects alias and we could use that alias to open up files in hedge DFS. So the same way we open up files in the local fire system, we just use a hasty F s earliest toe open files in the Hasty FS file system. So it's very simple and very convenient. So we create an array, we create a file handle to the hasty office Far system. We just check that file handle actually work that we didn't get some sort of exception there. And we create a schemer key value because create a pandas daughter frame eventually now faces an array. So we now construct the elements of disarray using a panda object. It has a C S V function, and then we arson hasty office file into our panda. Rude. See his feet. So pi do is a really good interface. We could just use the hedge, defense, far system, like any file system in our python code. Once we construct our pandas object, we then return it and we close our hedged office file system. So that's our method from my notebook if we copy these next two lines and we'll go to, uh, current notebook and we copy from Get home thes lines So we have our path. This is our path to our daughter in hedge DFS. We know it's in cloud era. We know the apple treachery cause we looked at it before and that's the map produce file. And so we call that far with our read See SV from Hedge DFS function. We just set up, get some value from it with some pipe in, and then we print out that value so we can look at what we have. So this will allow us to see what we obtained from Hasty FS we've pi do. Before we do that, we need to make sure on the virtual machine that we have started, how do correctly So in the manager we went, actions start so hedge DFS is running because we shut this down before when we were going to install the dependencies. So if we go back to a notebook and now we go, so run who and it's running, just wait now and it's actually worked so you can see we've got the daughter so these are feature matrix. So now we successfully installed pie do with creative function, were use by due to excess hedge DFS and then recreated a panda started frame object. So we have pandas. So this object that we're returning result is actually a daughter frame. This is indexing to get the second column. So the first column is zero, and the second column is one. So if we were to replace this resume and run this again, so I know we would see the case, there's the case. So this is just indexing on the columns. So we want to look at the Value Column, So I'm averages output Father's key value pairs. So this just says retrieve into this data frame object from this started frame object. Retrieve just the second column. So run that again. And there we have feature matrix, so I next job now is to split that string. We want to know extract all of features. So now we have a string, a piped, eliminated string of our feature values. We want to now create a daughter frame object where we have columns for each feature, and we also want the headers for those columns to be declared. So we have a preacher training set. So now this line will create split so we can see this is a feature set. So we want to obtain to split on this string with this do eliminator. So if we copy this and whatever is the last object in a cell, if there's nothing after, it will print the values. So let's run that and just see what we get from our split so you can see our spit worked out. And so now what we need to do is construct a diner frame because this is returned a list. So daughter value is a daughter frame. Dartevelle to is a list. So now that we have this list of values, we will use that Mr Values to construct a new daughter frame. But when we constructed Donna Frame, we will also construct schemer from get up Kurt. We will now copy extract as data frame. Now this is the scheme and we're going to construct a pandas daughter frame with that scheme. When we construct a panda started frame object, we start with an array. So he's an array well used to construct our data frame. And now we're going to do some filtering in the Datta. Some of the temperature values for the surface temperature and the daily mean temperature incorrect, and they are too small. If the temperatures less than minus Verdy, we're not going to consider it, cause that's like Antarctica. So we filter out those extremely negative temperatures, and now we create a dictionary object. And this is where we are creating a scheme. So a dictionary is like a key value, Petr. So here we letting our latitude be our first value are longer chewed equals our second. And we are also constructing a schemer with a columns value where we put the same values in . So that's I'm effort. So we copy that. We click in there the plus we paste in our daughter, frame it. So now we can extract our daughter as a daughter frame. So we act that on our data value to So we were called this returned object data is frayed. And then we would check that that worked out by just printing it out. Two standard output in the notebook. So we go to so Ronald and inside out of frame, everything is working nicely. We have a schemer and we have our daughter frame. In the next video, we will move on and we will use to start a frame for some visualization and we will run a logistic regression on this starter. 9. 9 Visual Analytics and Big Data Analytics for e commerce: just quickly review what we've done. We pulled in all our machine learning dependencies. So psychic learn gives us our machine learning algorithms and pandas gives us our basic data structure that we can work with. We looked at Pardew and how we could create a high H DFS handle to the Hadoop far system. And we could create a function that would read Haiti Office file and returned a pandas object. We looked at using that function and extracted features matrix we needed to split the features matrix and that returned a Siri's a serious He's not a doctor frame. So then we created a method to translate Siri's data to a doctor frame. If we look at first claims at our data frames object, we could say we have this column and has no name, and it's counting the roads. So it's like an index if you like. And then we have these column headers, and so we can think of that is being like a schemer. This is very much like a database table, and it does indeed have an index method. So let's explore its index method now. So we create a new cell, and in there. We call its index, so we run this run, slept 10 below. You can see it's returned CICC Index type. It's a 64 bit in future and its column numbers. So this data structure is indexed by this column. So now let's have a look. What is returned by its column function or its column attribute and you can see is the name of the columns. Let's have a look had this last function. We have this I X function. So let's have a look at the effect of taking this part run selecting below, and now you can see we're getting from the second to the fourth inclusive so we can see this is doing a roast slice. And it's inclusive now if we put back what we took out, and this is like a slice, if you like off the schemer that we can slice on its road space, how we can slice on its column space so well, Run that again and you can see it's sliced on the column space and it's inclusive and it's sliced on the road space and it's inclusive. So a data frame it's a very convenient object. We can think of it like a relational table and we can slice sections out of it. Let's have a look now at some of its other functions. So let's look at a simple aggregation and then a visualization of our aggregation, so we'll just print out a couple of rows off feature said. We could see we have latitude and longitude, so that's look location, a day temperature, a surface temperature, moisture. So let's run an aggregation where we are grouping on moisture. Because if we were farmers, it would make sense. This is the critical parameter, and we want to know how is this parameter modulated by these other fields and obvious ones ? In the first instance, the temperature, What is the effect of my daytime temperature as opposed to the effect of my surface temperature? So what is the critical parameter here? The daytime temperature or the soil surface temperature? We group on moisture and what we're doing is recalling the 1st 40 rowers. We are setting up our plot as a bar plot where here we're looking at the last free values, so remember it's inclusive. We are using a slightly different index notation here. This is an inclusive. This is exclusive. Let's do from 0 to 2 included. But if we use it like this where we don't have a first value, this is saying Exclude from 0 to 2 inclusive sewing exclude latitude, longitude and include a surface and moisture. Let's run this So we got so on selection and there's a plot so you could see it's very easy to work with pandas and visualization, So we have got a lot of good visualization tools here. We use Pi Doop to pull our data out of Hetch DFS, and we got our key Value Pier where our value was our training set, but it was piped eliminated. So then we needed to split the string that was up. White eliminated feature set, and unfortunately, that split value returns a Siri's or a list in python, not a daughter frame. So then we created another myth to construct a daughter frame from a list, and we actually used some filtering in there to clean up the daughter. And so we obtained our training set if you like, and then we looked at what are the basic first instance, attributes and functions that this data frame gives us, and we saw we can do very good slicing of the data. And we see it's got a very sensible indexing and it's got a scheme. And then we looked at an aggregation function. We can do aggregation, so it's got other educations. You can look in a documentation. We looked at a group by aggregation, and we need to collect a numerical data for a plot. So we're able on the subset off the aggregation to extract the mean off a group, the main of the moisture for the group. And then we could use that to make a sensible plot so you could just see the blue here. So it's clearly the critical parameter is the surface temperature. The service temperature is a parameter that overwhelms the daytime temperature. So what we'll do now is we're movement and we look at our machine learning algorithms. So this first part, which is to look clear what a daughter frame is, how we can use it, how we can do some quick visualizations of the data with a data frame and its attributes and the aggregation functions can we floor and extract meaningful patterns quickly from the data with those tools and we can see we clearly can. So now we move on to our machine learning curve now that we've gone through data frames and we understand how to work it out of France had to take slices of data frames and how to visualize starter frames to look and discern for patterns in the data in what is essentially our feature set. Now we can move on to using psychic, learn and run a logistic regression, run a supervised learning, how group and to do that, it's easy in my fund because of weapons tools. So the first tool that makes life very easy for this sort of algorithmic work is the patsy module we spoke about earlier. Now Patsy allows you to take some sort of data structure, a useful data structure like a data frame and converted to an import that is friendly for a machine learning algorithm such as logistic regression. And then we need to import our machine learning algorithm it So So we re import this data structure type from petty called a d mattress e and reimport. A learning model called logistic regression from psychic learn logistic regression is a machine learning algorithm supervised learning. We can think of it as a simple algorithm that will find a relationship for a model between a set of independent variables. And it said of dependent variables. So for a simple example, here we have a dependent variable moisture and to independent variables. Now we need to run a machine learning algorithm on a training set. So this is our first task to create a training set. So he was showing some data frame syntax. Here. We're running a filter on a data frame where we are looking at all the values of the greater in the latitude value, then 23.8 degrees. And we're thinking that is being the North values. So the data for the north section off the training data set. The return type of his syntax is a filtered data set that is a doctor frame. So this value, it's a daughter frame. We're going to set it up using patsy type D matrices that will set up everything nicely for a logistic regression type from psychic learn. So we have our expression that defines the relationship between our independent variables that are dependent very. We have our input data frame, and here we are taking a slice. So it was slicing out the latitude in the longitude. We felt it on the latitude. We don't care about the longitude, So to keep things simple, we take a slice. It excludes those values, and we ask it to return a type of data frame. So that's our D mattress ease, and it returns are range and our column space so we can look at our columns. We print out our columns. Here are Collins here, so it's having an intercept value and an expression for the day temperature and expression for the surface temperature. We need to flatten R Y value because we are putting in a dart. Afraid we're returning a doubt of right when we run the regression, it wants to run on a list. So in this line here we flatten. We don't use to flatten function. We use dysfunction because it creates a copy of our data. So we party in our white value and it returns to copy about why value and we create an instance of our model and then we run. The regression befit our model that we are asking for the moisture as a function of day temperature and surface temperature. The model that relationship here and then we can look at the output. So in the first instance, what we do is we just run our model to check that it's valid. So we run select timberland, and you could see we didn't get any air messages. So let's look the main value for our model. So this why values instead of why values over a range of X values so we can look its mean value. So let's run that we could see the mean value over its range is 0.1765 Now, remember, this represents a moisture value, so that's quite reasonable. We can look at the model in more detail. This syntax here will print out the coefficients of the model, so this model is a mathematical function. In this sort of analysis, it's useful to remember to just look at the coefficients that are in the polynomial structure off the model. So the interesting information is found in the car efficient. So let's run this now and have a look at our model, so it run select timber, and it's nicely presented the output of our model in terms of its coefficients. Linear regression. It's always linear. Logistic regression does not have to be in your in fact, by the photon. Logistic regression is, uh, step up from a simple linear regression. We are importing it from a linear model, so it's a linear rised version of a logistic regression. So that's why we talk about things like intercepts. This is like a line function of two variables day and surface. Note that the day coefficients zero this could be concerning. But recall when we did a plot off the dada, we uncommon are plotting function on, well run that we can look and we can see that day temperature is simply overwhelmed by the surface temperature. Here we can see some very small day temperatures down here, Eric or we also filtered our data. So the combination of filtering your daughter and the many order of magnitudes increase off the surface temperature as an effect. As a predictor critical parameter over the day temperature. It's only to be expected that the model has correctly assumed the day temperature has no effect on the model, so that's a confirmation that our model is working correctly. We've only scratched the surface of what is available in socket, learn and patsy. We can do much more. However, let's just review what we have done. So we used ah, hard to map produced strong with Java code to create que value where the value waas a training set, the value was apart. Eliminated set of data that we would use as a training set and then using the scientific learning dependencies and the data structure dependencies and the visualization dependencies and the pi doop bridge between HT if it's and the up put of the map ridges job and ah Scientific and dot a structure dependencies in python we created a function were imported to the output of our map ridges job, which is our key value pair where the value was the training set. We imported that from hedge DFS with Baidu and we converted it to a python Pandas Diarra structure. Now, when we work with the Dada said, we need to call a split function and we can convert our value, which is our part. Illuminated training set data into a data structure, but it's not a daughter frame, and the Doctor Frayn is what we want to import to our scientific learning and I machine learning. So we credited function that takes a list or Siri's and creates a dart of frame. And we used that opportunity of creating that function to do some cleaning on the daughter . So now we have our training set in a data collection object of repentance object in python , and we can use indexing function off the data frame. Just look a small subset of our training set, and we could look at an aggregation off the data structure. We could look at an aggregation on the data frame, get a handle of a visualization of what's going on in our Donna, and were able to pick up that one of the variables. The surface temperature is a critical parameter because it's overwhelming the daytime temperature. So then, using Patsy, we imported a D matrices type, which is a bridge between our data frames and a logistic regression, or machine learning. You were able to successfully use the patsy data structure and the machine learning algorithm, Emma with her model for the daughter, which confirmed what we observed in the visual ization. So we've only scratched the surface of what is possible using hard do and the python scientific machine learning data dependencies. What we'll do now is will move on into the next topic. We will look at the python interface to spark I spot, and we look at how we can do powerful visualizations using python on Spark Machine Learning Data topic. We will create application code for performing visual analytics on big data, and we were used a key element of the modern hard oops stack. Apache Spark and Petchey Spark runs on a year on Cluster just to talk briefly about Apache Spark now machine learning algorithms, as we saw in the last topic, attractive. We don't just run, the algorithm wants. What we need to do is train the Algroup, so it needs. We need to run it over and over again on enhanced starter sets or training sets. So if we're dependent on describes between each generation, it becomes a very slow process, and if we are dependent on may produce describes, that slows it down. Even more, says Spark was developed to solve that bottleneck. So spark, alas, is to perform machine learning algorithms as an iterated process because it does so much in memory. Spark has a number of AP eyes, so it has an escrow, a p I machine learning. Will I be a graph? Might be Now this graph library. It's not a graphing or plotting library. It's a graph data structure, so we can use it like neo four j graph database. It's graph is a big list that we can use for persisting and processing data. Then it has a streaming AP I where we can plug a calf streaming to spot and form analytics in real time. On Kev, Constrain Spark is a very powerful tool. We will use it in this application via its SQL interface. The basic data structure of Spark is known as a resilient distributed data set rdd. So we will work with Spark RD days and they're a mutable and immutable pattern is a way of designing an object that will work in parallel code where you have many friends, so immutable means it can't be changed. So it's Fred say, for a parallel environment. But we can perform operations on our DDS return newr DDS, and it has met filter and persistence operations. We could persist ideas in different file formats as Para Kit by the SQL a p I. And if we serialize it will serialize with spark is an important tool. We will learn how to work with it. In this topic, we will also use Python analytics. We will use of python interface with spot by spot on when we're running on a cluster operating system like Horton works, which doesn't sport by do I Spark itself is a powerful bridge between Hasty FS and iPhone. So we will use by spark for any leagues but also as a bridge between hedged edifice and python. And when we have our daughter in hyphen, we will introduce a powerful Pipin visualization module known as, say Braun developed at Stanford University. Here is the link for, say born and we're just briefly looking at now. So it's a powerful data visualization toe so we could take snapshots of Adada and ran different sorts of visualizations to gain insights into the dialogue. This is important technique, entire science. Now we will have some preparation to do so. We'll start with the preparation, so we need to be here for two technologies. We will have a map ridges phase where we prepare the daughter. And for that we will have a Java greater application. So greater is a powerful bill too. So we look at how we can set up a Java MEP ridges application using greater as the bill to Then we will have an analytic face. We will have to set up well, uh, hyphen and I spark modules and dependencies for scientific data. Any particular Seaborn. So we'll move on to that. Now we will install by from 2.7 and the dependencies that support part in 2.7, such as pick. We will follow exactly the same procedure as we did in previous modules. So we follow all the steps from the previous modules all the way up to step seven. Step paid is where it's different. So we were starting this video signal from step eight. Now are step. It looks like this. The first thing it's important to note is that I do Pizzonia available one cloud era. Now we're working on Jordan works again. We will not install pie Do. But instead we were used by spark. Asare breached a hasty affairs dependencies look like this now you notice there is no I spark. That's because pie spot is already installed as part of the spire client installation Re copier in store command from a documentation and paste it into a terminal. Now there is a time management step. You This is a time expensive operation. So you need to make sure that you have time for these commands to complete. Well, what's the video? But we will show you the final result. If your final terminal output matches what you see in the video, then you know it's correct. So run this command and what was the video while it completes and we would just show you the final result for the final output in the terminal. If everything has worked out successfully, looks like this. Where have highlighted You need to see the running to sit up for the critical dependencies like numb pie, psychic see boom pandas. You need to see the running in store for these dependencies and after that you need to see this successfully installed message. If you see this in a terminal, you don't see any red era flags then you know, everything worked out correctly and now you've installed all the dependencies. But Peiffer. Now we will go through the steps to install and configure I Python Jupiter Notebook to work with Apache. Now, every time we come into a terminal for the first time, we're going to need to run this command to enable the wife in 2.7 interpreter already issued his commanding the terminal, so we don't need to worry now. We were in store the I pipin dependencies. This is a quick in store, so we should that command. Now, If you had a near a two step, it would be because you hadn't enabled by from 2.7 interpreter. So when I Python has installed successfully, this is the airport that you see now configure Jupiter notebook issue this command and this will configure the notebook. So we ve I on this file its command v i to configure this follow we scroll down. He'd I Anderson space it escape. Go back up. And now we will copy in. Hey, set in Now there's a little bit of mistake there. It's missing the c dot notebook c Do what notebook? Resetting it to us on any i p to listen on the port Triple 89 were pointing it to the Spark line Home Directory. So when we have successfully paste to that in his escape, she colon w two right to file she colon que to save. And now we have configured the I pi for notebook So we just review We are pointing the home directory for the i pi for notebook to be the Spark Client directory. So here I've looked into the virtual machine within Ssh, well, browser and I'm at the top of the directory tree. Now go to the user directory. Hey, HDP into current and I'm going to go down and find spot current. He's my spot Khan directory. This is a path to the Spark Client directory and this is the path that I said here. So my notebook default directory must be despite client directory, Go back to the Spark Line directory. Here we have a boyfriend folder and inside the pipeline folder is where we have asked by spark dependencies. So when millions when we installed a lot of life in dependencies, we did not install pie spark because pie spark is already installed in the spark lion. What comes next is a bit tricky. We need to configure the necessary spark environment. So we need to have these environment variables Now we run centres so we could put these in the dot Bashar see for ease of use. But it's for now We can just export um as variables So we'll do that Export them in the show had later we will put them in the Bashar See for now we will just export them from the command line. Then we need to create this I spark configuration file Now the documentation of a pipe Jupiter is in transition because I python developers have moved over to Jupiter. So the documentation is not clear about exactly where they should go in terms of Jupiter. In terms of I python, it's clear it needs to go in this directory here. So the easiest way. But now it's just to make this directory insider Ssh! Far about er so we create the dot I Pyfrom directory, we create prayerful default directory You folder Careful to vote any name, we create startup directory You folder, start up and in here we are going to create this spoil. So now we created the correct directory tree in the root directory for the virtual machine in a terminal. We should this V I command create the empty fall. So now we have the empty file. So now we just copy contents of the farm pace to me. Now, there are some areas here. You can see the import statement is incorrect. So we scroll up putting the missing. I he escaped. And now check that the fall is complete. The father is complete Shift colon W two, right. She called Q to quit. I know. What would do is we will test this bar by running it in python. We want the full path to the file, which is here. Put us in a terminal type Eiffel. So we type I print space, full path to the file. So now the file is running and this should boot up spark. If we have a problem for environment variables, we will get a raise. And we didn't get any air assortment. It was finding spark environment variables correctly. And so this file needs to run behind the scenes. Every time we boot up. I pipin notebook. This needs to run is part of the eye pipin configuration So we've successfully configured this spark Set up pie sparks it up far and it's found the environment variables Because if you look into Kurt, you can see here being for the spark home environment Variable It depends The location of the pipe in in the spark client. So a spark homey spark client So in this line were pointing to the piping directory in Spark Client It's calling a shell script inside pie spark that will configure spark to work with I python We need this vile to run. Every time we load up, I buy Finn notebooks. This has to run so that spark will be connected. So this is how Paice back is configured inside I python at the I python runtime. So this was tricky. So we need to make sure we have our environment variables set up correctly. We need to make sure this file is in the right place. The iPod from network will look for We need to make sure this fall is correct in it. So So now we're ready to start up. I hyphen to test the like Piper Spark Bridge. We will now run I python, but we must open up the port so we could connect to wipe iPhone. So if we are in a new terminal, we must issue this command to enable the pie from 2.7 interpreter. Then we start iPod phone. Just as before I pipe in the book and now I pi fun is running on port triple 89 We are in virtual box and running virtual machine sittings. We will go to network port forwarding. We will copy a rule. Right click copy selected room. We will call this I pi. So we know this is our I python. The I is correct, but we want this sport triple 89 so we can copy that. So we double click in the port and we give it the same port number. Triple 89 for host and guest. Okay, Okay. And you could see it's configured it on the fly. So now we copy the shoe era. We have got a new fresh browser copy, and I You are Well, we have our eye pipin notebook. And if you're lucky, you can see we are in the Spark Home directory because there is hyphen directory where we have a pie spark dependency. So we just hope now that our spark configuration for Ran and it's Poch is connected. So to test that spark is connected, we create a new notebook. So we create a new put notebook and we can give it a name. So file every night, call it spark one. So now we have a notebook called Spark One, and to test its park is working or we need to do is type SC and this is the spark context. So in Africa, so run and you can see it's actually found spark contact. So we successfully configured pi Spark to work with notebook and it was a bit tricky, so we needed to create configuration directory. So when we ran the Jupiter Configuration Directory, it just created the Jupiter configuration far. So we needed to create an dot I Python configuration directory In there, we need to create a profile default and inside our profile default, we needed to create a startup, and inside I start up, we needed to create, uh, pipin configuration file this file. We tested this father work, and now when we booted up our notebook, it called the file by default. And there we have a spark context and with despite context, we now have a spark bridge to hedge DFS. We've successfully configured spark from the pipe inside. In this application we have two components. We have iPhone component which we've just successfully configured, and we have a Java component for our job. We're actually going to use Grader so again to download and install grade all and we will create ah harder grade all application. We'll look at how we can install Gradel and set up ah harder Java cradle application. How we will move on to installing the Java dependencies we will install are built to and transitive dependency manager cradle and this is the download link for greater. If you go on the Web browsing through that link, you will find the download for greater but we want. It's a binary distribution that's always a good idea to use the latest version of things. But sometimes this can cause conflicts, so it's possible to download previous releases. When in doubt, always choose the latest release. So we download this here. I have downloaded it into my home directory. We can now use the ssh far browser. We go to the virtual machine begin. And here we are. We go to the rich directory and we just paste into the root directory. Now it's in the rich directory. We need to unpack it. We re named the Unpack directory to Great Hall so that our paths are nice and simple inside are greater We have been directory and that's what we want went to so we can call greater from the command line So now we will create a dot bashar CPR, befall our environment, variables for life and end java So now in a terminal in the root directory will create dot Boesch RC We need to be careful here because we can break our system. We go down, make some space go back up We copy our export variables. So in red, these are python environment variables and in blue these are Java environment variables. So we will configure Bashar see the python and java you know, file opening V I go to insert hurt but selecting right and carefully paced Exactly where you wanted to go now he escaped her leaving certain road. Yeah, we noticed that we have a lot of space here. So out of insert mint scroll down here. I now hit the delete key to delete the lines So there's not so much space he'd escape. Shift colon W shift Colon Cute to quit and then we need to check. Is Boesch RC for viable? So this command will do that. If we issue so start special. See? Then we will get errors if we have a roszina Bashar See Syntex There were no areas That means that our Bashar see his viable If we break our best shot See, when we rebuilt the machine we have Arisa machine may be broken and not reboot So it's vital you do that step by going sourced up Bashar See, we pushed the environment variables into the runtime So we should get the greater version If we've issued this command greater vision Okay, so we're not able to run greater because we have permission denied So we can fix that easily. Remember in the sandbox we are always root So if we just go see Hatch Mark minus capital R and F for file for 777 Every possible commission in just cradle hit enter and they would try our command again Greater version and that time it worked. So you could see we have greater on the off. So greater was available now. So now set up on the box is a job, a bill to and Java runtime dependency grader. And we have ah hyphen spot configuration set up and our life in modules. End of tendencies in store. What we will do next is download the code via get from get hub and we will get on with building Java may produce code with Gradel and in running are harder MEP ridges job to set up our initial daughter for a spark and leaks. Now we were pulled the source code from get up onto the virtual machine. Then we will build a map ridges job We're cradle and then we will run the map ridges job on the cluster. We will find the Sasquatch at this link on Get up the sizzling This is the source card. We have a no ports directory where we have hyphen notebooks. Then we have a job source directory. When we run out prepared the data face via a Hadoop may produce job so we need to cleanness onto our virtual machine. And if you need to get up. The clone euro can always be found here. Get is installed on the virtual machine. So you simply get pull this link. Here I am in a ssh far browser locked into the root directory of the Horton Works Virtual machine and of clean things up by creating this folder older number, everything I'm not using at the moment in there. So I can see more my directory structure. So here I have my bill to installation folder Gradel that we set up previously, and he I have cloned the source code. So this is the source code from the guitar link. So now we have a greater build far. So what we're going to do is build and run the map produced your we look at the output who look at the input, and then we will look at the data on hedge D effects and once we've done that would step through the code so we can understand what's going on. It's easy to understand when you look at what the code used as its import, what the code produces its output helps you understand better in a terminal in the rich directory the virtual machine inside the edge Onyx Spark Analytics directory, where we pulled the source code and then to build a coat. We're going to first try grade all tasks and that will run through and show all the tasks it's going to check out. All the dependencies available has any air, is there We will find them. So we have some tasks here, and the interesting one is fetch are This is a test that we were used to build the urban job for. How do so now, Before we build, we will clean the previous build. So it's like greater clean, and this would really any previous spilled artifacts. So when we build a project, who would be building the project into a clean folder? So the clean was successful. So now you shoot Gradel Bill. Now the fist time. This could take a while for you because it needs to download all the dependencies. So in this phrase, greater was the transitive dependency manager. It's going to use maven or I ve to go on the Internet to your declared repositories, and it'll the dependencies for the project, and you could see build successful. That's what you want to see built successfully. So now we're going to type grader Fetcher and the Greater Fetcher will build a lube job for do so. We tried, Grader Fetcher. And now it's going to assemble all the builder effects and the dependencies into an uber job. So we built successfully. So now if we go, it's a stage, Barbara. So go to the directory containing the source code. We're going to boot. We're going to lives. And here we could see her pretty. So this is our uber job for the preparation phase. So we simply copy this and move it to the root directory. If you run into permission problems moving the jar with an external ssh, far broader, this simple shell command will move for John. So here I am in the top level directory of the source code. I issue this command inside the top level directory move. I dropped down into the build directory drop turning to lives than I run over stars everything, sort of Herbie. And it's only one artifact of the uber charts off. It will move that into the root directory. So we go to our Ssh far browser, and we now look in our root directory. We could see the uber jar is there. We need to move the raw data. That's it's rolled out a Gortex. We need to move there so we can issue of similar move command everything. It starts with our greater only raw data starts with army modified into the root directory . So now we're looking out. It's a search for browser. Just refresh it. You see, we have the road out of the text file, which is the input to the May produce job. And we have you, bitch, before we ready to run uber jar in her do, we must put the raw data into the hedge defense file system to run a Fetcher. We must first change into the hatch, DFS user, because we're back on horror works. And we must create the directory, which is root. We can clean it if we needed any time with this command. Then we put raw data file into the root directory, and then we run the map ridges with this command. And once the map ridges has finished, if it works, we can check out the results with Hugh as route we should The command s you space hedged. If it's to log in as the hate DFS user now the next command is to create the root directory under use of which you this command and the director exists because we didn't get an exception. So let's put have road to hunter into the root directory. Now, this is different to cloudier because we need to give the path to the directory where we're putting far. So you said if it's haIf input command Now let's check that it's there with E l s command and we have, ah, import data in hedged office for a map produced job. So now we will run, uh, map ridges, job this command. And if everything works, we'll look at the output in Hue. We just pause this while the map ridges code goes from. So I met, produced, code completed. Would you scroll up and have a look for the job completed successfully? There it is. So our job completed successfully. So now we go in a browser on the host machine to local host 8000. So just bring the browser down so you can see the u R L. So we just talked local host and the slugs into Hue. Now, when Hugh we go to the far browser, we want to go to the user directory and in the use of directory, which is the root directory. And here we can see output. And if we look at a help for her, this is our output. And here you can see once again we have a key value. And here we have a piped eliminated feature set. This is our piped eliminated features it. Now there is one issue about this data he could see. I have these next book and last book. If we go to the last block and we choose, hit it and we go down to the end, grab the hetero. We will put it up top. And so now we can save this safe. Now we've updated it in Hue. So now we have our data. Our feature sit. Is this part eliminated? String here and here we have our header. Now this Donahues falling around because I've minimized the browser. Maximus, A browser. You can see that that it's working correctly. So now we have our daughter in hedge D A office. So we successfully ran our MEP Ridges job. What we do now is we're just quickly stepped through the code that set up that may produce job. This is occurred on guitar now. The MEP ridges job was a job, a job. So for the Java component of the code, we've used Greater Asare project management to and he is our main bill far bill dot grader . If we look at this, you can see we are declaring some plug ins. So if you use intelligence, you can declare the i d. Plug in. It has plug ins rather ID's and we're using our Java plug in and our scholar plugging a source level. This is important that we choose the correct source level. We can't use Job eight. Here we are sitting on a hard version is 2.4 point zero. So that's like the latest How do version. Then we are declaring our remote repositories. We are sitting up our source. It's because we could have scholar code and Java code side by side makes sense to declare some source directory, so source main Scala Scala will be the root source directory for any scholar packages. Java will be the root source directory for any job. A package of who we are declaring are harder dependencies. This is our fetch. Are that builds? Are you Bujar? We just saw that. That what? Correctly And here we have to exclude some artifacts. Otherwise we will get exceptions. So this greater build far successfully builds a Java may produce job. So now if we go back and we look at our map ridges code itself, it's only basic met produce. What's good about the MEP ridges here, he said, is using the latest AP I So once again, we are extending, configured and implementing to then we create with the tour runner our job with, for instance, of the configuration class and an instance off our actual main class. Prepare Datta. Then we set up a job so we don't set up out job this way with a static get instance method , we will get it. Appreciate it. Morning. Now we set out for matter. Actresses. In the last topic, we sit input and output types which are just text cause we're just working with strings. He said, um, EPA and our producer classes. We said our output format for our producer, we create our paths and we submit and run the job and wait for it to complete. So it's only very simple map ridges code. We would just look at the map A and the reduces class quickly. So I'm Epper for its imports. Is just taking Texan along rideable. Just as in the last topic, we split a string honest eliminator and we create our feature set part eliminated. Just as in the last topic, I reduce the class. It's a simple reducer. It just writes out the keys and the values we go back. We could look here. We have the schemer set up our data that we are working with his four Swedish railways. So I have the data for all the roadways in Sweden. And so the doctor schemer is based on an i. D. A year where the daughter was type and we are looking at expenditures. So the important feels we will look at expenditure for labour. The numbers start the expenditure for electricity costs per kilowatt hour and our dependent variable would be the total cost of running the rail network. So we have a complicated schemer for railway company in Sweden and their total cost. If we go back to the top level directory, we can just quickly look at the raw data. You could see the scheme of their services,