Learn Spark and Hadoop Overnight on GCP | CS Viz | Skillshare

Learn Spark and Hadoop Overnight on GCP

CS Viz

Learn Spark and Hadoop Overnight on GCP

CS Viz

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
41 Lessons (3h 35m)
    • 1. Introduction to E-Commerce Data Load and Operation Setting Up Hadoop & Spark

    • 2. Data Explosion and Reduction in Storage Cost

    • 3. When Data is Referred as Big Data

    • 4. Computer Science Behind Big Data Processing

    • 5. Hot and Cold Data

    • 6. Hadoop Architecture

    • 7. Hadoop Cluster Data Operation

    • 8. High Availability and Replication for Enterprise Part 1

    • 9. High Availability and Replication for Enterprise Part 2

    • 10. GCP Dataproc and Modern Big Data Lifecycle

    • 11. Data Load into HDFS or Storage Bucket

    • 12. Configuring and Running Hadoop in GCP with Dataproc

    • 13. SSH Inside the Master Node and HDFS Files System

    • 14. Summary - Part 1

    • 15. Introduction to Up and Running With Spark on GCP - Part 2

    • 16. SSH to Google VM Instance from Local Machine

    • 17. Processing With Spark

    • 18. Spark Inner Working Part 1

    • 19. Spark Inner Working Part 2

    • 20. Running Spark on Google® Cloud Platform

    • 21. Launching Jupyter Notebook in Spark

    • 22. A Simple Introduction to RDD

    • 23. RDD Architecture in Spark

    • 24. RDD Transformation and Actions Map and Collect Respectively Part 1

    • 25. RDD Transformation and Actions Map and Collect Respectively Part 2

    • 26. RDD Transformation and Actions Filter and Collect Respectively

    • 27. RDD Transformation and Actions Filter and Collect

    • 28. RDD Transformation and Actions Collect and Reduce

    • 29. RDD Advance Transformation and Actions Flatmaps

    • 30. RDD Advance Transformation and Actions FlatMaps Example

    • 31. RDD Advance Transformation and Actions groupByKey and reduceByKey Basics

    • 32. RDD Advance Transformation and Actions groupByKey and reduceByKey Example

    • 33. RDD Advance Transformation and Actions groupByKey and reduceByKey Example (2)

    • 34. RDD Caching, Persistence and Summary

    • 35. Why Dataframe and Basics of Dataframe

    • 36. Installing Faker to Generate Random Data for Examples

    • 37. Creating Random User Data With Faker and Creating Dataframe

    • 38. Working with DataFrame and Understanding Functionalities Part 1

    • 39. Working with DataFrame and Understanding Functionalities Part 2

    • 40. Working with DataFrame and Understanding Functionalities Part 3

    • 41. Working with DataFrame and Understanding Functionalities Part 4

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

This is a comprehensive hands-on course on Spark Hadoop

  • In this course, we focused on Big Data and open source solutions around that. 

  • We require these tools for our E-commerce end of Project CORE (Create your Own Recommendation Engine) is one of its kind of project to learn technology End-to-End

  • We will explore Hadoop one of the prominent Big Data solution

  • We will look Why part and How part of it and its ecosystem, its Architecture and basic inner working and will also spin our first Hadoop under 2 min in Google Cloud

  • This particular course we are going to use in Project CORE which is comprehensive project on hands on technologies. In Project CORE you will learn more about Building you own system on Big Data, Spark, Machine Learning, SAPUI5, Angular4, D3JS, SAP® HANA®

  • With this Course you will get a brief understanding on Apache Spark™, Which is a fast and general engine for large-scale data processing. 

  • Spark is used in Project CORE to manage Big data with HDFS file system, We are storing 1.5 million records of books in spark and implementing collaborative filtering algorithm. 

  • Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells. 

  • Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. 

  • Runs Everywhere - Spark runs on Hadoop, Mesos, standalone, or in the cloud.   

Meet Your Teacher

Teacher Profile Image

CS Viz


Hello, I'm CS.

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Your creative journey starts here.

  • Unlimited access to every class
  • Supportive online creative community
  • Learn offline with Skillshare’s app

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Introduction to E-Commerce Data Load and Operation Setting Up Hadoop & Spark: Hello and welcome back. This is the next section of the project court in this section, we are predominantly going to work on big data. This is the part where we're going to go inside or e commerce and understand how the big gate off first emerged. And what were the first solution? It saw the biggest challenge. Now we will be looking. This is deep into our Hadoop. We're going to the Hadoop distributed file system. What is the architecture? How intra wides enterprise, the high availability and persistent storage. And then we will go into the current or the more than approach off implementing those bigot a solution which is by using cloud technology and big Google. Because we already have one account in Google. We also got the free account there where we can speed up the system. Start our is Davis cluster with one monster. More entry, multiple sleeping. This is a wonderful section. Which Lady? Farm Foundation For all the developers out there, they might not have got any chance to go to any of the administration activity or know about STD fs file system which is a little outside the SNP world or they surprise whole. This will be a good section to provide them one of the other options or how the world was behaving form past 10 years to solve this big challenge on Fitton. Now s it evolved when we try to use those services than what is the right approach. You place your most of data, your transaction data and your historical date of it is like going to be huge state are. And now there is a lot of data off. Unstructured format is coming to picture how those will be stored in your enterprise system . Mostly, they are tagged as a cold storage where they are put because these are important enterprise data. But you have to also store them and you also one, nor to spend too much on your hardware cause still get a high availability and persistence , which is basically back up good veil saying back of videos. Data in the section will explore that and go and have a phone foundation on this big did applications and tools and also explore the entire ecosystem that is built around Hadoop, which is mostly introduction to your open source world for the solutions off big and By the end of the scores, you'll have a pretty good understanding off waters. How do whites used were the challenge with it solves on how to spend your own. How does the JCP for free? So let's go ahead and learn the basics of big data. Do distributed environment and spending your first Hadoop machine on Lough. My name is I denied and Ivan Seal in the spring. Three. Part one off the Project Corp and loans. 2. Data Explosion and Reduction in Storage Cost: Yeah. Hello and welcome back in this section, we are going toe start a do not before moving ahead and seeing the how do functionality We will be exploring two main reason or cause for something like Hadoop to come into the ecosystem. The first is storage. Now, when we see the storage cost or past 30 years, what we will see basically is it has come down drastically. In 1980 you might be spending 180 U. S dollar or about one million for just buying one gigabyte off hard drive space. Now when we see that number into town 16 the price off a hard disk space of one GB same storage with much better capabilities, robustness and speed will be available to you in 10 pennies. So that is the drastic change which has occurred within the past 30 years. The second change is the data generated when we see the past in your data and also mapping the trend to 2020 we will see that there's exponential growth off data which will be generated in future as well. Now, most of the generated data is unstructured and the structure data will be only about 10 to 20% off a total volume off the data generated. Now, with this to scenario, what we have is the Lord this space or storage space and higher demand off data storage because we're generating so much of data. And most off those data points are unstructured. For example, if you're having a video files, audiophiles images, most of this data go into unstructured category. Now, when we talk about the banking records, the medical records, your own regard, are you If I've seen user records, they're all structured data because we have stored them in table in a proper structuring manner. So most of the data are like video audio files, the going on structure and that is what requires a major attention for storage now. So we need to have a storage solution which is based out off this less expensive hardware, which is the hard disk and have the capability off storage off this unstructured and structure data within it. And this requirement cause something like her Dube to evolve it. Now when we will see her, Dube, you will see that it is specialized to store on structure data on it. Basically use your hard disk to store that and it utilize the hardest to store that data. Now there's one more problem here. When you go to enterprise and you have the data, there are a lot off pre checks which need to be performed for the storage toe happen. You should have the backup. You have the data redundantly stood in multiple system to provide for safety. So if any one of the system goes down and the other will be having the data, you have by amount of security involved the data and making sure that only the right person can access it. So this basic requirement should be also put in place. If you are using a storage solution like you, can our Hadoop system make the hard disk, which is snow inexpensive? Have all those checkpoints met? If it is ableto meet all this criteria a checkpoint, then all the enterprise will be willing to use her Dube to lower their costs off order ship off storage. This is the number one reason. So in the next section we will be elaborating into the spine and also understand what are other requirement. Often enterprise, which made her dupe one of the popular choice to go for storage 3. When Data is Referred as Big Data: Welcome back in this section, we are going to focus a little bit on enterprise. Now, Here we're walking towards project core and project core should be enterprise. Great project. Now, when we talk about how do they will be a numerous use case, which you can utilize for your own purpose. But when it comes to big businesses, this big businesses will be having a lot off dependencies and love things will be at stake . For example, if you are a manufacturer off. If fast moving consumer goods like Procter and Gamble than any mistaking a system which might result into a downtime off, maybe even 30 seconds might cost in millions. So that is a steak we have keeping a system in place in enterprise require some of the serious discussion into the architecture. Robustness. Also the reason off involvement of the system. Now, when we talk about big business, they will adopt any technology or a software. If the seat 10 extra turn now in this few section, we will be discussing about the basic reason who would be starting from a computer science aspect, and then we will be showing you how. How do provide a 10 x system which allowed it to be used in the industry first for let me quickly give you overview off there. In the current enterprise systems, Hadoop is used. When you're talking about Fortune 500 companies and enterprise off dosed level, they will be using systems like SMP, Oracle, Sybase and they will be actually sticking to one vendor and s. It is a dominant solution provider for this vendors, and we will be basically trying to focus the use cases within that spectrum. Now coming to the computer science aspect off the big data challenge the all of that big Attar is a term coined recently and what it signifies that any data which is difficult toe operate or work with the classical approach off operation will be done as big it up. The ville you recognize is Victory V's. They will talk briefly about the three V's. The first V is velocity if you have a system in place, which is bringing volumes off the data very quickly. For example, if you talk about Facebook in Facebook, there are more than one billion users. They are uploading image simultaneously. They're texting. Some are actually sending friend requests. Some are uploading videos. All those events are happening very quickly. Now, in this scenario, we say that data velocity is very high. You are getting a lot of data. Similarly, if you are in the banking sector, have the bank has a lot off users, then they will be doing a lot off transactions. And these transactions will result in two huge number of data flowing through the network, which you have a store and sometimes also process. Now, this is velocity. The first fee, the second is volume. Sometimes you have huge data to be stored. Those data can be gigabyte or petabytes. This is roughly equal to tend to the tree or 10 to the power six gigabyte. So it is going to pay a lot of storage and not only storage. You have to also process the state up. Now, this is the challenge storage off this data itself. It is a big headache and now you have to basically process over it on. This is going to be another major challenge related to data, and we consider the state out, which is having huge size also into the big attack category. One of your example, is your YouTube video as well. And you do. There are billions off video present now on all those video need to be stored in some place . And if you're pulling some analytics are searching through that data. Then your search result should be actually going to all the video and finding you the right match and giving the result. And after you got the result, you should be also be able to view that video, which might be one off the billions of the video in YouTube. In our channel, you have five CN. Also, we do have a lot off content. More than 20,000 or video present the platform, and sometimes we also call. This is a big data in dumps off sheer volume off the videos. Now the third V stands for variety. Sometimes you generate data, which is very different in nature. For an example, let's again take Facebook. In Facebook, you do have content like text. You have pictures. You have videos. You also have events like who? Like your page, who gives a thumbs up in your page or who comes up in your post and also behavior data off all the users, which Facebook might have usage in some form. If you have experience with Facebook advertisement, Facebook takes advantage off its behavior, the user to market them. So this data has been poor in place for business usage, and you cannot really ignore the state. Are the first stage here is to store that data, and then you should be able to process it when you have it it off. It is large in nature, coming very faster to you and do have different types or format or structure. So it's combining all three V's so it will be impossible with traditional database system to manage that kind of data and use it for our own purposes. So this is the place where we have to. It's like a dupe which takes care of all those requirements. 4. Computer Science Behind Big Data Processing: So in this section we will be going into critical computer science to understand why from a critical aspect her do fieldwork. No calling to convert assigns. If we are having an algorithm and that algorithm is a single group, then the nature off the core will be linear. Now, if it's a loop inside loop, the later off core will be order off and square, so they will be occurred. Now, if it's a loop inside a loop inside a loop, that it will be again, order off and cute. No, If you see the access here, the Y axis is time and X axis is your data sets so more than it has sets. Mawr will be time required. And if you move on, then at some point of time, with only few data sets, you will be taking a lot of time now. What will happen if to the computer which is basically processing? This will provide more memory. So if you're walking and using some of the advance algorithm, then we can distribute this data and utilize the memory to operate individually on the detested. So the algorithms like, divide and conquer or have this approach off problem solving by dividing a big problem into multiple parts and operating on the individual aspect off each part separately. Then the result is combined and given back to our user. So what? It says that if I want to perform this activity than I have to increase my space complexity , so space stands for here. The memory in which the operation is carried upon in the scenario there is one more variable which will come into picture is the war complexity and the war. Complexity is time complexity multiplied by space, complexity, how much pace you're using and how much time you're considering. So any algorithm it can be the world's best algorithm will have to follow this rule. If you want to reduce the time, then you have to increase the space or memory you are utilising, so that is a basic rule. So what in her do they have done is to reduce the time of processing of the data? They have basically distributed the memory and expanded it across network. This is the basic paradigm on which Hadoop it design, and they will see the architecture and see how it is done. And what are the challenges which might come if you're doing it, and how Hadoop solved that 5. Hot and Cold Data: in the previous section. We understand about time, complexity and were complexity. And how, What complexity will be always constant now how it is relevant in enterprise? What happens an enterprise is we have some data which is having a lot off importance. And this is the data, like financial records off all individual person. Their main data over their names are stored. The location also the address our store to rich person that transferred money. All those sensitive records are present and also the Houston name and password. The second of data can be that it have, which might be important for the business, but not very important. For example, in the first case, if a data is very crucial and important, that might sometimes also get required on transaction level. For example, if I'm doing a transaction and paying some money, then I should be also aware how much money I have in mind Back account. So this data is requirement transaction level, and when I'm doing those transaction, I need to understand where the stored it our value. The second category, which were talking, is might be important. For example, what wars my response toe, a male campaign which waas sent by the bank banks trying to so great God. And they send everyone off the customer. The great card. What is my response here for the male that can be wanted, a point which might not be useful in the transaction might be useful. And analytics understanding that behavior off users just me. Actually, third time of data will be something which might not be that useful. For example, if I have a lot of system in place and those systems are generating several logs, they're locked files which is informing us whole service is doing which a dumping data into this file and they're doing it every day, every hour and every minute. And this time of Rita might not be that crucial for the business is only crucial for analysing and understanding. If I ran into a situation, my system and it abyss crashed. And I'm doing debugging to understand where the problem is on. This doesn't happen often in the business, but still the state eyes useful in those type of situations or scenario. Now this a trick a degree in which data can be off importance toe business and what business should do business should take care the most important data and put it into some place where it can access it fast. And because it is important and crucial to the business, develop putting it in a system that is robust and highly secure as well. The second type of data, which is useful but not very often on the third time, agreed of which might not be useful much in turn kind of scenario will be useful, can also be put someplace which might not be a premium valued storage. So what we saw here is we have to category of data. The first is which is hot for the business and the 2nd 1 which is not that thought. We normally refer to that as a cool data source, and the common practice is we put the hard data in the ear piece system, main memory or the system that is actually going to process it, for example, and we're using SMP they're putting the state are into S a P Hannah system and the cold storage sometimes use her do because they'll be lord off events happening and campaign going on for our marketing team. Those off the dollar will be generated in huge quantity and those kind of data will be stored in system, which is comparatively inexpensive and nature, and we use had due to store our cold it a source and view system like S a P Hannah to store our hearts source off data. So this is the existing architecture which many of the companies are following, and they're able to hit the right balance between money, spa, bend on the systems and storage, and I'd infrastructure and the demand off the data. So, to put it, in nutshell, you will be using system like as a P Hana as a hard data source in your enterprise. And this data source will be walking on O L T P, which is online transaction protocol. That transition are happening using like a dupe to store your cold data source, which is date of it. It's not that much used. And sometimes those day Doc comes into online analytical protocol as well, which is often referred as a lap and all up is normally used for pulling analytics or a huge set of data. So the next section will look into mawr on her dupe and how Hadoop stores data now regarding the analytics were using Hadoop on We're Using Spark on top of Hadoop. We'll be discussing more about spark and why we have to invent spark. But initially we will look into architecture of how do and how it's able to manage to store big data and process over it and want other part. We have the Hadoop now. One of the prominent reason for using Hadoop is it's open source and it's free. Off charge is off any licensing fees and has a big ecosystem which supports the development and enhancement overhead, do so. That's also a main reason. While off company tried toe implement had do and take advantage to store big data and also put analytics or it. Some of the system also use online transaction protocol with Spark. The look in tow it in the next point where we're going to use park and spin on top off her dupe storage as well. So let's go in the next section where we're going to see the architecture off a do 6. Hadoop Architecture: in the section. We're going to see a dupe architecture now why we are going to loan Hadoop architecture because most of the time has discussed in the last video, we will be using her Dube for storing our cold data, which is not very frequently used in our businesses. On using systems which are expensive, might not be a good idea to go about it and solving the problem off big data and storage. Now let's see the Hadoop cluster with stores a large amount of data vitton it Architecture makes it also possible for high availability and ableto scale up the entire data system with stores your call it a storage and also this architecture itself will make the use off inexpensive hardware come into the picture, which makes the storage and dire cost, which is the total cost of ownership off storage drivel and doing operation to a very low level if you bring down the total cost off ownership. So let's look into the architecture in this architecture, I'm showing you one Hadoop cluster similar to this. We can have multiple Hadoop cluster arrange in this fashion, for example, your client will be accessing this switch This is the switch. Now the search in town will redirect our client calls. So the system which stores the data now when we talk about all the system which are connected to the switch that to type of distinct system. The first type of system is called the name node, and the other systems are called The data nodes are the working notes. So why they're called name Northern Bloke innards. Basically. Then we need to access any one of the system. For example, of this is just remember one this system number two, this is system in the three, and the client data is stored in the system. But tree now, Vito also understand what data is stored in the system. So we can basically fulfill the client requirement when the corresponding call is made. So one no dedicated Lee in this branch is responsible for just understanding and storing what data is placed rare. And that note is called the name node. This architecture is commonly referred as your master and slave architecture, the monster Nord and name, nor decides and keep count off there. The data has been stolen or in which snowed you need to push your task or processing so you're head of system will be forced. It lies for storage and it also will be utilized for computation on that story. So your worker node will be doing that competition. They will be running heavy program on the data which they have as the work of the muscle note to divide the computation Among the vocal notes. Now, in the scenario, we are typically talking about storage. So storage is one of the key factor. Why? Which we are going to use Hadoop system in our e r. P. Fair using SMP Hannah. So we're going to focus on that aspect. If you're going to go into computation side and doing heavy processing, the will be little difference in architecture or main focus when we're working with AARP companies is to store data and able to retrieve it when we required processing and heavy computation really mostly run with our spark system and also with s a P handle system. Now we're using Hadoop specifically for cold storage off large amount of data and in the scenario, we have one name node and all the other slave north or the data nodes are storing. Get up. The name note is keeping track off their Those data points have been stored and switches allow us to redirect the traffic which is incoming and outgoing. In the next section, we will see an example when our client is trying to push a data and how the data will be actually going into buffer and gets told how the name node will be updated and how they don't know what will be abated. Keeping in mind that this is an entire Hadoop cluster, we might be using a cloud which internally takes care of that. And that is the case most of the company are running now this they don't go for managing individual infrastructure independently. For example, if I'm tryingto create a Hadoop cluster to store my historical data off AARP, which are not used on day to day activity, then I will be putting the data out of my ear piece system to this sort of cluster. And some of the service provider, like Google AWS, will give me out of the box capability and create the Hadoop cluster internally. So I don't have toe also create the infrastructure individually, and those model are commonly referred as infrastructure as a service. And this type off mortal is what we are going toe learn also and implementing the project core because we're not going to spin up individual hardware. That is not the way enterprise. Nowadays they're running on cloud, and we should be also aware how that is done. So let's go ahead in the next section and understand how those operation off right and read happens within these Hadoop clusters off data storage. 7. Hadoop Cluster Data Operation: Yeah. Welcome back in the previous section, we got an overview off the architecture off Hadoop Cluster funded operation. The operation is the primary part for which we are going to use Hadoop and also an enterprise. Most of the processing will be done with spark, which is 10 times faster than I do. We will not be using any of the processing with a dupe. At this point, we will be using car Dube mostly for storage off the court data source. Also, we looked into the architecture how the switches are in place. We have a course which and this course which is in turn connected toe other switch. This such number one is connected to a lot of CP use. And the CBO's have hard disk and also processing unit and out off all this Nords, one of the node will be serving as name nor the main work off a name node is toe know where about off other notes? So for example, in this where about I can mean that I know exactly In which Lord, what is the space left in virginal war data has been written and in veg, nor did operation has been going on All those information on the location off all the snow is put in name Lord, This architecture also we saw in the previous section is commonly referred as master and slave architecture and the data Nord will be storing dead are a name, nor will be the one which guides to operation off entire architecture. In the section we will be going through a simple example then our user here is trying to write a file in our Hadoop cluster So user is connected toe this Hadoop cluster network I'm calling it network because all the name nor Andy A note is connected with help off a network together and works like a big database or file system. Now the user first will be requesting the name known and name Nord will be having a file structure in place. This is a virtual file structure. This virtual file structure contains information about what is the full of structure off this entire Hadoop cluster. This entire virtual inmates sometime also is referred as fs image. So this fs image is something which you'll find in our name known. So a client will be forced director toe or name note it will understand which are the location that decline can rightto. And if our client wants to write a file and put some data on that file than our name node will find a place in the entire network. For example, if it found a place here, then our name node will be creating a file here and will be returning a success toe our user. So once user goddess success on creation of the file, nothing has been transferred yet. On the data level, the transfer of the data is done with streams and thes dream off information have a preplanned sizes. For example, if I have size off 10 toe aid megabyte so this file will be divided into chunks off small packet, which is also called as a block on this blog's are commonly stored in F S data stream, so F s data stream will be created and there is a stream will be having a fixed size. For example, the default setting can be commonly 1 28 megabytes, so at once 1 28 megabyte of the data will be written in this FS data stream, the stream is filled with data off size, wanted it megabyte and the data will be going on. It will be written in this data node. So this is an old will be containing the 1 28 megabyte part off the file. Once one data stream has been written into state on old, then they will be second call to our name. Note Vet can store the second efforts Data Street. Then this name note cannot locate the same data node which we previously used or can also locate a different data Nord to store the remaining data in our FSL. A stream in this way. Block by block. All the data will be copied into our data notes and our name known will contain all the information. What are the data and where they're stood now? One thing you will be wondering is how the name node keeps track off all of this. So this is operation which is carried to copy a file from your user into our Hadoop did a cluster. Now in the next section, we will be talking about how our name node managers that are architecture and how it provides a fault tolerant and replication which is required for enterprise data because sometimes if you are storing data in this data node And what if this didn't? Lord is in a do location and some off the event occurred which resulted in tow, the destruction off their data note or the server video data is stored. So is there some replication made? And if the application is made, then how those replication are made we will cover in the next section. 8. High Availability and Replication for Enterprise Part 1: Yeah. Welcome back in this section, we are going to discuss about how hard do provides high availability and how it takes care off enterprise need when we have a big business and big business has big requirement and everything is costly. Even a small downtime and big business is costly. Imagine you are running a big retail shop every minute in a retail shop. About 1000 people buy the product across geography, and if your system meet with some down time, then you will lose those customer purchases. That means you're losing money. So any enterprise system demands high availability, an up time for the systems. This also makes our architecture much more robust and fault tolerant. The common norm in business is providing a 99.999% off up time, which is also referred as availability on your data, should be also replicated in multiple hardware, so any failure in any note might not result in to los off data. So how how do provides that? So the simple way Hadoop takes care off any data to be lost is having multiple replication . For example, if I am having data in this load, then I can make multiple copies in this Nord and I have one more coping Stone and I can also select this loan. I can make a copy here and name node which is referred as an end with responsible for replicating those copies. Now how the name known Understand what node are up and what notes are not. There's a concept of heartbeat. There is a concept off heartbeat. Every node on a periodical instance is sending pings to our name node. And in that way, our name would understand that they are part off this big Hadoop cluster. For instance, if some system failed to send heartbeat to our name note then our name node will come to understand that something bad happened toe the system and it is out of the entire Hadoop cluster. In that way it will replicate that data from one of the replication side toe another note. Now the replication factor is something you can configure it. In this case, we are seeing that the data is replicated tries and sometimes that it eyes lot crucial than you can replicate more time as well. For example, three for five and also if you want to save space than also, You can have only two replication Now, sometimes the situation can come Where then die Rackets down. For example. This is a big rack and we have a switch allocated. Tow this truck and we have the computer and desk spaces inside the computer and all the replication has been done here, for example, my data One application is here. My data to replication is here and my diet are trees. Application is here. And what if the entire rag goes down? Then all of the three application will be missed. In that scenario, our name note should be able to understand that all data should not be replicated in a single rack. And that concept sometimes also referred to as rack awareness. So this all is managed by our name load, which is the master known off entire Hadoop cluster. Now, one more thing, which they will be seeing now in detail is the availability. How it is taking care of the availability, which we discussed just now that it has to manage about 99.999% off availability. So for that to happen water, the scenario, we're going to look that in the next section 9. High Availability and Replication for Enterprise Part 2: Now, when you see that our architecture, we see that there's a clear dependency upon the name node. The dependency off data node is removed by having multiple application, but the dependency of the name note how it can be removed. What the name node have is a efforts image, which is a virtual image off the entire data nodes off the entire do cluster and also a log file. This lock file contains all the information off what has happened in my Hadoop cluster. So to remove the dependency off name node going down on our entire Hadoop cluster going for hold, we have a standby name known to stand by name. Note is a note of it is always prepared and has the exact number has the exact number off detail stored and ready to be deployed whenever our main or the active name note goes down and we will have a zookeeper which actually takes care off that very well. Our main name, nor is down. The zookeeper will understand that the name note is down and it will automatically allow our secondly name, node toe become our primary or the active name note. So this is the way how the name nor dependency is removed. Also, we do have a second renamed. Now one more situation can come here. Is that what if my name note goes for a restart? In that case, when our name node is starting again, it will lose and dire efforts image. This FS image is a war actual image off the entire had do file system architecture, vitton main memory This will be lost and one way can be. Then our name node comes back again. It can read those information from log. For example, If my name Lord is down, my main memory will be wiped So I will lose this efforts image and what I can do here is I can use my long and go through this log and according toa my history I will be able to create again this FS image. But this process is time consuming because our log files are updated at a regular interval . This files are also big in size So going to this file and creating the FS image is going to be a time consuming task. So for that scenario, we have a second re named Lord always ready that our efforts image. So whenever our active name node is restarted, our second renamed Lord will be providing the FS image which contain all the information about how the Hadoop cluster is structured as a file system. So this is how the Hadoop cluster provides our system a high level off availability and robustness build vitton the architecture. Now from the next section, vivo be focusing and narrowing down our scope of discussion because the architecture, the implementation off, how do and development there, three different rules. So if you see here when we talk about architecture, mostly system advance will be taken care of that as a developer, we will be mostly walking on, according some of the time we have to manage the administration part. But the architecture and the administration part will be taken. Care are the development process. So from the next section, we will be focusing our scope toe the development, and we will be seeing how we will be using a do in our project core and how spark will be working with it. And it will start towards the discussion off Google Cloud platform, her group running on top off it and also spark. So let's catch up in the next section 10. GCP Dataproc and Modern Big Data Lifecycle : in this section, they're going to start our implementation face in the implementation phase. We will be closely looking in tow, only the cloud provider, because gone are those days. Then you will have to install her. Do poor spark in your own machine because when you are installing Hadoop spark in your own little machine, or even if you have two or three computers set up in a small lab, you will be mostly walking on one north, which is acting as a master and one Nordtveit is acting as a slave, or you can have, at most 23 off the notes, which are acting as a slave. So this will be a limited set up, which you would have experimented with it. But with more than equipped cloud platform, you would be basically setting up those infrastructure with less than two minutes. It's kind of a very less time to install and get on with your set up. But with the modern capabilities off Google Cloud Platform, you will be able tow up and running with infrastructure within two minutes. What we have with the Google Cloud platform is data broke. Data. Brock is a entire and to and big data Lifecycle management solution which you will find in the cloud platform on it provides you out of the box all the tools and plug in necessary which will be required over a period off life cycle of picked. It s so this is the modern way off working on big hitter and the corresponding AWS services is e M R. If you are using EMR than you would experience the same services, there will be slight variation off the cost and slight difference in the way or the architecture. But for an end, user services will be warded. Approx offers you. So let's go ahead and try to spin up our Hadoop cluster with less than two minutes. What I will try to create is I will try to create a single master and I will be also creating for North Visual Act as slaves. And mostly this will be just for the demo off. How easy it is to spend up your servers in JCP, which is Google Cloud Platform 11. Data Load into HDFS or Storage Bucket: now when you have your cloud platform, how you basically use her? Do now for that? If you have your off data, you will be buying in AWS, it is called s three Data storage on that is relatively off less price. So you prefer that in Google Cloud platform you would be buying the unified storage, which is Google storage. Google Storage internally is utilizing your Etch DFS data storage. So you out of the bar get high availability and also you can coordinate your map and reduce activity to have your processing done or a large cluster or big data leagues. Those activity can perform as well, but the only defense is you cannot really go beyond a certain customization. So for that, you spin up your HD fs Cloward vitton, your cloud provider, and you Lord, your data in tow, H DFS it can be AWS or if you are preferring G C P or Horton wogs or Cloudera. Any of that you upload your date are toe s defense system on that will provide you with all the customization you worn, or you can go for it. So why does necessary is if you are going for large implementation off date, Alec, then this will be acting as a coordinate our source where you will be storing our data which is not being access or process very frequently. So this will be ordered, a source and you spin up your SDF s system on any of the preferred PLO provider which you have and this will provide you saw on the capabilities and customization which you might not be able to do with just story. So the point is out of the box in G, c, P and eight of Louis. They have taken care off high availability and able to access large or big data. But if you want to go custom more, then you have tow spin up your own as DFS services and load the data into a stiffest into operation over it. Now here we are, going toe completely. Focus on one cloud vendor. It is Google which predominantly getting a lot of popularity nowadays, and her off migration has been occurring from past few months. What we're seeing that Google is getting off customers indeed a Brock In managing the entire big there's a life cycle for enterprise and non enterprise applications. Now, mostly, till this point of time we have covered the fury aspect off your Hadoop. You might be also wondering that ven ve will be actually ableto do something in the spring . We are mostly covering the fury, but big get a solid foundation is necessary And as much we can expand on. This course will really go long if you try to go into the SDF s file system and try to show you each and every detail because so lost they has been so many different tools coming in this ecosystem off Hadoop that it will be unimaginable. But for now, what we have to keep in mind is that for a developer, we will be now walking on this cloud services which have taken or simplified the entire life cycle off pic Twitter management. And we're sticking to that. This is a new application will be built, and it will be very simple that cloud platforms which will be actually predominantly going to be used. Let's go ahead and see in our Google cloud how to spin up your date approx service. But I do. So let's go ahead and do that 12. Configuring and Running Hadoop in GCP with Dataproc: Welcome back. We are in our clouded approx page and as it is written here, a faster, easier, more cost effective way to run Spark and Hadoop before that. If you want to go ahead and read some description about it approach, you can also do that. And in the bottom you will find a pricing chart. This is the pricing chart, which tells you how much it was going to charge with the CPU configuration you're having. But initially they will be spitting it with our freak. Varied. So when I say try it for free, it will redirect me to my building page in my building page. It will basically show me how much credit I have left or how much money I have bean charge for the previous storage or computation and how many days the crate is remaining if I have the free credit in my account. So the use another free credit, which we utilize for our recipe Hannah installation and loading the master data into the system. So let's go ahead and spend up the data pro because we do have a lot of balance left with our account and you will experiment a lot over the time that project core and this balance will be utilized with that. Now I will go to the dashboard now and here have already pained Did approx Did approx will be available, some there in the bottom. There you have the big deter all application related to big data, and this is that place. I have penned it, so it's easier to find in the top. And I was going to the cluster and I will start a new cluster. Getting up and running with Cloud Brock is very easy. It's just like installing an application in your machine. If you're using windows and you have to just understand where is the configuration you required and it will be up and running. So the name of the cluster, which I will be giving his demo it a Prock. One thing you will notice here is you cannot use any space or daughter or any kind of character like that. You have tow either use smaller case characters, and only hyphen is a character which they allow you to use within the name. If you want to know more about any of the points here, then you can just hold over this question marks to know more and regarding the region we will be preferring the region which is near to your geographic location. So I will be saying Isha, saltiest one and I will be saying Southeast Asia A depending upon the region in zone. We have a different pricing bucket, as we saw in the SNP Hannah system as well that when we're selecting US zone, we are getting a lower prize and then we are selecting the Southeast Asia region. Then we are charged little more. But it is always advisable to select the region, which is nearby tohave less network latency in getting your data Now in the biggest application, it's pretty much necessary because then log data will be traveling from the server to your machine. If you're doing any kind of computation, if you're keeping everything within the region or zone itself than that case might not arise. But if you're transferring, data are from a local computer, do your network that you have set up or a cloud then that will be surely playing a role in deciding the late INSEE, or how much fast the processing gets done or how much faster coping gets done. So in our simple example or demo, we will be creating a one master and end worker node. And, um, what basically happens is the muscle Lord already comes with Yanni. Resource manager, and also the vocal chords will be having its Jeffers install over it all. The set up comes out Toby pre building, pre configured. We just have to select how much computation we need and also hear one limitation comes in. If you are using the free version, you cannot use more than its CPU. In our case, I will be just He was in to see peels, and it's abuse is having 7.5 gigabyte memory, and here I will be making the prime minister's size to be 32 gigabytes. I'm not going for a high computation for the vocal towards you can see that what you will be getting inside each worker note on the replication factor is to here. So de application factor is the number of copies it is made or done, as we saw in the previous section recovered Waris Application factor and how it is basically helping you toe have a high availability now the machine type I'm going toe against elect a very lower and machine here. I'm going to just elect one CPU off each three points and five gigabyte off Emery each. I n customize here if I want, you can also use up to 64 course off worker nodes. But for that you have to upgrade your account. If I press on that, it will show me a prising detail which you can go through if you want. Oh, perform or do that on the pricing will build is different for your high end requirement. You can go through it if you are requiring those kind off infrastructure in your enterprise . Before that, we will keep it for a very simple demo. And what we are going to do now is the primary memory here is also going to be 32 gigabytes , and I'm going to select how many notes I need. As in the previous section, I would be having four north so I can go for three nodes as well. So basically it's asking me how many work and Nords I need, and I will only acquired three worker north. I can also specify here the local SST which is basically solid state disc which is for foster processing. I would not go for any SST here. I will just make it zero And as I abated here three you can see that the young core is also Bridget here three So five bid here for then you can see that the corresponding change in the yang core and yarn memory has been already done here. Young will be a resource manager for our use case or just to show you Hadoop, I will be just making it three Nords and three young course. So if you want to see a little more advanced, what is basically going to be given to you? Then you can press on the link here and then it will show you basically also the image version. And also, if you want to initialize any more software, for example, if you need jubilant notebook Toby also present there or maybe launched when you run your machine or when your machine goes live, then your job in your book should also be available. So from Shukan start or access it, you can also do that Also, one more thing I would like to show you here is the image version. Basically, it is telling you what version off. How do is basically going to give in to you. For example, if you pass on loan more here, you will be seeing different version off a party Spar and Apache Hadoop, which is present in the version which is given to you. So the data Prague version 1.2 comprise off sparked 2.2 on a party. I do 2.2 point eight and bigas 0.16 point zero and hive 2.11 and a lot off versions which you can basically go through. And sometimes what happens is you don't want to really go with the most updated version. You want tohave a version which you already have and want to migrate your data or your development. So you can also select the older version off spark in Hadoop with that it approx version here and the corresponding version as you can see that in the data Prague version 1.1, you have Hadoop 2.7 point three, and in one point zero you have the spark version one point explained to so any changes which comes in your spark. And Hadoop is also actively supported by data Prock, that approx team makes it available for you. If there is a new version off sparkle how do or any kind of major abate happens in those ecosystem and it provides you a new version. So if you want to go ahead for that, you can also select the newer version off your did a broke and you will be having access to it. So what you're going to do is we're just going toe. Make it default for now, you're not going to going detail or other wants configuration. At this point, I just want to show you a simple bluster and show you how you will be able to see it in your dashboard. Let's try to create it, Andi. It will take two minutes at max toe, make everything available for you, and then you will be able to do SS edge and ableto work on your cluster. So let me pause the video for two minutes and come back till the time yours infrastructure is getting prepared. 13. SSH Inside the Master Node and HDFS Files System: so it took almost two minutes or little less than two minutes. Toe, Get the data. Brock Ready for us? We have three working north. And let me try to go into details and see the CPU and memory utilization here. So what I can go here is I can see all the vehemence stance We have one monster and all off others are worker north, which reconfigured. And I can also start my ssh. If I press on this, I will be having a window open. This window is basically the SS edge to the master and this comes inside a browser windows . So even if you are in a Windows machine, you don't have the ssh tools like Pootie or you don't have your shell. Then you can use this simple window pop up which provides you out of the box. Essa, such to the system as you can see that this is Bill vide beyond Lennox. So your redhead common might not be useful here, So the first thing which we will be doing here is going and finding the Hadoop file system . To do that, I will do Hadoop fs and ls slash So this will basically show me the file system off her. Do So this is the file system which had do gives you as a virtual file system. Inside this file system, we will be storing our files and structure and unstructured data. This is a virtual mapping off. How those data are actually placed in your master node f s image. Now, what we will be mostly working with is spark. They will be putting the data in Hadoop during a copy based and we will be utilising the data within Spar more about sparkling alone in next section. But in this section, we will just check if we do have a pre configured version off spark shell in this JCP north which we have to spend, I will use the command spark shell to start our spark shell. And also you can use by spark to start your heightened based spark shell. We will be doing most off those Vic Jupiter notebook for you toe save the commanders will for the later purpose. As you can see that the spark shells started in about 20 to 30 seconds. So it also gives me a access toe web You I portal and mentioning that if you want to access your spark or veb you are you can go to this You are. But if I will go into this you all I will not get sparked because at this point, I have known given any public I p toe my Hadoop system. So at this point I would not be able to reach my spark over veb you in all these things we will be performing in the next section where vi vill be working mostly are spark spark is the technology which is we are going to explore more compared to her do because all the processing off, the machine learning and all the libraries in the machine learning are built with a spark which we are going to use at this point, I'm going to just come out off the spark shell by pressing control Z As soon as I press control Z, it makes me come out off the spark shell in the spark shell. You can perform operation or right the scripts off scaler or court off scale and run and you have environment off spark exposed to you within the spark shell. You can also use by spark and by spark will allow you to open a spark shell where you can perform those activity with beytin. We will be spending most of the time vit by spark because Patton is the language that we have just recently learned in the previous section and you will be using that to operate on spark. Now, most of this activity will be done here. As you can see that I'm already inside my pie spark shell. And in this shell, I will be able to use beytin and use spark libraries as well and context as well. So more about this on a lot off other spark we will cover in the next section off this print tree. This is the end of the spring tree first part where we have seen a lot of political part of big deter and understood about how Hadoop came into the picture. What is the ecosystem and how in more in days Hadoop is going to be utilized. Before closing this section, I would like to first stop my instance in the cluster and remove everything so I don't incur any of the cost. So for that I will first come out off the spark shell again By pressing control Z, I can close this command prompt and I will then be stopping everything. I will go back toe the main cluster and it will allow me to stop everything. Initially, you will not see anything and want to do. A auto refresh will see your data pro cluster, and I will be basically deleting it and it will promptly message. And that's OK and it starts to delete the cluster. And what happens if you just stop your cluster is you might be incurring the storage cost if you have any storage within those systems, so it's always advisable to delete or terminate in aid every system. What you normally will do is determinate everything than those memories. Spaces will be also freed up and you will not incur any cost when you're sleeping or overnight, because those bills sometimes can be really surprising. No. One thing that is very good with Google is that it always give you notification in your email if you have any of the services running, but it's still not be terminated or stop on on, a constant intra value will be getting those notifications and Also, it's the same case with VW Systems to the modern day close system, has already inbuilt capabilities where it understand that some of those human errors are inevitable, so they are taking care off it by notification on constant interval. 14. Summary - Part 1: So this was pretty fun section. We learned a lot about Hadoop. Distributed file system saw the ecosystem off the entire group. Then, at a later point, we went into Google Cloud and the Saw how in the modern days you've ilb utilizing the capability off each DFS and in the coming few section or spring trees spring for the remaining part of spring three and four, the going deep into spark, we will be using the cloud for it. We will be using the M A library and again goto the Jupiter notebook and try toe learn how to use machine learning. And also there will be a big prerequisite course off much in learning coming. And this is how we're going to use big it up because this is exactly you will be walking in the project as well. So let's catch up in the next section where we will be talking more about machine learning and spark and see why he switched from Hadoop to spark and why we're not using predominantly the inbuilt capability off each DFS map produce. And in the next section we will be seeing that. So let's catch up in the next section where we explore more on spark and machine learning aspect off our project. Call. My name is a generic and thank you for watching and being part offspring tree. 15. Introduction to Up and Running With Spark on GCP - Part 2 : Yeah. Hello and welcome back. This is a deny and we are going to start Spring Tree, Part two off Project Spring Tree. Part of the project core will be the extended version off our big guitar learning the previous station off part one. The industry about the big did our we saw HD. If it's fine system Vince to the architecture and the basics of the details off it and how you can have your own stiffest file system on Google Cloud Platform in this section, we are we too focused more on across the state, bar off the big data with this defense five system or any storage time, we're going to see 12 which is prominently used in all the Scenario off Peter Process. And the name of the tool is far spark, as the name suggest is a lightning fast fee to process your big data sometime has 10 X 200 eggs. Advantage on speed gain on the tray shin map produced with in this section. We are going to go into details in Google Cloud Platform, launch our own cluster and we see each and individual part of spark, which would make you a big data developer on spark because video to be prepared for the next section which is the machine learning part of the project. And you need to get a good understanding off spark so you can create a mortal off machine learning which will be required for our e commerce in this section. If you completed properly with examples in all exercises, we will be keeping all this section 80 to 90% practical oriented and 20% theory which is required to understand the concept. People recovering in details Arditti all the transformation action all off the reduction action All off the architecture inside our spark All off the data frames which is improved version off oddity and offer the data frames. We will also see how to access data from your Hadoop system to spark system and all of these you can perform in your own cloud platform. This section has bean perfected and we promise that if you complete this section you will be confident enough to work on project that spark become the basics give you form fundamental and we launch towards our project for e commerce machine learning. I hope that I will be able to give you a firm foundation. If you need to ask any question, you can directly contact me at my humility a generator have seen dot com or you come up the chat in the slag group off our network. I will be instructing this entire section So join me in this section and loans bar hands on the to do Vic Practical example on Google Cloud Platform in details. My name is on tonight and lonesome. 16. SSH to Google VM Instance from Local Machine: Yeah. Hello And welcome back in this section, we are going to go into our Google cloud platform and the fasting which we're going to do is we try to create a ssh tunnel. Essa such tunnel will provide us a secure connection toe over JCP. Now, the DCP already provides you a browser based ssh. But most of the time, if your virtual machine instance contains multiple users and some of the users might not be having a admin privileges in those Sen Argo, you need to provide a s message that this user can do basic Hadoop operation or can do basic spark operation. For that, they require the terminal access and in the section we're going to cover our say based encryption meta logic performed that. Now what happens in this scenario is you create a public key in a private key in your own local system and you will be sharing the public e across to your Google Cloud platform. Virtual machine instance now wants you have your public e inside gcb, you will be able to do an SS edge to your Jess ipi by using the private key which is stored in your local machine. So what happens here is we create a bear off public E and private key, and we shared the private key cross to gcb. Now this dcp instance is the virtual machine we're talking about and when we try to log on on terminal and we're using the terminal to reach to our JCP, then we will be using our private key. So this is the basic authentication matter logy which is followed encryption on the same thing which we are going to see in three steps. So the first step will be creation off the keys from your own local machine. The second step will be sharing those keys and the third step will be logged into the washing machine. Instance off recipe. So let's go ahead and create the keys in our own system. Now we come here in the terminal and the terminal. We will be creating the keys with the command. I have already listed the command here so we can copy this command itself and what I'm going to do is I'm just going to paste it. So what we're trying to do is we're trying to create a ssh key Jen minus D R s a r s e is the type of encryption were falling slash f and the folder with a key is going to be produced is your home directory and SS Edge and the key name I will be creating G c p core new. This is the name of the key, which I'm tryingto create. It will ask me for the password this past for is the password off your machine where you have the admin privilege and I'm entering the password And my key has bean produced now everything ran successfully. So what I will be doing is I will be now navigating to the folder where all the keys are stored. That's this edge and I can do it ls And you can see that DCP Cornu is a name off the key and they might be to keep produced. Here is a big core new and disip ical new dot bob. What I need to do now is I'm to share this key the public key to my Google Cloud. Let me open this in V I, Editor, Do you see be core new, not Bob, and I will be pressing the insult I can do a copy operation. I will be copying the entire key here. I will goto Meiji CP instants here. And this is my VM instance. And in the VM instance, What you need to do first is if I go to the home page, you can directly go to the V instance by clicking the Dem A cluster here and inside this below You will see the message. I have one. It's a such here. I can edit it. So we copied the key here with controlled see command and will go into the cloud the, um instance here I will go inside and the details off the vm instance will becoming. I need to press on this edit so I will be finding now the SS edge here I can do a control f just to search the entire page for a sausage. As you can see, this is this is such key. But I can add myself such key. I will be tasting Mikey here and it showed me that the user name which it belongs to his Roger. So what I will do is I will save it. So once I save it, one more operation I need toe also perform here. I need to enable the scdp and STD PS traffic. These are done to allow our Google cloud console Toho Services, which can be accessed fit est It'd be or s a T B s protocol for ssh. We will be basically using the services or entering into the VM instance for the terminal. So that has been done. So it will take a few seconds to obey that. As you can see that all that has been updated its great out because it saves it and it converts into read only mode. As you can see here that we have the key stored here. I have the user name as well, in which I need to log now. The next thing is I will go back to my terminal. I will press escape slash Q just to quit this public E. Now I need to change the permission for my private key. This is the private key and I will be giving a read permission ch moored 400 b c b corps new. So this will allow any services to be able to read this private key. Now the next thing which I will be doing here is I will be trying toe logging into my washing machine. So the command for that is pretty simple. And what I'm gonna do is I'm going to copy it, and I'm gonna base it in the terminal. Now you have 35. The 200 now, one thing which we did mistake here is our key Here is not DCP Corky, but disip e core new. So we're trying to log in tow our gcb with ssh the commanders A search minus I the key name . This is the private key name and the user name. So where do I get this user name? So if you see this VM instance detail, this is the user name which I can log in with this keep So I will be providing that user name and then i p of the system. So where I can get the i p. So if you go into the dashboard here, you can see the external i p here. This is the I b we're talking about. So we mentioned at the end use the name at the rate i p. And once I do that, I will be pressing enter toe, Ask me for the password. This is the password off your machine, which is the local machine. So once I enter my password, I will go directly inside my JCP. I can see the version of the JCP console here. My DeSipio console offering system is Debian Gino Lennox and I can also test my command like by spar to check if the spark is there and try out the command. So this is the quick demo off how to install public key in your gcb and establish a ssh tunnel. So in the next few section, we will be getting ready with our G c p for the data processing part off spark. 17. Processing With Spark: Hello. Welcome back in the previous section, we understood how toe have a secure connection to your Google cloud. Now in this section, we're going to understand what is the next step forward. While ending the section we also saw that we do have sparked present in our G C P and this spark will be the next step forward. Now why? We are using spark and not using internal Etch DFS map reduce program now to give you a brief background, your data, your enterprise data If this is your enterprise and your enterprises storing all this data in cloud and this data is stored over the edge DFS storage, then why were going for spark and not going for the inbuilt Hadoop map produced library for processing. Now the answer is provided in our spark homepage. Spark is open source product off Apache as concede it provides a lightning fast cluster computing. Now, if you compare sparked with the classical map produce, then there's it de knicks faster. If you compare your map, reduce processing with spark processing, then if you are using my produce in memory which is on Ram, then spark runs 100 times. Foster then how do produced. And if you're using disk, which is the magnetic storage, then the spark programmed runs 10 times more faster than Hadoop map produce. So Dez is one of the key factor why we are going to learn spark for processing off data, which is stored with HD for storage on any kind of storage. The second important reason is it's much easier to use if you have any experience with running a map. It is program in H DFS. You have to take care off multi buildings in that, but in spark, it's much easy, and you can use our beytin and jiao as well for your own case. So in that way you are able to code with your own language, which you understand or know. The third reason is when you are using spark, you do have out of the box number flybys like Spark a Scale, Emily Graphics and Spark Streaming. Now we will be focusing more on this and live and going to understand how the machine learning mortal can be implemented with the spark in Build em Lib library, and we are going to use beytin for the courting purposes. So it's very flexible when it comes to data crunching on a bigger scale than spark is considered to be the Goto platform, or STF, a storage for open source world. Now the fourth reason is park and run anywhere. Spark and run on her Dube. It can run on your cloud computer. It can run on edge space and Cassandra. And also it can utilise the storage off each DFS or your own local file system. Or you can transfer the data or scdp heirs or STP. Their law. Flexibility, vitton the spark and mostly the community around spark development is huge. That's the main reason why the development has happened so quick in bigot a processing and analyzing the data fast and making it simple to use. So we will be learning spark in coming few section, and our objective in this sections will be to understand and get a good hold on spark on terms off development activity, which we might be doing or it. So let's go ahead in the next section and go deep down into a project spark 18. Spark Inner Working Part 1: Hello and welcome back in this section, we're going to understand how spark works now the number off information you will be bombarded with. If you sort this topic in Google or you find any videos in YouTube or any of the platform, it's very difficult to understand actually what goes inside and how everything's actually get connected and how I can actually use bark and start writing my own program in cloud with multiple worker notes and Norges on my computer, where I'm just experimenting with it. How soon can I understand this so I can work on real project In this section, we are going to step by step, covered those five steps. They're actually five steps. If you get it properly, then you have a firm understanding on how spark works internally. So let's start with the 1st 1 The 1st 1 is we're going to see how basically spark is built , and it uses your favorite programming language. It can be biting or are toe called develop application and execute or its cluster that is the first. But the second part is we're going to see a very important part, which is a center stage. Are also domed as the heart of the spark, which is spark contexts. It is like the CEO off your spark environment. We will look into it. The third thing, which we will see, is how the operation is carried upon. If you have a cluster of computer sparked basically works with the master slave architecture again. It's similar to our Hadoop as well. And in this monster slave architecture, what is the master program which drives all of the slaves to process it? Basically spark also uses the map introduced the algorithm equal int off divide and conquer which we saw in the previous section. And what is the program which actually computes it? And who manages all this processing? We'll look into it in the third party in the fourth part will Look, what is Arditti? If we're looking into spark, you're walking on spark. You will see a lot of discussion on oddity. An oddity has been improved with a better version now called Data frames were located. It airframes quickly. Now this is a basic video where we're going to just touch the critical aspect off House Park is actually put to work. We will be going into a detailed practical sections after this critical parts. So let's start for the first thing now questing I will be clarifying. A spark is built if it's Kayla skill. A is a precise, big deter language which is utilized to build a park and scale. I uses java where should machine to run. So in a way, if you are a Java programmer, you can directly execute your programs in spark, and those programs will be run by your GV. Um, now what advantage off JBM is done, it can run on any machine. For example, if you have a computer which has inlay knocks and you have a computer which has windows, if both off them are running the same version off Java virtual machine, then your code will be able to execute in port off those environment. Equally, you will have the same result because it's JV and responsibility to perform the operation off, converting those bite cord into machine 11 language. So when you utilize the same concept, we a spark. The spark might be running on a lean it server or your spark might be running on a Windows server. In the scenario, you will be basically using a Java virtual machine, which takes care off the internal implementation off your according to the bite Good conversion, which makes your spark black from independent. Now coming back to the point. What will happen if you are using beytin toe Call your program Douville beautifies beytin and they will be a library called by for J By forge. A job will be to convert those bite and program into Java. Specifics in Texas, which then will be executed or spark and spark, then be executed on top. Off spark and sparked will be carrying out operation. Now this Java virtual machine will beaten stall in all of this notes. For example, if you have a cluster and the cluster is having one master node and it has multiple slaves nodes, so all of them will be having your Java virtual machine and in each off the cluster, you will be actually running some part off your spark court. Depending upon what task has been assigned the task assignment, we will be seen in the third point. So for now, just understand that your Java virtual machine or environment will be all set up and ready in all of the machine made the master or slave. And our washing machine is what you run your code in when you are running your coat on spark Now the second. But now, once we're clear that we run everything on Java virtual machine, let's go towards the spark context now sparking takes on a shot form is the heart beat off a spark environment. It can also be termed as someone like a seal which carries out all operation in a spark environment Spark index will be utilized a lot when we're walking on your programs now, why is sparking text utilize now the main role internally off spark contexts is it? We'll provide you already. Oddity is the data structure. You will be using two story or data and on the structure you will be writing your operation . We will be looking into oddity in details and how those operation will give you a result and how you can work on this data structure off rdd implement map and reduce as well. The oddity are the debt structure on which we are going to implement our operation as well . Spark contexts also provide the activity. Do your execution north or the worker nodes which actually performed those operation. It keeps a track and runs and make sure that you're Voelker Nord is able to control and give back the result to you. It will take care off all those operation internally and this is initially involved or call to create our infrastructure and the operation is carried out by that it a structured. Then now they will be seeing this in details in section three Now in section three, what we see is a architecture. This architecture is how Spark Program works. So you have a driver program this program VOCs fits parking texts. Now here, Spark context will be requesting cluster manager for assigning some working node. Then cluster manager will be assigning some of the independent work and no to you. The worker node will be having a execution unit, it task unit and memory and caches and all of those. It is like an individual computer which has the compute capability and also capability to store. So these are basically two main things which it requires. Now what happens is venue are working that spark context. It will be then dividing the job with help off classroom managers toe all the working notes . All the worker nodes will execute the task which has been given to them. And they will be giving back to cluster manager, then back to spark contexts and you get the result with help off sparking texts. So this is how it task is coordinated that spark contexts and cluster manager This how internally evil evoking Now they will be spending a small spark cluster in group cloud platform and vividly, mostly working. It sparked context and internally, all of this operation will be carried upon an hour. Spark index will act like an interface which takes care off all of this activity. Internally, we will be getting the final result with only our data structure. Now, once we touch our inner structure rdd then we'll see what is basically oddity. Arditti here stands for resilient distributed data sets 19. Spark Inner Working Part 2: now what is basically oddity and why we need to understand it. Toe work on Hindu. Now, when we talk about oddity, it is the data structure in which you will be storing your data in spark Oddity stands for resilient distributed data sets and for example, imagine you have a big text file and the text file is loaded into your spark system. That spark system will be using our rdd did a structure to store that file in distributed clusters. This is a logical partition. For example, it can happen that you are dividing the data into eight parts and in reality, your data is told only on four machines, which is four worker nodes. So this is a logical partition or structure off a data inside spark. And mostly we will be working on our oddity and apply transformation action. Rdd by nature are immutable. So if you created a structure, you lord, your date of it, that data structure, then you can not change the data structure. What you can do is you can transform does era structure into a new data structure. So if I say Bs dash so this is the new deal a structure which can be created by applying some off the transformation action. And that is the part where people were spending our most of the time for transforming our structure. In reality, what actually happens as when you are applying those transformation operation from one data structure toe on a little a structure you're operating on top off your distributed data. So this is the part where we performed the operation over Spark. So this is how we perform the operation on data or spark. Now, data frames, which is the improved or a special type off oddity, are much faster and the only difference between oddity and it a frame is that data frame. I need to have a schema. So did a frames. Are your two d representation off your data in a table or format? So inside your spark now over the data frames, you can have a logical partition where you will see the data in terms off table, because it's much easier to perform operation. All it I can use SQL with my date of frames for my operations. Now, one thing you need to understand is data frames are internally a special type off oddity. This are not a different structure, but this is a special form off oddity which are foster and much at once. So all the time we will be seeing oddity. But we will be implementing our project of the data frames. Give me a practical example. When we ran our machine learning algorithm or oddity, it took five days to complete the entire model off learning. And when we use data frames for creating the model, it took almost five hours. So the 100 configuration will have big impact and also the data set. But the drastic difference is off 10 or 100 times on the speed off the processing. Mostly this happens because the architecture off, how the data structure communicate with its other logical partition is different. We will be seeing that in details when we look into the examples. For now, what you have to understand is oddity is the data structure in which sparks does the data and these are immutable in nature. The only way we operate on that is by applying transformation operation. And this is in return how you operate or you perform operation off spark and use er mapper use capability with spark. Now live abusing Get a frames, which are special type off oddity. So this is in a nutshell. House Park walks from the next section. We will be going into our Jupiter notebook and one by one, seeing our spark and Tex our oddity and data frame inaction. Mostly, when you are experiencing any issues or any error in your program, you will be of coursing by Forge a library because this library is responsible for converting our fight and program to our Java Virtual Machine program. And this will throw an exception if it's not able to understand some of the pipe and court victory, right? So let's go in the next section where we start the practical off spark. 20. Running Spark on Google® Cloud Platform: Welcome back. Now we have got a basic understanding off what to do. And we also came to understand that when we're walking on spark, the oddity is the most important aspect of it. A structure which we need to go through. Now let's go ahead and do some practical related toe Our oddity for that we will require our cluster back. We deleted the cluster because we were not using the cluster and even if he stopped the cluster, we would be charged the storage cost. So let's try to recreate the cluster. The first thing which we will do is we will go into data broke here. If you are not able together, did approx, then you can navigate from the main dashboard to date approx that it approx will be somewhere in the bottom off the product catalog off Google Cloud which is here. So I have already pinned it in the top and I will be going the cluster and I will be starting in new classroom now there will be a difference in the baby created the cluster previously and the way we are going to create the cluster now. The only difference is we are going to start this with initialization action. Now all the steps will be same will provide a name plus to spark demo. And then we will make it for Asia. Southeast one, We will say a and we want to create a standard one monster and worker nodes. We require one sip. You were going with the minimalistic conflagration here. I will make it 35 gigabyte. Primary disc size minimum is 10 gigabyte. I will make the machine type also off one CPU here. And if I never get done, I need to also change this toe 35 gigabytes, which is the minimalistic conflagration for us. And we're going with two notes now. Change! If it required Here is we are going to give a initialization action. Now, what is a initialization action? Previously, I mentioned that we can provide a initialization action to start any particular program in our machine which we're going to spin up now. So why we are doing it? What we need is Jupiter notebook in our spark Master node. And we need the session to be running when we start spinning our machine. So real acquire toe. Give this command gs Colin doubles large data broke. Initialization action Jupiter, Jupiter S H. It will start an instance off Jupiter notebook that the machine spent. Now there will be a lot off initialization action similar to this if you want. Oh, start any other services like yon the list off all the initialization action is provided in the get help page off Google Cloud platform slash data Brock Initialization action. And if you scroll down here, you'll get a list off all the modules which you can start or initialized when your machine is spinning up in the beginning. So what I need for now is only Jupiter Don't book and I will check everything else if it's proper before creating the machine. And we have two notes, 35 gigabytes, one CPU core minimalistic conflagration and all the things have been properly said. So I will press on this create once it started toe create. It will take two minutes. It's in the process off provisioning. So it's creating the instance for us now one thing which I need to also mention that this process can be automated with just rested behind. So there's a rest equivalent off this manual activity we did here. If you call the rest FBI equivalent, then it will automatically do this activity for you. It's required for automating your tasks in a production environment or even growing enterprise environment. You might be having multiple scripts which does this for you, and you have a custom process or you might have a custom template. Where you mentioned those machine requirement or initialization action are any of the custom requirements. So we will not go into details off setting of the landscape for enterprise, but the rest equal int is already mentioned in the Google Cloud Platform. Once the machine is set up, I will be able to show you the code for that as well till the time machine is up and running, which might be 1 to 2 minutes. We will be bossing this video and continuing once the machine is up and running for us. 21. Launching Jupyter Notebook in Spark: So now we have the machine up and running for us. We will go inside this machine on. You could check the CPU utilization. There is not much information because we just spend up the machine. You can also see the VM instance where you will see that How many worker note and muscle Know what you have? You have one monster Nord, where you have ssh capability with the browser based Domino and also toe walking towards the rest equivalent which I was mentioning previously. Is this onto Prestes? This is the cord you can call with ap eyes and will create the equivalent set up off one master node and worker nodes for the configuration which we just mentioned. So you can see that if I've gigabyte which we just mentioned and the zone off the instance which we previously saw. So this is all the rest equivalent off creating the same manual activity we did. We adjusted. So what we are going to do initially is we're going to go in our cluster now. We need to get our I p address so we can log into Jupiter notebook and start our coding exercises. So for that we will go into the vehemence since you can also see that this is the conflagration which we just perform and also wanting, which we previously didn't mention is the cloud storage staging bucket. This is the place where your machine instances told. Even your machine is docked. You will be also charged off storage depending upon size is off your machine. Now let's go to our VM instance toe, get the I P which I will get in the compute engine. I will goto AVM instances and what it will provide me is the external i. P This is the excellent I p I will mention that might be here. I can mention the port off Hadoop 80 It it This will open up the Hadoop dashboard which provide me overall summary off the entire Hadoop cluster. It's Nord and some of the metrics which are important for me to monitor the Hadoop file system. Now we will be in the section interested in Jupiter notebook. So I will be going into a port 8123 This is the default port for Jupiter notebook which is provided I can all right the sport. But this is the default port which you need to type to get to Jupiter notebook. So I can see my Jupiter notebook here up and running with the public I pee on This is the public I p of the master on I will be creating a new by spark session. So this by spark session is beytin session off spark and the first thing which we will be seeing is the spark context. So I would say sc dot version and I will press control Enter. It will provide me the version off spark contexts which is already present written the Pie spark Jupiter Notebook console which I have so it will take a few seconds. So as you can see that the version off sparking text is 2.2 point zero and it also tells us that spark in text is available within this Jupiter notebook session and I can start working on my oddity now. So I would love to rename this spark and rdd learning, and I will just rename it so from the neck section, we will be going into the exercises off already understanding each of the functionality. It is important for us to use it 22. A Simple Introduction to RDD: So in this section, I will be starting with some of the basic functionality off our d d. But before that, I would like to share this document. This is sparking rdd document, which will help you to navigate this commands because sometimes it's much difficult for you to get the command from the video and type it in your own stupid a notebook so you can just copy and paste. So this that all the set of commands which we're going to be used for our practical session now. So the fasting which I would be doing is I would be creating a auto complete feature. What this will provide is by execute this. Come on, then, I can type any of the variable there and I can press dot and it will be able to provide me an auto complete if I press tap. So this is very handy if you're walking with you but a notebook and you are new to spark, you might not be aware about all the functionality this oughta complete is good for exploration or even when you are quoting you don't have toe do mistakes or spelling mistakes which sometimes are very difficult to find and debug as well. Okay, the next thing we will do is check our fightin version because sometimes the version off fightin is to sometimes just three. So if he understood what is the version we are getting, then we will be able to correctly right this in Texas, So it's a fightin slash slash version. As you can see, this is version tree. Now some of you might be asking this question that in the previous section off loan bite an overnight recovered fighting to. But here the version is three. To answer that, I would say that you don't have to really panic here because the two words shins are similar. There are slight differences inviting tree. Some of the functions we write is different. Some of the return values are different. The way they both work under the hood is entirely different, but as a developer, most of the functionalities will be same. Only some of the functions will be different, which I will be covering over the period off this development. Now, this is additional advantage because you will be also getting exposed to Biden three, which is the next level off fightin you would be programming in. So the main difference which we will be encountering initially is the print function. Print function is bite in 2.7 was a statement inviting three. It's actually a function. So in Beytin three will be writing print as something like this with a parent is open and close. Let's try to execute it and you will get the desired result off print, which previously you were writing something but the space inside and this was basically statement. But in biting three, this is a function, so that is a major defense. What we will encounter in this section or the time we will understand more difference once we encounter that. So let's go ahead with spark because that is the reason why we spin up distributed notebook . The first thing I would like to start with is arrange, so I would like to create a range. On that range will be a range of anti jer. So if I create a data variable and I give it a range and I would say give me a range between 1 200 previously inviting toe, if I would have to bring this entire range I would have done bring data and it would somehow now one thing you might be doing rough mistakes if you are not familiar. The syntax is not uses bracket. So this is how you print invites and three with the bracket tradition. And what I was saying is, when we previously used print and the range, it would somehow give me a big list with all the numbers from 1 200 because that is the range in between. We have created a list now in fighting tree. If I want tohave the equivalent command, then I would be writing something like list and then pass the data. So this will be the turned, which was previously return with just a print data command. So this is a new difference which you might be seen, and one more surprising difference is X range. X range is not present in fighting tree. You have to use range for that as well. What is the difference between X range and range toe? Quickly cover this up in X range. If you would have se x range, it would create a range which will be evaluated lazily. That means you would not be creating a big range instantly. If I create a range often teacher from one toe, 10 million, it will not be created instantly, but it will be created when it is required. In that way, the processing is much faster in X range. But infighting tree They has been sound optimization under the hood and they have improved the range function. An ex range capability off optimization on execution is already covered within the range in Brighton three. So the point is in Brighton three we would be only using range to create list off numbers. And this is how we call the list function on bass, our range. So what we are trying to do here, they will create a simple RDD and vi vill partitioned this and I list. So in the previous section we got a basic understanding about oddity and we understood that Arditti stores the data, inform off different partition, so we will create different partition. So I use the spark in text and use the function better lies. So I can also you stab toe complete this I will pass my list and I will tell how me number of partition I want to divide this entire list in teacher elements. So I'll say I want to divide this entire mints in tow. Eight bots and I will be basically storing this into an rdd rdd healthy range rdd here on I would like to execute it. Now I can check the type off range oddity and you can see that this is off life Pipeline Oddity pipeline RTD is one type off Arditti itself Now the next thing which I would like to show you, is some of the features which are present within this oddity. So we have this data structure off spark, nor with us you can drill down mawr. Now you have the auto complete feature present with you range rdd dot and dab, and the show's all the functions you can perform over distress structure. Initially, we will be starting with some of the basic functionality because I just want to give you a experience on how this important function of map and collect work and how the lazy evaluation actually takes place. So we will be seeing the coding part in this section and to make you understand in a much better way. What happened inside the Hood. I would also showing you some of the diagrams in our sketch pad, So let's go ahead and do basic operation. I would initially check how many partition my RD is having get number partitions function on as he specify. A There are eight partition, so then tied it and the list is divided into eight partition, and the next thing I will do is I will try to use a function called map. What map will be doing is math will be mapping each of the data element to a function and Vidin this parenthesis I need to provide a function. So before the map, I need to create that function. So and you told to do here is define the return type what dysfunction is going to give. So basically, this function is going toe return whatever value. So to use Mom, I need to define a function which will be actually passed here. So let's create the function first. So let's try to define dysfunction. I will say sub as the name of the function. It will have one argument and what I need to do inside the function is I will be taking the value and I will be subtracting minus one, and I will be returning that. So this is what then our function off. So we look like and we have got a syntax error here. What we have not specified is a Kahlan will try to run into again and let's try to use up. So if I say stop then then it will basically give me nine. If I say it's up one, it will basically give me zero. So you got the point that we have a sub function which basically reduced one to the anti Jerry Buss toe that particular function. Now what we will do here is we will map each off the in teacher toe, the sub function on We will be using this party. So I say range oddity, Don't map And then I would boss the functions up. It's that simple. Now, one thing, which also is a property off oddity, is oddity are immutable in nature. So if you want to store this into a new oddity, then only it will be basically giving us the result. I will show you that in a few seconds at how it will behave. If he just used this oddity and do collect operation. It's the next operation which we're going to see and what will be the result in tow authority if he just used the collect function. Okay, so once we have mapped it, what has happened now? All this in teacher elements which are present in the partition are mapped in tow dysfunction. They will be actually going through this function. All of this in teacher and stored in new audit, even eight partition. Now map will be mapping it partition by partition. This new oddity and what we need to do to see all those elements within this new oddity and show us Okay, what is present inside all of this partition. So in this new oddity, I will say you wanted you don't collect. Once I run that, I will see all of the anti jer elements in the list and operation off sub has been performed. Now, one thing you need to understand is this operation only happens in line number 83. There is no operation which is performed or we just got it out in map are in creating in paralyze very a partitioning. It are even any of the operation before if it carried out the operation. Actually, in physical memory happens in our collect face. This is lazy evaluation. The reason sparked does that is because normally in distributed cluster computing, there will be a lot off computers where your date I stored. And if you try to do operation in each command, then they will be a lot off meaningless operation happening which could create network congestion or visual increased the overhead off time and memory required to execute. So at the end, when you have some trigger action, then only all of this command physically is executed in memory. So this is the first example off spark Hadidi. What I will do now is I will take all of this command in my sketch pad and help you visualize this because now we have got an understanding off what will be the output off each command. But let's try to visualize this So you get a very clear understanding off what is art because this will be the building block in the foundation. If you want to develop on spark, it's all done fifth oddity. So let's go ahead and spark and try to visualize what happened in each off this execution lines 23. RDD Architecture in Spark: So in this section we are going to see the glimpse off what we did in the previous section with Jupiter and how it actually took place under the hood. So the first thing we created is we created a list. We used the range function here and we provided the argument 1 200 It created a list for us within that range. Then what we did next is we store this list into our oddity and if he created eight partition, So if you can see 12345678 So we had a partition in the oddity with the command paralyze and how it would be physically stored it. In our DCP we have one master node and we have to working towards. So the master node has eat system have a primary memory, some disk space and some off the core for computation. So in the working order as well, you have primary memory, which is ram your despairs, the magnetic desk and some of the processing unit, and they are to working notes. Now, this oddity would be stored physically in the worker Nords and in the main memory off the worker note. We have a partition. So this eight partition will be divided among the to work and note and will be stored now. One thing we also saw in the last section is regarding the lazy evaluation. Now, one question which comes in the mind here is once we create the oddity, do we have the storage automatically done as we show here. What I mean is, when we create the oddity does the system stores the data in this format, which is four petition in slave one and the four petition in the slave to So does it happen automatically? What you think the answer to this question is No. We don't have the automatic partition done when we create the partition in the oddity, this is how it will be basically stored. But all this action will be occuring when we call the collect, because spark follow lazy evaluation. So any off the action which triggers some of the response or output, only then it will result in to all the previous action being performed. So at this point, there is no storage done on the worker notes. So the next section we will see how the previous lazy evaluation took place. Then we applied the map functionality on the collect functionality. In this oddity, 24. RDD Transformation and Actions Map and Collect Respectively Part 1: Now when we look at the operation, what happened under the hood? We have our rdd, which is now just a logical partition. There's no physical partitioning or storage done yet. We just have a list which we want to partition into eight parts. And we applied the map and what we did is we stored the Mab partition into a new oddity. What we previously saw the nature of oddity. They are immutable. So you cannot change rdd which is our initial ality. What we created you need to create a new oddity on we applied a transformation function called map and the past a sub function. This saw function is basically a function which will do minus one off any value vicious parts, for example. If I buzz then then it will result nine. So this was the function which we defined and the bass dysfunction into the map transformation. Now what actually resulted here is all of the petition is mapped toe a new partition and this function is applied No, In reality, the operation is not carried out yet. This will be evaluated when we will have a action operation called Collect now, coming back to our new oddity. Now, this is the new oddity. This is also having a partition exactly like our previous oddity on. The only difference is that each element will be generated after we apply the cell function to the previous rdd elements. Now, once we apply the collect action function to our new oddity, all of the operation which was previously bending, was carried out the first operation Waas storage. So the oddity is created in the worker node and the then call the petition the second thing which guide out waas the map operation. And once the map operation was carried out, then the sub function was also called for each off the elements in each off this petition. So one and two became zero in one and so on and all of the partition are recreated for this new entity and collect returned us the list. This list is basically on off the values in each of the partition in this new oddity combined. So this is in some off list now. One question which I would like to ask is if you have understood this process, what will happen if I would like to use collect to the previous old Rdd, which doesn't have any map operation, what would be the result? So the next section I will be physically doing this operation to make you understand what will be the result. 25. RDD Transformation and Actions Map and Collect Respectively Part 2: Now, the answer of this question is when I apply collect on this oddity, then I will be getting the list. But the values would not be passed with the cell function. So the list will be just the range which I created from 1 to 99. So let's verify it This in our Jupiter notebook. So if I goto Google chrome here and what I will do basically here is I will take the previous RTD vicious name as rdd on I will apply. How do you don't collect? I just here I'm just using go to complete feature with the tap I will apply the function and let's try to see was the results. So this is 12 99 So this is the result because oddity are immutable in nature. So I cannot change this oddity. I can create a new oddity with the map function, which is basically the transformation action. We will be seeing more off transformation action which can be applied to oddity and how they will be helping us toe my produce or do processing or clusters. But in this case, we're using the transformation map which resulted into a new entity and over that already we applied collect. But if we just apply collector the previous oddity, we will see all the list element in each partition. In the next section, they will be seeing more functionality and functions related to oddity. 26. RDD Transformation and Actions Filter and Collect Respectively: he saw the map, functionality and collect. We also saw how under the hood this functionality Voakes. So the next thing which we will be looking is into filter. Now, what happened with the map? With the map V are specified. One function to be executed for each off our data element when they transform into a new oddity Now in filter, what we will be doing is V will be filtering out all this data set and narrowing that down with help off some condition. Imagine we have this big data set and we just want to filter all the elements which are lower than 10 which are having less resident and so this 1st 10 value or from 1 to 9, I will say, because 10 is not less than 10 then is equal to 10. All this nine values it will be the resulting data set and want if the operation which we require only needs the operation to be done over this values. That is a very common operation spark. You do have large data sets and you want to operate only on filter data set. So filter the function which is utilized here and is also commonly used toe. First, filter out the data veggies 11 out of the large data set and then do any operation because the number off operation, if you can reduce the in done, increase the performance off our overall execution. So the filter should always be applied first and then operation if you have any kind of filtering to be done. So let's see how this filter can be applied. So if I say range rdd, which is the oddity, I have informed off a list off element between 1 to 10 and I want to apply a filter operation on top of it, I would say Arrange oddity dot filter And here in the bracket, I have to specify the function, which does the filtering so like in the map, recreate the function. We will first create the function for us. That's Tronto create the function. I executed this pond, but we have not passed and he functions in the filter, so it gave a type error. So let's try toe defining function called filter. Then we're going to have a condition if value. So we need to also specify here that we're passing a argument called value If value is a bigger than 10 then return falls. Have capital Onda. If the value is smaller than 10 then dumb true. So this is then dar function which we will be using in the future operation. So whenever the values true that value off that particular element will be kept in the new oddity. If it has false, then it will removed first test out our filter function. So what I would do, I would say feels toe 10 and I wouldn't boss, maybe 100 and I will try to see if it's working properly. So what I got is the falls. If I try to bars nine I got a true If I pause it then it is still true because in the else condition I have written true if this condition is not satisfying, so 10 is not less than 10. So it will go in the else condition. So in this case, 10 is also giving us true. So in the final result, we will also have 10. You can also modify this by changing the sign here to 10 and you're done true here and basically gene faults here. So by doing that, if I executed, then 10 will be giving us false value. So as you can see, 10 is giving us false value. So it depends upon you if you want to filter, how much or what is the criteria depending on dysfunction? Now what? I was due here as I will create a new filter oddity. This is the transformation action because rdd are immutable in nature. So the range Arditti cannot be changed. We need to store this into a new oddity and I will pass the function for the 10 and within the argument off the filter and I will try to run it now to see the result, I need to invoke the action function which is collect as you remember the previous section . This is the step where we will be physically performing all this operation because in sparked, we have lazy evaluation. So let's try to see the result and you can see that the final result is 1 to 9. Because we have filtered through the dysfunction, all the values from this list which is not required Now if imagine very in the case, filter and math will be used and water will be the sequence off the operation. Then you will actually come to the conclusion that if you have a very large data set, it can be inform off a list or it can be inform off addiction elements and it is basically stored in many partition across multiple clusters. So you have multiple machines which are worker north and they are all storing the petition off and are oddity in them very interested to operation if I filter my regards, which will be only what I need for operation. What I'm doing here is I'm reducing the volume off input which will go into the next operation. So the data which will come here if he say data one here and if you say that to here, the data to will be always lesser than deter one. If you apply proper fits operation and the visual operation is performing some fitter, then you can apply map or any other operation. What will happen, basically is we are walking with less data set which is relevant and increased our performance off our operation because the detained always less because this is the cluster and each note is connected with each node in a network system. There will be some did our communication happening. Then there's operation carried along. Each communication might result into some network packets to go from one machine. For example, this is A, B, C and D. There might be some communication between A to C B two A and so on. Based on how your operations are now in the scenario, this operation basically will be done on the collect face because in the individual face, individual worker north will be working on local data. But when you combine together, then all will be invoked. And now, if a large number of data passes through the network, then you will basically take more time to compute that. And if you are hosting in a cloud system, then you will be incurring more cost as well, because your operation will be taking mawr CPU. It will be higher CPU intensive. So when you're working with spark are any big guitar project then your optimization should be done on your operation level because in town you might not be having any optimization planner like SQL you did in RTD. We don't have any optimization strategy. It should be performed by the operation developed 11. Now, what you're seeing here is one off the challenge or disadvantage off using Arditti because there's no inbuilt optimization. You have to take care off the operation performance in your development level when you're according so you're not using the innate capability of the machine to optimize your basically asking a manual effort by developer. So this is one of the reason why data frame was created, which is a special form off oddity. We will be looking to data frames and mawr about it. Now why I am linking this to rescue. This is one of the interesting fact as well, because in the data frames, we will be able to also use SQL queries to extract this large amount of data for now is we will look into mawr operations on oddity and all the time. What you will find is some limitation which gave rise to the data frames. And now this. Most of the operation on spark is done of the data frames because they're much better version off data storage in spark and when you perform operation that beytin on our spark machine in the core is written in Skela and is run on gv um so this jbm will be the engine off spark and for our beytin to run on spar need to be converted with by 40 which we already discussed previously. Now the spy Fergie optimizer as much Foster. When it comes to data frame, we will be seeing in details, data frames and also see in with scenarios. Later frames are going to outrun oddity by margins. So for now, let's stick toe, understand more functionality off oddity, and then we will pick up data frames. So in the next section, let's continue discussing more functionality off oddity and see what are more functionality , which you can perform or those later structures. 27. RDD Transformation and Actions Filter and Collect: welcome back in the previous section, we saw the operational filter on for the filter operation. We use the filter function and also the created a function which will be filtering the value for us now. In the production scenario, we will be using Lambda Functions in the bright and prerequisite course we have already covered the Lambda Function and Lambda Functions are anonymous function which are one liner which will allow us to ride the entire function and the command within a single line. So let's see how it is actually done. So if I say new filter added, the one will provide a range oddity and I will say filter and here instead, off passing the function, I would be creating a land a function. So how that Syntex will be is Lambda. Then I will be passing. The condition, which is X, is smaller than 10 which is this one. And if it is, then it will provide true and if it is not, then it will provide false. So this is the exact same off what we define in for the 10 function, and it will give us the same result. So let's try to check it as well. So if I try to see file new filter Roditi one dot collect the value, then you will see that the result is same. Then we used for the 10 function filter over oddity values. Now, one thing we notice here is the length has significantly reduced, and somehow, if you get used to land a function, it is much more readable as ville. Initially, you might not be very comfortable, but London You might be confused which number or which variable is executed first, or getting the value first and wars return. But over time, once you get used to land A than you will try to do all the operation. All function in Lambda and it's a single liner function, so we cannot include much of the processing are conditioned. Check inside it. You need to them do a message lambda in that case, So now let's see some of the more operation related to our Arditti. These are commonly use function, which you will be utilizing when you're operating on oddities. So the first is taking the first element out off the oddity. For example, if you see this list, the first element is one. So if I try toe, execute the function first, then it will be returning the first element to me And one more function is if I want to take a few elements out, then I can call the function take and take for basically take forced four element or five element any number of values and it will give me those number of filaments. Now, one thing we can notice here is if he passed 100 my oddity only have 10 elements or nine elements, then I will be only it. It done that number of elements which are present in my adity So I can also do some of the operation off sorting in terms off the elements off Oddity. And for that I will have a new entity, one dot dick order And I can also use my tab just toe. Understand? If we have some function like that and I will say tree so what it will give me is it will sort the entire list into increasing order, which is ascending orders and give me the 1st 3 element. So I can also do the opposite of this operation which is top three elements which is by descending order, if I thought and fostering elements out of that. So these are few operations, which is let's do sorting now. One more thing we can do is if I pass one more argument, two revolts. This then I can write something like Lambda. Yes, minuses. So, basically, what will happiness, whatever number I'm getting. I've reversed order for that because what I'm doing here is I'm just converting the number negative. So the negative one is higher value the negative nine. So I will be getting the 1st 3 numbers out of this operation. So this is one of the way to perform that where I have a lambda function past which is temporally working or operating on the number while the comparison are the ordering is being done. 28. RDD Transformation and Actions Collect and Reduce: Welcome back in this section, we are going to see a operation which is one of the most important operation which we are going to use now. What? That operation does this. When you do collect, you might be expecting to return the entire values. What if you want to do a operation on all the values, return a single value out of this. Imagine if I need to add all of this value and return a single value. So how will I do that? So if I was the function reduce, then it will be basically requiring a function off. How you won't reduce the entire data sequence. So initially, I will be just using a simple at operation with Lambda function and I will say a and B, I'll be doing operation off a plus B. Now the result is 45. The one thing you need to be careful than you are using reduces in nature. Only the operation, which are cumulative and associative in nature, should be applied reduced on. For example, if I do B plus A, it will probably have the same result. But if I try to do a substructure in operation a minus bi. Then you will get minus 43. But if I do B minus A and you're getting results as five, so I cannot apply the reduce operation with abstraction because it's not cumulative in nature now. Before that, let's understand what is cumulative and associative property because you need some mats refreshment. Now for those off you who might need a refresher on waters associative and cumulative property, let's try to see what is the associative any community property. For example, if I have variables A, B and C and E operator plus so plus does the addition. So if I have sequence a plus B plus C, then if I am adding a and B together fast and then the result is added with C, it is equal to if I do an operation off a plus B plus C. Now here, B plus C is added first and then the result has added with a. In both of these cases, the final result is going to be same. This is the nature off addition operations also in case off cumulative if I have a plus B and it is equal into B plus eight in gears ups. Abstraction. Substructure in is neither cumulative or nor associative. For example, a minus bi is not equal to B minus a. Also, if I do operation a minus B minus C. If I try to subtract B minus e fast, it is not equal to If I do subtraction off a minus bi first and then obstruct the value. But see. So this is why operators like sub strapped should not be applied with reduced operation. No, whenever you are applying, reduce and you should get consistent result if the operator is both having associative property and also having cumulative property. Now let's understand how the reduction actually works. For example, here we have then list element. This is basically our list, and how direction is going to work is it's going too fast. Take the value one. It will apply The reduction on the 1st 2 values dancer might be our want to. This value is reduced with third value and you are going to get our one tree. Then this are one tree is used with the fourth value you might be getting are 14 then are 14 is used fit the fifth value vegetable result into our 15 and similarly you'll get our 19 and this is how your entire reduction operation will be taking place internally. Now reduce is one off the important command or operation because it gives one single value out. It is aggregating a lot of values into one simple values, and this value can be further processed with operations or this value itself can be the result you are looking into and reduce are used a lot, and we are going to see a different version off reduce as well when we will discuss, reduced by key and grew pikey. Now let's go to the next section, where we're going to see advance transformation and actions. 29. RDD Advance Transformation and Actions Flatmaps: welcome back In the previous section, V saw the basic functionality related with oddities. In this section, we are going to look into some of the advance functionality related with oddities. Now. Then we look into the action on operation, which we were performing. We have one oddity, and we were applying the transformation action off map and this map Waas creating one more oddities. And the first scenario is what we saw more in the previous few sections. Now in the scenarios, we have one values in the oddity and that WAAS transformed into a different value and this new value Waas passed with a function and that function you had just subtracting one by each value. So one minus one, which waas basically zero and to dash was basically tu minus one, which waas one basically. So this was the result in value. Now what can happen also is you want to keep a track off the indexes or you want to keep a track Vidic key value pair, for example. If I have one, it can also be any value. It can be employed record. It can be also employed details or details about the material about some files or information about a catalogue information about a product. So in the key value pair you have given a key and you have given a value. So in this case we have A as the key disk is a unique identification and one is the value. So, for example, if you have multiple product in your shop is brought will be having when I d and your key might be your I D. Or maybe a serial number which have uniquely identifiable a serial number which can be uniquely identifiable. Not this key, and value is stored in a couple. So basically this is a couple and couples are nor changeable, so you cannot really change the value off and one so topples are immutable in nature by itself, so it cannot change the value within the couple. So the key value pair is maintained. Now what we can do is we can apply transformation on the value off the trouble so the new Arditti will have a new triple and here the key will be same and your value will be what has changed. So now in this scenario, if he try toe access the value with the key, then I can do that with a new oddity. And in this case, if I just need to understand what is the value based on the key, then I can also do it. So this is one scenario which we will look into where we will have the mapping off rdd. We will see the transformation off Hadidi with the map functionality with key value pair. We're going to use the math functionality for it as well, which we saw previously. Now, one more variety can also be there. You have the values and you will have the result There. You have the transformed value when we apply the transformation function. Now, in this scenario, we also require inform off a list the original value. And this value might not be in a cable repair, but in a simple list. Or maybe all value club together. And this scenario is achieved with a function called flat map, so you might see the difference between what happens if you apply flat map and what happens if you play map. If you apply map, you basically get the key value pairs if it's a key value pair, but with a flat map, you will be having the key value pair in a flat structure where all are equally in a value format. So we will look into fost the map with cable repair and the second we will look into flat map. So this is the first type off advance rdd, which we are going to look into, so let's go ahead, and I want to put a notebook and start to see the map with given repair and the flat map. 30. RDD Advance Transformation and Actions FlatMaps Example: So let's see initially, a simple example off having a doubles with the map and value transform with our oddity. So you create a simple list. Have at least 1234 and what I will try to do next is I would create a oddity under the V one and I will say Spark and take start battler lies and I'm going to pass the list and here I will be using list values because list is a key word on. I should not use keyword as a variable name, so I will be renaming it to list values. On Have this Values here has passed through the paralyze function to create rdd, and I will define how Maney partition I need create for partition here. And the next thing which I will do is I would try to use the map function. I'll say on TV, too, on abusing Arditti One and apply the map function and use a lambda function and the Lambda function. I will be creating a double which will be return for the Lambda function, and I will try to return the value which is going to act as a key, and I will be having some of the operation done. I'm just going to add one which is going to be the demo operation to show. So each off the values will be going to this land of function on. We will have a key off one and we will have a value off plus one. So let's tryto run it. No, I can also see it's a good function toe show you that I can also count how many entries are present in my oddity, I can use the count and there are like four entries because the number of items present in this list of four I wasn't crud to dot Collect the action operation on this will show me the doubles. But the key value pairs used to well is three. So this 1st 1 is the key in the 2nd 1 is the value which has been produced by our malfunction. Now, what we will try to do is use the flat map here. So imagine if you're using flat map here than what will be though result. So if you're using flat map here, let's try toe use the flat map. One thing we need to do is we need to make the am capital. This is camel case in Texas. Now, if you see the count, it will be eight because the number of entries in the final list will be if flat values, which is sequence of value off cable repair without the topples. So let's try to see the collection and you can see that we have 1234 So now, in the final values, we have the keys and the plus one values both in a single list. So to give you a side by side comparison, I can also have this function off map in the next line. So you're just able to see what we are seeing, so I could just on it in a single line. And as you can see, we have the side by side comparison off flat map and map. So in here you have couples in here. You have all the entries off values combined using a list. So we have one 22 times 32 times for two times and 51 times. So this is the difference between a map and flat map. Now, the next thing which we are going to see is going to be one of the most important transformation function, which is grouped by key and reduced by keep, So let's go ahead and see the next advance rdd function. 31. RDD Advance Transformation and Actions groupByKey and reduceByKey Basics: Welcome back in this section, we are going to see some of the most important reduction operation which you will be performing on. Those are reduced by Key on Group by Kim Now, in the previous section, we saw the key value pair. Now in the scenario, if he consider a case where we have three partition and it'll pre partition has two keys which are prominently being used on the values are different. For example, if we have is a key and B is a key and we have some of the key value pairs store in each off different partition. For example, in the first partition, I have to value E and one BN. One is the key and one is the value. In the second case bees, the key was the value and the second position. I have a one a one B one B one, same scenario off key value. And in the third partition, I have six and trees where a comma one is president three times and B comma one is present three times. Now we need to do reduction based on the key. For example, if we have a one and I'm just doing a some operation here which is basically your addition operation which will be performed to the values. So let's keep the operation toe very basic, so we understood what it is. So what I need to do is I need to find all of the keys and the corresponding values I need to do is some. This is the operation we're going to perform on the wave, which we're going to use. It is we're going to first reduce it by key and were then also going to have a different way or method where we're going to say group it by Keep now how the fundamentally operation will be carried out now introduced by key in individual level. Off each petition, there will be some reduction happening. For example, you can see here there is only one set off key and value. So is only having one couple here and is only having one set. He and one B is also having one set being one. So there's no election here in the partition to we have to set off A and P to the value. We're doing a summation operation. So the final result is a two and B two. So this is what which will be carried forward are accused in the next section. Now the third partition. We have three value pairs off a and three value pairs of B. So it will be having a three and b tree after a summation operation. Now what next is going to happen is based on the key. We are going to club together from each partition. So we have one position here and in this position I will be taking all off a. And in the other petition, I will be taking all off the B from the first partition, I got a comma. One second petition a comma to the third position in geometry. Where is a common key between all of these values? And based on that value, I will do the operation off summation. And in the second partition, you have all the B values. Become a one becoming to becoming three. And I do the corresponding summation Now here regard the final result we are able to reduce toe Final one value based off a key and off Beaky. The final value off a key is six and a final value off Beaky is six. So in the election operation, we got one value corresponding toe, one key That is the final result off erection operation. You get one single value and as we have started to unique keys, then the final election operation will result in tow to unique values. Now, in a similar case, if he looking toe group by keys we are also doing introduction operation. But the baby do it is we combine and bring together all off the keys to one partition. This is shuffled in the network so you can already see the yellow arrows here. I'm drawing it again with the red. So you get an idea that I will be getting all the is in one. Did our node petition from all the existing partition? These are all logical partition. They can be in one single machine or they can be in multiple different machine. This is internally handled by spark. So all of this network will be utilized to carry data from petition one to three. It will be understanding how many keys are there. And based on that key, it will all send one. Unique keys are the same keys to one network partition and shovel it to the network. It's basically the shuffle. Operation waits, shuffling all off the same key to one machine. And second machine will get the different keys if you have multiple cable repair than there will be multiple shuffle partition created in between. Now what? We can see the difference here if you see the final result. Once we got all off the key value pair here, I just do a addition Operation at once. And I got the final result to E six. I don't have to perform the addition Operation individually. I just got the information and I did the operation at a single go and same way. I did operation in the be case as well in a single goal. Now, when you have data within your spark storage and you will be using key value pairs to store the data, the key will be a small anti fire. On the value will be a big file or maybe a big jack off data, which is stored in multiple machine. Now imagine you have large data and it's flowing through network. So from multiple partition here we have only three partition. You can have a lot of partition in real life scenarios. If you have maybe 64 partition or maybe 100 petition about that range, the data is Travis into the network toe one note and similarly, you don't have only Tookie's. In the scenario, we have only Tookie's. You might have maybe 100 keys or maybe 1000 keys or more than that. And based on that, love shuffling happens in the entire network. So all the keys come to one petition in between all the other keys goto one other partition and then third partition and similarly so on. So if we have, if we have is killed up scenario, Vicky are large and values are big, then group by is not a better approach because it will result in tow too much network traffic and congestion within the spark cluster. And sometime that results in two going out off memory error. Now, Ben, we're using spark. If this letter is triggered than sparked, will handle it gracefully, but it still will not be able to process the information properly. You will not have a crash, but your final result will not be proper. The part off your petition where you are going out off memory does not be working, but still the results will not be correct. Now, when we come back to our reduced by key, you have lot off keys and the values are also like big data Chung's. Then on individual level, you're performing operations. So each partition they'll be in turn giving you a condensed version of the final version of the data. And this results in tow less network traffic. Therefore, it is better in case you have, it's killed up scenario work, er more and values are bigger. But in a smaller scenario, we are end up doing here is we're performing the operation multiple times. So I performed operation in each individual notes and also in the main nose. So I performed operation here five times the operation off addition And here I performed the operation off addition on the two times. So, in retrospect, what we should be using if you're using a cable repair and we're trying to reduce the final election operation will result into one value based out off one key. But depending upon the data size, the data is big and loud. Off key value pair involved. We should prefer reduce by key. And if the data is small in nature and you're more concern towards the operation at the individual Nords, then group by key is a better scenario to go. This is one of the most important part off spark production, which will be encountered a lot for any of working on projects. Also, they are multiple variation off reduction, which you can use instead off grew by key and that are combined bikey or fold by key. But these two are the popular choices to go based on the nature of the data and operation you have. So let's see in action Harvey really performing reduced by key and grew by key that oddities. 32. RDD Advance Transformation and Actions groupByKey and reduceByKey Example: So let's go ahead and see how group by key and reduce by key box Now first, we are going to create a key value pair. I'm going to create a pair rdd on the pair. Oddity! I'm going to Dar likely create I'm going to take the spark in text and paralyze it. And here I'm going to pass a list off Topples Where is the key in the first couple? A second double We have be as the key Onda here I can have one Mawr e as well as the key and the values to No second thing which we will be doing is we will be applying grew by keys The final result will be all the EES will be combined together and be will be combined together And we have only one be here We're keeping example really small now The vato apply group by key which we're going to see first is we just need toe call the function after the oddity and healthy group by key Then I will be passing this with a malfunction This malfunction will tell what is the final operation to be performed? So if you see here in our sketch bad and the group by Ky v have a final operation performed at the end. We need to specify this here, and what we are going to do is we're going to do a lambda operation off the summation. So I'll say Lambda and I will past the value key comma V, which is the couple and then I will be returning. Ke ke is the ki envies the value and I will sum up all the values and the next thing which I will do is I will perform a collect operation. So in a single line I'm doing all the operation off transformation and also off action. But this will throw error and I will tell you the reason for that. The one thing which you will notice in pipe and three is you can not provide multiple values inside the land of function. You can only provide a single value. For example, here I cannot provide key and value in a topple like we were used to do in fighting too. I can provide a key value pair at once, and I can access it here with the index. So the 1st 1 is going to be the key in the 2nd 1 is going to be the value. So let's try to run it. And now I'm missing one bracket here. I have added one extra bracket has tried toe. Remove that and run it. Now what it say's is Gavey is not defined. 33. RDD Advance Transformation and Actions groupByKey and reduceByKey Example (2): No. In this case, what we're trying to return is a couple back. So I need to specify the bracket as well, which we just deleted. So I added the bracket and we ran and the the final result we got. And the final result is correct because it actually combine all of the A and the fine values. Three. By adding the values off all the keys where e is the key and the V and the B, we have one. So this is the final result, which is correct. We have already confirmed it for the next thing which we're going to see is a different version off group. By key on this time, we are going to use map values. So I will be copping this and I will be using map values. Enter off map. It's utilize toe only trade toe the values and the Lambda function. We don't need to border or mention about the keys. We only need to mention what we're going to do with the values and we're going to summit some missed operation. We're choosing to do so. We're going to do that. Let's try to see if you have proper brackets closed. This black is for map value. So it will take all the values on some function will be performed over those values and I will be removing this previous index and then they start to run it. So it also resulted into the same output, which is to correct output. Now let's go ahead and see the other part off our election type, which is reduced by key. So in this scenario, I will be copping the same command from up here. And I will be replacing this by reduce by key. And in this I'm going toe first Pass E Landau function. One difference which you will notice here is in the reduced by key. You need to apply operations on individual partition level. So the reduced by key takes the argument off. What function need to be performed over independent partition when the election operation will be taking place. So I will be giving the value should be summed up and I will just make it proper and the and the snow pack it now. Once we do that, let's try to run it now. Most probably Beeville encounter better now. One thing I need to showcase to you regarding the concept off at once. Oddity is how toe find out What is the error. When you are going to work on your spark programs, you might be going through a lot off errors. And sometimes reading this letter is a painful activity. There's like a lot of information, and you cannot figure out where to start. The first part is your stack trees, and it should always be looking from the spy for J. Edgar so raised by 40 error. And this is the part, very should start looking into to find out what actually went wrong. So it shows a error occurred. Wild calling, fightin, collect and store. You can see the type of Federer now it say's Lambda takes one position argument, but two were given. Now it's mentioning that Lambda takes one positional argument, but two were given. So that means in time off reduction on individual partition level. If you go back to a diagram here, you're Lambda Function gets to argument, not one. So this is one of the V. You can think while you are creating your own Gordon Spark, you are getting error, and internally you can imagine what this land of function is basically getting so the land of function is getting to argument, and what we need to do here is perform is some operation off this to operation because we're reducing by key. So let's try to run it again, and we're getting the right result. So in this excise, we also saw how to go through the error off spark and understand what is wrong and fix it. Because when you are walking on those operations which were performing now on the video, you might be encountering ALOF headers, off type errors and tech center. Then you can read distract race and the other type and fix it as well. So there's also one more Veii by which you can apply your reduction. For that we need to import add Operator, we'll say from operator Import ad were basically getting the ad function from operators Library. Now the next thing we're going to perform is we are just going to copy it this and we're going to boss ad function instead of this lander, and it will also give the same result. Okay, so this is how you are going to use grew by key reduced by key in spark. And also we saw when you encounter the editor than how you go by and fix that better. Now, in the next section, I will be discussing few off the remaining part off rdd, which is mostly related with storage, cashing and also how you should write the Syntex when you're writing the program for production. After that, we'll go into the data frames. So let's catch in the next section. Very finish the oddity Ponte. 34. RDD Caching, Persistence and Summary: Welcome back in this section we are going toe look into some off the remaining but off rdd which you might be finding useful. So the first in which we will discuss is the cashing off oddity. Normally, spark will also cash your oddity in memory. Onda, let's show you what I mean. For example, if I take the previous scenario, we where we have party and I just confirm if he still have access to that yes, we have. So if I used the function get storage level, then I'll sprint the output. This is biting tree, so we need to encloses braces and B is small. If you see that serialize, you have one X replication know what I will do here is I will take the pair oddity and I will say cash it and let's try to run it again in the next line and you can see that you have a memory. See allies who onyx replicated So now you're oddity will be also stored in memory as a backup if something goes wrong and Stack has not ableto recover the oddity, so you will not lose any data up. Most of the time, Vince park crashes, it crashes gracefully. And most of the operation which goes out of memory at the primary reason for spark to crash . So now the next step is if I want to again make it known catchable, where it doesn't store in the memory. We don't have additional space allocated for it or additional replication in the memory or freed up the cash. Then we can also use the function on Persist on. And once I use the function on Persist, then if I print the tourist level, I can see that my oddity is back that no memory replication now. The next thing which we are going to look into is how to write a proper code in your production environment. So let's take example off a function we have written in past and this is the ravage the road now. But in production scenario, you should be separating all the function in the new line. So if I have, if I have good bikey, I bring it to the next line and I'm apples to the next line and collect, which is the action to the next line. So we have the transformation and we have finally the action and the oddity is produced by as C Don't federalize on. We mentioned the data here on the have a Prentice's around the entire command. So I will try to have a simple map function here and try to bring the data from here. The previous section Very so goodbye keys and let's try to run it. We have a syntax error. This is result off fightin three. So I was just change it to okay, the zero and here Que every one and we do have the final result. Same. So this is the way by which your code is much cleaner, readable. And there's the practice you should follow if you are writing the code for production level which might be delivered toe any customer or might be exported for commercial purposes for the final thing which we're going to discuss is about the oddity summary to summarize rdd we have basically two types off operation present in oddity one is the transformation and the other operation is the action in transformation. You will be basically converting the oddity. It do a different already, and this new audit ive will be the one where the transformation operation will be applied. The original oddity will not be changed so they cannot be altered. Once created, so already are immutable in nature and the set off function which provide transformation are executed. Only then action is provided because we have a lazy evaluation valuation in spark. So any transformation will be only taking place physical in the memory once you apply the action. And we also saw some of the action like they count, collect, reduce and job and also saw reduce by group reduced by key and grew by key. So this was oddity. Now, in the next section, we are going to see data friends. Now the big ocean comes is why we're going to use the data frames. Now this question will be answered in the first part off data frames. So let's catch up in the next section. Very understand why I've used it. A frames and what is different and better that these airframes how to give you just one heads up data frame. It is a special type off oddity where we need to also specify scheme are as well. Now why the ski ma again result in RTD to be much more faster and better. They will see in the next section 35. Why Dataframe and Basics of Dataframe: Hello and welcome back. We are going to start now over the data frames. Before that, we need to understand why data frame came into picture. Now, when we talk about oddity Arditti, walk on petition so you can see there are a lot of partition here and all of this partition will be stored in your spark. Now what happens as you don't have access off how this partitions are stored. So when spark try to run any action, it can not do optimization, so spark and or do optimization. And that is the reason why you're oddity operation A little slow now what happens if the provide this state are vit e schemo? Now, if he provide scheme are with the data than spark will be ableto get access to the scheme are and understand how the data is presented and will be ableto execute optimization created with schema. And that's how we got our data frames. Data frames are an improved version off oddity sometimes also referred as schema. Oddity. So what? We need additional toe. Our data is a schema now scheme are will be able to map our data into a tabular structure and this is how our data frames will be looking like in our system. So they are basically going to be in tabular in nature with the schema, and this will allow us to run SQL queries within the data frames. Now this is one the most important reason why we go with the data frames when we are in project, because we will be able to use our existing understanding and knowledge about a school to extract data from data frames structures. Now. Also, there's one more structure, which is data sets. Data sets is a evolved version off data frames. But we are not going toe discuss much about it assets because most of the project core development will be done with the data frames because they are even higher in performance output compared to data sets as well. So we will be focusing our attention more on data frames because as for speed, they are the fastest among the three data types in spark. So let's go ahead and understand what are the function and functionalities which later frames has and how toe work with data frames in our spark 36. Installing Faker to Generate Random Data for Examples: the previous section. We understood about data frames and we're going to start with the data frame exercises now . Before that, we need toe get one library to create sound of random data for us, which we can operate on. So for that reason, we are going to install faker. Now it's very easy to install faker. You go to your terminal window or your notebook file, and if I'm running tip, then it will automatically install the faker for me. So then you are in your cloud and you are in Egypt the notebook than add a discrimination mark before the speak command, so it will take it as a terminal command and Ron it. So let's try to run it as I have already installed that tape faker library. So it is not installing again. Now let's go ahead and look into what faker provides us. Now Faker will provide us a lot off fake data, and it has a lot of properties as well. Now let's begin with using the faker libraries. The first thing we need to do is import from faker. They're going to import factory, and we will be calling the create function off the factory, so we have access to create some of the random data. Let's try to create some random data. Now here, whatever argument we're giving that number off records will be created. So let's create 100 random data. And if I try toe print all the random bidder, then I can never get through this fake variable, and I will be saying Underscore in range zero Toe 300. Let's make it then. Else it will populate a lot off part of the screen print dot name, and we're biting trees, so I need to give parentheses. Here in the print on you can see there are some random names created, so figure factory can be used to create a lot of random data. You can get Maurin Formacion about water, different types of data I can generate at providers. I am interested in some of the records off people which are walking in the organization, and you can also create random mater off barcode banking and also some to create God data. If you have some use case to test it, figure library is a very popular library. If you want to quickly test your cord and you want to generate some random datum. So have a look at this library. If you are interested more at faker dot Read the doc, What are you? And let's go back to our data frames and we will then operate on the other frames with this newly generated fake data or random better what we have. 37. Creating Random User Data With Faker and Creating Dataframe: Welcome back. No. We have this faker function to Greater London data. The fasting which we're going to do is create a table which will have a schemer off last names. Last name, a unique I D occupations and age on. We're going to generate 100 records with this faker Lavery. So let's try to create it. So we are going to write the byte in court. The fasting which I will try to make is I will create a function which will be giving this fake data and I will make it fake entry. And I'm going to call it multiple times. So what I will create here is name Coteaux figc. Don't name on. We will be splitting it. I will return Name one on name zero fic Daughter Son. It will create a fake i d which is unique because this is going to be a social security number and then we're going to find a job for them. This is all the random data. And to find the age, we will be taking the current here and we will be creating a date and score time on. This is a random daytime. We're going to find and we are going to subtract the current here with that particular date here. So we are, in a way, creating a random age for them, plus one, and this is all we need now. The next thing we need to do is we need to create this function repeatedly, and I will be basically having time. And I believe passing a function to this function in the argument handles a four and a score in range times. So number of times we tell the fake entry Toby created. We will be passing that here. So it's more automation. We're going to us. We just don't want toe great manually. All this record. So there will be a one more function where you will be providing how many times you need to call this and how maney records you'll need. So what we required we can say in yield because we're calling this function and it just going to call this function, then bliss Try toe call water repeat function. We're going to create the list on will say repeat 100 times. And let's make this function the argument which is possible. Repeat as our fake entry and let's try to run it. The one mistake. But it is this times should be time. This is a small typo and let's try to run it. So what we can see now is data also weaken print data here and we can see the 100 records have been created. Let's try to map this particular data with our data frames. How to do that? I will be forced importing by spark esque will so the same pie spark our SQL import roll functionality. From that, we will be using that a lot. And the next step which will be the main step off creating data frame I will be using SQL Contexts Incident sparking text. So how will you do that? Will just need to call a school context. So this is available to you? You can just kind off execute it and it will tell you it is a SQL context. It is of type by sparked out Esko context And let's try to see what we have done that So we have a lot of functions initially V need toe created it a frame. So we will be using created a frame function on newly passing the data out of which many degree that it a frame. It's basically a list the data available contain a list and return then boss scheme are off . How the state are is so the schema contains What is the structure within the data? So we have last name we have then the first name. And after that we have their secret number There s ascend which is unique social security number and then we have their occupation. And then finally we have aged. So these are all random data and let's try to run it. So we have a bit of him paraded One thing we missed as we did not assign this to any variable and let's try to execute it again. And we have now data DF during the data frame for us also weaken Just dab and see all the functionalities we have video, state of frame But first I would like to show you schema So in the scheme are you can see that the scheme in which we provided to the state of frame is coming here. Let's try to run simple SQL queries After that 38. Working with DataFrame and Understanding Functionalities Part 1: Now what we saw here is the schema. Let's try to do some of the SQL operation over the state of frame. So let me demonstrate a simple Eskil operation. I will say it in a frame dot select star. So this is giving me a new data frame. I can use the show function here. I want to show 10 regards off this new data frame. So what basically happens is did a frame are immutable in nature, So you cannot changed it A frame. You can store it in a new did a frame on What I will mentioned here is a rare condition, so I can use the very condition with using the pair function there I will. He used the state of frame. Her age is less than 10 So we need all the records where age is less than 10 and let's try to execute it. So what we have is a new data frame where we have all off those records which agrees for this condition. So let's try toe see forced 50 records or that and you can see that there are a lot off records here. They're less than 50 as you can see that it is about 20 records we have. But all of them have aged listened. And so it actually filtered with this condition where age is less than 10. Now what we will see is how toe apply a for the operation on our little frame so illustrated, selected and we are going to write a simple filter operation by just writing filter here. Now we're going to use the new data frame to store this photos. Data now previously usedto have lambda function. We used to say If he calling is listen then then give me the result. But in this scenario, as you can see, we do not have e listed of a hammer. He table a structure. So I will be presenting each off the row here as a dot age and also hear a dot age. Now, if I run this, I will be getting into a error. And then we will write the proper Syntex. So the right way toe create a filter Operation is first have a user defined function. So how we will be creating a user defined function we will use from by bark dot SQL And within that we do have a lot off functionality. We will be forced importing our function and we say import UDF. And also we need toe import from types bullion types. Now why we needed. Because when we're creating this usually find function, we need to mention that the return type is going to be off bullying in nature. And there were going toe call the function bullion type which will make the UDF understand , use a different function. That certain type is going to be a 1,000,000,000. So let's try to create a function here. I'll call it less than equal to Judy F. And I Will Parson Lambda function here, this one in copied and just based it and innocent are you? We will be having a past and is less than 10 and the value which we will be returning out off. It is going to be off bullying type. And for that we required building type actually. So let's try to run this and we have access to a user defined function less than and now what we're going to do is we are going to apply filter operation. But this new user defined function now, one thing you need to understand is here. We can access all the rules off each leader frames by sub data frame, and we just kind off save which column we need so I can use this target a frame that age is adopted. A frame is this data from the original It'll frame which we formed initially, and we are accessing each rose and passing it to our user defined function and creating the new data frame where the age is less than 10. Now let's try to see what we got to regard the right set of data I can say show and asserts a 50 even see that these are all the same record as we got here, and all the ages are less than 10. So this is a different way toe right filter. But the unique thing about later frame is the capability to write your SQL. As you can see that it was very simple. What writing just Esquivel and most of us are family or with esque will. So it's much easy toe work with your data frames on top off that Venice SQL queries are executed inside, they will be optimized as well, which makes the performance much better than just oddity, which was originally a black box, so Spark was not able to run optimization curry on it. So let's go in the next section and explore data frame more. 39. Working with DataFrame and Understanding Functionalities Part 2: So when it comes to deter frame, I personally find it very easy to work because I'm familiar with most off this cooperation , and they're very intuitive. Venice map food it a frame what we have. We will be seeing some off the action like Max Average group by count or by will also see how to drop it, it a frame and create a custom data frame with a few fields. If you have a big table consisting off multiple columns, then I can create a new data frame. But just having few columns. So this are old operation, which we will be seeing here now again, in this case, off data frame as Ville, we will be having actions and transformation. Action will be the part. Very data frame actually is modifying the memory and giving you some result, and in case off transformation, you will be basically transforming it Data frame into a new digital frame, and this new data frame is not actually made at this point. It's just produced a structure off how to create it because transformation operations are evaluated by lazy strategy, so you will see that it our frame formed. But in reality, it would not be form in memory unless you apply a action on those transformations. Now we have already been familiarized with this concept off lazy evaluation with the oddity . The same philosophy or strategy is also implemented in case after the frame advantage is that you have the capability to optimize your operations. Plus, the functionalities are really easy to work with. That's personal opinion, But you will also find that happened when we go to the practical and see all of this operation in practice. So let's go ahead in the Jupiter notebook and see how to use this functionalities that our data frames. 40. Working with DataFrame and Understanding Functionalities Part 3: Let's try to operate on this data frame now. One thing which we didn't mention in the previous section is what are the functions which will be used as action Vitor frames. Now the function like show take and first will be used as action, and you should be using take and first instead off collect. To give you example, let's try to use the state of frame showed then rules in a very nice tabular format. This is action now. I can also use Dig, and it provides the same functionality. But it will give me rose instead off a nice tabular format. I can also see a structure. The technical name is also mentioned in top of the column, but it gives me a little raw data and I can get the state are in front off a list, and I can further process it as well. I can also get the fast item because there's basically a list by using the index operation or this is equal into writing just first here so as not to run it now. First dick and shore are all basically your actions, which you are going to use video total frames Now let's try toe, have a query where we are sorting the entire did a frame by age and getting out the highest age. So how we will do that So we will use the function order by with a little frame by an fee will be mentioning by which we need to order. They will be ordering it with age by convention age here and then Once I do this, it will give me it it a frame which will be ordered in dumps off age. Now I will be requiring the fast data field. Let's try toe, show it, inform off a row. I can basically take one and show the 1st 1 and it's giving me the lowest age. But if I want to reverse it and get the maximum age and then Terry called of the person that I can use here descending and it will basically give me the person detail which is having the highest age. Now, we can also have an issue where we might be having multiple records off persons. So what we can do is we can first try to see how many records we have. We can use a function called count. To see how many records we have, we have 100 number of records and if you want to check now. But how many regards are basically distinct, so I can use a distant function and then apply my count. So if this is less than our count that that means there are duplicate records, this suggests that there is no duplicate records. Now let's try to create a simple data frame, very rehab duplicate records. So I will use temp D. F. I will be calling the SQL context, and then within this, I will be quite the create data frame. And here I would be passing a list, and each off the list will have item double. Let's have name Jaw. You should be capital and the value is one. Have one more regard, and we try toe make this the same as the Foster God. And then let's have one more name Anna, and it will have had 15. So let's try to limit the detail. Tell this we can see the count as well. It's three, but if we try to run the distinct count now, so we'll just copy the command distinct and then we counted, we'll see the value to it too. So that means there some duplicate regards. Now I can delete the duplicate vit dot drop duplicates. And then I will be specifying what are the columns which need to be compared While we are taking the duplicate, we would say here the first thing. But one thing you might have noticed that we have not given any schema toe audit a frame. So what we can do here is we can add the scheme are with name and add the second part, which is basically in teacher. We can say it is the score which has been given to them. Let's try to run it and again the count. And now here we are comparing the duplicate vit name. So we will be providing within a list name and we can basically count it as well. If you want to store a new data frame, then you can also do it and then we'll count the new little frame and I can use this tempted if new I can launch the count. So we have to know. I can also see how the data frame is basically it Philby containing show and and a record one of the jewelry God wants to break it sweet removed it. Now let's see. Have to find the max men an average within the state of frame list. Now let's see how we find Max mean and average within this data frame. 41. Working with DataFrame and Understanding Functionalities Part 4: Welcome back in this section we are going to understand how to use the Mac's men an average function which has been provided off the box with the rate of frames. So let's try to use our data frame, which we created. This is a very simple example. So you understand how to use this function. What I will do fosters I will take the state of frame. Then I will use the group by ICANN the stab and will show me the group by and offer that one Nice thing I would like to show you which you might be doing is getting this data frame in a temporary frame stamp group by little frame. And what I will do is I will try to use the stem group by the frame. I will press a dart and it will show me what are all the functionality I have access to. Now I have aggregation average Count Max mean by what s Coll contexts and some use all the functions and actions which are exposed to me with this temporary did a frame inside. So what I would do here now is called the Max and the way you do it is you need to provide Rich. Call them. You need to do Max. So this is score. And if you just want to see what is the output it will give you in, return it a frame. So I need the first off that it a frame. Now what we're done is aero. If I just want the value, I can use the zeroth index to get the value, and this is the in teacher value, which I have now. I can also use other functions here like men. Give me the minimum score and average as well, and you can explore more with this group by function to get max mean average and count. Now let's try toe create Our final part from our data frame exercises is to create one more data frame, which will have a loo column added, combining one or any combining any one or two columns off the existing to the frame. For example, I have a name and a score in table, so I was creating a new column named Score With. These two values will be just concoct in ated so the baby will do. That is, well, say, temp data frame not select, and they will basically use Can Katyn eight can capped and we are going to pass of attempted or frame. And we'll didn't say What are the column? We need Tokcan. Captain eight will say name and me to concoct innate with a space. And then we will provide the second part or any of the baht which we need to can, captain eight. This name fit will provide score here, so we need to concoct Nate name and score. Then we will be providing any other field which we need. So for example, if we only need this particular feel in the final later frame, then that it's okay, I can also add more columns to this new data frame. So what I will do now is I will write the action show here within this line itself. So we are getting the final result, and we are not really doing any of the father operation off getting this data frame and then applying show action on it. So let's try to run it. So what we have here is we have our job name and the score can caffeinated cannot supply space here and Basically it will give a space between these Dukan caffeinated name and school. Now, I can also apply some of the condition. Like if I need one more column or any other columns Toby appended with this particular frame. Then I will just mention here that a comma that what are more columns I need in my final it a frame, for example. In my finding a frame, I need score. And also I need the name as well. I just mentioned name here. So in my final data frame, we have three columns now. The 1st 1 is concatenation. The 2nd 1 is a name and the final is score. So let's try to run it on. It gives the same result what we expected. Now what I can do more here is apply filter operation before the state of frame select operation. So if I need to filter any record which matches my condition that I can mention here, for example, I can mention here if do this operation If the score is more than then so I will say temp later. Frame dot score is more than then. So this is the condition. If it satisfies that that only do the sort operation and we have the final result has been See Now, this are some of the basic operation off the the frame and you might have understood that this is very simple and indu tive mostly so you can carry out most of operation from your own site to further drill down. You can also look into the spark documentation and this park documentation is the official documentation. You have their own specific language. We will be preferring biting. For that we have access to all this modules and libraries and packages in our disposal. You can have a quick search. For example, if I just want to say date are framed, then it will give me all the packages, continued it a frame and you can see all the functionalities which are related to that. With example, you can try out more off this examples, but at this point you are comfortable enough and have a basic idea off D D data frames, which is good enough for us to work on data set and the operation. When we apply our machine learning algorithm in the next spring, we will be looking into the machine learning part off spark. We will also have a prerequisite machine learning course, and this course will be mostly critical and also mathematical driven. Now, once we implement the machine learning, we will be taking help of little frame. And these basic functionality and functions which you are not comfortable with will be used to showcase and build the model. So next section I will be doing the final operation off this print that is toe import of our data from local machine toe Hadoop spark and then use it in our spark system. So let's go ahead and finish that pot.