2021 Edition - Big Data Hadoop for absolute beginners with GCP Dataproc & Cloudera QuickStart VM | Engineering Tech | Skillshare

2021 Edition - Big Data Hadoop for absolute beginners with GCP Dataproc & Cloudera QuickStart VM

Engineering Tech, Big Data, Cloud and AI Solution Architec

2021 Edition - Big Data Hadoop for absolute beginners with GCP Dataproc & Cloudera QuickStart VM

Engineering Tech, Big Data, Cloud and AI Solution Architec

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
14 Lessons (1h 12m)
    • 1. Big Data overnight Introduction

    • 2. What is Big Data?

    • 3. What is Hadoop?

    • 4. Hadoop Distributed File System (HDFS)

    • 5. Understanding Google Cloud (GCP) Dataproc

    • 6. Signing up for a Google Cloud free trial

    • 7. Storing a file in HDFS

    • 8. MapReduce and YARN

    • 9. Hive

    • 10. Querying HDFS data using Hive

    • 11. Deleting the Cluster

    • 12. Analyzing a billion records with Hive

    • 13. Cloudera QuickStart VM on GCP

    • 14. Bonus Running Spark2 on Cloudera

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

Learn Big Data with GCP Dataproc & Cloudera QuickStart VM. You will learn the following with hands on examples:-

1. Big Data Concepts - Data Engineering with Hadoop
2. Hadoop Concepts - HDFS, MapReduce, Hive , YARN
3. Hadoop Hive programming - Analyzing a billion records on a production grade cluster
4. Hadoop HDFS commands
5. How to do Hadoop programming on GCP Data proc
6. How to set up Cloudera Quick Start VM on GCP

1. Basic programming skills
2. Basic knowledge of Database and queries

Meet Your Teacher

Teacher Profile Image

Engineering Tech

Big Data, Cloud and AI Solution Architec


Hello, I'm Engineering.

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Your creative journey starts here.

  • Unlimited access to every class
  • Supportive online creative community
  • Learn offline with Skillshare’s app

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Big Data overnight Introduction: one of the challenges in learning beget a Hadoop. Is the environment set up related hassles? Our key focus in this course is to teach you basic concepts of dictator hard job. Also teach you how it could be up and running with three. We could have cluster insisted. Few minutes. DeSipio provides $300 frequented for a year for a name. New subscriber Using that free correct, you can explore big get our machine learning virtual machine stories and many other services for free for a year. You learn how you can use the fully managed cloud. Get a broke and moment on the jet ski black form and get started with your hard dope and high programming in just a few minutes will also teach you how to set up a quick start VM on Google Cloud Platform. Virtual instance and be up and running with her dope high programming again. Within a few minutes, you will explore the hard dope user interface and learn how to use the query. Editor. Togo High programming will teach you basics off hype how you can create internal table sectional tables store Did I in High Bend us really get out for your analytics poppers. I will teach you the basic commands off our DuPage defense so that you know how good tracked with her do file system. This is a course for absolute big nurse. And if we have basic programming knowledge and some knowledge of secret queries that you can handle for this course and get started with their big journey Thank you. 2. What is Big Data?: So what exactly is big data? Big data is about storing and processing huge volume of data to derive some sort of meaning out of it. Data helps us understand what exactly has happened so that we can take better decisions. In a big data world, data is processed either in batches or in real-time. It's not just volume of data, but we're also talking about the speed of data. Is the data volume grew, also the speed at which the data is getting accumulated in the real-world increased. There was a need to take a different approach to process such high volume, high-speed data. Big data has five key characteristics, also known as the five V's of big data. Let's look at them. The first V of big data is volume, which is certainly there in the name. How big is really big data? Let's look at the units of measurement of data. In a computer system, the lowest measure up data's bit, which can store one or 0. Then we're bytes which can store up to eight bits. A byte can store a single character. Then we have kilobytes, which is 1024 bytes. When we write one page of text, it requires about 20 to 30 kilobytes of storage space in a computer system. One step up from kilobytes, megabytes, a picture taken on a high resolution camera out typically take about two to three megabytes of storage space. Gigabytes is 0100 to four megabytes. These days, we get laptop computers with 1000 gigabytes of storage space, which is approximately one terabyte, which is the next unit of measurement. After that we have petabytes, which is 1, 0 to four terabytes. Now when we're talking about terabytes and petabytes, we are entering the world of big data. Traditional systems cannot store and process such huge volume of data. It doesn't stop here. The next unit of measurement is exabytes, which is 1, 0 to four petabytes. There are many real-world use cases which handled exabytes of data. Google stores about 30 to 40 exabytes of data to do all the processing for their search engine. The next V of big data is velocity. That's the speed at which data gets indebted or comes into a big data system. There are millions of tweets every minute. How did so photos get uploaded to Instagram every second? It big data solution must be able to handle such high-speed data in an acceptable timeframe. Next, V of the greatest variety data might be coming in from multiple different sources with different attributes, mystics, and any big data solution that we design should be able to address all kinds of variations in the data. Any data that can be grouped into rows and columns. For example, the data that can be stored in a CSV or Excel file is called structured data. Semi-structured data is data with some structure, but they're not really Taboola, there is some structure to it. For example, the key-value pairs in an XML file or JSON files can be called as semi-structured data. You everything else that is out there, it text file, images, videos can be called unstructured data. Any big data solution we design should be able to handle all kinds of data, structured, semi-structured, and unstructured data. Whereas it is the truthfulness of the data, it's extremely critical to have accurate data for the right decision-making. And the final V or big guys value, that's the most important characteristics of big data. The data is of no use unless it adds value to our organization or business process. The core problem that we need to address in any big data solution is the storage aspect that is typically solved by having a distributed storage system. Instead of dumping the entire data to a single machine or hardware, you split it to multiple machines, that is distributed storage. For example, if you want terabytes of data, you split it across five different machines. Since doing 200 gigabytes of data. These days, storage is really cheap. You can have multiple computer systems or commodity hardware and use them to store data for your big data solution. Since the data is stored in a distributed manner across a set of machines, it needs to be processed in a distributed way. That takes us to distributed computing. Processing should be done where they're at I store, rather than pulling everything to a single machine and doing the processing there. We'll start our Big Data journey with these two concepts, distributed storage and distributed computing. And next we'll look at Hadoop, which is a popular framework for storing data in a distributed manner and also doing the processing where the data is. 3. What is Hadoop?: Hadoop is a framework for distributed storage and distributed computing. Let's dig into Hadoop and understand how it stores data in a distributed way and how it does the parallel processing. Hadoop is designed to scale from a single machine to thousands of machines with each message providing its own stories and CPU. When Hadoop was first launched, it had two core components. Is DFS, which is Hadoop Distributed File System, that is for distributed storage. And MapReduce, which is what parallel processing are distributed computing. In a Hadoop framework, multiple machines are assembled to form a cluster up computers. Each machine which stores data is called a DataNode. Each data node will have its own CPU and memory for parallel processing. Gritty, central namenode or master node, which manages all these datanodes. It checks which node is free, which is more stories or compute capacity, and then allocates resources accordingly. Namenode axis is central authority in the Hadoop framework. This kind of architecture is also known as master slave architecture. Here NameNode access the master, and it tells the DataNodes what to do. You can also use the NameNode to store data, though typically that is not the practice in the real world. 4. Hadoop Distributed File System (HDFS): Let's now begin to. Hdfs is DFS, as we described earlier, is the distributed file system in a Hadoop framework. Is DFS allows distribution of a large dataset across a cluster of computers. In HDFS data is stored in block size of 128 megabytes and each block can be replicated across multiple machines. By default, the replication factor is three. For that you would need at least three data nodes. And this replication factor is configurable. Here as you can see, a DataNode highlighted in color rate is replicated across three machines. Similarly for the data nodes in other colors. With replication factor 3, when two data nodes go down, data would still be intact. For example, if the second, third node go down, the data represented in the red block will still be available in the first DataNode, this Gibbs Hadoop or is DFS fault tolerance capability. Data is distributed in multiple machines in case of any hardware failure of one or more machines get out, still be intact. You can always set a higher replication factor. Replication factor means your data would be backed up on board number of machines, and more storage space would be required. So you need to do a trade-off between storage cost and how critical your data is and, um, how many machines unit to back it up. Before we dive further into the Hadoop framework and understand MapReduce and its parallel processing framework. Let's set up a Hadoop cluster on Google Cloud GCP platform and understand the practical side of distributed storage. 5. Understanding Google Cloud (GCP) Dataproc: Now let's jump into the practical side of Big Data Hadoop. The easiest way to get up and running with their production lake Big Data Hadoop cluster is through Google Cloud Dataproc. It's a fully managed service for Apache Hadoop and Apache Spark. In just a few clicks, we can be up and running with Google Cloud Dataproc. When you sign up for Google Cloud or GCP, you get 311 free credit. And that's a lot of amount. Tryout various services on GCP including Google Cloud, Dataproc and videos. Are there big data and machine learning services? Google has a parsec and billing. You'll be charged for every second you are using the cluster. And that amount will get deducted from your free credit balance. Typically, you will consume about $3 per day. What a three-node cluster with 16 GB RAM on each machine. We got $300 free credit. You can keep the cluster running for a 100 days without incurring any additional cost. But you don't have to run it all the time. Create a cluster, use it, and then you can dominate it and the recreate whenever you need it. Next. Next, we'll understand how to sign up for Google Cloud or GCP. 6. Signing up for a Google Cloud free trial: Let's create a free Google Cloud or GCP account. Search for Google Cloud Platform and pick on cloud dot google.com. Then go to console. You would need to sign in using your Gmail ID. You can see opsin far cry for free. As of November 2020, the free trial period is for three months. You would get $300 fee credit when you sign up for Google Cloud for three months. And that's a big amount to tryout various things on Google Cloud Platform for big data machine learning and other services. You'll have to enter your credit card and other details are to sign up. Google will authenticate your card and charged up to $1, which you'll get refunded within a day or two. You would not be automatically charged after the free trial period. Google will send you a message and ask you to renew beyond the free trial period. This is the main Google Cloud Platform homepage and you can go to various links. One of the important links is Though billing link, and you can see how much free credit is available. Now from here, you can search for different services and try out different things on, under GCP cloud platform. 7. Storing a file in HDFS: Let's log into Google Cloud Platform and create a new Dataproc cluster. Could search for Dataproc. If you're trying it for the first time, you'll be asked to enable the EPA for Dataproc. Just click on that enable APA link. And it takes a few seconds to enable your Cloud Dataproc service. After that, you'll be taken to the Cloud Dataproc cluster pages shown here. Just that first time you read reenabled APA. Click on Create Cluster. We'll keep the US central region. And join also will keep the default. And we'll have a one master and multiple Walker. Our data nodes cluster. Keep everything else. Default on this page. Click on configured nodes will create a general-purpose cluster. What the master node will keep the default option of four CPUs. And also let's leave those two races 500 gigabytes. Now for the worker node or data nodes will select the 12. So that will get two CPUs. Because maximum we can have eight CPUs in the free dad limit. So we'll have fought in the master node and two, and each datanode ROR can note. We'll leave the default decisions 500 and TV. And all other parameters is default. Let's click Create. So this is how we can create a production grid big data cluster with Google Cloud Dataproc platform in a few minutes. Now the cluster is ready. Let's click on the cluster name. Next, we'll click on VM Instances tab. And we can see the cluster with one master node and two worker nodes and will also find a link to SSH to the master node. Let's click on that. Now we're blogged into the master node. This is a Linux environment. Hadoop is designed to work on a Linux environment. And you need some basic knowledge of Linux, like how to create a directory or how to move files between directories to work with Hadoop on a Linux platform. And during this course we'll be covering all those basic commands. You do not need to worry if you have no brand Linux background. Just to recap, we have one master node and two worker nodes or data nodes in a Hadoop cluster. So right now we have logged into the master node. Dataproc provides us a way businesses using which you can login to the Master Node server. Now in the master node, we can fire the Linux PWD command, that is the present working directory command to know where we are. And by default, you login to the home directory. It showing home for eugenic skill 2021 is the home directory. It's a user directory with the same name as the Gmail lady that was used to sign up for the GCP account. For you? It would be different and it would be same as whatever Gmail ID you use to sign up for GCP. Alias command gives us all the files in the current directory. So right now we do not have any file in this home directory. From here we can navigate to other directories or we can create sub-directories. Let's now understand how to pull an external file to the master node. Local environment will go to GitHub and find future IQ skill Big Data Repository. Now under this big data repository, we have a file called retail store dot csv. Will copy this file from the GitHub repository to the Dataproc cluster. And mom in this file contents is salary, gender, country purchase information for setup customers. Now in this lab we are working with a very small file. But whatever commands or techniques you will see that would work for a large file in a big data world. Not to copy this file, we need to know the raw file path. Click on the tab here. And whatever URL you see, just copy that. Now let's go to the master node Linux environment and your pre-specified W gate and the file path. Simply paste whatever we copied by using control V and fire this command. W gate is the Linux command to pull an external file to the current environment. Now the file has been copied and we can verify the same by using ls command. Now the retail store dot CSV file is present in the master node home directory. From here we'll be moving the file to the DFS directory is DFS is a distributed file system across a cluster of computers. In our case, we have two worker nodes are distributed, file system is distributed across two worker nodes. Let's understand how to move this file from the master node local environment to the distributed is DFS file system. Hadoop fs, which starts for file system space dash, is the command to interact with the Hadoop file system from master node or from any of the worker nodes. Since we have access to the master node from the master node by firing this Hadoop fs das command interact with the distributed file system. Let's see how that works. Hadoop fs does, gives us a handle to the default file system and we need to specify what command we did to execute. For example, we can say Hadoop fs does Ellis, it would list all the files that are present in the home directory of days give his file system. By default, GCP Dataproc doesn't create any user directory in his DFS, and we need to create that. And the way to do that is by using Linux MKDIR command. Mkdir is the command to create a directory in the Linux environment. And you can prefix it with Hadoop fs does to create a directory. Is environment will first create a user home directory maze DFS. And it has to be created under slash user directory with the same name as your local environment username. For example, future excuse 2021 in our case. So now if we do a Hadoop fs, ls command, it doesn't show anything, but it also doesn't give the message that no such file or directory. So that means this directory has been created in the Hadoop is DFS file system. Whatever username we have in the local master node environment. That is here, we need to create directory with the same name in the distributed file system for this user. Homes class future x-cubed 2021 is the local home and the corresponding hominis DFS is slash users slash future exclude 2021. Let's create another directory under Future Skill 2021. User directory in is DFS will call it data. Now when we do a Hadoop, fs does, is it shows the data directory. We can also do Hadoop effaced does. Alice can specify the directory name by default, picks everything from the home directory. Hadoop fs does. Alice would give us the content of the home directory and we can also get the content by specific the actual directory path. When we say Hadoop fs does a list class user, it shows all the directory under the user directory in HDFS. So we can see HDFS, HBase. These are some of the directories that Dataproc would have created while creating the cluster. Let's understand how to move the retail store dot csv from the master node to the distributed Hadoop directory. When we do a list, we can see the file in the Master Node local environment. Now to move this file to the defense directory, we need to use a command Hadoop fs does put down and then specify the file name and the target directory name. By default it is the home directory. So let's copy this and paste it here. And if we don't specify this, it would get copied to the home directory. Now if we do Hadoop or fish does the list, it would show details to CSV and also the data directly. And when we do a list, it shows only retail store dot CSV. That is the file in the local environment. Let's create another file in the local environment. We can use the Linux touch command to create any blank file. And now when we do a list, we can see two files, the local environment. But when we do Hadoop fs does Ellis, we see retail store dot CSV and data subdirectory, but we don't see the readme.txt because you have not moved that from the local environment to the Hadoop environment. So you need to understand this difference. What is there in the local environment in master node? What is this? What is there? And the Hadoop HDFS distributed file system. Let's now move the readme.txt using Hadoop fs PUT command. And this time we'll not specify the destination directory. By default, it goes to the user home directory in HDFS. Hadoop fs does Alice again. And we can see the README file. There is difference directory. The same is also present in the local directory. Now let's understand how to pull a file from the HDFS directory to the local environment. When we move the file from the local environment to the distributed environment, the file will be split into multiple data blocks. We have a replication factor of two, so each data block would be available under two notes, and the file will be stored in a distributed manner. Now to pull a file from DFS to the local environment, we can use the Hadoop fs GET command. Let's see how that works. Let's remove the file from the local environment. Rm is the command to remove any file in the Linux environment. Now we do not have retail store or CSV, the local environment, but it is present in the Hadoop HDFS directory. We can do Hadoop. Fs does GET command. The file from the DFS directory to the local directory. We can specify dot here. That means the files would get copied to this present directory. We can also switch where different directory name and the file gets copied to that directory. We have the write permission. Now if we do a list, we can see the file in the local environment. In this lab, we have seen how to store a file in Hadoop is DFS file system. Though we applied it for a small file. The same technique can be applied to any large file. You have seen how to copy a file from master node is and also how to pull it back from his defense to the master node. Typically, you will store the file in days DFS, and then do some processing using MapReduce, our Spark, which we'll see later. And once the processing is done, if want to share the file with the outside world, you can pull it from HDFS, store it on the master node, and then move it to another location where other applications can access it. In Dataproc, we have access to the master node. But typically in a production real-world scenario, you would have access to one of the worker or DataNodes. You can interact with HDFS file system from any of the worker or data nodes. Using Hadoop fs does commands. 8. MapReduce and YARN: Let's now look at another core component of Hadoop, which is MapReduce. Till now we have seen how to store data in Hadoop HDFS file system in a distributed manner. Using MapReduce. We can do parallel processing on this distributed data across the cluster of computers. It's a way of sending computational task to HDFS. Mapreduce consists of several map tasks and reduce tasks. It splits the data into individual chunks which can be processed by map tasks or reduce tasks, work on the output of the map task to produce aggregated data. Let's understand that through an example. So this diagram depicts how MapReduce used to work when Hadoop was first released. Typically there will be a resource manager in the master node, and also there'll be a JobTracker. And then there'll be several Node Managers which will be running on individual nodes. And then each node, whatever task tracker. Jobtracker will send several task to the task trackers. And the task trackers on each node would execute those tasks using the CPU and memory available on each node. Now let's understand MapReduce through an example. Imagine you have billions of students and their marks, and you have to calculate their average mark using the Hadoop MapReduce framework. First, you'll store the files containing student marks in the HDFS file system in a distributed manner. And once that is done, you will write several map tasks and reduce tasks to calculate the average mark is a client of the Hadoop platform will send a request to the resource manager to calculate the average mark. And the resource manager would send request to the NodeManagers to do the actual processing. The mappers would read the files and get the student in Mark is a key value pair and store the output in the disk. The mapper would only look at the relevant field to calculate the average math. If there are several other fields, then it would ignore those fields and list those are required for the calculation. And the mapper output on each of these nodes is saved to the disk. Once mappers finish their task, the reducers will kick in. And they typically do aggregation operations like some average calculation. In this case, the reducers. We'll look at the student Mark and calculate the average mark on each of these nodes. Node manager will send output from each node to the Resource Manager. And the resource manager will combine the output and send it to the requesting client. To summarize map tasks would read each record and get the relevant fields for the calculation. And the output of the map task is stored to the disk. Reduce you look at the output of the map tasks and perform aggregation operations like calculating some average, et cetera. And depending on the type of analysis we are performing, reduced US may or may not be required. In some cases we might do simple filter operations. For example, get me all the students who scored higher than certain mark. In that case, reduced tasks would not be required. In Hadoop 1, Resource Manager was a key component of MapReduce and reduced to do parallel processing and also the resource management that is keeping track of which node is free and which node is more stories in which node can run additional mapper tasks and things like that. Starting Hadoop 2, resource management was given to another component called yan. The basic idea behind Yan, or yet another resource negotiator was to free MapReduce from the resource management responsibility, which created some scalability issues with the anion place. Other non MapReduce applications, such as Spark, can easily interact with Hadoop HDFS file system and do the processing. In real-world, you'll see many examples of data getting stored in Hadoop, HDFS, but Spark or other non MapReduce query tools getting used to interact with that data. To summarize, Hadoop is three core components is the pace, which is the distributed file system, MapReduce, which is the parallel processing framework, and yarn, which does all the resource management and job scheduling in a Hadoop cluster environment. This is how Hadoop architecture looks like. Data is stored in a distributed manner in HDFS file system. And Yan takes care of the resource management. It keeps track of all the storage and compute capacity in different data nodes. And then MapReduce runs on top of pn to do the final processing. Mapreduces written in Java. And in the older days, MapReduce programming was done using Java, which is quite complex. Then later on he went peak query tools came in with simplified MapReduce programming by providing a C-Cl lake interface to the integers. Behind the scene. Pig and Hive queries get converted to MapReduce Java programs and get executed on the cluster. But that complexities hidden to the programmers. 9. Hive: Let's now understand HIV, which is a SQL query tool to process data stored in HDFS. This is your 2021 and high Bs, one of the most popular query tools in the Hadoop ecosystem these days, people do not write MapReduce job programs haven other query tools are used, which gives a much simpler interface to the Hadoop-based DFS data. Hiv is written on top of MapReduce. Whatever queries we laid that get converted to MapReduce behind the scene, get executed on the Hadoop cluster. The complexity of the MapReduce programming is hidden to the developers. People with SQL knowledge can use Hive queries to get or store data in HDFS. One thing to note here is HIV can process only structured data. That is the data that can be stored in tabular format. And habe is not a database. It points to the data stored in HDFS. Hadoop stores only the metadata information, that is the data about data, what is the schema and what are the columns, etc. in a Meta store, which is a relational database which is outside is Davis. In HIV Euclid tables, which given logical view of the data stored in is DFS, will understand that in detail in the practical session. Hive queries are written using HiveQL, which starts what Hive query language syntactically, it's very close to SQL queries. Let's see some examples of HiveQL to understand it better. 10. Querying HDFS data using Hive: Let's now dive into the practical side of HIV. Will login to Google Cloud console and create a new Dataproc cluster. Search for Dataproc. Click cleared cluster. We'll keep this default. Click on configured nodes. Per worker nodes will select the two CPU machine, leave everything else as default, and then cleared the cluster. Now the cluster is running. Let's click on it. Next to it, go to the VM Instances tab, and then SSH to the master node, will first create HDFS directly the way we are done earlier. We'll first create the user directory, then created data directly on the data. Next we'll get the result store dot CSV to the master node. Copy the raw file path. And using W grid, let's pull it to master node. And now using Hadoop fs dash port, We'll move to their data directly Ms. DFS, we need to specify the file name. The file has been copied. Lives check it out. And we can see the file in HDFS. In Linux. To view the file, you can use the cat command. So this displays the content of the local file. Similarly, you can write Hadoop, a fish does cat, and then specify the complete file path and view the contents of the file in HDFS. Now we've pulled the file from outside and stored in HDFS. Let's understand how we can interact with this data. Through high. On Dataproc, you can simply take HIV and enter the Hive console. From here, you can start writing Hive queries and interact with the data stored in HDFS. If you do show databases, it shows the default database, default, which you can use to store your tables or you can create a new database. Let's create a new database called Future x. The command for that is create database. Create database, if not exist eugenics. And it created the database. And the next we can use the database. Now we are within the hive future ex database. Short tables to see what are the tables currently available? It doesn't show any tables because we have not created a me. Next, we'll create a table to interact with the retail store data stored in HDFS. The retail store file is fields, lake is salary gender, country purchased. And we can create table structure with the schema that would map to one or multiple columns in this file. For now, let's create a table with columns which involve all the fields in the CSV file. The syntax for that would be create table and give any table name. And then specify what are the fields E1 and the datatype we have specified is as int, salary as float, then gender than country, a string type. And also purchases to intake. And we also need to specify how the data is delimited in the file. The CSV files that we have has data delimited by comma. In hybrid not create physical tables. It points to the data stored in HDFS. So we'll have to specify the location of the data stored in HDFS. Will create a logical table which will point to that data stored in HDFS. And we can also specify whether the file is the header. Otherwise though header in the CSV file will be considered as the first row. Let's create this table. Now, a retail cost table has been created pointing to the data directory miss DFS. And when we created this table, it'll fetch data from all the files stored under this directory. Let's do a select star from retail cost. We got the output. It shows all the records we just showed in the dot CSV. And when we fired the select command wherein the scene, it converted it to a MapReduce job and executed on the cluster using yarn ResourceManager. And we can see some of that here. Now let's keep this window open and open another terminal to interact with this DFS. Back into Dataproc VM Instances tab, we'll click on SSS link. And we'll have another window open to interact with the Hadoop file system. We haven't filed retail store dot csv. Let's copy it and create another file we can use in Linux. Command to copy this file. Let's call the copied file two dot CSV. Next, let's move the retail store to dot CSV to the same as defense data directory. Let's do a Hadoop. Fs does, unless we can simply say data, it will display all the files president under user future axial 2021 data. We can see Buddha CSV files. Now back in the hype Dominant. Let's fire the select query again. And this time we can see the records twice because you have two files into the directory. And when we fight a select query, it looks at all the files and fetches the data from those files. Now you can write whatever SQL queries you want to get analytics from this data. Let's fire a query to know all the customers who have a salary greater than 20000 will select there isn't salary with the filter condition, salary greater than 2000. We can see that those customer details are getting printed here. Every query you execute gets converted to a MapReduce program behind the scene and get executed on the cluster. Now let's drop the table and see what happens. The table has been dropped. Let's do a Hadoop core facilities data here. We can see that the data directory doesn't exist now. So when we drop that table, hive also deleted, there is DFS directory. When we create a table, by default it's an internal table. So that means HIV is also though a lot of that data. And when we drop it table, the data gets deleted. Create an external table login HIV, so that HIV will not have full control over the data. If there are other application using the same data, they will not get impacted if someone drops the table using Hive. And to create external table, the syntax is same as before. Only thing you need to artists create external table. Next, milky what is required to create an external table? Let's give it a different name and fire this command. Now we'll do a select star from retail cost dxdy. It doesn't display anything because the data has been deleted. Let's create the directory again and copy the files from master node HDFS directory. First we'll create the data directory. Then we'll move retail store to dot CSV. We can also move the first file. Let's do that. Now we'll do a Hadoop fs, ls. We can see the data directory. Then let's do a dataset. And we can see both the CSV files. Now let's fire the select query. Again. It doesn't display anything because the location specified here is not correct. Let's drop the table and create it again. And we'll create the table with correct directory. Now we'll do a select star and it displays the records. Now let's drop this external table. The table has been dropped. Let's do a Hadoop-based deaths, a list data. We can see that both the files are still intact. And if you do a select star now, it says table doesn't exist because the metadata has been deleted from Hive Meta store. But the actual data still exists in Hadoop-based defense file system. Depending on your use case, you can decide whether to create internal or external tables. If multiple applications that are using the data, you'd create the Hive tables as external tables so that HIV doesn't have full control over the data. 11. Deleting the Cluster: It's always a best practice to terminate the cluster once you have stopped using it. Otherwise, Google Cloud would charge you for every minute you are using the cluster, which you'll get deducted from your free credit. So create a cluster, do all your work. And once you're done with it, dominated. And after that, whenever you need it again, you can create it again. To dominate a cluster. You can go to the Dataproc Men interface, select the cluster and simply click Delete, Confirm deletion. And the cluster will get deleted. 12. Analyzing a billion records with Hive: Bill, now we have understood how to store data in HDFS and how to analyze that data using high critical. We can still use volume of data in HDFS. If we run out of space, we can add more worker or data nodes. And that would give us more storage. And that would also give us more computing to random MapReduce jobs and battling up to a certain data volume, let's say 300, 500 gigabytes. Traditional systems like relational databases can process data at a faster speed, is the data volume grows to TB's and petabytes. We need to store the data in a distributed manner and then process them in a distributed manner. And that's where Hadoop helps. Let's now processes lately have volume data using Hive will create 1 billion records, store the data in HDFS, and then analyze the data using Hive. Let's see that in axon, we're a retail store file which contains 5 million records. The file size is approximately 200 megabytes. We would compress it to a 20 megabyte zip file and upload it to a GitHub repository. Then from there we'll move this file to the Hadoop cluster and cleared to a 100 copies of this file so that we can have billion records for our analysis. We cannot upload a file which is more than 25 megabytes to get up. So we'll replicate the files of the cluster to create the desired data volume. There are other techniques to applaud huge volume data or big data cluster. But in this lab we'll keep our focus on analyzing the data using HIV. So this retail store 5 million zip file is already available in our GitHub repository. We create a cluster and pull this file to the NameNode. Let's go to the GCP Console and search for Dataproc. Click on Create cluster. Will keep the same configuration is what we did in the previous lab or the worker node. We'll select our two CPU machine and let everything else by default. Let's create the cluster. Cluster is now ready. Let's click on the name. Will go to VM instances, and then click on the link for the master node. Now let's go to GitHub and find the URL to this. Fine. We'll select retail store 5 million zip file. Since this is a zip file that is download option will right-click Copy link address. Then go to the NameNode console and we'll set up the file has been copied. Let's check it out. So this is the GIF file we have. We can unzip it using gunzip command. The file has been unzipped and its size is approximately 140 megabytes. This contains 5 billion records. We can see example the courts using the head command. You can specify ahead and number of records and the filename. So this has customer ID is salary, gender and country for a setup customers. We can see ten records. There are 5 billion records within this file. There are many ways we can create copies of this file. One of the easy ways is to write the Linux shell script for loop, which will read this file and cleared the copies. And the syntax for that is something like this. For I in one to 200 will copy the file to a name which will have I0. And then we'll close the loop. So let's fire this command. So this will give us 1 billion records because he's fail is 5 million records. And then and then we're getting to a 100 copies of this file. It would take some time. Let's open another console. In the meantime. Let's do a listening check how many files we can see that files are getting copied. You can do a list, B and C files sorted by date. It has created 60 to that router to a 100 files that need to be created. We'll give it a few minutes. While the file copies in progress, Let's create the required as defense directly. First we'll create the home directory Hadoop fs does MKDIR then slash user Class future executed 2021. And we'll create this directory. Let's correct the typo. The home directory has been created. Let's create a directory called Retail landed that we'll use this directory to store the retail store customer files. Let's see how many copies even created. We're at 132 now. So the files have been copied. Let's check it out. So now we have 200 files with each file containing 5 million records. Next we'll move these files to the Hadoop HDFS directory. And we can do that by fighting the Hadoop fs PUT command in a forward loop. And the syntax for that is again similar to what we did earlier. So this will look for all the files that we've created and copy them over to the retail directory. In the real-world scenario, you would take a similar approach to copy batch data, which comes at a particular interval. You can write a script and then schedule that using a scheduler so that the data would get copied to the Big Data Hadoop platform in a predefined interval. Let's fire this command. This would again take a few minutes because there are 200 and files and each file represented take two to three seconds to copy. While it is copying. Let's check out the Retail Director in HDFS. We can say Hadoop fs does elicit middle. We can see that files are getting copied now while it is copying the file. And let's look at the kind of analysis we can do with this data. We are copying the file to a retail directly in is DFS. Then we'll create a table on top of it using which we can query the data. Hype can be used to transform the data. We can do various filtering and aggregation and then store the output to another table. And then from there it can be picked up by various BIA tools. Reporting tools are different applications. For further analysis. We are using Hey visit tool for large-scale data processing. Later on we'll see how the same problem can be solved by using Spark had a much faster pace. Now we can see coping is about to finish. It's copying file number 19 five. It's complete. Now, let's do a Hadoop fs does Alison Check it out? We can see all the 200 files have been copied to the retail directory under the huge radii in HDFS. We're HDFS directory which holds 1 billion records for retail store customers with their customer ID is Sal degenerate country information. Let's analyze the data using Hive will log into the Hive shell and clear the table and start creating the table. We can enter the cell by typing HIV as you have done earlier. We'll first create a database for eugenics. Let's fix the typo. We'll use Future x. Let's now clean it. They will retail cost, which will point to the repel directly in HDFS. And it could have matching number of columns with the file. So let's create this table. Once this table is created, we can query this table to get the data. Let's first find a simple select query. And we'll put a limit 10. Otherwise, 1 billion records will get phased and displayed. This is a select query without any aggregation, so there'll be no reducer tasks. Only mapper task should be created. We can see that 40 tasks have been created and their status can be seen during the processing. And we are able to face that occurs. Now let's try this lately complex query. From this data will get the average salary for each country and then store the resulting output in another table. Let's first create HDFS directory slash country where the output would be stored, will go to the other shell and create a directory called country. This should be created under the user directory. Next to the high prompt will create a table country average salary containing two fields, countries a string. And I've recently is double. N location would be the new directory that nucleated. So currently there is nothing in that directory. So if we do a select star from that table will see nothing and no record space as expected. Next will read data from the retail cost table, which points to the retail direct twin is DFS, do some processing using Hive and then store it to country average salary table, which bunch dock country directly in HDFS. We can do that using a simple SQL query, insert into country average salary. And then we'll select country and average salary from retail cost group a country. And the resulting output would get a new table. So let's fire this command. Since this query heads aggregation, we are calculating the average salary reducer tasks would be created along with the mapper tasks. We can see 40 mapper and then 1000 regressor tasks created for this particular query. First, all the mapper tasks would get executed, and then the reducers task would kick in to create a key-value pair up country and salary and write it to the disk. And reducer would read output up mapper tasks from the disk on each node and do the average calculation. Typically in a production setup, multiple jobs would be running. And as an administrator, you can specify how many mappers and reducers are required for a particular job depending on their priority. You can allocate more mappers and reducers to finish the job quicker. Now the reducers have kicked in. And since there are many reducer tasks, it would run much faster. So Reducer who do the average calculation? Now reducers have completed and the processing is complete. Let's first take the destination is defense directory. We can see multiple files under this directory. Let's take out the content of this file. Here the file contains one record. And similarly we can look at the other files. We can also change the setting to write everything to a single file. Now let's go to the high prompt NDA. Select data from the country average salary table. One mapper is started, and you'd get output. You can also put the heap command in the sequel failed and executed from the command line. Let's create a new file. We'll use the nano texture editor. And let's call it future x-dot sequel. Let's have the same script, what we used in the Hive console. Use future X and then fire this query Control X and Y to save it. Hit Enter. Now we have a sequel script, CMD command, which we included in the high prompt. To execute this, we could do HIV minus F and specify the file name. So this is how you can execute type script from the command line. This will again do the same operation and insert the same set of records to that destination directory. In a real-world scenario, you can create directories based on the year, month, and day, so that it is partitioned by a date. And when you do an incremental load, you can point to a particular directory so that duplicated elegant get uploaded to the destination directory. 13. Cloudera QuickStart VM on GCP: Welcome back. We'll understand how to install. Clouded are quick Star VM on Google Cloud Platform in this lab. Let's log in tow. DCP Console will search for virtual machine Select Compute Engine that Star Virtual Machine service Anticipate platform please clear it given instance that name. Now let's selector Machine with the HCP. That's the max we can get with the JCP Frieda and Left Herpes Ebola. And let's also increase the hardest. Kasay's leave 34 Lennix type increase. The Heart decides to 330. Let's allow a strictly pianistic DPS traffic to this instance. We'll also ensure all the ports set open to this particular instance. Goto details, and you cannot. A new fire will rule our modify one of the existing federal rules, too. So when all ports for this particular instance, So although ports would be allowed for this instance on the course of the open so that when we install CDH, we can easily access Hue and clothier manager from this instance, let's not log into this instance using their such link, and we'll install doctor on this machine even if you don't understand anything about Dr Just a magical set up commands that you see here and you'll have a doctor installed a new VM and then using the doctor can install How did a quick start the Emma easily? The key thing that we're trying to Alanya is, however hard open stance on the discreet platform which you can use under the future limit . So once Dr is installed, you can check the person you can run a hello world program to validate if everything is playing. So this looks good, brokerage, and start now Hello from Dhaka. So this confirms that doctors in start upgrade yourself to the road user and then using this come arctic and see with the doctor is running or not, and then just fired this command to get the quick start via his simple is that you can set up Cloudera Quick start VM on your local machine also. But for that you would require it least he took credit to be lamb. Now you can see all the images in this doctor instance you can see the maze for Claudia's quick start, Veum, and this is the command toe start the quick start VM 71 injuries the port for Hadoop user interface and 88 80 is the port for oh, clothing manager. All these commands will be available in the resources six and using the public eye Petrus and Port A takeoff, he can get the hard dope user interface Hugh Homepage, then log in tow. You using clouded our clothing that that started 40 was ready and password. So this is the heart of user interface. Home pitch. From here, you can access different tools that are available to interact with the Hadoop cluster. Let's go toe the hybrid eater. Here we can execute queries. See the list of Red Ave. Cesar people inside a tackle tables. You can also get into hype from your command. Prompt Ontology City instance Click Cave and then you they can do by prompt. And there you can also execute the same hype command. Any senator is only wondered at this, so let's create another database called our effects Course to be and make sure you select and light before using group. We can see the course or table with God created. Go back to the previous page and refresh that it of this list you'll see affects course Devi Travis. So you can either use higher editor, or you can use the command prompt, depending on what kind of access you have. But the fetches getting started with her job and high. Then better to use the Squitieri duck. We create one table and inserts some records. There is nothing. Initially, we used a few tricks course TV so that they need. They will be created to be created under the future ex course. To be good at this imputed effect, Score stable and let's insert some values have such included the windows just to demonstrate that you could use either of the two windows. But you can do all your operation either in the Highbury. Terror in the console. Let's fetch the record Me in circuit here we can see that. So as you can see, it's really easy to set up Cloudera Quick start VM on Mississippi and get started with our Dubai programming. We'll also sparked programming, which will see shortly. You can pick on this, manage jobs link and see all the jobs that are running on your hard black for hi gets converted to my produce jobs. So you're seeing those jobs in the console you can interact with is defaced by system, and he said there was nothing are do professed. Does is the command toe interact with his defense by system. Let's create a direct creating that make direct command. We have a single nor plaster here. If you want. You can have a multi, nor cluster anticipate also. But for your training and for your practice, so one nor cluster is good enough. You can also access his defense for Lord from File Rosa and you So they are. You do not have story about fighting is depressed commands. You can simply create files, upload files and move files around. It's really easy to work with Are dope User interface is a big, no clouded eyes that the fourth user and then there are other you just to complete more users. Let's change our digital collaboration and then create a directly under the clothing. I use a folder recorded a fixed data, and you can go toe Hugh interface and viewed Oh, that you can also view it from here 14. Bonus Running Spark2 on Cloudera: Let's know, enter the spark shin. So here we can do spark programming. Using sculler is you can see the person is currently one point us. Explain Jiro. So that is the default person. Let's see how toe upgrade the spark person. Well, first, remove the default Djelic A, which is 1.7. All these commands are available for you to execute. Then we'll install opens at 31.8. As you can see, Jessica has been upgraded to 1.8. Let's search the job a home, and we'll in start a book it using, which will copy a file over toe VM. Instance. Let's go to the spot download folder and get any of the stable warts. Enough Spot, who 0.413 is good enough will select the 2.4 point three and a hard job 2.6% and copy that spark in Stoller, Toe of Yemen stance. Let's in. Start up. Look it again. Looks like a dinar gated store duplicate Istok amount to get filed from how safe, but of year instance this time we're able to pull the fight. So the job has been copied, too. Though vehement stance. Let's undoubted now. We can move for this part for her to user locals back, so that would be our new sparkle. Go protect our directory and you could see all the spark face. Let's make something just to make this park is the default spark Use Nanotech Streeter. You can install that easily, then goto user bin folder and such for all the default sparking scooters and you modify the spark in this fight's point to the new spark that we just installed instead of user live spark, we pointed to use a local spark where we have installed our latest 2.4 point three words. Enough spark. Make that change in norther sparking Good advice. The mortar fire fights to have user local spark. Now let's go dysfunction. Now we can see that 2.4 point three was enough spot. We created a media different do and should everything is fine. Looks good. You are granted to spark 2.4 point three on this cloud. Adequate started him well out on spark with hype. Let's first get rid of their time pie folder. Sometimes it create issues and after that we can log into spark shell and then try to fate shorted out from the hype table using spark will face that are from the courts double that were kicked around here and store it in a spot that a frame and then displayed using dot show. It's a spark and hide by interaction Works fine on the clothing. A quick start VM Now we love you Did it a frame and righted toe high. We created temp table using spark and then you start to clear the hype table. You can see that they will under the high. But I was dead. We can go to the high prompt and fetch Get up from the new high people that we created using spark Thank you.