Big Data and Hadoop for Beginners - with Hands-on! | Andalib Ansari | Skillshare

Big Data and Hadoop for Beginners - with Hands-on!

Andalib Ansari, Big Data Consultant

Big Data and Hadoop for Beginners - with Hands-on!

Andalib Ansari, Big Data Consultant

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
27 Lessons (2h 35m)
    • 1. Course Overview

      2:21
    • 2. Introduction to Big Data

      9:23
    • 3. Big Data Job Roles

      6:30
    • 4. Big Data Salaries

      2:55
    • 5. Technology Trends in the Market

      6:30
    • 6. Advice for Big Data Beginners

      2:44
    • 7. Introduction to Hadoop

      8:23
    • 8. Hadoop Ecosystem

      5:01
    • 9. Hadoop 1.x vs Hadoop 2.x

      14:13
    • 10. ETL vs ELT

      3:19
    • 11. Hadoop Vendors

      4:20
    • 12. Managing HDFS from Command Line

      9:09
    • 13. Introduction to Hive

      2:41
    • 14. Hive Architecture

      2:28
    • 15. File Formats in Hive

      4:40
    • 16. SQL vs HQL

      3:46
    • 17. UDF & UDAF in Hive

      2:57
    • 18. Hive Demo

      18:50
    • 19. Introduction to Pig

      2:57
    • 20. Pig Architecture

      1:39
    • 21. Pig Data Model

      2:17
    • 22. How Pig Latin Works

      2:57
    • 23. SQL vs Pig

      5:32
    • 24. UDF in Pig

      3:25
    • 25. Pig Demo

      12:49
    • 26. Designing Data Pipeline using Pig and Hive

      7:59
    • 27. Data Lake

      5:24
  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

265

Students

--

Projects

About This Class

965b4ac8

The main objective of this course is to help you understand Complex Architectures of Hadoop and its components, guide you in the right direction to start with, and quickly start working with Hadoop and its components.

It covers everything what you need as a Big Data Beginner. Learn about Big Data market, different job roles, technology trends, history of Hadoop, HDFS, Hadoop Ecosystem, Hive and Pig. In this course, we will see how as a beginner one should start with Hadoop. This course comes with a lot of hands-on examples which will help you learn Hadoop quickly.

The course have 6 sections, and focuses on the following topics:

Big Data at a Glance: Learn about Big Data and different job roles required in Big Data market. Know big data salary trends around the globe. Learn about hottest technologies and their trends in the market.

Getting Started with Hadoop: Understand Hadoop and its complex architecture. Learn Hadoop Ecosystem with simple examples. Know different versions of Hadoop (Hadoop 1.x vs Hadoop 2.x), different Hadoop Vendors in the market and Hadoop on Cloud. Understand how Hadoop uses ELT approach. Learn installing Hadoop on your machine. We will see running HDFS commands from command line to manage HDFS.

Getting Started with Hive: Understand what kind of problem Hive solves in Big Data. Learn its architectural design and working mechanism. Know data models in Hive, different file formats supported by Hive, Hive queries etc. We will see running queries in Hive.

Getting Started with Pig: Understand how Pig solves problems in Big Data. Learn its architectural design and working mechanism. Understand how Pig Latin works in Pig. You will understand the differences between SQL and Pig Latin. Demos on running different queries in Pig.

Use Cases: Real life applications of Hadoop is really important to better understand Hadoop and its components, hence we will be learning by designing a sample Data Pipeline in Hadoop to process big data. Also, understand how companies are adopting modern data architecture i.e. Data Lake in their data infrastructure.

Practice: Practice with huge Data Sets. Learn Design and Optimization Techniques by designing Data Models, Data Pipelines by using real life applications' data sets.

Meet Your Teacher

Teacher Profile Image

Andalib Ansari

Big Data Consultant

Teacher

Andalib Ansari is a Big Data consultant based out of Mumbai. He helps companies and people solve business problems using Big Data technologies. Also, one of his passion, to guide and train people on different Big Data tools and technologies.

He is having a very decent exposure of Big Data tools and technologies, and have worked with various clients, top level Mobile Network Operators (MNO), from Latin America and the US to solve different business problems for different use-cases, and designed optimized Data Pipelines using Big Data technologies on the cloud.

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
    0%
  • Yes
    0%
  • Somewhat
    0%
  • Not really
    0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Your creative journey starts here.

  • Unlimited access to every class
  • Supportive online creative community
  • Learn offline with Skillshare’s app

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

phone

Transcripts

1. Course Overview: Are you excited to learn about Big Data? And Haru, do you want to know how to design data pipelines in her due process? Big data. Do you want to be big export and get some exciting opportunities to walk? Do you feel that Internet is overloaded with Lord off contents on, you often get confused. Where to? Begin with. All right, you're solid. Depends here in this course I'm going to give you are detailed introduction to big data on its market so that you could easily understand that technology trends on different job roles required in the big data market. This course is designed to teach all fundamentals off Hadoop and quickly start walking on her. Do five and I will go David I into her do where you learn about history of her group. It's complex architectures, ecosystem, different versions of her do. Setting up a new environment on your machine, different vendors in the market on her DuPont club. If you look at the companies that are using do most of them are also using hive in big on the productions hive and bigger components of do so In this course, I'm also covered about hybrid pain where you will learn about their architectures walking mechanism Demos on designing data pipelines using them for your better understanding have also included assignments and use cases. This pools very well. Learn about real life applications off the group and its company. Hi, this is me on Live on starting your instructor for this course I'm a big Erdogan Safety have worked on various projects spread across Latin America U S and India in Telecom Economos 100 gentlemen, well, creating the scores have mostly focused on content representation on rituals so that you could easily understand the complex architectures off due on its company. At the end of this course, you will be able to understand real challenges of big data. Understand how doom and its architectural walk with her group? I've and take on design data pipelines. Start your own posies If you wish on her group, Ivan Big things course will also help you in preparing certification exams for heart. It looks and cloudy. Thanks for watching. See you at the course 2. Introduction to Big Data: Hello, guys. Welcome to the Coast In this lecture, we're going to learn about fundamentals off big data. So the question is, what is big data? Well, if you look around there many ways data is being generated like tweets generated by millions of futures on Twitter, Facebook boats by billions of users, YouTube videos uploaded every minute and sensors generating data a Boeing generating terabytes of data in a single flight. These data can be termed as big data. If you look at these data, they're very complex to analyse and sold. Why? Because they're mostly in semi structured or unstructured form, which makes it difficult to extract information on business inside. So the question is why it is difficult to extract information. The answer would be simple because they cannot be processed with traditional systems, so water. These traditional systems traditional systems include relational databases like my secret Oracle Esperance over it sector on these databases on Lee store structure data, they cannot restore semi structure or una structure data generated by these social media sites or sensors. So to a store process and analyze these big data. We should have a right combination off tools and technology This is where her group comes into the picture. We learn about her group in the coming lectures. Let's have a look how structured, semi structured and under structure data looks like any later, which you see in an Excel file or data stored in any database. Stables are a structure data sini structure data. An XML file would be a good example. Logs generated by servers are a structure data. Okay, if you look at these big data there, often described using five ease volume velocity variety where a city and value let's have a look at these terms individually. Volume refers to the vast amounts of data generated every second. Just think off all emails, Twitter messages, photos, video clips, sensor data At sector we produce and say every second they're not in terabytes but Jetta bites or even hiring a monk. This increasingly makes data states to laws to store and analyze using traditional database technology with big data technology. We can now a store and analyze these data with the help off distributed systems, which make computing easier and faster. The last year, the first today speed at which new data is generated and the speed at with data moves around. Just think off all social media messages going viral in seconds, the speed at which credit card transactions are checked for fraudulent activities. Now, with big data technology, we can now analyze thes data when they're being generated without ever putting into databases but ideally first to the different types of data we can now use in the past, Refocus don't structure data that neatly fits into tables are relational databases such as financial leader sales by product or reason. In fact, 80% of the world's data is now on a structured and therefore can't be easily put into tables. Just think off photos, video sequences or social media updates with big data technology, we can now install, process and analyze these structured, semi structured on una structure data. Veracity refers to the accuracy or truthfulness off data well. In every analytic exercise, 40 to 60% off time is spent on data preparation, like removing duplicates, fixing partial entries, eliminating eliminating null blank entries, attack protector and with many forms of big data, Quality and accuracy are less controllable. Just think of breeder post with hat stats, but with big data and analytics technology now allows us to work with these type of data value. I think value is the most important part when looking at big data. It is all well and good having access to big data. But unless we can turn into it value, it is useless. If you look, 70% of big data projects fail just because off lack off use cases and understanding it is really important for businesses to make use cases before jumping to start collecting and storing big data. Now the ocean is why big data is important. Let's consider use case. Imagine you're running an e commerce business and you have a website where you are selling your product. You're capturing very few metrics, not any click stream data where you are able to see you have millions of websites. Traffic people are coming to your website browsing the products, adding products to the card and making payment. Now let's assume at a certain moment there 100 people that went to the payment page and out of them 70 people were able to do successful payment on 30 people, got some technical issues due to which they were not able to make the payment. Now these 30 people left your website and went to other websites From where? The about this important. Since you're not capturing click stream data, you will not be able to analyze what problems people are facing on your website and you will have no idea where these people are gone. You have high website traffic, but local was indeed so. It's really important to capture as much data as possible from your business so that you should be able to analyze their shoes on improve your business services. So in this case, capturing click stream data generated by users on your website is really very, very important for your business. Okay, I just spoke this to help you visualize how different companies are capturing and processing big data on their productions. In Moby is a mobile act targeting company that helps businesses toe improve their customer engagements through its mobile customer engagement platform in Moby Serbs billions off arts daily to more than 7 59 million users across 1 60 plus countries. Just look at the size of data there, capturing and processing every day and then, um, off her do jobs, they're running toe, get the insides out of the data. Now let's understand how companies are monetizing big data. Telecoms companies are doing businesses with banks to detect fraud, but triangulating location purchase details on testimony in four with the retail outlets to personalize offers real time and pushed them by a mobile channels for effectiveness with travel forms. Better targeted marketing based on customers. Trouble preferences with social networks. Toe. Identify true network nodes using complete network information with AB developers to get inside full information into what kind off APS are being preferred and why credit card companies are doing businesses with Economos forms to help design better real time offers around payment options with retail outlets. Toe improve traceability by mapping car dolar in four with travel forms. Used location data to track customer routes. Retailers are doing businesses with CPD forms based on purchase patterns. Helps ability. Forms. Better design on demand. Lord upped but first and is informed toe. Get tremendous insight into what the Giants customer prefer. With credit card companies toe identify customer share off wallet and spending patterns. Okay, that's all for this lecture. Soon the next one. Thank you 3. Big Data Job Roles: Hello, guys. Welcome back in this lecture, we're going to learn about job roles required in big data career. Well, their various rules that comes into play when we talk about big data. They are big data analyst, Hadoop, administrator, Big Gator and junior big build a scientist, big data manager, Big Data Solutions architect and chief data officer. Let's have a look at each of them one by one. A big data analyst is one who walks with data in a given system, and performing analysts is on that data set. They generally work with data scientists to perform the necessary jobs. The primary key skills required to become a big data analyst is having a good exposure on different B I tools. Like Tab, you click view it sector on good programming. Languages like SQL are Java or beytin. Apart from these skills, one should have a good working knowledge off her do framework, like map reduce high big etcetera. As a big data analyst, one should know that what drives an organization, the key performance indicators and how the available organizations data can contribute in making critical business decisions. Ah ha! Group administrator rules comes with great responsibility since it involves the maintenance off her do plaster and making the cluster up with least downtime. So one should have a good understanding off. Lenox Bassa Scripting good understanding off networks Where Cebu memory her group architecture. Like his defense high big as his bees, they should be able to deploy Hadoop cluster, add and remove nodes. Keep track off jobs. Monitor critical parts of the cluster configure name node. High availability should do and configure it on big backups. Troubleshooting in schools, hardware configurations like Raxit ABS disc topology, aria I D at spectra. Data backup and recovery performance. Monitoring and tuning and running patches and upgrades. Ah, big data engineer builds what the big Data Solutions architect has designed. Big leader. Ingenious develop, maintain, taste and evaluate big data solutions within organisations. Most of the time they are also involved in the design of the big data solutions. Because of the experience, they have her do based technologies, such as map reduce high big ah, big data in junior builds lost skill data processing systems is an expert in data warehousing solutions and should be able to work with the latest knows equal interest technology so they skills part, they should have a good understanding off data warehouse. Ideal business intelligence on her do part, huh? Grube High. Big as deface. No sequel databases like Mom. Would he be Cassandra Experience on walking with clothing? Roman. Good familiarity with building lost skill data processing systems. Doing her job solutions. Big data scientists is said to be the success job in 21st century. Successful big data scientists will be in Heidemann and will be able to earn very nice salaries. But to be successful, big Data scientists need to have a wide range of skills that until now did not even fit into one department. So, as a big data scientists, you should have a good understanding off machine learning. Predictive modeling start testicle analysis. Natural language processing, hodoud map Reduce high, big or no sequel databases like Mongo Cassandra programming languages like biting our Java closer. That's extra. Ah, big data manager is the middleman between the technical team members and the strategic management often organization. Therefore, the big data manager needs to understand both sides of the coin. I really the big data manager has an I T background with a strategic experience, so this would have ah, excellent communication skills. Experience in handling Big builder team Good. Expect good exposure in machine learning. Predictive modeling start testicle Analysis on how do framework they should have a good knowledge on his DEA phase Map reviews high paid no sequel databases like Mongo db Cassandra on programming languages like Bite in our job our spectra. To become a big data solutions architect, you should have good exposure in designing large scale data systems. Good exposure off her do ecosystems like her Do high Big Mahat Whose IJO Keeper School that sector No sequel databases like Mongo db Cassandra at sector RTB MSK data warehouse It'll tools like Tahoe Informatica or talent biting Java Aruban Cloud These are the main skills that is required for big data solutions. Architect. Chief Duty officer To become achieved it officer, you should have good exposure in data governance on data quality, expert eyes in creating and deploying best practices and methodologies across organization family already TV major big data solutions and products available in the market. Knowledge about building and supporting big data teams across the organization Good exposer in machine learning star testicle analysis Predictive modeling, developing business use cases It spectra Okay, that's all for this lecture. Thank you guys 4. Big Data Salaries: welcome back in the previous little we learned about different job roles in Big Data Market in this letter will learn about how big data professionals are being paid in the market. So here is a story from Wall Street Journal. Tom Davenport, who is teaching an executive program in Big Data and Analytics at Harwood Enlistee, said some data scientists are learning and will Seles as high as a three legged dollars, which is pretty good for somebody that doesn't have anyone else working for them. Devonport also said such workers are motivated by problems and opportunities. Data provides The big data. Job market is an extremely competitive one. Indeed dot com is one of the largest job search portal around the globe. Let's have a look on some Saleh trains for big data professionals on indeed dot com, look at the salad ring for Big Data Engineer in San Francisco. The every cell is one like $53,000 and it is increasing with time for big data sent Taste it is one like $67,000 on it is also increasing with time For big data analyst, it is one like 60,000 for a Solutions architect. It is around to last $12,000. In New York, it is one leg, $89,000. So the salaries for big data professionals are very high in the market and in varies from place to place. Based on my research, I found this is how the salaries off big data professionals varies based on experience. So for a big data analyst, the Sally could be anywhere between $50,000 to 1 like $10,000 for big Data scientist. It could be anywhere between $85,000 to 1 like $70,000 based on your experience. For a big data manager, it is between $90,000 toe to like $40,000 for a big data and genius. It is between $70,000 to 1 like $65,000. I hope you got some idea how big data professionals are being put in the market. That's all for this lecture. Seeing the next one. Thank you. 5. Technology Trends in the Market: Hello, gays. Welcome back in this lecture, we're going to learn about technology trends in the market. Well, I believe if you live in the real moth information technology, then you should probably know about latest tools and technology in the market. Knowing these will help you prepare accordingly and survive in the market. It will also help. You aren't good if you have the right to skills at the right time. So the graph I'm going to so you is a branded tool created by Gardner Gardner is an ideal research and consultancy company. The graph is often called gardeners hype cycle that represents the life cycle stages a technology goes through from conception to maturity and widespread adoption. Okay, if you look at the graph their fight part, I'll go one by one. Innovation, Trigger and this stage our technologies conceptualized. There may be prototype, but they're often no functional products on market studies. The potential inspires media interest and sometimes poop off concept speak off inflated expectations. The technologies implemented space Lee by our early adopters. There is a lot of publicity about both successful and unsuccessful implementations in this lecture. My main intention was to so you about Internet off things popularly known as I o. T. In this graph, the most hype technology is Internet of things, and if you look on the web, the most budgets are I OD and big data. According to International Data Corporation, the worldwide market for I only solutions will grow from $1.9 trillion in 2013 to $7.1 trillion in 2020 i. D. C. Estimates that as off end of 2013 there were 9.1 billion i ot units installed. I. D. C expects the installed base off Iraqi units to grow upto 28.1 billion units in 2020. So the question is, what is Internet of things? Well, then turn it off. Things is a scenario in which objects, animals or people are provided with unique art into fires and the ability to transfer data over the network without requiring human to human, our human to computer in direction. A thing in Internet off things can be a person with a heart monitor implant what? A farm animal with a bio chip transponder, an automobile that has built in sensors to alert the driver when tire pressure is low, or any other natural or manmade object that can be assigned an I P address and provided with the ability to transfer data over the network. Let's see some applications off. I OD afraid, signaling no milk carton grocery store getting message. Automatic text When you enter the grocery store, this could be an application on consumer side connected cars. Smart city Smart malls on the business side, off i ot applications, we could analyse lock violator to resolve support issues on unravel new revenue opportunities. For example, General Electric, one of the U. K's largest manufacturer, is using Big Data Analytics to predict maintenance needs. G manufacturers, jet engines, turbines, medical scanners. It is using a personal data from sensors on its machinery and engines. For pattern analysis. G is using analysis to provide services tied to its product designed to minimize downtime caused by parts failure. Real Time analytics also enables machines toe adopt continually on improve if he Cincy, the airline industry spends $200 billion on federal or year, so what troopers and saving is $4 billion. G provides software that enables airline pilots to manage feel if his N C. So now you can imagine the amount of data going to be generated. But these Iot devices on the requirement of big data ingenious in the market in coming future. Okay, coming back to the graph. The third part is trough this religion mint flaws and failures lead to some disappointment in the technology. Some producers are unsuccessful or drop their products. Continued investments in other producers are contingent upon addressing problems successfully. If you look at the graph, big data is between the peak off inflated expectations and trough disillusionment. While because big data technology is still not mature, there are a lot of features which need to be added in the technology. We'll see it in the coming lectures. But yes, big data communities. Very strong people are contributing a lot. The Lord off improvements and announcements coming up every day, slope off enlightment. The technology's potential for further applications becomes more broadly understood on an increasing number off companies implement or taste in their involvement. Some producers create further generations of product lady off productivity. The technology becomes widely implemented, its place in the market and its applications are well understood. Standards arise for evaluating technology providers. All right, that's all for this lecture. Soon the next one. Thank you 6. Advice for Big Data Beginners: Welcome back. Well, my main intention to include this lecture is to guide you so as a big data Bignell. What are the things you should know? What are the habit just would make? And what are the skills goto have before jumping into big data field and how you should proceed? So I'm going to answer all these questions on habits part. You should attend as many meetups as you can If you're not on meter, go and sign up to the and join as many big deter groups as you can near your area. Attend conferences on Big Gato. The main benefits are there. You'll find and meet people having same interest. And there you can have a good amount off knowledge sharing. Start falling big data news on online channels like TechCrunch good prismatic when she beat it. Sector. You should also start reading different companies. Engineering Blog's who are using big data. Trust me, if you make these habits, this is gonna add values to your big day schools, and in the long run, it will repay you in your big data career. Being a big data developer, you'll spend most of your time in data preparation. And since this course is intended to teach you about big data technologies and how you can process big data using her group and its component, so having basic skills off Relational later stores Ideal B I Data Warehouse would be a place for you. Nowadays, in most of the company's, you'll find there is a safety in analytical platforms. There is a migration from traditional data stores toe How do Toe handled large volumes of data on most of the analytical task are being performed using these big, better tools and technology. So once you learn how do bandits company, we should a start playing with them. Pick any small later said you like, and play with as deface hive and big. You should take any use cases on big data and try to achieve them using hive in pig, I believe before playing with huge volumes of data, you should first said the data pipelines and see the data flow with simple data sets. So once you set all data pipelines, repeat the same task with huge volumes of data. Try to implement data processing techniques, which you learn in the schools. In some cases, you might need toe iterated your task with different configurations in how do you achieve optimal solution? So in that case, I would suggest you toe always benchmark your alterations to find the best solution. That's all for this lecture. 7. Introduction to Hadoop: Welcome back in this lecture. We're going to learn about history and fundamentals often do. The group is named after my elephant belonging to developer that Gutensohn. The original project that would become her do was a Web indexing software called Match. Google has released two white papers, namely Goebbels, File System on My Produce into Turn Three and Tutor and Four, respectively. Nuts. Developers used these papers toe build, a processing friend book that relied on tens of computers rather than a single machine, with the Anglo lawfully building its Web search infrastructure. Yeah, who used nuts storage and processing ideas to form the backbone off do in its earliest implementations at Yahoo, a loop only walked on 5 to 20 notes Yahoo's decision to set up a recess grade for its data . Scientists help the research steam gradually scale up Hadoop clusters from dodges to hundreds of nodes. By 2008 Yahoo was ready to debut her group as the engine off its Web search. Using ah ha do plaster With around 10,000 nodes, the company was able to increase its search a speed in two Italian 11 Yahoo was running its search engine across 42,000 note. With many more players involved in open source project than in its early days, her do continues to evolve and branch off in new directions. Should. The question is, what is? How do? According to Apache, Hadoop is an open source software that enamels distributed, processing off large data sets across clusters off commodity servers. It is designed to scale up from a single server to thousands of machines with very higher degree off $4 ins rather than relying on hiring hardware. The resiliency of these clusters comes from Softwares ability to detect and handle failures at the application layer. So to understand her group in very simple terms, you have to understand to fundamentals things about it. They are how Hadoop stores filed that is, as DFS and how it processes data. That is, my producer is DFS is a storage engine off Luke, where you can store files off any size. You can install files ranging from embassies to TV's or even higher, depending upon your configuration on business needs. It also lets your store as many files as you can. It distorts the files in a distributed fashion spreading across multiple machines. We'll see that shortly. Map reduce is a data processing engine off Helou. Which processes did sitting in as DFS. If you look at conditional later processing uproot data is moved over a network which is processed by us off there. Moving data overnight for can be very, very slow, especially for really large data sets. Her group uses a smarter approach rather than moving data to the codes. It moves the processing codes to the leader sitting on distributed machines. These schools processes data there on on Lee returned the regions. Now you can imagine the amount of network latency being saved here. This is the beauty off. So tell no, we have seen as defense and my produce a bone together for each other to solve problems in processing data on very large scale, where, as DFS provides a file system and map, these provides are distributed data processing framework. It is one thing I would like to remind you again. This course is totally based on her group 2.2 version. So whatever is covered here in the schools is related to her due to print version. But yes, whenever I felt you should know about older version. I have covered that in this course. Also in the coming lecture will see the differences between her do wondered acts and Hadoop toe X virgins coming back to the topic. Understanding my produce. If you look at map produced at a very high level, there two parts map and reduce. Ideally, applications developed a map and a reduced methods in Java by are implementing appropriate interface or abstract class. They also specify input and output locations on some configurations, rest is taken care by the framework. Now we will see a what common problem which is like a hello world program in map ladies programming. If you look at the screen short, I have created a text file which contains some line using my apparatus programming. We will come the occurrence off. Hello, envoy. Word in this text instead of looking how map reduce framework will solve this problem. Let's understand how Lennox developer will solve this problem. Using batter script They're two Basses scripts here map a lot s H and reduce her daughter message. I will feed my X files into mapper dot Shh. Using pipe script map a lot SS will read it line by line and organize each line. If our token is hello, it will print key value pair as hello comma one if we're talking is void. It will print a key value pair as well. Comma one the scriptural Ignore all other tokens you can see in this screenshot. So the logic is quite simple As explained Why Luke will read each line And for Luke will token eyes each world in the line. East talking is then examined. If it is a hello or world, an appropriate key value pair is printed If you try to do it your own Don't forget to give Execute permission to your mapple. That s it s group using ch mood. Come on as soon in the screen. Let's look at producer dot Shh. I will feed my out. Put off my member wrote essays to reduce adult s age through by Producer will examine each key value pair produced by mapper and since will simply count for how many times it found First pair that is Hello, comma one And how many times it found Second pair That is WorldCom a one. Finally it will print my desired off boot as soon in the screen. So that's what map produced is at very high levels. My produce framework will feed input data to mapper Mapple Netter, developed by a programmer. Mapple knows what to do with this data. Hence it will process data on general key value Pairs, which are given back to the framework train book, will perform a search and sort operation on all key value pairs generated from various nodes across the cluster. Then it will feed these key value peers back to reduce. ER Producer is again a mattered. Written by a programmer and skit knows what to do with these key value appears reducer will perform reduced operation to generate the final regional. Okay, that's all for this lecture. See in the next one. Thank you. 8. Hadoop Ecosystem: Hello, everyone. Welcome back In the previous lecture, we learned about fundamentals off do. In this lecture, we're going to learn about Haru Equal system. The Hadoop platform consists off. Two key services are reliable. Distributed file system called her group distributed file system. That is, as DFS and the high performance parallel data processing engine called her Duke my produce , which we have already learned about them in previous lecture. If you look at her loop ecosystem, there are several tools available to address particular needs. Like hive big Maholm Uzi school sector. These tools are called components of Hadoop ecosystem. These components provide a means to access and process data sitting in as defense. Let me give you an example to help you visualize the Haru ecosystem. Look at this picture. What do you see here? A smartphone with many APS and stolen it right. Let us consider this a smartphone as a Hadoop ecosystem and its app. Nothing but the components off loop ecosystem. Consider your phone memory. As an s defense, you have photos, videos on your phone. You can share these photos or videos using APS like Facebooked lending tweeter it sector. It means you are accessing phone data using these ups right in the same way components off her do can access and process data residing in as DFS. Each components of Hadoop ecosystem are designed to meet certain business needs. Let's have a look at each of them one by one. Hi is like a data warehouse, which is built on top off Do. Instead of writing complex map pretty schools in Java or other language, I've uses a skill based query language to interact with data sitting in a group. Big is a data flow language, which uses big lead in the script to interact with data sitting in. It also highlights the complexity off writing map, reduce schools and programming languages. Like Java, Gladding is similar to SQL. You can write biggest scripts to process big data in school stands for SQL Toe. How do so? Basically, scoop is a tool which is used to transfer data from Ali BMS to his defense and vice versa. Uzi is a job application used to scale you a party. Helou Jobs woozy combines multiple jobs sequentially into one logical unit off work. It is integrated with Lupus. Start with young gadgets, architectural center and supports her boob jobs for a party in my produce. Big high, any school who's you can also skeletal jobs, space efecto a system like Java programs or sell a script. Do Keeper provides operational services for a Hadoop cluster. Do people provides are distributed configuration service are synchronization service on the naming the history for distributed systems. His base is an open source. No sequel database that provides real time read write access to those last data sets. A party is based scales linearly to handle huge data sets with billions off rose and millions off columns. On it is Silicon Minds data sources that use a wide variety off different structures and a schemer. Flu is a distributed, reliable and available surveys for efficiently collecting, aggregating and moving large amounts of streaming data into his Deaver's. For instance, flume can be used to collect several logs and dump it to as defense in real time. My heart is elaborately off a scalable machine learning garden implemented on top off do and using my apparatus paradigm. Once big data is stored on on the Hadoop distributed file system, Mahat provides the data science tools to automatically find meaningful patterns. And those big data sets. My heart supports forming data since use cases. Collaborative filtering, clustering classifications frequent. I didn't say mining. Okay, that's all for this lecture. Soon. The next one. Thank you. 9. Hadoop 1.x vs Hadoop 2.x: welcome back in this lecture. We're going to learn about different versions. Offer. Do that is how do wonder tax and how do to that acts before we start? Let's understand basic terms off Dude Helou plane. It is a machine which is not a part off Hadoop cluster but have some configurations so that a user can submit the Hadoop job, which is supposed to be executed on Hadoop Cluster. Generally, Klein machines have Hadoop installed with all cluster settings but are neither master nor a slave. Instead, the role of the client machine is to load data into cluster, submit map pretty jobs, describing how the data should be processed and then retrieve of you the regions of the job . When it's finished in a smaller clusters, say 30 nodes, you may have a single physical server playing multiple rules, such as both job tracker and name No. With medium to large clusters, you will have each rule operating on a single server machine. In our previous lectures we learned about is DFS and map reduce. We saw how my pretties job is broken down into individual task, called mapper and reducer. During her do the delegation off task is handled by two demons called Job tracker and striker. Our demon is a process that is long. Lip the job tracker overseas. How? Map Pretty jobs are split up into task on divided among nodes within the cluster. Job tracker resides at name No, the star striker, except the task from job tracker and performs the world and allowed duh job tracker once is done. That's trackers on data notes are located on the same nodes. Toe improve performance that are Straker. Demon is a slave to the job checker. And the data. No demon. A slave to the name. No name, no name, no distorts meta Later about data being stored in get unfolds, whereas data nodes stole the actual later. So the name don't have the information. Like on which blocks on which rack or on which data? No, the data is a stool. On other details. Name node runs on Master North. Didn't our data mold stores data in his defense? Ah, functional file system has more than one date and old with data replicated across them. Data, no instances can talk to each other, which is what they do when they are replicating data data node runs on sleep nodes. So in simple terms, we can see her group has a master slave architecture. Their name node runs on Master Nolde, various data nodes and on slave nodes. The task tracker Demon is a slave to the job checker on the data. No demon is a slave to the name. Let's understand how hard loot stores a file. Let's assume you have a file off one GB size. So what, hello will do? It will break the files and the blocks depending upon your block size you have decided on. I stole it across data and old. Let's say if you have configured the block size off to 56 MB, so it will break the file into four blocks on, well, a store it across different data nodes. Name node will have the files metal later. Like which blocks are stored on his date on wards on on me. To there. The block size is the smallest unit off data that a file system can be stored that the four blocks sales in UNIX system is four key, very gin Helou. It is 64 Emmy. Now let's understand why Helou has bigger block size, as we have seen in our previous lectures as DFS, is mean to handle large files. Now let's see you have 1000. And if I in his defense and you have configured the block size as four Key, so you would have to me to like 56,000 requests to get that file. That is one request problem in as DFS. Those requests go across the network and come with a lot off overhead. Each request has to be processed by the name room to figure out where that block can be found, which is a lot off traffic. If you use 6400 blocks than a more frequent goes down to 16 which is greatly reducing the cost off overhead and load on the moon. Now let's understand what is data application for the high availability off data and do Hadoop stores copies off same blocks across different data nodes on drags. So if at some point off time our day Donald goes down, the same data can be accessed through other data nodes. By default, Hadoop restores three copies of blocks across different data nodes and rats. However, the replication factor can be increased or decreased depending upon your business need. Let's discuss about secondly named Lord, as we have seen name Load holds made a later like block in formation, dragged information, data, notes, information, etcetera. And all these informations are stewed in main memory, and that's the reason why it is called single point of Failure in Hadoop Cluster. Now let's understand deeply how name node and secondly name notebooks Name no. Also stores made a later informations in persistent storage in the form off dreadlocks on efforts image you can see in the diagram how a name no distorts the information Fs image is the snapshot off file system when name node is started Whereas a deadlocks are secrets off changes made to the fine system after name Note is it started Onley in Greece Start off name load A little oaks are applied to F s image Get the latest snapshot of the fine system But name node restart of very rare in production clusters Which means it locks can grow very large for the clusters. We're name no difference for a long period of time. In this case, we can face the situations life and it locks become very large, which will be challenging to manage it. Name node Restart takes long time because a lot of changes has to be moved. Number three in case name not goes down. We lost Hughes amount off, made a leader since efforts image is very old so overcome this issues We need a mechanism which will help us reduce the A little upside which is manageable and should have up to date efforts Image so that load on named old reduces This is where secondly name note comes into the picture. It's really some little with those restore point which allows us to take snapshot of the toys so that if something goes wrong, we can roll back to the last restore point. Secondly, name known helps to overcome these issues by taking over responsibility off merging a little logs with FS image from the name No. So in gets the edit logs from the name known on regular intervals on applies to F s image. Once it has a new FS image, copies back to the moon I am not will use this efforts image for the next three start which will reduce their start up time. So we can see her do. Secondly, Name no puts a checkpoint in file system, which helps name no to function now coming back to a mean topic. How wondered. X version is different from two attacks. Well, their various limitations you'll face when using pointer that's virgin like no limitations . One can have up to 4000 nodes within a cluster job. Second bottle like Resource Man has been job civilian on monitor. It has only one name. No tu minus his defense. It's mapping. Reduce slaughter static. One can only matter this job. No. Any custom jobs are low. Now let's understand how read request is processed in 100 wondered explosion in a new cluster. They don't notes, keep on sending heartbeats and block reports to name no name, no nose which didn't know is alive on which one is done. So when a new client request for a read operation on the plaster name, Node knows the location of blocks and get a hold. So it returns the data node and block ideas to perform the lead operation. Now let's see how right operation is performed in her newborn attacks. When a how do client request for a right operation name, no details that they don't old ideas, and then the Hadoop line performs the right operation. Data application is done by the data notes itself, and then they send back the block. Reports to name known. That's a candy name. Note keeps on checking their debt locks to have the update. The office image running job In her wonder attacks when a Hadoop job submitted by a Hadoop line, the job tracker and trust tracker Take cares of the job. The job tracker overseas. How map Pretty jobs are split up into task that is mapper and reducer and divided among the nodes within the cluster. The tar striker, except the task from Job checker and performs the walk and then alerts the job taker. Once it's done, their various major improvements have been done in Hadoop attacks. Now how do you write X supports up to 10,000 nodes? Park Lister. It supports multiple name nodes to manage his defense introduction to Young for if is in plaster you dilation. Yanis stands for yet another resource negotiators. A group tour it acts, has the concept off containers, while had wondered, access lots continues All Generate and London any Dave off task but a slot canyon either a map or a reduced tusk because of the containers, and that multiple distributing computing models can coexist within the same cluster. The U dilation off Hadoop two dot exe is significantly higher than the Hadoop Warner Excl Esther at a very high level on introduction off Young in Hadoop, Kodak's The Job Tracker has been replaced where Resource Manager and the task tracker has been the place where known Manager resource manager helps into doing jobs and also take care off scalability and support for alternate programming paradigms. Nord Manager Take Cares Off Individual compute nodes and Hadoop Cluster. This includes keeping up to date with the resource manager overseeing contenders, lifecycle management monitoring resource uses like CPU, memory off individual containers, tracking node help locks, management and other services, which may be exploited by different young applications. How do portrait acts is also backward compatible with my produce written in her do porn attacks and forward comfortable any APS can be integrated with Hadoop. Two wrote X so it is beyond map produce. The read and write operations are almost similar to what we saw in Hadoop under attacks the only difference we see in this architecture is the registration off data nodes. Too many name nodes. Train case. A name node goes down. The data can still be accessed with the help of other name nodes. Look at the block pools in the diagram. How data notes are they stirred with different name nodes in her group to dot Axe, we can say name node heads, high availability. So how do put out acts can take care of the cluster automatically when I name No goes down Leading operation writing operation are quite similar. What we saw in her to wander decks running job in her ludogorets. So when a Hadoop line submits a job in Hadoop tour attacks, the resource manager take cares of the job and deployed to the cluster. Very note managers take care of the task. With this, I'm wrapping up my lecture. Hope you enjoyed learning the job architectures and different versions. She's in the neck lecture. Thank you. 10. ETL vs ELT: Hello, everyone. Welcome back in this lecture, we're going to learn about it'll. And lt I'm assuming you have basics Knowledge off ideal and data Warehouse. Well, it really stands for extract, transform and load, whereas Ile de Stand for Extract load and transform before we start. Let's understand these three terms. Extract is a process where data is copied from multiple sources to a staging area. The sources could be any databases that restores organizations transactional data, for example, my sequel, Sense Force. Excel, etcetera. Transform is a process where data is transformed according to a target deliveries. The target database is called Data Warehouse. Once data is transformed and it's stored into a staging data ways it is then loaded toe the data warehouse in traditional analytics approach. What we do, we forced gather all business requirements. And then we designed our data warehouse so that it could easily answer all business questions. Redesign ideal so that it can easily transform the data that is supposed to be uploaded to the data warehouse. While extracting the data, we extract only columns or tables that are required for the ideal process. Rest is ignored. It requires a separate infrastructure to maintain ideal and data warehouses start. And the most painful task is that whenever business requirement changes, you need to redesign your holy step to include the changes, which is a costly affair. Let me give you an example. Let's assume your new business requirements. Want some new columns to be added into your data warehouse model? So in that case, you need to redesign your data warehouse model to include those new columns. And since your ideal job is designed to select Onley, dedicated columns or tables, so you need to redesign your it'll job, too. Okay, when you talk about how do your data comes first into how do? Then you think about designing data pipelines to meet you analytics requirement. So, in case off her do the most widely used approaches guilty, you extract all your auditor on, load them into how do and then you do the data transformation to meet your analytics equipment. But yes, good, very depending upon business use cases, the best thing about lt approaches that let's say in if in future, your business requirement changes, you do not need to worry about extracting auditor again, since your data resides in its original draw from it in her do. Whereas in case off traditional data warehouse, it is not because they have been transformed by your ideal job. I hope you guard idea how healthy approach is used in a do. That's all for this lecture. See, in the next one. Thank you. 11. Hadoop Vendors: anyways, welcome back in this lecture, I'm going to talk about Hadoop distribution by different windows. So before I start, let me tell you the difference between her do offered Bear Party and Hadoop offered by these market windows. Well, if you look there two ways, you can set up her do on your test or production involvement. Number one. You can download the Banbury files off a do and it's company from Apart Cheese website and set up the involvement manually. Number two. You can up for any of the distributions provided by vendors in the market, so a very highly bill. The difference between them is that if you opt for Hadoop distribution by any of the windows, you will get support. Some added features on top off Duke and its competent on a very nice gear a system to manage the clusters. Okay, let's have a look at some of the top vendors in the market. Imagines elastic map reduce. Popularly known is EMR was one of the first commercial Hadoop offerings on the market and leads in global market prisons. EMR is her group in the club liberating Amazing easy to for compute Amazing s three stories on other services. Claudia Reyes Focus. Don't innovating her do. Based on enterprise demands, its Hadoop suite is known as clothing distribution. Also known as Sidi it. It has built a faster a school Injun on top. Off her do quite Impala clouded eyes build a very nice gear, a system known as clothing, a manager for management and monitoring off a roof. Harden looks. Hello, Distribution is hard and Books Data Platform, popularly known as SDP Hardened Works strategy is to drive all innovation through open source community. In this course, I will be using hardened works for distribution to demonstrate her group and its company. Autumn Looks also provides a very nice do. A system known as a party embody for a place to management. IBM, in for spare Begin Sides, provides easy integration with other IBM pools. Like S Peace. SPS is Advanced Analytics Workload Management for Higher Performance Computing, B I tools and data management and modelling tools. It's her group suite includes sophisticated text analytics, model Ibn big seeds for data exploration, another 80 off performance, reliability, security and administrative features. My part Technologies is the third your plea but lacks the market presents off loaded and harden. Its Hadoop distribution supports network file system, one of its key innovation. It supports running arbitrarily cold in the cluster performance announcements for each base , as well as high availability and the justice recovery features. I thought it was the first Enterprise data warehouse window to provide a full featured enterprise grade had a planes. It was also the first to roll out an appliance family that integrated its group and to praise data, Warehouse and data management lives in a single track by voters. Haru Distribution has an MP PP group. SQL Indian called that provides our MP pp like a school performance on her well MP PP stands for massive parallel processing. Irureta being a specialist in Enterprise Data Warehouse, it has partnered with Horton Books to offer a group as in the blinds, Terra later distribution includes integration with terror editors, management and a school it off Federated School Indian that lets customer query data from its data warehouse. And that's all for this lecture Soon. The next list 12. Managing HDFS from Command Line: where you coming? Back in this lecture? We're going to learn basics. Come on to manage its defense. Learning these basics command will help you a lot. When you start playing with you. I'm assuming you have a working group involvement on your machine. If you haven't installed yet, I would suggest go back and install how do then attend this lecture. So to create a directory in as D a face, you can create it using Hadoop, eh? Freeze minus and candy. A year under directive part. So in the example, I'm creating two directories there. 8100 Tito within user directory to list files within a directory. You can fire the common her do efforts minus Ellis and the directory location. If you want to see how Maney blocks are available within a particular file in his defense, so you can check that using every secret command so every CK generates are some malleable that list the overall health off a file system as DFS is considered healthy if and only if all files have a minimum number for replicas available to copy files from one location to another location in as dear phase, you can use my necessity command as soon in the example here. And if you want to upload some files from your local directory to his defense, you can use minus put command. So don't note files from his dear face you can use minus get. Come on. See the example here to know the fine size from his DEA phase you can use minus do Come on . And if you want to remove some files from his defense, you can use minus R M. Come on. And to get help, you can type minus help command on the terminal. Now let's see these commands in action. I have loved it into heart and work sandbox. And I'm at the home duality at 10 bucks. Now let's creativity in his defense. So from Terminal Al guy, I do FS minus m kedia slash user slash s de ia one. So this will create a directory directive one. We didn't use a directory. And if you want to see this treaty, you can fire the command. I do a freeze minus Ellis. It's less user. So it will list all the directories within user directory. You can see year dirty when we just created I No. Let's put some file into the street from our local directory. Okay, so in our local directory, I'll create Adam, you file touch based file 11 dot txt. So I'll put this file and do is defense. Do a face minus put slash through SLUss. Best fly slash s user slash DEA one. So this command will upload the file from our local directly to his defense. You can see the file from Day one. A new face menace lace. It's less. User slash DEA. Okay, so now this file has been uploaded to his defense. Thank. Now let's create another directing as deface where we will see how we can copy a file from one location to another in his defense, trying to create another director, f ace minus and giddy a year slash user. It's less desire to. So this will create another directory within use a charity. You can see you to your new office when it's loose. It's less use. So the new directive has been great. So what? I'll do the file, which we have uploaded to direct. Even we'll copy the file from 31 to 32. So what else do I die? A new face minus C P slash users last year were one. Okay, it's less test file 11 the extreme 60 slash users less Delia. So this command will copy the file, which is sitting in the 81 to ability to so we can see the file from the I Do a face menace . Ellis slash user slash desire. Now you can see the file has been copied to direct you to Okay, we learned about fck Come on, which give the summary of file system held in as defense. Let's check that. Come on. So did I. I do. If a Seiki, it's less users less simple data. It's less salary. It's less salaried or see SV. So, the report say's whether a file is handy as defense or not, so you can see their starters here. The file is Hildy. If there are any missing blocks you can see in the report, you can also see details about replication factor. They don't nodes. Car blocks attack protector on. Let's see if you want to see what is the size of a file in his defense. So what you can do, you can die a group f ace minus. Do you? And another copy this the sizes in Beit against here. And let's see if you want to remove some files from a starting the between as deface, you can die do f ace minus adam slash user slash desire to star so it will believe all files within directly to and let's see if you want to download some files from his deface to your local directory, you can die a group. If it's minus good, it's less users less sample data. Let me a copy. This one to our new director. So this command Oh, against this file exist. So let me remove that. So, what I'm doing here, I'm downloading the salad. Artsy, is the file to a local directory so you can see. All right, that's all for this lecture. I hope you enjoyed learning is defense Commons and I would recommend go and try on your machine. Thank you. See, even the next lecture 13. Introduction to Hive: Welcome back. This lecture is going to give some basics knowledge about high. So before I start, let me give you an idea about why hive in big have been developed. I'll go with a simple example. Let's assume there to files, namely customers and customers transactions and as defense. Now, if someone asked, Tell me the top fight paying customers by Geo to answer this question, you will drive down a map early this program to solve this problem. Because of the extreme simplicity off map reduce, you have to deal with much lower level hacking. With many state branching data flows that arising practice, you have to repeatedly called standard operations, such as joining the hand. These practices wrist time introduced bugs harm readability on reduced optimize ations. There is a lot which is repetitively during data preparation process, so there is a need for high level toe toe a completes thes things, easily hiding all complexity inside. That's where High went big comes into the picture. So high provides a familiar model for those who know Esquivel and allow them to think and work in a relational database perspective. It provides a simple query language called hive Key Will, which is based on Esquivel, on which enables users familiar with SQL toe ad hoc waiting some radiation on data analysis . At the same time, haIf Key will also allows traditional mapple this programmers to be able to plug in their custom mappers and reduces to do more sophisticated analysis that may not be supported by the building capabilities. Hi is a competent, off Hadoop ecosystem. Hi is a data warehousing infrastructure for her Luke. The family responsibility is to provide data, some origination, query and analysis. It's a foods analysis for last data states stored in her loops as DFS File system. Now let's see what hive is not. Hi viz Not built to get a quick response to Kuwaitis, but it is built for data mining applications. It is not designed for online. The injection processing hive does not offer real time grease. It is best used in bad jobs. That's all for this lecture. Soon the next one. Thank you 14. Hive Architecture: Welcome back in this lecture. We're going to learn about hive architecture. Well, this is the architecture of five. When commands and queries are submitted to hide, it goes to the driver. Driver will compile, optimized on execute those using steps off my pretties jobs. It seems always that Dr Irwin generally Java map pretty jobs internally. But that's not the fact Hive has generate mapper and reduce our model, which operate based on information in an XML file. Now let's understand mean components of fight view it. This is the interface for you just to submit queries on other operations through the system . Driver received the queries. This component implements the notion off station handles and provides, execute and fetch a piers modeled on Jodi Beatty or division faces compiler parses. The Kuwaiti does semantic analysis on the different query blocks and quot expressions and eventually generates and education plan. With the help off table and partition made, a leader looked up from the meter stool. Modesto. The hive table definitions and mapping to the leaders toe are stored in made a stool. This meta stole is a traditional relational databases. You really my secret made a store constitutes the Modesto service under database. The media store service provides the interface to the high and the data base. Estos the data definitions map ings to the leader. On others, Execute er executes the education plan created whether compiler the plan is a bag off stages. The execution engine manages the dependencies between these different stages of the plan and execute DJ estate is on the appropriate system. Components optimize your hips toe optimize the quality plan. The query can be performed on a simple data to get the data distribution which can be used to generate a better plan. That's all for this lecture. Hope you got that idea about how high components working generally soon the next lecture. 15. File Formats in Hive: Welcome back in this lecture, we're going to learn about file formats in hives. Well, if you look around, data is growing at a very high rate on today. Almost every businesses are capturing big data. But the problem arises when you just try to access these big data. In this lecture, we will be discussing how different file formats in high help to store and access data in New. First of all, I'll be creating an external table in hive to read file sitting in as DFS to create a table that stores file as a text file. We need to specify the file type. Then we will pull data from External Table Toe Sally and a school text table, which will restore data as a text file artsy finally stands for the court columnar fired RC file agent data storage structure. That didn't means how to minimize the space required for relational later in as DFS, it does this by changing the format of the data using map reduce framework, The RC file combines multiple functions such as data stories for mating data compression and data access optimization. So it helps in fast data storing improved query processing off demise, storage, space circulation, dynamic data access batons. The RC file Former can partisan the data both horizontally and vertically. This allows it to fetch on Lee the specific fails that are required for analysis, thereby eliminating the standard time needed to analyze the whole table in a data ways. The overall data size reduction can be up to 15% off the original later former. As you can see in the screen, I have created a symbol RC file table before loading data into the table. You have to run these three SEC amounts to enable completion. Once you load data, you can run. Assemble quickie on individual column to see how maney bites are being read. When map reduce it starts, it will be lesser than what you see in normal text Table Park. It is a culinary store that gives us advantages for storing in a scanning data. Storing the data column Boys allows for better compression, which gives us faster scans while using less storage. It is also helpful for white tables and for things like column level aggregations. The overall data size reduction can be up to 60% of the original data. Former creating a barcode table is quite simple. You just need to specify the stories type like you did for RC File table, and then you need to load the data. WARSI stands for optimized row columnar or C file. Former provides a more efficient way to store relational data than art. If I reducing the data stories for made by up to 75% of door is nil. The also filed for my performs better than other high file form. It's when hive is reading, writing and processing data in comparison to RC file warts. It takes less time to access data and takes less space to store data. However, the RC file increases CPU overhead by increasing the time it takes tow. The compress the relational data creating WARSI table is similar to what we did for RC File or Parkway. You just need to a specify the stories time during table creation. At last, look individual. How different file form it's are reducing the original size off raw data. URC being the most if its into one here. That's all for this lecture Going practice on your machine. See you in the next lecture. Thank you 16. SQL vs HQL: Welcome back in this lecture, we learn about haIf queries, and also we will see the similarities and dissimilarities between a skill and excuse. Escape really stands for structured query language where it actually stand for hive query language. When it comes to hike Ortiz, they're quite similar to SQL queries. When using hive you Access, made a later about the schema and tables by executing estate means reading in haIf Key will . Surprisingly, these meta later statements are quite similar. Toe what you see in SQL World. Look at the statements for selecting databases, listing databases, listing tables describing creating their embassies in high, and how they're similar to what you see in SQL. Worry that three ways to describe a table in high to see table primary in four off hive table use described table. But the script for mated reason displays additional information. The extra information includes low level details, such as whether their table is internal or external when it was created. The file former the location of the data in as D affairs, whether the object is a table or a view, and for views that text of the Kuwaiti from the view definition two c code in a clean manner. Use described for mated. Table him So this is a command to see all information. Also describe all details in a clean manner. You can also order hive queries from command line their release ways to execute hive Quartey's from command line. Look at the examples here. Like if you want to run some hive cordis in silent more, you can specify minus s minus e option on the terminal. You can also set high conflict variables when and enquiries from command line. If you want to run hive Quartey's through SQL File, you can specify minus F option and the fire name on the terminal. Hi provides a lot off capabilities. When you are in the hype sell, you can done a script with in hopes of using source. Come on as soon here. In the example, you can list files from his deface using DFS Command. If you want to list files from home directory, you can run Ellis. Come on, you can use set Come on for configuration variables, you condone set high door tab. Come on for auto completion in High Self one can reset all variables using reset come on within himself. You can add jars less jars. Villagers from outside hi quarries are almost similar to SQL queries. Look at the court easier selecting columns or to find distinct values, doing order by operation or join or prison. There are limitations in haIf Quartey's, which we'll see in our next lectures. Also, we will see these qualities in accent in our high dem election, that's all for this lecture. Go and practice what you have learned today. 17. UDF & UDAF in Hive: Welcome back in this lecture. We're going to learn about you D off and UDF in hive user defined functions. Let you call your own application logic for processing column values. During a hive equity, for example, Rud of could perform calculations using an external math library. Combined several column values into one do geospatial calculations or other kinds of test and transformations that are outside the scope off the building s school operators and functions you can use utf to simplify Cory Logic when producing reports are to transform data in a flexible ways when copying data from one table to another. For example, if you fire select lower name from employee table. So for each row in the table employees the lower duty of takes one argument. The value off name on all puts one value the lower case representation of the name and, if you fire, select date if start, date and date from employees. So for each row and table employees the date the fugitive takes two arguments the value of started and ended on outputs one value the difference in time between these two dates. Each argument off a UDF can be a column off the table a constant value. The result off another You d off the result off an automatic compute ation beauty of a stands for user defined aggregate function that trips a group of values and returns a single value. You tips are used to summarize and condense sets off lows in the same style as the building town Maxam or Rece functions when I ut if it's called iniquity that he was his group back laws. The function is called ones for each combination of group back values. It evaluates multiple rules but returns a single value. See the example for clothes restaurant. It evaluates batches of rules and returns a separate value for each batch. Example off most profitable location here, you can see an example here. First of all, we're creating a huge geo to convert senses into far night, adding the UT of jar into hive. Then we're creating a named function so that we can call it into equity. That's all for this lecture soon in the next one 18. Hive Demo: welcome back. So you know, we have learned a lot of things about hive hive architectural data models, different file formats. Differences between SQL and SQL. Now it's time to have some handsome. Whatever we have learned about high in our previous lectures will see them in action in this lecture with this lecture habitats Simple later sets and scripts which are used in this demo. The simple later states and the scripts are for your exercise on your own machine after you finish this lecture well, first off, all really to start her do on our machine have already started Hadoop on my machine in the Hadoop Installation Guide. We already learned that there two ways to interact with her Do via terminal. That is, Either you can access through virtual box window or you can access through your local terminal, and there is one more way to interact with you. You can interact via Horton works women to face. I will demonstrate all three. Logan's Let's have a look Virtual box logging. First of all, I need to intento Walzel box window since I'm using Mac, so I need to die. Effin oy and a fight have been already f late. Okay, User name is you do. And the password is How do so? I am in the home, directly off sandbox. Okay. And the other way is fire terminal Local Germany. Okay, so I just die. Message a minus bi both number route at three. Local host Boss Worthy's you can see here. And the third way is fire weapon deface. So you just need to time this you are in hot and works and works provides a very nice web interface to interact with a loop and its component. The Web interface has a lot off capabilities. You can upload files, stress, deface broads, files in his defense, run queries on different components of a loop, designing broad jobs and much more. Let's have a look how we can create a directory and upload files to his difference through weapon to face Shall goto file browser. So you are in the SDF. It's OK. You can see the date easier. So these are the director is interesting as devious. And if you remember, these are the directives we have created during our his defense commands lecture. Let me clear their new direct to you Dia for me. So a new directory has been created in as DFS the air for Let's upload some file here department and app alone despoiled to his defense. So this file has been uploaded to his defense. Look how it is easy to upload files through web interface into his defense. It's catered database in hives. Then we will learn our symbol qualities into this Notaries. So I look like I've taken here. Okay. Ah, look. Been created. Deliveries cream dude, Avi's five be able. So a new database has been created if you click on this database system. So you didn't see the list of databases here? Okay. For demo? I have already uploaded. Simple later sits in his defense. Let me show you the director and files. I shall go toe filing riser. Use it. Aan den symbol leader. So I have uploaded Sally files in salad directory department, filing department 80 something like this. I will create some external tables to read these files. So let me go to hive. I demo. Okay, so I'm creating an external table called employees and I'm mentioning the location of the fire in his defense. Let's create it. If you Google Ebel's employees will has been created. You can see the column names here. And if you want to see some Temple Leader, just click on this tab. Simple. We'll see the values here. Let's go to hire a liter will create some other tables. Also a clear department location of the department. In his defense, click on Tables department has been created. And if we want to see some sample leader, click on this. It's great. And that there be with Sandri. I'll go to hire editor and will take me all queries. So we have created three external tables here to read file sitting in his defense. This is the simple leader. Let's create some internal table in hives. Did you copy the Brady? This time? I'll be creating that They will using terminal I shall go to done on Wednesday. I live. It will take some minutes. Doping hopes in. Let's wait, Staking, dying. Now we are in the hypes and really die. So races we will be creating their table in high underscored demo. Let me base the brilliant that will launch some members of the job here. Look at the status here, so that table has been created suddenly in the Scranton looked fires. I'm greedy limit. So the location of the data is in ABS. It's less hives, less warehouse, less high underscore Demo door Devi. So if you relate this internal table, both data as well as schema will be deleted from high. So in our previous lecture, we learned about file formats in hives. Let's run those qualities. So for stuff, all L created will be creating this David sellin escort Next based. Quitting. So Havens. Now let's loaded into the stable, certainly in the school text. So we loading the data from extended, it will learn some member of the job. You can see their status here. So we have created our table that the stores data as a text file. Let's fire Sally in the school next. I'll create inevitable that the stores far lesser falsified So tables So you can see this structure certainly in school to see let me loaded. So before I load the data, I'll have to run these commands as we have discussed in our previous lectures. Okay, Now let a little later and plus he find it. It is launching the map Readers shop. I'll create an adaptable that the stores file is a market for it. So let's glow didn't. So Once it is done, I'll show you the different size off files we'll see table. So copy that equity. So tables now load the data into this tip so some man produces going on. So we will see how the file size varies across these tables. So let's go toe file browser in his defense. Oh, let me go through High Day burns. So let's see the size of the file for location. So it's around 97.3. Gaby on If we go toe for worse e file, you're gonna see the difference. Look, it's only 15 Gibby. See how the file is being compressed across different, you know, file formats in high. You can see their differences. Now let's run a joint Korean haIf so we will be learning this query So this query will give the results off. Employed by department who? Salvi's grated in 1000 in August 2008 let me go to hive query a day dear on it based Disquieting staking sometime. It's sweet. So what is throwing some logs like what is going on. You can see their status here. Off, my pretties. Sticking some day now. We got the result. See, they employ ground by department. Now, let's run some hive query from command line. Let me copy the ready. I shall go to Horton Works Home Directory and will base the equity. See, this is how we can also execute the queries from terminal sticking someday. Launchings on map Introduce Show bogus. Yeah, you can see there, isn't it? Let's run one more Cody stick ings and dying. Look at the results with this. I'm wrapping up my hive Dima lecture. Hope you enjoyed seeing hiding accent. See you in the next lecture. 19. Introduction to Pig: Hello. Welcome back in this lecture. We're going to learn about history, and basic fundamentals are big. Biggest, designed to handle any time off data Big is a high level extensible language designed to reduce the complexities off. Quoting map produces applications. Big was developed at Yahoo to help people use her due to emphasize on analyzing large on the structure data sets by minimizing that time is spent on writing mapper and to do so functions. All tasks are included in a manner that helps the system to optimize the education automatically. Because typically 10 lines of code and Paige equal 200 lines of code in Java, big converts operators into my pretty schooled biggest. Composed of two components Monday, the pig Latin programming language and the other is big random involvement. Big is a high level language platform developed to execute queries on huge state assets that are stored in his DFS. Using how do it is similar to a school query language but applied on a larger data sets on with additional features the language used in Vegas called big learning. It is very similar to escape. It is used to load the data, apply transformations on done the data in the required form A big converts all operations into map and reduced us, which can be efficiently processed on How do It basically allows us to concentrate upon the whole operation irrespective of the individual mapper and reduce their functions Big can be used as an it'll tool to design data biplanes. It allows our detailed step by step process by which the data has to be transformed. Big can be used for research and development. It can also be used for iterated data processing benefits of big, big, big few lines off court to process. Complex task. Biggest self optimizing. No Java language is required to learn big lighting. It can be used for at up. Quaid e Big supports a school like capability like joining sort filter mathematical functions at Spectra. Lesser development time. Biggest Skip takes 5% of the time. Compared to writing man produce programmes in Java, it is good for batch processing jobs. It can process a structured, semi structured and unstructured data. With this, I am wrapping up my lecture soon. The next lecture 20. Pig Architecture: Welcome back in this lecture, we're going to learn about big architectures. We will see how different components of big work together, as we learned in our previous lecture That Big is a high level extensible language designed to reduce the complexities off roading member These applications. So when you submit pick queries initially there handled by the person parters checks, this index of the script does type checking on other checks. The output of the parcel will be a dam that is directed a cyclone graph. Which of the present the pig letting the statements and logical operators in the DAG The logical operators of the script are represented as the nodes on data flows are represented . AIDS edges so parcel basically generates logical plan as output optimize it. The logical plan is passed to the logical optimizer, which carries out logical optimization such as projection and post on compiler. The Compiler compiles the optimize logical plan into a city's off my pretties jobs execution engine. Finally, the maverick these jobs are submitted to her do in a sorted order on these Napoli jobs are executed on how do producing the desired results? That's all for this lecture. I hope you got the idea about how big works internally. See you in the next lecture 21. Pig Data Model: Welcome back in this lecture, we're going to learn about Big Data model. Well, it has very limited set of data tights. Big data types are classified into pool types. They are primitive and complex. The primitive data types are also called simple data dates. They include end long flow, double etcetera. Big support. Three complex data types. They are a couple, which is an ordered set of fades bad. A set off couples is called a bag map. A set off key value pairs is called a map. Big supports a lot off operators, which has its own functionality. Most of them, ERM intend. Here, Loader operator reads data from file system dump Operators writes I'll put to a standard output limit. Operator limits the number for the cards. Group operator Electric cards with the same key from one or more input the scribe operator . It returns the scheme of the relation for each generator operator. It applies expression toe each record and output one or more. Records. Filter operator it select couples from a relation based on some condition. Joining operator enjoying store more inputs based on key split operator. It s splits data into two or more sets based on filter conditions. Simple Operator. It selects a random sample off data based on a specified time built AIDS order operator. It sorts records based on a key, distinct operator. It removes duplicate er cards. Stole Operator. It writes data to a file system union. It merges two data sets. Frank operator. It returns each temple with the rank within a relation. That's all for this lecture soon election. 22. How Pig Latin Works: welcome back in this lecture. The main aim is to give you an idea about how pig Latin walks in big big Latina statements work with the relations are relation can be defined as well. A relation is a bag. A bag is a collection off pupils are to pull is an ordered set of fields. Ah, field is a piece off data. Ah, big relation is similar to our table in relational later bees where the two pills in the bag correspond to lose in a table. Unlike a relational table, however, big relations don't require that every couple contained the same number. Fields or the fields in the same position have the same time. Also, relations are on ordered, which means there is no guarantee that pupils are processed in any particular order. Relations are referred by name or Helios. Names are assigned by the user as part off big alert in a statement. So in this example, the name or Elia's off. The relation is a look at the example how I'm loading the data and it's specifying the schemer in the relation on dumping results off the relation on the screen in a relation fields are referred by positional rotation or by name. Positional notation is generated by the system. Positional rotation is indicated with dollar. Sign on begins with Jiro. So, for example, dollar zero the 1st 2 first feel in the file dollar to reverse to third freely in the fire . Names are assigned by the user using a scheme us. In this example I have some fine sitting in as D a office and I have defined the relation in pig to read these files. As we just learned, our relation is a bad which is similar to a table in relational database. When you fire described the it will solve the skim off relacion de in the third line. We're taking sample of Baghdadi on limiting the no more clues to tell. And finally we are printing the results on the screen using dump operator. Look at the relations for complex data, right, The output off relationship. We'll have a complex data, right? I hope you got the idea about relations in pig going practice today on your machine. Soon the next lecture 23. SQL vs Pig: Hello. Welcome back in this lecture, we're going to learn about similarities and dissimilarities between s krill and pig. Let's understand some fundamentals off. Big Leading Big Latin is prosecuted. Where SQL is declared do big letting allows biplane developers to decide where to checkpoint data in the pipeline. Pig Latin allows the developer to select a specific operator implementations directly, rather than relying on the optimizer. Big leading supports splits in the pipeline. Big Latin allows developers to insert their own code almost anywhere in the data pipeline. Big leading is prosecutable on SQL. Under the hand is declarative. Let's understand by an example, consider, for example, a simple pipeline. We're data from sources users and Clegg's is to be joined and filtered and then join toe a data form. 1/3 source called you in full on aggregated and finally stored into a table called Valuable Clicks party Amy in SQL. This could build it in a Sonya, so in the inequity were joining the sources. Users and plates and then in the outer query, were joining them with geo info and finally storing diligence into valuable clicks. Pardini and the same thing can be leading in big as soon here, look at the relations here. Look how it's cool forces the pipeline to build it in, inside, out with operations that need to happen. First happening in the from clause of Kuwaiti. This could be resolved with the use off intermediate or temporary tables. Then the pipeline becomes our best rented set off a school queries where ordering is only happening by looking at a monstrous script that she was all the school together. Also, depending on how the database handles temporary tables, there may be cleanup issues to deal with it. In contrast, Bigler teens whose uses exactly the data flu without forcing them to either thing inside out constructor, set off temp ready tables and manage how those tables are used between different SQL queries. The pipeline, given in a Skrill, is obviously simple. It consists off only to very simple steps. In practice, data pipelines at large organizations are often quite complex. If each big Latina script spans tennis steps than the number of scripts to manage in source control, gold maintenance on the workflow specification drops by an order of magnitude. There's some keyboards in pig Latin, which works similar to what we see in SQL World and Big letting Filter is quite similar. Toe Where clause in SQL that's in Texas. Different but conceptually. This is similar to a skill. Wear clothes where were filtering data based on some conditions. Since filter is done in a separate estate from a group or aggregation, the distinction between having and where does not exist in pig R. Dickey would be hips pretty much the same in pig as an excuse in big joints can have. Their execution is specified on day look a little different, but in essence, these are the same joints you know from a skewed, and you can think about them in the same way all joints are supported by big. These similarities group Everything in a Skrill is a room the grouping created is not persistent Onley, the data produced Aggregating away remains some quality to use in pig Latin. Every step has a declared. Elia's reusing security tables is natural and intuitive on generally does not involve building them. Toys U Thiha bigs list of building functions is growing, but it's still very lesser than what article or my sickle provides. So what big does it allows? The user to define Aggregator analytics functions in other language like Java, fightin and then apply them in pig quickly without any issue. Here are some examples with so how Big Latins Index Varies from SQL Queries Look at their jumbles for selecting queries running distinct functions in SQL and big running aggregate functions. Look how we can perform joins and paid performing union operations. I hope this lecture have given you enough idea about how SQL and pig alerting varies. It's time to start practicing. Go and play around Big Ladin on your machine. We will see big LaDainian accent and our big Dima lecture. That's all for this lecture soon the next one. 24. UDF in Pig: Welcome back in this lecture we will see how you d efs play an important role in pig beauty of stands for user defined function. Ah, big beauty Off is a function that is accessible to pig but written in a language that is in pig Latin Big allows users to register UT apps for use within a big leading A script. In previous lecture, we learned that big has limited set of functions in comparison to what auricular bicycle provides so we can write analytic functions to process data. You d efs provide capability to do custom data processing. In big you idiots are easy to use and cold big you tips can be executed in multiple languages like Java, python, javascript, etcetera. With big you lives, we can process anything like image feature extraction. Geo computation did a cleaning natural language processing and much more. Big allows users to combine existing operators with their code via you. Deif's piggy bank is nothing but The collection off user contributed you two years that is released along with Big piggy bank Julius, certainly to be restored manually in the big jar when used in biggest scripts. There are three types off you tee off in pig a value tee off a creation, you tee off and if it's you evaluative evaluative I used in for each type estate means and is lacking. Look at the example here we're defining a relation in the fourth estate On In the second step, we're doing string processing using are you dear to generate names in lower case. This is one of the example off Valued and Mitigation Union The stable fut upset applied on group data. This is similar to function we used during a group by statement in SQL like some ground detector, so aggregate functions are used to combine multiple pieces off information. In this example, we're calculating the moth sales by product. Look at the relations here. I think you do. Ah filter union is used to filter data based on some condition. A day later, Boolean values So in this example, we're filtering abusive cummings from given set off data. Look at the relations here. Whenever you submit big queries, big converts them into set off happily these jobs a separate instance off duty if will be constructed on running each map and reduce tusk. This is very your beauty of Wilder in I hope that, you know, you have enjoyed Learning Group and its company. Next lecture is gonna be really, really super exciting as you're going tohave a lot off hands on experience. See you in the next lecture. 25. Pig Demo: Welcome back in this lecture. We're going to see Big in action. I'm going to them straight. How big? Letting walks in pig We will be running different simple qualities to see how we can process data on interact with them with this lecture have all sort as simple data sets. And scripts, which are used in this demo for demonstration, have already uploaded symbol data sets into his defense. As we learned in our previous lectures, Pig Latin is a data flow language on each processing steps or relations result in a new data set. Let's see them in action. I will be demonstrating the queries using heart and works Web interface as well as big Grand Cell. Big can process anything if I skim off for the data is available. Big will make use off it both for a friend, enter checking and for optimization. But if no Eskimo pies available, Big Willis still process the data making the best cases it can. Based on how the scripts treats the data, let's run some temple Gorey's shall goto pagad eater. Andi, let me learn the ready. So I have a firing sitting in as defense in this location Okay, so I'm you loading the data by specifying a schema on in the second step, I'm taking some sample off the bag e. And finally, in the 30 step, I'm dumping the results on the screen. Let's run in. It will take some minutes to run. So this is the reason Let's ransom other simple queries on Big Grantsville. So I'll go too hard on books Home Directory and I'll tie Pig. After some time, we will be in Big Grant, Ill. And from Bigger until we can execute big lad in the script. Now we are in the big granted Let me copy Assembled. Very So in this quality, I'm loading the data from his deface by specifying a schema on in the second state am limiting the rules by pen and finally printing the results on the screen. Take me that night so you can see that isn't only this ransom Kuwaiti using group by a statement. So this will on some apparatus job to process the data and finally it will print the originals on the screen So you can see this that this year this taking some time no 25% is completed for people's in completed now, 75% is completed. So look at the result here. Now let's in a joint equity in pig. So we will be counting number of employees by department in the first relation. I'm loading employee data in the second relation. I'm loading department data on In the third relation. I'm joining the two bags E, M, P and DP by department I d. In the next relation. I'm doing good bye bye Department name in the 50 relation. I'm doing unique cones by these department on generating the groups and finally am printing the originals. Let's run it. It will again laws the map Pretty job. It will take a few minutes to complete. It's we, and you can see the map produces status here. It will through the logs, 25% is completed. - Now 75% is completed. Hell is the reason. So this is the count off employees by department. Now, if you want, you store big results into its deface, so you can do that by using a store operator. Let's have a look on the symbol goody. During the last, I'm a specifying a store operator on specifying the output location in as D a office. Let's run it. It will again. Lawns, the map Pretty job will take some minutes to complete. 25% is completed. If the person is completed, - 75% is completed. So the job has been successfully completed. We can see the results in his defense so we can go profile. Roger. Oh, let me go to some data we go and you can see there isn't. Now let's run a work on problem in pig. So we will be finding the occurance off. Hello, world and world world in a given sample data. In the first relation, I'm loading the sample data sets in the second relation I'm doing token magician In the third relation, I'm doing filter operation for a hello world and world war. In the fourth relation I'm doing group by operation by the world And for each word I'm doing count operation and finally printing the regionals. It's running. Sorry. Oh, letting based it. So it will launch map. It is his job on. We can look at their status here. It will take a few minutes to complete. It's sweet. 50% is completed. Look at the results here. I hope you go died about how big? Letting walks in pain. That's all for pig demo going practice on your machine today, June the next lecture. 26. Designing Data Pipeline using Pig and Hive: Welcome back along. We have got enough idea about how Hadoop and its component work in this little I'm going to design our data pipeline using big and high to process logs generated by users on the website, we will be analyzing Click stream Data generated by Huge is on a website. Let's understand what is click stream data. Ah, click stream data is an information trail or user leaps behind while visiting a website. It is typically captured in semi structured website lock files. The lock files contain data elements such as a date and times time. The VA Jitters I p. Address the destination you are else off the pages vegetated on a user i d that uniquely identifies the website visitor. We'll have a look on the simple leader in some time that the most scripts and data set have been a terrorist. With this lecture, which I'll use in this demo, you can run them on a machine once you finish this lecture. Now let's understand what is data by plane In general sense, our data by plane is the process off structuring, processing and transforming data in the stages regardless off what the source data form. Maybe some traditional use cases for a data pipeline are pre processing for data warehousing. Joining with other data sets to create a new data sets on feature extraction for input to a machine learning algorithm. Data by plane is an automated process that executes at regular interval of time. Toe in just cleans transform on aggregate Incoming feed off data to generate the output deal rested in the former that is suitable for downstream processing with no manual intervention. So in this demo, this is how I have designed a sample data pipeline using big and hive to process click stream data. First of all, we will be uploading sample locks file into his defense. Then biggest script will transform this data into a structured form which will be then used by high for further analysis automating data by plane. As we just learned, that data pipeline is an automated process that executes at regular interval so you can automate the entire data pipeline by calling your scripts and Cron job. Cron is a time based job, Sindelar in UNIX systems where user can call the rescript. So this is how we can automate the entire data pipeline in ha due to process data. Let's have a look at the simple later on their schema, which we will be using in this demo that t fines product. It includes category off products and their corresponding you are ills on. This is the skim off the fire category and you are in users. This file contents user details visited on the website and this is the schemer. It contains user i d date of birth and gender logs. This is semi structured website logs that includes data like times Time User I D. I P. Address on basically click stream data. Okay, so first off, all we will upload the locks file into his defense. Then we will process the semi structured data into a structured one using pig on will dump the process data into another directory in his defense. After processing the logs by the pig, the process data would look like this. It will contain log date. I p u r l user, I d City country estate. Then we will design an external table in higher you to read this process. Data for further analysis since high provides J. D B, C or D basic an activity so we can connect some regulation tools like Tab you etcetera to visualize and analyze data. There is an assignment for you in this data pipeline. And the assignment is you have to join all three tables that is processed logs, products and users to create a new flat table in high. And here I have given you the hand how tojoin the tables on that table Eskimos should look like this. It should contend. User, i d is gender, country, State city Log date I p address product category under you are Once you create the table, you should be able to answer these Kuwaitis tough I products visited by the users. But you count off uses ations, but you count off future stations by gender byproduct Bajeux. So I don't hide Quartey's toe. Answer all these questions in this data by plane, I will be processing data still big on Lee and the rest should be done by you, which has been artist in the assignment. Now let's see how we can process the website logs using pig. I have a started my sandbox. Let me So you the file location where I have uploaded the lock files. This is the location. This is the website Log data on the products, data and users data. Let me show you the biggest script, which I'll use to past logs. So in the first relation, I'm reading their leader from his defense. In the second relation, I'm naming the Collins in the third relation. I'm doing uppercase for a state and country and finally storing the results. Let's run this Kuwaiti on big. Let me go over your head and I'll go toe big Grand Cell that we started read for some seconds. And I'll based Macquarie here. Now it will parse the logs and store the results into his defense. We'll see that are put in something to sticking sometime. It's sweet and then and the job is successful. Let's see. There isn't. I shall go toe file, brother Andi. Okay. And some Boluda. Big old buzz logs. So now this one is in a structured form. This output off biggest script. That's all for this lecture Going Complete your assignment today. Thank you. 27. Data Lake: Hello. Welcome back. This is your last lecture of the schools. I hope you learning Journey has been great so far. Well, my main aim off this lecture is to give you an idea about how different companies are adopting modern data architecture that is digitally and how it can give more values to the businesses. Ben Tahoe City or James Dixon is credited with coining the term data Lee, as he described it, in his block country. If you think off a Data Mart as a stool off bottled water cleansed and Pakistan a structured for 80 convention, the Data Lake is a large body off water. In a more naturalistic the contents of the data lake streaming from a source to fill the leg on various users off the lake can come to examine diving or take samples, data description and challenges. Exponential growth. An estimated two point age at a bite off data. In 2012 it's expected to grow to 40 Jet abide by 2020 85% of this data Growth is expected to come from new types, with machine generated data being projected to increase 15 X but 2020 s for i. D. C. Varied nature. The incoming data can have little or no structure or a structure that changes to frequently for reliable Eskimo creation at time off in just value at high volumes, the incoming data can have little or no value as individual or a small groups of records. But high volumes and longer historical perspectives can be inspected for patterns and used for advanced analytic applications. So the aim offer Data Lake is to collect everything. Our Data Lake contains all data, both raw. So say's over extended periods of time as well as any process data diving anywhere. Our Data Lake enables huge ER's across multiple businesses, units to refine, explore and enriched data on their terms. Flexible access. Our Data Lake enables multiple data access patterns across a sale, infrastructure batch, interactive online search in memory and other processing engines. Now let's understand how data Alec approach is different from traditional data warehouse approach. In our ideal versus the lecture, we saw that in traditional data, warehouse approach data is collected from different sources transformed by it'll process and then loaded to data warehouse. The data warehouse was able to store Onley structure data. It was not able to store any semi structured or unstructured data. We also saw its various limitations on the design part datalink. How do provides a low cost scale of approach to data storage and processing, since it is designed to run on large numbers off commodity servers. And we also saw in our previous lectures, that is, defense can restore any type of data on any size off data. So her group has become the backbone off data leak are digitally captures everything all data captured are in its or isn't raw form it. And since her Duke has many query engines like Hi Paige Mahat etcetera. So you just something they can come to examine diving to get any insights. Also, Hadoop so that acts provides easy integration with any other abs which is outside. How do so? Our data lake can deliver maximum scale and inside with Louis possible friction and cost. So who differentiate our data warehouse on data like we can see data warehouse, estos Onley structured or process data Very a data late stools, any kind of data. In its original role, former data warehouse is a scheme on right various data lake gives the capability off Eskimo on read because of the Hadoop addicts main engine Storing huge volumes of data is expensive. In traditional data warehouse various data lake is designed for low cost storage. A data warehouse is not that flexible when compared in with a little early A data lake is more flexible in terms off everything. A data warehouse is mostly used by business professionals, whereas our data leg is mostly used by data scientists. As of now, with this, I'm wrapping up my lecture. Hope you find the school's helpful. I wish you all the best for your career in big data. I appreciate if you leave your feedback and reviews. Thank you for taking the schools Have a great journey ahead.