Architecting Big Data Solutions - Use Cases and Scenarios | Kumaran Ponnambalam | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Architecting Big Data Solutions - Use Cases and Scenarios

teacher avatar Kumaran Ponnambalam, Dedicated to Data Science Education

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

37 Lessons (5h 21m)
    • 1. Intro to ABDS

    • 2. Traditional Data Solutions

    • 3. Big Data Solutions

    • 4. Current Big Data Trends

    • 5. Intro to Big Data Solutions

    • 6. Architecture template

    • 7. Intro to Technology options

    • 8. Challenges with Big Data Technologies

    • 9. Acquire overview

    • 10. Acquire options SQL and Files

    • 11. Acquire Options REST and Streaming

    • 12. Transport Overview

    • 13. Transport Options SFTP and Sqoop

    • 14. Transport Options Flume and Kafka

    • 15. Persistence Overview

    • 16. Persist Options RDBMS and HDFS

    • 17. Persist Options Cassandra and MongoDB

    • 18. Persist Options Neo4j and ElasticSearch

    • 19. Transformation module

    • 20. Transform Options MapReduce and SQL

    • 21. Transform Options Spark and ETL Products

    • 22. Reporting module

    • 23. Reporting Options Impala and Spark SQL

    • 24. Reporting Options Third Party and Elastic

    • 25. Advanced Analytics Overview

    • 26. Advanced Analytics Options R and Python

    • 27. Advanced Analytics Apache Spark and Commerical Software

    • 28. Use Case 1 Enterprise Data Backup

    • 29. Use Case 2 Media File Store

    • 30. Use Case 3 Social Media Sentiment Analysis

    • 31. Use Case 4 Credit Card Fraud Detection

    • 32. Use Case 5 Operations Analytics

    • 33. Use Case 6 News Articles Recommendations

    • 34. Use Case 7 Customer 360

    • 35. Use Case 8 IOT Connected Car

    • 36. Transitioning to Big Data

    • 37. Closing Remarks ADBS

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

The Big Data phenomenon is sweeping across the IT landscape. New technologies are born, new ways of analyzing data are created and new business revenue streams are discovered every day. If you are in the IT field, Big data should already be impacting you in some way. 

Building Big Data solutions is radically different from how traditional software solutions were built. You cannot take what you learnt in the traditional data solutions world and apply them verbatim to Big Data solutions. You need to understand the unique problem characteristics that drive Big Data and also become familiar with the unending technology options available to solve them.

This course will show you how Big Data solutions are built by stitching together big data technologies. It explains the modules in a Big Data pipeline, options available for each module and the Advantages, short comings and use cases for each option.

This course is great interview preparation resource for Big Data ! Any one - fresher or experienced should take this course.

Note: This is a theory course. There is no source code/ programming included.

Meet Your Teacher

Teacher Profile Image

Kumaran Ponnambalam

Dedicated to Data Science Education


V2 Maestros is dedicated to teaching data science and Big Data at affordable costs to the world. Our instructors have real world experience practicing data science and delivering business results. Data Science is a hot and happening field in the IT industry. Unfortunately, the resources available for learning this skill are hard to find and expensive. We hope to ease this problem by providing quality education at affordable rates, there by building data science talent across the world.

See full profile

Class Ratings

Expectations Met?
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Intro to ABDS: Hi. Welcome to this course. Architect ing Big Data Solutions. This is your instructor Cormorant here. First of all, thank you for registering for this course. I hope this course is going to help you in your carrier. Let's start with what is the goal off the course. The goal of the course is to educate students about big data solutions, their architecture and technology options and help themselves real world problems. If you look at all the training material education material that you have out there, you will find that there is a lot of material about individual technologies. No Hadoop spark, big data, no sequel, those kind of stuff. But there's not. Not hardly. You'll find anything that talks about putting them all together to build a complete solution. This core focuses are a much higher level, not at the integrity's off, how each one works, but rather have you would look at each of these technology options on take them and stitched together toe, come up with a big data solution and solve a real problem. So what do you actually, by taking this course, you appreciate the difference between traditional data solutions and big dinner solutions. how they are really different from each other. You understand the models off what a big data architecture means. What is a big data architecture? What are the various pieces in that puzzle? Then we look at various technology options available for each of these models, which you can pick and choose, and then you can straighten them all together to build a solution on. Then you learn about each of these options as to what are the advantages. There's advantages on, but do you be using these technology options? You then implement this learning in eight riel world use cases We go from a simple use case is too complex use cases trying to put them the things that we learned together into building enterprise use cases. And then, in general, you learn, are get an overview off the various bestie big dinner technologies out there. Generally, that advantages, shortcomings and use cases on that is going to help you in your job interviews, because the kind of questions people typically asked in job interviews is toe compare and contrast these various technologies and I believe this is going to help you in taking some interviews. What do we have in this course. What is the core structure we start with? The comparison of traditional data solutions was his big data solutions. We look at our architecture template as to how a typical big data architecture has laid out . Then we look at various models in that architecture for each of the models were gonna be looking at, What does that we're gonna architect in need of this model? What feature should be there? What kind of things you need to take care. And then, of course, we will talk about best practices for each of these models. We then get down toe technology options for each of these models. What are the options available on? We look at the advantages and shortcomings of each option on When do you use what option? Unfortunately, in the big gettable, there is no one size fits all kind of solution. Eso you have to go use case by use case to get as are not quite option and put in there and use it. And then finally we look at atrial left and the price use cases to kind of take the learnings were from the earlier of course material and I play them to build these solutions . And if you're gonna be building our drawing board level, how the solution would look like brain trying to get all the pieces and stretching them all together, Things not covered. There is no programming covered in this particular course. This is more an overall level or overview level course that is more at a drawing board level to figure out how architecture will look like on. We're not going to be focusing on any kind of programming in this course on, we are not focused on building a big data solution from scratch. When I say when scratch, it means that you're not gonna sit and called all the the entire solution. Your son rather you gonna be looking at existing technology option, take them and use them at a pass a part of your solution. And that self. Today, almost all big data solutions are numbered. We hope that this cause is gonna help you in your carrier. Best of luck in taking this course and also in your carrier. I hope this course is going to be really useful to you. Thank you. Bye. 2. Traditional Data Solutions: Hey, welcome to this lecture on how traditional data solutions work. Now, when you're trying to look at big data architectures, one of the first things we need to understand is how are they going to be? Are the different from the traditional data solutions that are being existence for the last 2025 years? Eso Let's started local. Take a look at Water Dick. Various characteristics of traditional data Traditional data is all about numbers. That's where the whole computer industry started, as about looking at computers as number crunching missions, a lot of the yearly applications and a lot of obligations that were developed in the 19 eighties and 19 nineties, where on uneven to thousands were mostly about number counting. Now we're talking about, ah, business number crunching things like, you know, finance, sales and payroll, where there's a lot of numbers that is being created, and these computers were used to crunch numbers. Some rays numbers add up numbers and stuff like that. These traditional data also had very well different schema when you said scheme of the structure of the data is very clear. Okay, there is a 90 which is a number off size 20. There is a name off, which is a character off size 45. You know, that's very, very different schema on. Typically the data confirmed to the scheme out our to which it was prepared for There is pretty defend linking between the data as to how they were able to link to data be us. For example, we're looking at a pair in payroll. How does the payroll linked to other employees records So that there is an I. D. Link there is. We talk about all these foreign keys that they're used to Linda, that between one another that is pretty defined and well defined. And it's all known before the schemer is even put into place. The attributes of the data hardly changed. It's already kind of pretty find, because these are pretty standard applications. Andi. Everybody knows what kind of bait has told in this applications. That data that distorted the obligation doesn't change the baby on. Typically, the state are the sites within an enterprise. There was no concept off a cloud or anything like a centralized data center Later sites within the enterprise, Rita belongs to an enterprise and recites within the enterprise off course. When the enterprise go growing big numbers like the banks that are across the world, you know he has the data can get spread A But typically in a medium size business, it's all sitting within and under price on maybe in single location. There is a centralized data repository center place place in which all the data is stored. There is maybe possibly once over one huge server that manages all the data on Dallas the place. But every day that goes in and gets stored and the backup that used to happen are these off lane back, upstate based backup city Best backups are backups are taken, and then she stored in a separate place. The people of the backups can take of so many hours whenever there has to be required. So this is kind of how the traditional later who looks and works like I was traditional data processing. There is very small distance between the source on distinct synchronous a source. You're talking about the U Y. And the single talking about the database. So the distance between the U and the rear bases, where it's more typically the data Hendry happens and data centers on the way associates in the same database centers. So typically, it's a wire link. Typically, it's a land kind off link. The data is not across my and typically that's very small distances between the source and the sink on the data transfer is pretty instantaneous. You know, it's not like the data has to be archived and moved around and then stage and living like that. The did the data students was a produced simple and straightforward between the EU and the database. The database to the data processor back to the reader base, our prom database toe the reporting place. You know, police, very small distances, instantaneous transfers, data moves to the application court or processing. This is a big difference between traditional data processing and big data processing, where the data in the database is brought in the memory than mode across the network to the Application seven. And it is the application server that works on the data. And then once the data is processed, the data has put back into the database server for story. So this is ah ah, big. Different data sizes are small. They're mood across the wire to the application server from process data. Validation happens at the source when he said the revelation happened. The source, you know, typically, data has us and on the very in which data gets into the systems through a you I somebody sitting there and entering data through the U. Y on the uber does its validation on the data to make sure that there is no bad data that is getting into the system, For example, you know, is there is a field called Country are Typically there is a drop down list of countries that you cannot go into the wrong country and wrong, really, for the country. Same with, say, somebody's into in a date. Date is validated at the entry level you've in force off for murder in which the date has to be entered. Somebody put the wrong munter incorrect bid. Are you going to be thrown? Exceptions ing the state has not accepted, so you need to make sure that data gets into the system in a clean fashion. So there's no incomplete data. There is no daughter data in some fields. A mandatory in the U. S. The U. S going to scream producing I unit entered this value. Otherwise, I'm not going to save the day that. So that way we look at a traditional system that does not issue off incomplete, outdated it that they didn't come back toe. When you're looking at big data solutions, are they very much as a build for number crunching? That is why they came up. There were a number crunching missions. They were bad, a text crossing, but the excellent number crunching missions. And they still are great number crunching missions. But get asked, not beat in the art of Emma's in terms of number, country capabilities. Ah, pre summarized and pre computer data. You know, traditionally data processing. You love a lot off pre summarized eight averages. You have all this transaction data, and then you pre compute somebody's and store in summary tables. You know, half in our family. Somebody hardly somebody's daily summaries, yearly summaries, summaries by department. Somebody's product we knew store all these pre computer and pre separated. When it comes to traditional data processing, reporting is primarily pre can't when you say pre candor is already pre defined reports, you goto our reporting system that reporters are reporting cinemas like 50 reports already . There you go into the perimeters that is going to give you back that report in a pre defined format you cannot adopt, typically determine what former do you want the reports to be. Of course, there's insistence came overplayed, that allow you to create your own ah flexible report that mostly the reports were very much pre canned. What about traditional solutions? Architecture? So when you are as an architect trying to design a traditional solution hollow, they looked like Generally there is a single centralized data store in the middle where all data is stored. There is typically what you call the three tire architecture of the entire act picture. Where there is a presentation layer, there's a business Lee, and there's a data layer. Build it. Architectures typically don't have a presentation lay there might be, but there's not. Not very simple, like as you could see, an attritional database system that is more let court. No, you are going to buy five different products and tried to combine them. That is one huge court of his return from scratch, like Uriel, something like Java R C plus plus, or something like article farm for the boat's to build a moral piece of court right from scratch are you buy a product from the market, and that product is usually a single product that does everything for you like s a P R Article financials. They typically are monolith products, one shoes product that you buy and deploy, and there does everything for you. There is hardly any integration between products, even if the integration is that it is through custom interfaces. Standard a p ace Do not exist. Janitors Anders do not exist. Eso anybody wants integrate product A with product we that typically requires a custom integration project. How many times you want to change the solution? It's functionality with data. It needed full lifecycle projects. No, there is, um, there is always start with washing one dodo than we goto wondered One Wonder Toe daughter. We keep going on and there are projects being run there. Every project at its on requirements. It's on business and document. It's on project planning, execution, tracking, back fixing everything. So this is a traditional architectures are solutions, architectures, data solution, architectures, web. What were the challenges people faced with respect to traditional data. One of the biggest challenges is that takes cannot be processed. A tall are it cannot be process in economic fashion. Traditional data solutions cannot handle incomplete our baby data. They assume that as the data comes into the system, the data is already complete on it was already killing on their off course. High cost off stonings texted. If you look at any of the rtb my systems, you see that typically the cost of the system itself varies in terms of how much data you gonna start. Text ater takes a lot of space combat the numbers Andi it cost in terms of hardware and software as to how much money it's gonna take to produce. Stood next. Data backup restoration is time consuming. Big quackers themselves were time consuming. You know, the traditional way are they must work. Is that okay? You have back a process that France every day on it does have things like incremental backups. But whenever you get into the restoration trusted, the restoration process is always time consuming. Give you out of the store data. There is high management and licensing costs associated with traditional data solutions. Typically, when you are having a RBB Amos. It has its own management and licensing costs. If you're buying a product off the shelf like ASAP or any of those ah kind of e r P solutions, they have a lot of big, big cost associated with them. You need trained people in all the solutions to do things for you. And, of course, schema changes take significant time. If you want to add a column toe a table in a production system that already has, ah, millions of rows of data and all that goes through a lot of process before you can go on a make this changes. So there's a lot of these challenges that kind off require a new part of them to come in. That's why big data came in. So these are the traditional data challenges. There are, of course, more. I mean, you can come up with a lot more. I just letter. Ah, few significant ones here. I hope the selected has been useful to you. Thank you. 3. Big Data Solutions: Hey, welcome to this lecture on how big data solutions looked like And how do they stack up against the traditional data solutions? So let's start with what is big data. We keep hearing this term a lot. Big data. What does it exactly mean are there are a lot of definitions are lot of overlapping definitions about big data. But let us go with what Gartner has said about big data. Big data is high volume, high velocity on our high variety information assets that demand cost effective, innovative forms of information processing that enable enhanced in sight decision making and process automation. There's a lot of stuff that is going into in a one liner, one sentence so kind of let's start breaking it down. So when you look at big data were first talking about where I d of data, not just numbers you're talking about text, video, audio and a lot of mission data. Also, there is volume. There's a lot of volume of data. There is volume in the range of terabytes are petabytes of data that we're talking. In this case there is velocity, the speed in which the data is coming into the system. Typically, velocity fear is not under your control because you're not controlling the number of you wise and which people are going and entering data. This is typically conducted by missions by sources that are not under your color control. So the velocity is not really under your control and you have the plan for that. There is veracity of date average is another Baylies inoculate, dirty, incomplete and unit to be ableto Ah, sit and take it off All these kind of things before you can use the data in a A pretty good predictable. So let's start with what led to big data. Why did this whole concept of big data came into picture? What triggered this one? The first thing is cloud adoption. The last thing started moving to the cloud of the amount of data that people have to deal with. No multiplied because one cloud deployment typically is supposed final multiple enterprises that a social media that led to an explosion of data that is created on the Web. There's a lot of people tweeting that the dollar people putting in comments and then that is creating so much of social media data. There is mobile explosion. Every mobile debates becomes a user interface in which data is generated and everybody is doing something on their mobile systems that needs to be captured and analyze. So that's a lot of data coming in that is machine generated data data that is not entered by users but created by emissions. Typically, trackers are sensors that are analyzing data, analyzing something, reading something every every nanosecond and then generating a record every nanosecond kind of thing. And that is creating a lot of mission data that is coming in all these trackers, the Fitbits, a lot off sensors and airplanes and the sensors everywhere in shops and cars and serves in any kind of electronic missionary generating a lot of beta. And, of course, that is data driven management that has come in. Typically, people make decisions based on intuition, but no, they want to analyze data on and use the data to make their decision. So there is all generally a need for more and more data analysis. More and more analysis on new kind of data, a new kind of analysis on data that is driving are data driven management at the states. All of them have led. So these concept of big data. So what do you define as a big data application? You know, it's a very generous, gentle time. Everybody talks support, but in general, one of the following things need to be true about a big data application. First of all, we're talking about data in terms of terabytes of petabytes. There should be at least more than one source. Our form from better data generated more than one soul since in terms of the system and one form, or that the stakes are your video numbers that kind of stuff. Are we talking about a lot off text or media data? Nordeste numbers. We're talking about huge processing loads, and when data is in terabytes and predebate, you're trying to stake the data and do some huge transformation to the Raiders. You're talking about huge processing lords that is not going to fit into just one processor are just a few sets of processors. There is real time stream processing that is involved, which is after data is coming into the system. You're processing some and generating some inside, told offered. That is advanced. Another dicks. When they talk about advance in our necks were talking about mission learning. So there is mission learning involved in time tryingto analysis data and coming up with some insight on the data. There is a big deployment footprint in terms off home with the hardware that would be used . Typically, we're talking about Ah, few hundreds, a few thousands off servers when you're trying to use a big in the application that is changing users requirement people. Your genuine requirements are very fluid and flexible in terms of what they won't do because in the old traditional leaders and the data is fixed. So what the users expected out of the data was also fixed. But in this case, the data is changing. It is unpredictable as toward the data will show up. So as users are looking into the data are they realize that there is a new kind of analysis required because the dead are showing up something new and they want to do something something different on the data dogs of analysis. And they can't sit and wait another six months for you to come up with a new report. So there is a constantly changing user requirements evolving user requirements based on what the data is giving them. And, of course, these big data at the plant application should be related. Lee. Cheaper to Build on Brandon Maintain So there's another born characteristics off for big data application. So how do to big it up products in the market stock? There's a lot of big data products or technology that are coming in the market. Typically, they're all open surf. That's pretty good, because that reduces your costs significantly, almost down to zero. They support open integration, open digression technologies and open integration. AP a set of standards in place that makes it easy for you to integrate with any other product. They have very high interoperability capabilities, so it's easy for you to pull a few competence in there. You know, stick dumb and sticks them would be easily. Are there very constantly evolving good and bad because they're constantly evolving because it's a new product. Products just got a new to the market. They're still building capabilities. They're constantly evolving. Good, because you're getting a lot of new features bad because by there every version has significant changes and you was very difficult to keep up with them on. These are also I would call immature when I say immature, immature. It is not a bad thing in which simply means that the product is still evolving. Products is productive. Still not figure out what it is exactly supposed to do. That's what When you say in matures, you know, you start with the new concept. You start building features and then use the start asking for new creatures and you're still not. I mean, it's not like a product like our habemus we say. OK, the Artemis must has these features. 12345 Whereas how does a big data positive looks like? How, what a spark supposed to do our waters. How do I supposed to do it on? Think people are still debating? That's the water should be constantly adding feature. They're changing features and all kinds of things are happening. So that's what do you call it immature that this is the big data products stack up against interational data solutions. Thank you 4. Current Big Data Trends: Hi. Welcome to this lecture on Colin. Big data trends. This is your instructor camera. So let's talk to look at what is happening in the year. Big data world. How are the product shaping up? So the technology is wise. What are they? Look at the big Data technologies. There are numerous companies and projects that are today come up on big data technologies. There's a lot of companies coming up with a variety of new products. They're not even one Dato Justin appointed or 2.2 kind of things. A lot of incubator products that are coming up and they're all mainly open source was just got because, you know, it's pretty easy to look at them. Check out, kingdom out and try them on. Mostly there on cloud focused. They're built for the cloud from from deployment point from a management point they have been for the cloud on. A lot of these technologies focus on one thing. They do not come up with the street that does. A lot of things are are are full fledged product offering that they concentrate on one specific area on one specific problem and come up with the product on day typically have open interfaces for integration. So they focus on one thing on. Did it ask? Is that okay? You can use this as a model in your solution on. We can integrate very well with others of models are other solutions in the market. That's how they're coming. A break best example is, if you look at the no SQL mark, you know there's no one no skill solution that is going to be putting all your needs. You know, each one of us is your solution addresses different use case on their very specific to the use case and focused on the use case. That's what they have been trying to do, so that is happening everywhere. You know that that's numerous products, numerous companies coming over the products, and you don't know which one to really choose. Are none of them that is really mature for adoption. There is phenomenal growth in adoption when its adoption, these new technologies observed by other new technology companies that's over. This happens, somebody comes up with Huddle on Dope is used as a base for coming up with other, particularly under a set of companies, is a lot off adoption within the startups from They pick up one technology and trying to come up with another technology, and that's been going on on. There's a number of immature alternatives and eat segment again and say mature measure. It is not necessarily it's not. It's not a bad thing. It is just product evolution. And it Any product new product that comes into the market and the new technology that comes into the market will go through a maturity cycle. And currently a lot of these products are at an immature level. They're not really in a mature level. That's what it means. So what is happening in software product organizations when it's a software product? Organizations were talking about Companies that produce software are come up with self right products like Microsoft, for example, Apple telling about the top two companies in the world. But a lot of companies like that. So what are they doing? New products. The mines are driving new product futures, so that is cloud social media mobile. These are driving new features in the product. Everybody want to be cloud enable social media enabled mobile enabled, and that is driving a lot of product features inside their organizations. And as you know, all these guys generate a lot of data, and that's what they need to handle on. Big data is considered necessary for cost savings. Typically, any of this software products you will see that they require a database on their base comes at a lot of costs. Eso they're trying to use beginner technologies because they're open source, take and take big data technologies and actually wrap them into the product, offering at almost no cost. So they're trying to look at big data as a cost saving ah feature within the product. Rather than having to lay upon traditional data with solutions. People are asking for flexible adopt analytics capabilities inside their products. So any product in the world needs some kind off I analytics for it on. All right, so they are trying to come up with flexible analysis capabilities that demand flexible scheme. Also, so a simple example. Let's start with your operating system. The operating system generates a lot of logs, wrecked logs alert, and if you want to collect these logs and alerts from our so Maney PC's are so many laptops and get them understand place placing on lace as toward is going one that needs a lot off a dark analysis capabilities as well as big data capabilities. Everybody wants to be adding advanced Alex capabilities into their reporting to the solution. So it starts with if you're getting the reality of the financial sector, you wanted to use the data to predict something. If your ligament you're making hardware products or software products, you know there is also the need for you. Get a log from various deployments and analyzed them and try to predict failures. And I'll try to predict what system is likely to fail and then go fix it before the face and a lot off things like this happening in terms off advanced analytics, also in each of the software products. Now let's look at the other side of the world, which is the enterprise idee enterprise ideas in your documentary price. And we were talking about a company which is primarily not a computer company. When you say it's a bank are it s a business, some other kind of business, right? Andi, they're all these companies have within them, and the price I t department even call them i D department and an E, D. P or whatever. People were the color on what is happening within these within these departments about big data. So they are curious and scared at the same time. And they're looking at big Deal because these idea organizations typically move at a very slow and stable fair space. They did not go on adopt product just like that. They typically spend a lot of time looking at new products. Doctor new production when they adopted productive products, stay in their system for a long time. You don't go and acquire a new product because your query new product would take them like six months. And after that they develop a solution on the product and deployed that text someone to two years, and then the solution stays in there, set up for another 10 years before they look it under the new product. So there's a pretty slow pace think, and now they're looking at the market anything. There's so many things coming up, and they're kind of scared about what's happening around the world. They are mandated to do cut social cloud and mobile later because their companies, the parent organizations have to be involved in these areas like a bank, for example, today they do not have an option. They have to be in the mobile world there to get into the cloud will their regular into the social media world because that's where the customers are. So as a service enterprise ideas or have to also add up to all these new data sources, get the data and start analyzing them. There is competent to pressure today to be data driven in the in the management world. That this is the new intended at river management is the new in thing and I data driven, if not your kind off really old. That's That's part of a people have been saying about. So they are also in a watch and wait until the technology matures typically, and the price I idea. Organizations do not get themselves into immature technologies that typically wait for the technologies to mature. So there is this whole thing about they don't want to be the scapegoat off the guinea pig for a new technology that typically wait for washing to rot or three toto before they start adopting. But then the world is moving pretty fast. That's where they're curious and scared at the same time, because they don't know the rate for another five years. Things to make sure because their own company doesn't maybe them with them and move forward so that spread their in. They're starting a lot of proof of concept projects. That's when there is a huge crowd off demand for big data professional because everybody wants to get in the big data. They want to start off some projects and see how big data can fit into their organisations . And they're also looking at moving towards the cloud for cost saving purposes. This is what us happening in the Enterprise. I d world. As of now off, the selection has been pretty helpful to you. Thank you. 5. Intro to Big Data Solutions: Hi. Welcome to this lecture on an introduction toe. Big data solution Architecture. This is your instructor Common fear. So you feei either being an existing architect off our regular traditional solution. Are you just a student trying to understand how big data solutions are designed? Big data solutions are radically different from regular. Traditionally, their solutions on that is what we're gonna be seeing in this lecture and how they have been different. Let's start with what is a big data solutions are even a big date of application or if you want to call it that way off course, the goal off a big data solution is to acquire an assemble. Big data, big data being the definitions we have seen before the four weise are you gonna be looking at data from They were sources from our baby A messes toe flat fighters to social media to mobile on. It's also going to be off our various formats like it could be text based data. Jason numbers, media, like, you know, ah Weiss files video files. It can be anything our process and persistent, scalable and flexible data sources. So you gonna be processing and persisting the data in pretty large, scalable data stores, as for less flexible in terms of scheme, are flexible in terms of what you can do with done. That's what we're gonna be doing in a big get absolution you provide for flexible. Open A P ace for quitting is that it's an SQL interface or it's gonna be arrested. You provide some good open AP is by which people can query data. So one thing going to remember about the big data solutions is that big data solutions do not really focus on the data entry part nor own really own on the reporting. But those are end user functionality were largely going to be focused on the back and us to have you get the data and move that it move that big giant trailer around various places and get things done. Provide advanced knowledge X capabilities. This is mission learning prediction, those kind of capabilities. Because big data has always being almost always been associated with this one. You might start you made. Even if you start up without this capability, you will pretty much realize that asked, that thing goes on. You want to add these capabilities because all organizations today are looking at advanced analytics to help their business. Andi use big Gator technologies to knit the solution than building ground up. So nobody no enterprises sitting in developing a big data solution from grown up all by themselves, whether they want to go and get some solutions that are in the market and kind of made them together to create a solution on that's what we're gonna be seeing a later in the discussions. So how does a traditional application different from a big data application? It's gonna be different and a verity of ways. Now, if you look at data acquisition, how we are quieted, our traditional applications of data entry by India's is there is a U in which typically people go on, enter some data and it could collected that way, whereas in big data applications, it is there fresh from being a basis on mission logs or social media. Now, in big it also, you might argue that the data and there is some kind of data entry going on possibly like somebody like Amazon, for example, there are people and turned it into a U. N. But a big, better solutions to of you does not usually comprise off those application. They're typically considered a different application. The data and report on the data collection part which typically they call it goes into an operational later store are over the years on your big data Solutions starts from the ODS not really from the U I data entry. You it because you people can't a data and they can't get stolen on hundreds of servers. And the big data solutions start. But it starts collecting data from this hundreds of servers and then start crossing them. So that entry, you might argue, is a part of the big resolution. But it is better to keep it separate because the technologies involved the kind of skills required. Belinda's applications are totally different from these kind of applications. Data validation in in traditional solutions are typically Jordan during data entry, not, they typically have A You buy a form very off faster, enter some values and there is validation happening then in there, if you enter the wrong date, you immediately prompted further intact prevents you from anything wrong. Better by giving you list of values are the options to choose from rather than asking to new freeform text, whereas Big Dirt application do tend to deal with a lot of dirty data because it's typically freeform a text, and that can be a lot of missing data in them. A lot of data that is on a misspelling, all kinds of stuff. Eso when it comes to cleansing a traditional data solutions do not have a cleansing step because it's already validated during data entry, whereas bigger application of data is coming from the bear. But social media, there's a lot of cleansing involved when it comes to transformation. Traditional solution. That money transformation. Typically, you do with some more ization of data, you know, convert transaction evens two records and records toe offered a somebody's to daily somebody. That's what you're typically doing in a traditional application, whereas in a big your application, you're doing like texture numbers, conversion formation, learning a data enrichment. They also do some more ization that you do a lot of transformation work and big data persistence. Big traditional solutions typically have one single centralized RBB Emma's, and that's what they typically do. Whereas big gate applications I would distributed and a poly clock persistent, which is you would use different data store type. You might combine and I d be a must with no SQL database. Got enough to get the things you want. Many liquid application architecture, a traditional solutions that typically what you call three tired architectural entire architecture. It is centered on owned the business lier, whereas bigot applications our data center and integration oriented business lee an application that is a central business layer to which data as mood. So it has moved from the data store to the business layer for processing and then back. Whereas big get applications, you're not gonna be moving there. That's pretty costly. Rather, you are going to be moving the court to wear their data access on finally and look at the usage of traditional applications. You're talking about reporting analytics, statutory data reporting and stuff like that on big. You're every bit more focused on, you know, advantage, politics, mission learning, predictive and prescriptive kind of analytics. Different kind of use cases, drive boat, traditional and big solutions. It's important for you to understand how they're different. We got on us an architecture architect. You're going to be looking at them in a different way when compared to any kind of traditional solutions. One of the biggest things you are going to be focused are faced when looking at big data. Is this differentiation between historical and real time radar? Now, traditionally, if you look at a regular traditional business application, what happens there is you collect data in real time only, and then that data is then used for all kinds of historical purposes also. But given the volume of data you're doing, dealing with, it is not possible for you to really do everything in real time in a big data solution. Because of the stream of data that is being gender, the wallet was being generated. It is not really possible for you to do with the solution that process every piece of data in three and find because that's even if you wanted over there is gonna be really, really costly because you have a design. Your solution. Take care of the maximum load on. That could be really, really high when you're talking about on. That's also because in traditional applications you have some control off the data inflow. For example, if you look building a traditional financial accounting application typically data entries done by uses. That's that's typically lot slower. And you control the number of clients, you know? Okay, there's going to be 50 people intimidate 100 people entering data at any point in time, Maxwell Lord. Whereas Andrea, when we're looking at the social media, you don't know how many people are gonna be tweeting about your company and you cannot really control it. There could be really spikes in terms of what data is coming in. So there is a difference that you need to acknowledge between real time and historical. You can build a solution that does both at the same time. But you know, that's going to be really seriously costly on the way I would compare that as a real time later is like a sports car. Historical later is more like a truck. They have very different functionalities. Yeah, you want to combine them together than that means you're tryingto build a vehicle that has got the capabilities of a sports car on the capacity of a truck. And that is going to be one and possible thing. And even if you build one that's going to be a very costly eating. Great. So let's look at how historical and real time data the friendship between themselves. A historical data is stored and forward. Real time data is streaming. Irureta is coming in. You just, you know, sitting and listening to data. And there's a lot of Lord Historical. You typically go pull the data rail time is being pushed into you historical. Later, you're really looking at end of day processing or end of our processing as batch processing going on, whereas real time, it's even base, triggering the last evens happening. Things are being pushed into you, and as events happening, you have respond historical, then also talk you deal with completed record straight. For example, you're talking about the obsession. Historical record is created after the obsession of the large incision of a user is over. So that's done afterwards. Whereas in real time live updates as the user is clicking every link in the rep ridge, you're going to be getting an even though there's life Ricard's being maintained. If you maintaining wonder called for decision you create that occurred when the loser logs in and you will be continuously updating the recorders. The updates as more and more action is going on from the user side. Historical, Whenever there is missing data, you do a full publisher republish, you know, a real time. This always the delta that is being published. You don't publish the entire data rather your publishing deltas and you don't know how to handle Delta data Historical. One of the requirements will be no loss of data. You know, you can be slow, but you can't lose data rail time. The requirement is that it has to be fast, but there could be possible loss of data. You see, in real time, data is only going to be used for some key reporting purposes. So you might be looking at smaller set of data and it's OK. There is a little bit of difference in terms of numbers. Historical data is used for detailed analytics, whereas real time will use for snapshot are intraday. Analytics are immediate and takes whatever you wanna call it when it comes to advanced Analogic. Historical data has used for model building. When it comes to mission learning. Historical laters. Only used for model building. You build a model to predict something under your time Raider is used actually for making a prediction. So there's a lot of different between how historical data is created is processed and used once Israel time date. So just for you to get a picture of how they're different uh, hope this has been helpful. We'll keep moving on toe. More discussions like this when we get to the architecture. Thank you. 6. Architecture template: Hi. Welcome to this lecture on architecture template for a big data solutions. In this lecture, we're gonna be seeing what overall Big Peter solution looks like. And what are the various models in a big get a solution? Onda. We will be working on details off these models in the later lectures. So let's start with what are the various models off a big data solutions. What are the various competent now in the case off regular traditional data solutions? When you have models models off, the traditional later solution looks a lot similar to each other in terms of, you know, Code Besar. How the US looked like that's because you build them all from scratch. They just different functionality. And the kinds of big data application did ours a lot off differences between what these models are in terms of their I would call shape, size and of those kind of stuff on. You would be using different technologies for each of these models. So let's start with the first model, which is the acquisition model that question models. Job is toe connect with your data sources on acquired the data. This focus here is getting toe connected them and get the data. And of course, the connections can be both. Batch more are streaming more, and there can be multiple farmers for are the data that is coming in. Then it comes to transportation Lee, and that is a big transportation layer that is involved in big data. Because the transpiration ist the sources of the data are very far understand from where the typically the the destination is. So that is a significant transportation efforts involved in transporting over the Internet over organizational boundaries to get greater because the collection points can be pretty numerous, they can be on the Web, there can be in the cloud, they can be in different data centers, and the data has to be moved. This is big date or not small later, and that has to be more across all these organization boundaries to get to the destination . Then comes persistent, so persistence is storing data on persistent in a bigger resolution can be polic locked, which means you would be using different types of data. So, so sorry, data sinks. I just told different kinds of data, so there is not going to be just one solution fits all. Unfortunately, we would see in the raid lectures that we do not have a solution of a one size fits all solution. So you might be using different database types to store different kinds of data. Transformation is a long process gin wall getting the data, cleansing the data, linking, translating, summarizing. There's a lot of activities happening in the translation layer off a big data solutions. So that's a significantly in a renovated a solution. Then there is reporting. So you, of course, want to use the U. S. The radar for some kind of you a based reporting and also you want to provide some A PSB H third party applications or other applications can get and use this data. So that is a reporting more deal there. And finally there is an advanced knowledge ex model in advance analytics, more deal, other candle things like mission learning prospect, the directions, actionable act, actionable protections and those kind of stuff. Now, if you look at all these layers, not all layers our military for a big data solution, it depends upon what you want to achieve in the big data solutions. Sometimes it has spent simple Sometimes it is complex. So these are the various models that are typically involved in a big solution. And finally, there is a management lee. The management layer job is to manage all this stuff around on my management layer. There are pretty few options available irritable on. Typically all individual technology options give you some management capabilities. And then you might want O build a management layer over to kind of get all the data and present its crescent, the data force management reasons. So let's look again as to how the template looks like. So we start with acquisition model, whose job is to acquire data from wherever the sources are. Then there is the transport. More doulas job was to get the data and transported over variously as to get to where you're salute. Destination is in the destination. Being a big data store on the big data store, you are going to be doing party glad, persistent, possibly because you might be storing them in different base. Then there is a transform layer, which is a series of jobs that can do a number of activities. Are the transom there? We will typically data from the persistently transformed the data and put it back into the persistence layer. Sometimes transformation can also happen in the transport layer itself. Ask young movie D. Another is also possible if it is really time kind of data system. But typically, trance at the transformation model is a serious off batch process that works on the data in the position player read from It transforms it went right back to it. Then, of course, there is the manage layer who just manage Lee that can go and manage all these various tested out there in a work on them and see how they can all fit together. There is a reporting layer whose job is with Do provide a way by which users can look at the data in the persistently and do some reports. Take a to get the data out on do some graphics. It can be just visual. Things are it's an AP for you to get the data out of the system. And finally there is the analytics layer. Advanced knowledge. Ex player who can read this data, performs a Manal Biggs and then right back to the same persistence layer. So there's a lot these are all the various models and they start your search. Uh, our diagram shows you how the only typically work together on this is a Typically, a big data system will look like These are the various models. And depending upon your your scare, some model might be big. Some model may be small. Some model may be known existence, but this is the overall picture on. We will be exploring each off this models in detail in the upcoming sections. Thank you. 7. Intro to Technology options: Hey, welcome to this lecture on technology options. So during this entire course, we're gonna be discussing a lot of technology options for building big data solutions. We're gonna be giving a little interaction as toe what we're going to be doing around that area. So about technology options in this specific course, we only going to be discussing popular options. This is so money options that are available today. A lot of them are upcoming. Lot of them are being in Russian zero, not 10 or two. Kind of everybody's kind of come up with a big data solution. Onda, uh, there's there's a huge list and we don't want to be going through each one of them because cornerback Okay, pretty boring. We're gonna be looking at only a few popular options, and that will be going to be not getting in the way. Little discussion, because each of this popular option is ah course in itself. If you want to get into really details the border, what we're going to be focused on is the salient features off that particular technology option. Our product advantages and shortcomings were trying to look more into a comparative motors to what is the difference is between some products and where are they pretty much useful to ? We're focused on the advantages, shortcomings and use cases where we're gonna be using them on. We definitely increase you to seek out other resources for deeper link the learning of this technologies we don't think about. Okay, let's add more content. But then we thought about how much more convenient is going to be, especially if this is going to be a terry on liquors. And there's no point coming up with a 30 hour lecture for beginner technologies. Regular solutions. Course you can always seek another is just too our toe. Learn more about these technology options. There's a big difference in the way traditional solutions are built. Wants us how big data solutions are built. The traditional solutions are typically they're built from scratch. These are more analytic applications, like one huge application that is homegrown. You build the whole solution in house. Are you by the holding from a market when dad like typically you ate ability and their air pre solution in house, using something like article forms and stuff like that are you? Buy them from a vendor, but it's going to be one monolithic application. It is typically return and one single programming language. Thousands of lines of quarters. Court orders return in the a monolithic application toe. Build everything that you want. Typically, there's a single centralized data stored in this application on there are typically high development and maintenance car. This is a traditional applications have bean built so far, whether by one single application or by our by it on. And that kind of does everything for you, right from start to end. You know, that's all. So far, applications have been built. But the new way of doing the big date of a is assemble and stitch way when they assemble and stitch way rather than trying to build everything from scratch, trying to assemble pieces off technologies from various of various options, and then you stitch them together. One of the reasons why you need to do this assemble and stitch is that there is no one solution. Fits. All are one. Technology fits all the rest today in the board, it might come later, but today that is not the case that you have to choose the best of breed for every model that you have, and then you assemble and stitch them. So big data processing us to common demands one a scale ability in a massive scalability and reliability in that skill on both are things that are not that you can build easily from sky. They really required some significant amount of afford and programming. You know that that is one reason why you don't wanna build big data solutions from scratch . Rather, you want to piggyback yourself on a technology that is already available. But what is already available is that there's a number of products and technologies today available, especially as open source. I know there's good thing they're open source, but the same time there are too many of them. There's a lot of options. A lot of people are actively building solutions. Eso you got a lot of options in there on days typically support excellent open integration . That's a good thing that there they're pretty open. These technologies work well with each other. They have support for each other so you can easily stretch get them and Andi stitch them together. So how are you going to do and go? Go on do. Your application building is you are going to go on first, acquire the most suitable competence for your solution. You first understand your solution on under your use case and then come up with a solution and say These are the best competence for my solution and then you go get them on. Then you stretch them and integrate them to create a solution. You get them, stitched them and integrate them to create a solution. This way there is minimal custom work. You want to focus on minimal custom work on. That also means that there is a very fast to production time. So big it up projects shouldn't be running for two years or three years that that there should be more like a two month the three month project. But your job is to come up with an architecture that uses a lot of existing competence, pull them, sticks them together and then deploy the solution. And that's the way you're gonna be getting some pretty fast to production times. Thank you. 8. Challenges with Big Data Technologies: Hi. Welcome to this lecture on challenges with big Data technologies. Now we are always excited about this whole big gate of world in our so new and it's a grand and we all want to get in the world and do it. But there's a lot of challenges when it comes to using this big Gator Technologies from an enterprise point of view on India's up. One point of view on, that's what we're gonna be seeing in this lecture. The first problem is that there are too many options Now if you look at something like a database in every our baby in most databases, there are very few options. Right is article that is my sequel. There is M. A sequel will be Post Crest. You know, there's very few options were very clearly defined. Ah, market for them but Rome. Other when the other we must feel started when you go back another 2025 years ago, there were like 20 odd maybe most products at that time that because the field was new and everybody was coming trying to come up with their own or are they don't not baby in this product, but Then, after some time, you know they'll settle down. Some product grew to become market leader. Some product were been down eso things like, you know, usedto here about ingress and say bees. And then there was like Hetch B and B had its own RD Bemis product and son at its on stuff and all those day down on Big Gators today, at a pretty similar stage. Where are Daydreamers Waas 24 years ago in that it's a new field. Everybody is trying to come up with a product, and there are too many options at this point so that everybody is thinking okay, I think I can do something here. Let me go build a product which I think is going to be new and exciting on every product addresses and narrow specific feed. There's nobody who's building a product, which is okay, I going to cover the entire gamut of big data. Other Everybody is building a product for a specific use case for a specific model. There's nothing that is covering everything, and there is no one size one product fits all situation. At this point on, everybody is trying to expand to cover other use cases. That's what typically happens. And everybody starts with the product which covers a specific use case, our domain. And then they were trying to start expanding it to cover everything else. So that's a stage a lot of these products are in on. The problem is also that replacement technologies are being invented at a very fast paced like If you go back four years of four years ago, you re doctor about democracy was being the be all and end all of big data processing. But none People figure out a vampire, my producer at his problems, and they came up with a party spark on. Now everybody is like, OK, Apache spark is that thing on my produces dying? And before you say that there is something called Flink the best coming or to compete with spark and your leg hold on a minute, this part, which is good. But why is that a fling on? People are trying to come up with new and newer technologies at this point that things are not settling down on. That's a big issue for big Data architect because when you're trying to come up with a solution, you want a solution there can stay and work for another five years. At least you're not going to build a solution that can only work for six months and then day eso you want a solution that can work for five years or so. That means that the product that you're using as a part of the solution should also be robust and should live and grow during the period. Most of these products are immature and incomplete in maturity. In a product is not necessarily a negative thing. It just says that the product is still in a very childlike stage, you know, it was just grow born and it's still growing. It is not fully mature in terms of its capabilities. It is not sure about what it is supposed to go in terms of stability. That's where we are in. And they're incomplete because the products are just doing what they're doing on the main thing. They do not have things like management capabilities and some monitoring capabilities and stuff like that. These products are still like, you know, a toddler or a teenager kind of more. They still have to grow and mature. These products have a very high changing. Things are changing very fast, which means that our libraries new library is coming up in them replaced or libraries getting replaced interfaces getting replaced really quickly. So that's creating a lot off churn in terms of what we want to use. Feel supporting services. A very primitive still, That's one issue with when you're building a big solution. Is that what the products they're using in your solution? You want to have some support and service? They still need to address things like administration and usability. There is going to be a shortage of skill and experience. Personals. If your unit people to implement your architecture, you need toe. Be ready to spend some money to get the good people for that on. It is difficult to predict the future, our future off a lot of these products, because things are changing very rapidly on it is also not future safe because technologies are going out of big before the first release of the application before the faster please, I'm talking about the one daughter release. A lot of the protects our instill in Sub one Dato more. They're still being adopted and Jews, but things are changing pretty quickly about enterprises typically like their investment to be safe for at least 10 years. They want in Western technology. They want the technologies to keep be in the market and keep going. Growing the market. Good thing about Big Day Dies. It's all open source of cheaper. But then my mother there has also cost associated with acquiring the technology put in the technology in place and using the technology on. People do not want to be in a situation that their technology gets pretty fast. Companies supporting this technologies These are mostly small and startups. You know, that's a little worrying for you because startups can change, but pretty quickly they can get acquired and they can change direction. The product you're trying to use can all of a sudden lose support muster following you're using a product that is free. So there is no commitment from the vendor that yes, we will. They will continue to support the product. So I think this is a little fluid thing that you need to work on on. Of course, market size deployment sizes. You know how many people have evaporated. That technology is also pretty fuel. Andi, if you're trying to choose and go technology, need to consider all these things because you want to use the technology that is stable. And what do you want to expect in the next 5 to 10 years is that there will be few products that will grow on become market leaders. We don't know which ones, but they will be a few products that will grow and become market leaders. Maybe you can take a bet on certain things based on a wider adoption. The more number of people trying to use the technology there's a chance that that their technology will stay and grow. There's a lot of merging of products you can see starting to happening, so that way they get closer and closer to the one face fits all situation. You will have fewer maturity at your options on Afghan possibly will have stable features. You know, things are a peas are not going to change. Libraries are not going to change that way. Things are a lot more stable. Then how do you then make your investments future safe? I mean, how the you as big a targeted choose technologies that is going to stay in the market is west evolved. Look for product and developers. Support on the companies that offer product and developer support typically have a long a chance to see and stay in the market. Look for cloud options that one good option to go with, because when you look picking up a cloud option, typically there the chance that they were taken off a lot off. You know, the upgrades and changes and compatibility issues and stuff like that. Look for adaptations by leading companies and products that the technology is being being used by some leading companies of the other product, it means that are typically there is a support network there. There are people who are willing to pay money toe, keep the product alive. So that's something good. And look for open a P ace and open data for months. In case you have to switch technologies, it becomes easy for you to do that spit. So as a big gate architect, you can say, OK, I'm gonna invade for five years. That's fine, but that means you might be losing a lot of business opportunities. You don't want to be doing that, but at the same time, the technology is still in a very nice and stage, and until the technologies mature, you have to sit in work around them, work through them to get these solutions running. So that's a big challenge for a big architect at this point. Hopefully, this lecture has been useful to your that's getting more details about these technologies in the later lectures. Thank you. 9. Acquire overview: Hi. Welcome to this lecture on the acquisition. Or do let's take a deep dive into what this model is supposed to do and what are the best practices? So let's start with what are the responsibilities for the acquisition module. The main responsibility of the acquisition model is, first of all, connected, maintained connection to the source so you might have more than one source on each source might have its own connection module. Eso. Its name job was to connect and maintain connection, so sometimes it is a bad process. You just connect to do things and disconnect with real time source. You're going to be connecting and keep going to keep that connection going forever. You need to execute protocol responsibilities, which is, you know, when you're connecting to a. So there are some protocols involved in our depending on the type of the source. It's arrest a P I s Cuba on those responsible. These Legree connects handshakes. Better handling all this has to be done by the acquisition model. Data for my conversion is a key responsibilities because most probably you want to store the data in a format suited for big data and analytics. it could be like a Jason former that a sequence file format on the source data might not confirm. Most probably might not come from that that farmers. So you might be doing format conversion. Also, the question model might do filtering to first. No data that does not require it can do local cashing. So when typically what happens is when you're looking at a real time later source. The source is generating data more than what the pipeline on the transport layer can handle . You might be doing some local cashing. Also, compression is a key responsibility because when you're transmitting data, especially texted, ah, what the wire. You want to compress it so that it uses lead less bandwidth and it can move a lot faster. And that is, of course, encryption. You want encrypt the data when you transfer data or what? The wire so that you know the data is not. There's no theft of data what you're doing this transfer. So you need to be concerned about this one because there's a lot off left off the rather this happening when they descend on the wire. If you're doing something with sensitive data like your customer information credit card information. That is a possibility that there could be Did After what? What are the various source types you ever deal with? You have a deal with databases? Are baby Emma's databases even, or sequel databases? You're going to be looking at data that as dead and databases under a dive at doses, you might be dealing with data in files, so files data could be anything that could be like media data, like recordings are videos and majors and stuff like that. Or it could be something like this could also be something like, You know, the RBB mass data there is provided for you in files because there could be 1/3 party with no no connection to the third party, but data source. But they may be giving you data and files. It might be here, Http and rest in case of what is going to be a real time Raiders, there could be custom data source, real time data streams that are coming from any kind of applications. And there could be custom if there is, ah, your own custom in both application, that generating the day you might want to connect it through some custom interferes and also get data so the source could be off a lot of types, as you can see, Want to architect as architect what other things you need to consider an architect. When you're building an acquisition model off, you need to talk about I think about how you can identify new date. So there is a data source. It's not going to be a one time right off which you're going to keep going back every day or every other back to the data source and you're gonna be good in new data. On the most important question is how do I identify what data is new are typically in an RGB mess. You might be looking at a primary key, which is a continuously gender, the number and keep track of which number did it last time. So now where should we start from our recon? Be looking some time stamps and doing there. This is a pretty important thing because you do not want to be good in duplicate data as well as you don't want to be missing data, so you is very important. This is kept track and a very clear on and very clean. Big three acquisition and re transferred. As if case. There are some errors transmitting it as a question errors. How do you go about getting re acquiring day down? Retransmit indeed, once again begin. Sure you're not missing any data on your not double transmitting anything? So this is another, more important thing you need to consider when you're architect in a solution. Data loss. How the way, prevent data loss? How do we not miss a trick? It's. And that's another thing you as an architect, us to consider while you're building it, buffering at the source in case the transport layer cannot handle. There's going to be like spikes off data. Spikes of loaded is coming from the source. Will the transported. I'll be able to handle it other ways. You need to do some kind of a buffering at the source. Buffering also means start when you're buffering. It also is told in a very secure way, and also in a reliable way so that we do not do lose any data in flight. Keeping data in memory could be risky, because what of the box coast on What's the boxcar starts choking and and then you lose the data. Most important thing here is not to lose the data when you're doing buffering their security so the source provider might have their own security policy. The social. Probably that might be a db a. An internal dbn. Externally be 1/3 party. They might have some policies as toe. What kind off? There are some security requirements I need you to consider and other toe. Remember that when you're getting data on moving data across the via the rest big security concern off somebody stealing data. So this is something you need to consider privacy. If you're getting other people's data, let's say you're getting data from the Internet from Twitter or something like that. Consider a privacy. You make sure that you do not steal a good in for any information that you're not out raised to get. You're not invading the privacy of people. This is something all sort of to keep in mind as an architect, and finally you need toe architect for alarming so engaged that something's going wrong. There has to be assistant through which, you know some alerts are raised from alarms are raised by which administrators can monitor the system, and in case some things are going wrong, they can take a look at it quickly. So alarming is a key management future that you need to take care off when you are connecting a solution. What are the best practice is recommended for architect ing the question model. Fasting is involved. Source. Owners to establish good handshake. So when you're building an application discussed with Dick Source owners as toe have, you can make this ah question Mahdi relabel because things like handshakes, protocols and the only way you can make them to be robust. It's actually talking to the owners, getting them involved in the solution on getting them to also help you coming up with a protocol that is really safe and secure security, privacy and those kind of things in a doctor, the source owners and established them. Make sure you work with them and I defend new data. So if you're coming up with the schemes to say this is how I'm going to be an in differing the new data, they need to also validate the idea that it's gonna work that way. Also, they can't be too going and changing their source data scheme us without a new informing you. So those kind of things need to be working. Bagdad with the source owners go for reliable and open A P is always use open and open a pH rather than customer base wherever possible. The reason is that these open a p A's there, already kind of built and used by the people. There's a lot, lot liberty going on there. Plus, it is easy for you to switch around products because when they conform to open a P, is it easy to use a separate different product and use the same? Maybe is once again and do things. So go for opening the label AP ICE Native AP is informant. So you're gonna be has getting 80 way piers and former a lot of times they should be standardised as early as possible, so you might have data from four sources each and four different formats terrorize them as quickly as possible. Means you standardize them prop possibly at the source itself. It was a converted or something like there to convert them to standard former before even transmitting the earlier you do the conversion, the easier it is for you to handle them later because if you're going to be keeping four different farmers and in write four different, possibly four different transport lives for different transformation layers later. So you wanted to convert them to standard former as early as possible. Real time. On historical, there is a tendency to go to the same channel, not about area. But think aboard how you can use one channel itself to get all the real time response times as well as the reliability of the East Article Channel, I think, possibly convert separate channels again. Look at the use cases about what real time is supposed to do and historically supposed to do if they are very different than possibly considers that prayer channels Ondo pay attention, toe privacy and security. Those are pretty important thing that can come to bite you any time later. So think about them on DA. Based on that design, your acquisition model Thank you 10. Acquire options SQL and Files: Hi. Welcome to this lecture on options for acquiring data. Had to start looking at. What kind of options do we have to connect to data sources and acquire data? We talked about a list of things that we need toe architects, some of the best practices. But that's all possible only if the source of support some kind off way by which those can be implement the first option that we're gonna be talking about as the SQL Query. A sequel. Kredi, however you want to call it Andi, Even the Soldiers and oddly, be Much sequel is given on sequel. It's really powerful, powerful in that it can do a lot of things for you, even though you say it's big data really bother about sequel but to acquire data because that's the baby to go from an RGB, miss. Find off you. That is a traditional way of extracting data from relational databases, and the good thing about sequel is it's a mature technology. There's Bean for a very long time, is really mature in terms of its capabilities. Ah, lot of these implementations are really, really optimized. It has the ability to transform data like you can do joints grew by cube and filtering because these are great capabilities because you can do them at the source. It's a for example, you can do joints. You can be normalised data under source rather than having to do them on the transformation . Lee at that minimizes the amount of what you can have to do. So if you have data needs to be joined. Filtered Ari went summarized, It's better to use equal itself to do all that work for you up friend. That way the news that the amount of data that he ordered transfer under introduces any of the transformation share steps that your due later in your processing so that a grating with sequel off course equals supports indexing that takes care of performance. You can do them all without any programmer. Well, programming work that you need to do for that on DA is equal also supports encryption compression. So that's a good thing s O that one of the things that you can do with Sequel on when you extract data from sequel, you can actually transmit them on their way directly, or you can actually put them in fighters on move the model sequel also gives you the ability to go after incremental data. That's a great thing with Sequel, because you would support me and I squaring with times. Times are with I D feels so that doesn't like some of the good things that you can do with Sequel. Now let's try to look at the advantages and shortcomings off sequel. So the advantages of sequel are that it has extensive support by various programming, languages, tools and products. This is a mature technology, which means you're going to find a lot of tools that supports equal Jerry busy, only busy connectivity and stuff like that. And, of course, there's a lot of people who know how to use it. It's a very popular and mature technology. You can find easiest skills and products that Dover on it supports incremental features as well as filtering. That's a good thing about sequel is that you can do a lot of them. A friend with sequel What are the Shortcomings Off sequel is that it is limited, so daily basis not all data source of support sequel, which is which is a kind off about thing because, you know, sequel is very powerful. You can do a lot of things. Things with very minimal work on the other thing is in their organizational or their backs . Eso supposed. When you're tryingto acquired it out, you're not all you do not always get the option to go and connect directly to a database. Even you get that kind of an option. Yes, you can use equal, but suppose you are are forced to not. Not a lot of people do not give access directly to the databases, so you need to go through an application layer on a P A layer are even. They would say there's no direct taxes. Rather, they were extract data in files stem cells and give you the files. In those cases, you cannot really go on your sequel in those cases, you know, that's one of the limitations off limitation is that sequel itself is very prop powerful. But I did not get a lot of opportunities to use them unless the data sort itself is controlled by you and you can directly go to the data source and pull the data use cases for sequel Our baby. Most sources, especially when the HABEMUS sources in the price. You can go and quickly pull data through Sequels and, of course, even day that the sitting in high sources of body hydrated sources can be pulled through a sequel. It is pretty powerful except darting on. Sometimes the actress is not provided to us for sequel. That is the only issue that we have a sequel, but it's a very powerful tool that you can use for data extraction. The next option. You have a XFILES getting data as files and file, just like a simple and common way of exchanging and moving data. So any data source owner you know is willing to give you will be willing to give with data and files. No, you don't have to connected us systematically. There is no issue of them feeling things like, Okay, are you going to go connect to the database and mess around with greater? There's no questions like that Any day. I will be able give you the data in our HABEMUS for you and file format in the application for the murder are typically has a data extraction capability where it could give you data and file formats of file is a very common way off exchanging data, especially in an inter organizational situation. And it is a very the very standard tools for moving files, encrypting files and stuff like that. So it's a very simple and common to use method by which you can exchange it. And a lot of these applications can convert that data into files like Sears verifies XML files. Jay's on files on even thinks like media files can only be. Still, media can only be stored in five media. Like you know, Weiss recordings with your recordings in majors there typically only stored in files. So file is a very popular way by which leader is exchanged between various organizations, advantages off fires, all systems applications from a file Based on that Xander thing, you can goto under any anybody you and I say, OK, you have an application that has data can you extract and give me an A C. S. C. Five? Yes. So that's a great easy thing to work with files, and it works easily with inter organizational boundaries. Whenever people are think twice. Okay, I need this data cannot connect directly to a database, and people are like, No, they're issue secular issues and you give me the same data and fires. Yes, so that's a great thing about fights. And there are common tools utilities for working with files for extracting data and the files are copying files, moving files, zipping up files, security files You know you can depend. Put a passport on Teber. Do their dark farmers and all operating system fire support file operations. So many things you can do with files that's that's a very commonly used method by which people can exchange data shot. Coming with files is that we risk low men terms of have you move files around a related Lee store. Slow come back to some other methods. There are so many manual steps and stop buying. So you are the more files. Okay, somebody copies the file over two point there to plan B it, and then somebody moves it from Point B to Point C is a lot off those manual steps involved in file moment data is exposed because their text files data is exposed unless there pop properly encrypted. So that's something you have to take when you're losing with fires. Use cases for files are inter organizational data moment. So when you have the move files across organizations, whether it is, companies are whether between departments within the company are evenly between two applications. This is a very popular farmer. This is the only way by which you can no media files of your home or on media files on finance do support capabilities for greater encryption and compression. Very standard tools are available. So that's a good thing that you can do using files. Thank you. 11. Acquire Options REST and Streaming: continuing on on the acquire options. The next option we're looking at is theorist AP. Eyes I Trust is a based AP a standard for exchanging data for profound for performing Kurt operations Gladstone for Create Read Update and delegates arrest repays like, ah sequel kind of capability. They can either be used for retrieving data. No are also for updating data. And and restaurants are very popular, standard across the Web that lets you exchange data it be couples, the consumers from the producers. That's the good thing about restaurants that decoupled the consumers from the producers and gives you a simple way of accessing data from any supported source it provides for stateless existence. Or that every query that you say send as a stateless query and it comes back with the result richest, self contained. There is no state to be maintained between multiple requests. That makes a work a lot more simpler, and it provides a nice uniforms interface that is based on hatred. DP standards for for doing a get post, put a diligent various operation that you can perform using at Estate B A. It supports advanced security. Do you typically use of the auto scheme off authenticating for data fetch, which actually is a good scheme because it was a p A keys rather than using user names passwords on that makes it a lot more secure on it supports inscriptions over When you're moving data across the Web, it has a lot more safer and said, You're on that is supposed to abort it by most cloud, based on mobile data sources like Twitter, Facebook, Salesforce, all of them provide you pressed AP is to extract data. Are the data this is becoming the de facto standard on the new Web? Or the cloud and structured? Same real, like SQL is used to be. Rest is becoming for developed. The advantages of rest aren't that it is a standard for Internet data exchanges that we're becoming the de facto standard. It has excellent security and scale, ability build in. It's pretty simple to use easy to integrate, a pretty easy to learn and use on almost all programming languages. Support rest, Actually, even if they don't support, all you need is to is further programming language of support has GDP, and you can do the rest pretty easily. Ah, Shark, Cummings Weiss for rest. There is redundant information that may have to be passed around because of the stateless. Every request have to be self complete, self sufficient, so that could be one of the limitations. A bigger headache with rest champions is afraid limitations. So whenever the prop up a big providers are giving you rest, MPs like it with their Facebook or sales. For that is a limit on how much Diane money rest queries. You can go within a given amount of time and how much data you can fetch that bigs up. What's a big headache on you in terms off? You having to optimize your application suitably so you do not hit these reiterate limitations are you have to pay extra for getting additional data, which again means that you are the architect in a way by which you are not being too much money for accessing this. Rest appeared. That's a big headache for you. It does not support real time. That's one headache you need to use streaming AP separately. If you want to get real time data, the use cases for escapees are cloud social media data sources. You know whenever you are to get data from cloud our social media. You have to use for us there another way. If you're getting data from mobile data sources, also rest is becoming pretty popular. It can also be used internally for exchanging data. Real time. Meta data can be exchanges England Staples. That's another use case, but most leaving. You're getting data from cloud. You are going to be using the restive. The fourth option is streaming. No. Streaming is a real time data subscribe and publish model, which means that you have a subscriber and you have a publisher. The subscriber, which is typically your client, goes and sub scrapes toe a source and say, I need real time radar from you, and you establish a persistent connection on whenever new data occurs for for a topic or an object that yourself scraped to the data is pushed to you do not a pull away into the bush way. So clients are steptoe. A specific topic are a subset of data on http connection is kept open all the time, and server push esta data to decline whenever new data is available. For example, if you want real time twitter feet, you open a subscriber toe twitter and say, I want any Twitter feed that is happening on this specially candle. And whenever three somebody treats on that specific handle, the data is pushed to you and you acquire the date and start using the data so it uses secure keys and encryption. That's pretty similar. The rest API is also boasted that rate is pretty secure. It again has limitations on great limitations. The same bien este pia has on I mean cups of streaming, especially from with all the popular data sources like the social media data sources there again, limitations on what kind of data is streamed and how much data has dreamed, of course, but this is a way by which you would get data for any kind off aria, any kind off social media, NATO sources, advantages of streaming is that it is a real time instantaneous data transfer. Yes, it is really real time now. Streaming can also be implemented by you in your own application to the custom streamer, but it gives you a real time in Sudanese data transfer. It can give you only def. That's a most important thing. It can only keep propagating to you. The changes are not the entire records that limits the amount of data that is being plans were across the wire, and it is supported by a major cloud. Providers like Twitter, Facebook, Salesforce. Everybody supports streaming a shot cummings of streaming. This loss of data I've connection is broken. Connection has to be kept alive all the time. If connection is lost between the time you lose connection to the time you rest, your connection and in uniter that occurs is lost. So that's one issue with streaming that is late rate limitations. Once again, there can come up on affect you in terms of how much data is dreamed on. You might need to supplement this with historical later pulls, because again, the size issue. So you might want only your streaming to get data that you need to use for any kind of real time activities. And you supplement that with historical data to get a rest of everything else. You know, rather than trying to depend upon streaming to give you all the data, you only your streaming for temporary data are instantaneous. Date are minimal. Car minimal. D. Rather is required for you for real time activity, and you supplement that with the historical channel that can give you the rest of the data use. Cases were streaming are real time sentiment analysis. For example, me. I want to subscribe toe social media and see what people are tweeting about. About your company are trading about your company in real time. This is a great use case. Real time reporting again. You want toe? Have some real time reports based on real time activities that are happening. For example, you might have a customer support Twitter handle, and people are writing about your company. You might want to use that for real reporting on any kind of real time. Action that is based on user behaviors or streaming is strictly a really I'm used. I would not recommend, just depending upon streaming, forgetting historical data because I said any kind of loss off data. Any kind of connection is broken. Kind of leader is lost in the middle. You need to supplement with it with historical data feeds anyway. So these are the various options are not give available for you for acquisitions, the popular options we kind of discussed yet there may be other ones. Also, we are not going to be going through them. But hopefully this is pretty much covers all the use cases that you are interested in hopefully. 12. Transport Overview: Hey, welcome to this lecture about the transport model. The transport model place a big role in big data as compared toa regular application, because transport is almost nothing when it comes to regular applications, non big data applications. But in big data world, there is a huge amount of data, and that huge amount of data has to be moved on. That is really, really a big task. Wanted us the great as huge. And second, the distance between the source on the destination is also larger. It is usually across multiple organization boundaries and stuff like that. The transport layer acquires a very important role when it comes to big data. So there are generally two types off transport early Mahdi Oh, transport models that are typically in place. The fastest is the store and forward type of module where the data is stored and forward. It has moved from one place to another leg. Step by step. They start is almost like you're sending some parts of through a courier service straight. So you departed the parcel in the source place, and then there is some some transport truck or something that takes it from one location to drop the rather than another track six from that location to another. That's like a store and forward mechanical, where you receive a data off the source Red Data received addressed the source. It is more from place one by one, it and borders usually mood and tons of units like files or directories that is tracking of completion. And there's retransmission off the entire unit in case there is a transmission failure. On the other hand, there is the streaming type off transport of models where the data is moved continuously across the pedestrian. There, there is a life connection between the source and the destination on data is flowing almost like a pipe. There is almost like a pipe in which water is flowing. The data has to be throttled at the source so that it does not over flood the pipe. Similarly, the data has to be throttled for the thing so that the sink is able to receive the data. Asset is being put us provided by the by the transport layer. The struggling becomes pretty important here so that, you know it doesn't overflow anywhere and they does getting lost. And there is also the need for in flight storage. In case there is a lot startling happening, you cannot push the data as fast as possible. Last e things can consume done for us to be inflate stories. Also, these are the two types off transport layers that you typically would encounter. The 1st 1 is stolen forward is typically done for historical. Later, while streaming is done for real time data in terms of the responsibilities of the transport layer, its first responsibility is to maintain link with the acquisition model. Translate data toe protocol, optimal format. Sometimes data that is there has to be translated into protocol optimal. Former, you know, if you're moving text file that none may not be the case, but sometimes you might want to sit the data files, you know, So their diet has come. It is compact and encourage less amount off bandwidth for it. Transport data. So you know all this kind of stuff. You, of course, have to move the data. The transport layer also has to make sure the data is secure while it is being moded. Is not. It doesn't not open itself for anybody to stuff the data and identify what is there You need to maintain a link with the persistence model. That is the sink part offered on Save the Data in the position model and, of course, confirm that the data has been correctly for correctly acquired and fully acquired by the persistent model. Before the transport layer drops it off. You need to track the data as it moves across. Are the passport Lee? And so you need to track Wacky. It's the same way, like a career service. Ready Want. Attract value packages currently. So you need toe track, the data asset that starts from the source all the way to the destination. You need to have a way to retransmit in case that that is failures and transport. And, of course, that has to be reporting in terms of, you know, how many packages are being received. Our money Transport transporter, Hominy retransmit happened. How money lost package, where there are no events of that so that administrator can look out and see whether everything is working as desired out. That has to be something that has to be fixed when it comes to architecture. What other things you want to take that off will are protecting the transport layer. You need toe architect for speed. Yes. You want to make sure that you can move the data as quickly as possible. There has to be traveling, especially when it comes to a real time system, because you don't want the source to flood deep transport layer and you don't want the transport layer to flood this sink. So you need to make sure the data is probably plants startled. There has to be reliability of data so that there is no data loss and transport. This is a pretty important thing, especially when you're coupling that with struggling because when you're struggling something, you need to have some kind off storage in a temporary storage when your throat throttling data on. You can't just keep the daytime memory all the time because pretty soon you might run out of memory and then you start losing duda. You know it's on you to have some kind of the displaced. Our data does persistence for travelling purposes to are typically the products that use always parade this kind of traveling. If you're using the standard product, you need redundancy. Eso that you know there are multiple channels and the translator transport can happen. One note failed. That doesn't mean the whole pipeline comes to a halt. There has to be scalable, be so that you can transport a large amount of data and then a number off things can actually receive the data. Scalability becomes an important part, especially in terms of real time data. There has to be status reporting an alarming, of course, because you know you need to report upon what is going on. You need to have some good reporting systems to help your administrators to monitor the application of the solution at that as usual, architect for compression. So there is less band with being consumed on Unit two, architect for encryption, also so that the data is not secular as it moves across the transport layer. The desired things. You would consider yourself points into the ConStor while architect ing at transport Lee. If you're picking up right, most of them are chosen for you. But if you're trying to write a custom a transport layer, you need to somehow factory in all these things. In your prediction. Best practices for the transport layer do not reinvent the wheel especially when it comes to moving. Big data and big data is big on its scalability and reliability are huge. It's no. Those are not simple things that you can try all by yourself. With is better always to piggyback on some pro been messaging and plans for frameworks and particle. There's a lot off messaging protocols available. Those that are discussed in this in the scores, as well as those that there are no nordiska. There's a lot of messaging protocols available. Big one on Biggie Back on a door. Trying to write it from scratch is a lot of work for you. I look for integrations between transport technologies with other models. You know when you're choosing a transport layer, the transport layer has toe integrate very well with the question layer, the persistent layer and the transformation layer. So you need to see how well the product that you're choosing for the transport of model Fritz. Actually, with all these other models, the choices out of like mingle together, and they should work pretty well together. So that's something you want to look out for. Be aware of the data transport costs. Transport costs are a significant um, obviously as a significant cost impact because you're using moving there that through your Internet than there are banned with requirements that come into place. If you're removing toe a VPN again, you need to profit for that kind of bandwidth with the kind of data that you're going to be moving. Otherwise, you're going to be having issues with struggling, so you need to profit for that kind of a bandwidth. You need to be aware of how much it cost because a lot of times when you want to dedicate a channel for moving data, it might come at a price. You might want to look at techniques like and the compression to make sure that you take the minimum ban. But as possible, it's all this unity concert for your architecture used to label in flight storage. You're moving data from one location to another. Make sure that there are there are stop points in the middle. No file based stop point. So there you, for any reason the process guy crashes in the middle of something like that, you do not lose all the in flight data? No. In Flint, data management is pretty important in the transport layer. Either you need to have the ability to re transmit data that has lost in flight. Are you need to have the ability to retrieve it and continue from married left, or both required some tart crosses to be put in for architect ing. This solution consider security measures to prevent data theft is becoming a very important issue, especially in the Internet world. So your car, your data, you enterprise companies, data needs to be secure as it moves through the transport layer. Typically, once it reaches the persistent later this kind of safe because, you know, it never felt secure environment. But when it is the transport, more if it is moving across, organizational boundaries are The data has moved across the Internet. There are security issues that you need to be taking calf. Hopefully, this is helpful for you. We will go out and now look at the options for transport on the next leg picture. Thank you. 13. Transport Options SFTP and Sqoop: Hi. Welcome to this lecture on transport options. This is your instructor, Cameron. So in this lecture, we're going to start seeing What are the various options available for transporting data from the source to the sink. And we're going to start with the simplest off all, which is the file move or copy Command. This is something that you would have been used a lot. And you might be wondering, Why is this even showing up here? Yeah, it is pretty simple and straightforward, but it is still a great way off. Mooring files our data between two locations. It would this possible. It is one of the simplest way of moving large files. It is supported on all operating systems. If you're moving data between operating systems, which is inter operating system transfers here, that might require some. You're moving data between windows and Linux necks and stuff like that. It might require some adapter software, but it is still possible. Andi, it can be quickly scheduled. An automated all programming languages support libraries for file moving and transferring. So that makes life pretty easy. If the distance between the source and the destination is pretty shot is it is within the same network and stuff like that. This is one of the easiest way you can move files. Ah, Final One copy. What are the various advantages? Advantages that it is going to be simple and straightforward to use. It doesn't require no specials. It doesn't require any kind of special skills. Everybody knows how to do it. It depends on the operating system and pretty simple and straightforward shortcomings, though, that if you're moving data between different operating systems that would require adopters , I mean adapters. In terms of government be special software required for doing this movie. We're moving across data across a van that might lead to relabel early in shoes. It is a fight cop move copy by itself does not give you a lot of security and encryption, so there's kind of an issue. Also, management of large files and large files that might become difficult if you're moving especially on slow bandwidth, slow burn with channels so that all becomes an issue. So if it is a pretty simple and straightforward between two missions, this is a pretty simple thing to use what we are trying to move across a van across Internet that could get pretty tricky. Use cases, our entire enterprise within your enterprise between toe servers within your network. This is the easiest way to move data. This is maybe the only way to move data within your enterprise for media friends filed over and copy. Pretty simple and straightforward. You does. It has its use cases, any race, even in the Big World. The next most popular tool for transport is sftp are secure file transfer protocol. Sftp is a network protocol fall for file access and transfer, and this is supported by our operating systems. Eso This uses a secure channel for data protection. So it internally uses us a secure shell for moving data so that by default prevails you encryption and the that protects your data. It has support for authorization on authentication. So that again is good. So takes care of the security issues Sftp has in built data Integrity checks in that that when it is trying to move data between a source or destination, it makes data integrity's checks, all those cross checks and onda all those checks ums and stuff like that. So make sure that Rita has bean correctly moved, it can assume interrupter transfer. That's one great thing with the safety piece of your moving data between two locations where the channel itself a slow bandwidth. But the file is huge in that this is going to be a key feature on files carry basic attributes, you know, typically files out there on attributes like times, times and on the names and stuff like that. The get carried over. So that is a piece of information that the destination can use for various reasons on it has a white support for around across operating system that a number of tools that do it's of PP that you create is which do is of PP. There are libraries that can do is FTP So this is a very popular protocol that can be used for moving fights. What are the advantages of sftp is that that is a white support off, off acceptably across all of us and tools and utilities. Any of us has a couple of different separate readers, a lot of open source and commercial sort where that can do as a sftp for you. That's a great advantage. It is mature and widely accepted widely accepted as a very important thing because, especially when you're trying to set up any kind of data transfer between two organisations and you need to come about with an understanding of what is the mutually agreeable tool to use deceptive sftp evil drank top moment because you talk about anything else that what he will start talking about or know about security one about this. What about that? But Sftp typically is a very commonly use protocol that people typically immediately agree for. And that's ftp kind of a handshake and interface between two organizations. So that's a pretty popular for the call that you want to keep in mind a data security across Internet, VPN and van. That is also a great advantage with this FTP, even data that is there in our baby Emma. Sometimes you want to dump them and files and then ftp them over to begin a very assess, a very popular way off moving data, exchanging data between in the prices shortcomings, firewalls might have problems that f except sftp so ah, unit or taken off, enabling for opening up fire boats to enable a safety beef. You're moving them across in the prices. Passports needs to be shared across parties, and possible sharing is considered a a less secure getting these days. If you look at how the cloud is working on these days, it is typically done through a P A keys and stuff like that. The past for shouting is is considered less secure. Ah, slower transfer speeds because there is an encryption going on that is cross chicken going on. Jackson's going one. It results in a slower transfer speeds. Also use cases inter enterprise file sharing. This is one of the most popular ways in which to companies can agree upon for sharing information that a great way to share information because both parties, both the I D departments, would be okay with giving us or giving accepted sftp kind of access to anybody else. Media file transfers is a DVD With me again is great because media files are huge and large on Sftp takes care of making sure that their own moved in has got cross checking, making sure the vilest correctly transmitter and stuff like that. And finally, of course, if you're moving log files is again, a great option are removed using a safety the third option that we're gonna be looking at us. Apaches Cook. What does Apache scoop A party? Scope is a command line tool for transferring data between relational databases and Apache Hadoop. So you have data in a relational database, which is all about tables. Columns on you can write a sequel queries and then you want it. Take that data and transferred them into Apache Hadoop as sequined files. Ah, Badji Scoop is the tool to use now you about you. Scope allows you to do a plans off entire databases. Tables are the results of an SQL statement so you can write a sequel statement that has a group buy filter and stuff like that and you can filter play down the fly on this stool can transfer. Don't does not have any programming at all. It is a command line tool. You provide a set off para meters to the tool, which is, you know, things like, where is my source? Their base name, driver, name, user name, password. What table or database? Or a skilled execute and then my destination various. My heart up system is what are the port number for heart open stuff like that on this does the magic for you. And you can then read the script for this command and then automate that using s scheduler and that will again keep transferring files on a periodic basis. It has support for various file formats like a bro sequence. Park wheat are plain text files on the hard upside. It can actually transfer data into high art. HP is also and it can get data from high manage base, also on duty. Rivers also want it supports parallelism, so they're scalability with Apache scope. Andi. It supports incremental transfers. Who can identify new a statement, which is the which is the column to use for tracking elemental new data on it to make sure that you will keep track off by itself. Where may What is the new data that the tables every time a transit keeps the north off what extra leaders run this time? What a dump till what I need as process this time to the next time it started will start up from there and company one. So there's a great capability that school provides you on. Finally, US support for blobs. Also that again a greater Brundage. If you're moving data into the scoop is something that the apology committee Amud at the heart of community, came up with us a great payoff moving existing data in the huddle. And it was a simple, straightforward told that works great on Let's see, what are its advantages. So advantage is that it is simple, straightforward to use. It is just a command line. You can just go put in your para meters and will start working like magic. It has parallelism to speed up transfers, and it is actually by direction, and you can actually move data from her Duke back toe Relational database on This is great , because what typically happens is that when a lot draw date eyes typically pushed into hado on a lot of processing happens within hard job on. Then after that, the data is summarized in sort and huddled. Then the summary data can then be moved back to the relational database from very can be used for reporting purposes. So it is bidirectional. There is of great views, shortcomings. It is predominantly lie next based. So that is like OK on, then open security that it doesn't have any strong security measures like the passport for the database has to be given in the clear and stuff like that. There is something people are working on it He doesn't have any inbuilt transformation support. You can transfer more, transformed a down the fly. You can, of course, do that in the SQL itself. It in the school. You can't really that SQL that used to fetch that it says, shows that you can do some transformation. That scoop itself doesn't have a lot of great advantages on it. It doesn't have any streaming support also. So it is going to be a historical tool that you run it from time to time and keep fetching batches of data. What are the various use cases for scoop Hado based backups on the data of Eros is so when you want to move data from our baby must hard up for backups. Are you wanna move data into hedge pays or high? Are you on a move day that from her job back to Barbie business any time are they Mrs Whether the source are the sink, a Patrick scoop is the told to use. That's all that is coming up like So this is another great advantages of a party scope. Great. Oh, don't keep in your mind whenever you want to get data out of our Libyans to heart. Open back. Thank you. 14. Transport Options Flume and Kafka: Hi, Can you are not discussions about various options for transport. The next thing we're looking at us Apache flu. Now a party flu is a distributed service for collecting, aggregating and moving large amounts off log and streaming data about your flume was specifically created with the use case off log files and mine where you have, you know, a number of web servers, for example, you might have a farm of observers, hundreds of them on. Then there are three blocks that are being catered in this Web servers, and you want to be able to collect these logs from each of these observers and then send them over toe a central place for large scale processing. There is a use case on which flew was built. The vein flume works is that it has a non origin, a channel and a sink. Eso Arjun is a source competent. It's a model that is deployed on every server from where it can collect data locally, and the data collected locally is then sent across the channel, and there is a sink to which the data is then deposited. This is kind of a historical later, longest collector mood across the channel and depositor in the sink. The sources can span across a large number of servers. So you got a farm off observers from where you can collect this data township across it actually supposed people types of sources. It can be the sources can be files or it can be string. So there could be an application, a local application that is depositing strings in tow, the local flame client instance. Or it can actually do educated peoples. And you can receive http post into it set up and then topic no send out the car, the content of the post across to various things it can actually support with their streams . Also, that is support for my people. Think types, you know, hardly be missing. Hadoop sing catch basing. It has multiple same types. It supports on the out of the box. And you can actually add custom sources and sinks through court, so record so you can use fluto actually transport data for if you have your own application . You know you have your own plans that's running on various servers and then you want to transport data to your own sink. You can still use flume by reading these custom sources of things, though, which your own application can use flume as a channel to send data across to your applications along the sink side Also, so there is customization that is possible using flume. It is robust. Fault tolerant has struggling as fail over and recovery capabilities Pretty impressive. Us A. Set off features here. I also support in flight data processing. So they dies moving on on the through the channel while it is moving on the channel. You can also do some data processing, if possible, and that is done by writing some interceptor coat also. So there is some capability of processing data as it is moving on the channel. How does a party flume stack up in terms of advantages in this conflagration? They went Damaris highly configuration driven. You just go right. A bunch of the sort of conflagration files this highly configurable in terms of what it can do. It is massively scalable. As I said, it was returned to collect log files from various rep servers. So it is massively scalable, and you can do a lot of customization by reading custom sources and custom sings in Java on that can do a lot of custom processing for you. Custom state, our sources and custom readers thinks it's possible. Shot Cummings. There are no ordering there and peace. This is the big headache with Apache flume, and I believe they're working hard to get rid offered is that there is no ordering guarantees in the sense that you put a set off. You put a set off events in some water and the source they did not come open the same order on the sink site. So you need to have a big your own way off managing, ordering on the sing side Once you receive it, the dowdy order the data. It is possible to end up with duplicate data, the same greater being transmitted twice on. There's no replication capabilities that are available in terms of for a party film, but it's really skilled use cases there block shipping. Shipping Log France is one of the major use cases for Apache Froome. Twitter streaming is another use case, but it can hook up the twitter and get data and propagate also, and it can also do what is called edge server passing, especially if you look a mobile systems, you might have aged service H Others are the ones that sit on the edge of the network and interact with the real world in terms of interacting with the wire lesson for stuff like that, you can get events at that lower at that point and put them on the flu. Man. From that, a formal taken of shipping boulders evens to the central repository. There is one of the other use cases for fruit. Let's look at one more tingle. Apache Kafka. One of the question that you would be asking is, Why is there something called Apache Flume and a magic off car and the kind of overlapping in functionality? Exactly. And that is what we talked about earlier is that there are so many number of products coming up, dollar by different people, and all of them put put into open source and they're being developed independently, and somewhere down the line you expect one off them to grow on another to fall down. Are they just much together and create a single product? We don't know. But these are kind of overlapping products. They do have common capabilities. They do have separate individual capabilities. At this point, we have so many options because of that. A Patrick Afghanis are open source message broker platform for real time radar feeds off CAF cars more focused on real time than Froome. Even though flume supports streaming, Kafka's really more focused on build and feeds. It has a publish subscribe architecture. So Cathcart publishes Topics, publishes day down on topics that are subscribers or subscribe to these topics on as and when some new information is put into Kafka for a specific topic than that particular information is faithfully cento all the subscribers who receive the topic and then process them. So Kafka works on US Publish subscribe architecture it has developed at Lincoln, and it is redundant scale. Other tells you, you know, it has already been used for some real life scale capabilities. Topics are published. There are multiple subscribers who can be for every topic who received the data. A new the leader Kafka gives you ordering Garant piece, which is not there in Froome. Zoff ordering real time. Ordering a beta is important for you. Uh, you know, it is, though you this is a pretty important thing. If we want real time ordering of data, then you can. I used Kafka for that purpose, according require for publishers. Subscriber Toe Interface to Cough Cannot Kafka support certain standard cough car does support some standard publisher than subscriber, but if you're writing writes anything little custom you write some code. Winter phrase with Kafka is not highly conflagration driven like that we flew miss on for a few audio supports replication. It supports high availability so that others are different, getting free just for Kafka on what it can do from a differentiation point of, um, what are the advantages of Kafka? Advantage is that it is highly scalable, really. Time messaging system. Ah, one single topic can be sent to multiple subscribers that has a greater bandage on. It also prevents you some ordering guarantees of the veda that is coming into your system. Once you put the same order in which you put the data, it is going to come over in the same order, so that is a great advantage. Shortcomings of Kafka coding is required for publishers and subscribers. There's some coding required that was kind of overhead, that you have to encounter for using Kafka in terms off support from the support we're talking about. Tech support, tech support for calf guys. Not that high. At this point, I'm recording the lecture when compared the flume. So there s a limitation. Of course you can go. And do you know, of course, the rebels there for you to always put some queries and then stack overflow one. Get some answers and stuff that is always available. But if you want to go on by cooperate support, then, yeah, there are some limitations on that Use cases for Kafka, Real time analytics. It's a great use case where you're to get data from, ah, large off places in a lot off publishers and a lot of subscribers. And you want to do all of that in real time. Calf cast a great engine. It provides for operational metrics. Aggregation. That's one of the use cases it is used for. When is your operation metrics? We have basically like you're running a data center. You have thousands off servers of the EMS running various things and you want to accumulate performance and they know errors and ally alerts across each all the servers and collect them into a central place and aggregate and beauty in a CAF cars a great so great told to use for that kind off a collection and it can be used for complex, even processing like us. For example, you have all your mobile systems are you're getting a lot off events from various mobile client and they are to be collected and accumulated on processed calf guys again. A great option for that also. So you got so many options here, as we see for doing the transport, each has its own capabilities. Is its own advantages and disadvantages on you. Do you need to choose the right competent for your architecture based on what your use case needs? Hopefully this is hell for you. Thank you. 15. Persistence Overview: Hey, welcome to this lecture on persistence. More deal. But this is your instructor, Cameron. The persistence of big data is a big challenge because, as you know, it is big data and the data base system that we're going to be using needs to be able to scale to that level to support petabytes of data, terabytes of leader that is needed for managing the big data system so that it started and see what are the various responsibilities off a persistence model, especially in the big data word. The first thing it has to offer the persistence module is reliable. Data storage data at once put into the canopies cannot be lost in any way. Then comes acid properties, as it properties are a list of properties than any database system needs to complain with. Those were atomic city consistency, isolation and durability on different database systems offer this capability at different levels. A big dinner No, SQL databases typically are kind of lacking. When it comes to this, the RTB Massad, the ones that are mostly you know, more diluted this aspect. So you need to be very careful about what you can take and what you can lose when you're choosing a database for your use case schema. The database system should be able to provide a scheme on by which it associates meaning so the data that is told in the database it needs and capability to support transactions where , especially when you are trying to insert data into multiple tables, you should be able to combine them into a logical transaction. That way, you can probably possibly control correctly as to where they dined. That happened where rates happen on you know, when something fails very clearly. Our data access capabilities, a dock later, access through SQL or AP ICE, then their access to J. D. B C drivers. When we're trying to do programming, these kind off capabilities are also required to be provided by a database response. Times when you're acquiring data is another very important aspect and the response times required for your use case. Gator mines as a strong influence on war database. You choose for your data storage and finally, their skill ability, Big Gate and its scale Multi cluster shared nothing. There's a lot of architectures competent that going to creating a truly scalable architecture, something that is not dependent upon as single CPU are a single mission you need to be able to scale horizontally toe, and in the end, number of boxes are in. Number of volumes are across data centers to be able to provide the true A retired at the level of data storage. So what I think you need to architect as an architect, but most of the capabilities you would expect we provided by a solution that you would get from the open source of the commercial world. The first thing is, of course, collaborative Big data niece, Kayla Brady. This our home anymore. But we wanted about that one. It needs consistency in terms of data storage data has to be consistent. Read consistent right consistory Know all those things need to be enforced. Ability to support transactions. No, the database is not providing your transaction support. Then you need to have a clear vein, which you can implement that in your own accord to provide consistent transactions, you need to look at read intensive use cases. Worse is right. Intensive use cases and Jews and architectures property Italy. Sometimes it is reading and silver on time. It's right entrance, depending upon up in your situation you need look at notable was his immutable Later and immutable data is fight once Onley, immutable later keeps changing right intensive and immutable data common typically in a really tank scenario where the same day that occurred, maybe abraded again again, based on what is happening. So that is something you have the architect for data cataloguing, which is like scheme are metadata our data about data. You need to be able to get a nice catalog, which is what converts this data into a proper data reservoir you missed on the cattle August 1. You're going to go on dual acquiring and stuff like that in late and see requirements. You know, you look at a real time. Was a historical agency requirement real time Definitely need subsequently agency eso There is something you need to ask you to create from the moment you acquire the data all the way to your reporting, you weigh the letters he has to be in seconds are one or two seconds. So that is something you have the architect to be able to move data that fast and new people do. Ah, store data on transform rate at that first is a requirement. A static forces are dark lords, narrow mother in a big data situation, you are going to be encountering a lot off. I'd hardly loads. That is when your data scientists sitting in quitting the data, doing some analytics, doing some transformations on a dock basis where not a close happened. Typically, they tried to scan a huge set of data, and that's gonna choke up your database. So you need to provide for that kind of a dark capability also in terms off scalability. You know, when somebody goes interns and a doctor load that has no shouldn't be one that is affecting your regular data Passing already a regular reporting activities. A dark loads on big data databases can be huge. Eso We need to be very careful about when and how are not close a run. Typically, these databases provide you for capabilities for resource allocation as to how many notes you want to allocate and how maney CPU you want on the cable, you need to be able to con forget and run them directly. Ah, flexible scheme are something you want architect really clearly because, you know, big data probate for a lot off a dock analysis. A lot off new data attributes my good Arad as time goes on, so you need to provide for the ability to have a flexible schemer eso that your data scientists data analyst you programmers data engineers can keep adding new attributes hasn't been the show up without having to go through any kind of data. Basically, design are Regnery population anything like that? Best practices for choosing a persistence model horses for courses. So there is no one size fit. All of it's all on database technology that is available today for the big Did I use cases ? There are a number of databases, good appraisal, but they're all horses for courses, the all strategies by a few years cases. So you have to choose one that suits your exact use case are, and sometimes you have to go and just multiple of them. You know the same solution might actually use multiple different database types. Are the Beemers and a no obscure type of Armin member time depending upon the specifications case and have you want architect the solution? That's what is called probably Clark assistant, which is the data is being stored on multiple different database servers on the A properties ever being Joe's. And based on the use case, keep your scheme Anders and flexible eso that schema can change any time you can keep adding new attributes any time. So if you look at no SQL database, those are the ones that give you the capability. Even if you have to use an RGB Emma's, it is still possible toe have on as kind of a flexible scheme. If you start going into name value parts being stored as opposed to creating on a rigid columns for things keep data the lowest granularity possible. You know, you talks about transactions keeping a data transaction level in order that somebody's level on. The reason is that when you're talking about flexible query Munich talking about a dark wearing you don't know, know what kind of use cases you are data scientists on our list are going to come up later . I do not pre some raise the data, keep it a gonna read level so that it promised them the flexibility to somewhere they asked they want and when they want summarized data only if needed for example, if you have some standard use cases, standard reports that after run on those do not on a fast enough honor on the Grendler data . Then maybe you create three different summaries, but be watchful of when you create somebody. Do not go on blindly create a lot of summaries. You know, this is big data Onda. Typically, these terrible system scaled Berryville. Unless you feel you find out that kind of the great reports out of the gander a day that doesn't scare enough, then only going summarize data. Consider your real time application needs when you create. Use your big Gator base you to look at both real time an historical cases and see which use coasters have more priority on. Sometimes you might actually have two systems, one for real time. Another for has started to live there. This board is needed by the system. Don't try to push too much into creating one solution fits all because you may end up doing in number and tanks more work trying to create that one solution than actually cleared into different solutions. Do away with backups in big data over. There is no place for backup because these would be very time consuming. And restoring data is also very time consuming. That is why the Hado came in. It came up with the concept off. Ah, multiple copies Falk date and there's no need for you to take back up. So that's one of the reasons why out of itself came in the picture, and you want to continue that process and not have any kind of backup systems going on. So watch out for this one and make sure that you're not going to be requiring backups for your data. No architect in such a way that the data by itself is replicated across multiple copies, and they can work on other copies when, when one note is not available, Thank you. 16. Persist Options RDBMS and HDFS: Okay, Welcome to this lecture about options for persistence. This is your instructor Cormorant here, choosing a persistent layer as one of the most important decisions. You make us a big gator architect because ah, lot of the problems that big Gator are 11 toe the persistence layer. So what are the options? We haven't assistance. The 1st 1 is our baby. Imus. You might be surprised at this. One shows up in the list of big data because it is the problems with our baby mother actually triggered the development of big data persistent solutions like hard dope and stuff like that. But Big Data has still has a big role to play in a big data architectures. Sorry. RGB muscle has a bigger role to play in a big data architectures because some of the unique advantages that are the Bemis has still not being, you know, replicated on the big data Solutions are technologies that we have so still has a big role to play in big gator architectures. It's stored state and tables and columns. You by. No one would have known that. What? How are they? Bemis works and pretty much everybody in the idea we'll have some experience without the being. Most aren't even misses a built for number crunching. That's what they have been born. That's what they've been really, really good. That they have excellent query performance, outstanding query performance. Their technology is really, really optimized in getting the best out off your resources to do the best quality performance. The biggest limitations of our baby must come with respect for scalability, because almost all are dubious about products are built based on one single ah server that is managing everything. That's where the limitation comes in. Schema needs to be redefined off. What on our diplomats. It's all schema based, so you can't really have flexible steamer Ahmad Opinsky. Mine I d. Be mistakes can take some time and resources and put some stop to the various operation that you're doing on. It has got very few mature options on article my sequel, Post Grass Skipper and then Microsoft sequel, so you got very few mature options and rd Beamer's on most probably in your enterprise. You already have one of these are the members running on being used in some fashion. So does our demons come back from an advantageous point of view. It is a very mature technology has been there for 25 years on Dwight has grown and grown and absolutely optimized for whatever dust, excellent query performance on excellent third party and told supporters almost and already , we see Jenny busy supporters there on every reporting. Another takes ideal tool that you will find pretty straightforward on excellent acid support. But if you want aren't really bother about data consistency and integrity, this is the told to go with think, think about our Davey misses that. Especially when you have, like hundreds of clients trying toe work on the same table. You know, that's what are demons really comes into life. If you look at all the big data, I use cases that you typically have. You do not have, like like like 100 are different clients trying to work on the same table and insert date. I know that's typically a couple off clients in in a big date of a world that is doing some appeal, crossing and trying to insert an updated I'm rocking my concern up. That's not quite is. Our baby must really shines when you have so many clients trying to update the same. Our table abrade the same record. That's where our baby must really comes into play. Shortcomings are scalable. Be in size with respect, terabytes and gigabytes. It has got a pretty rigid schema, that is, it's shot coming. You need to have a pre defence scheme. Are with tables and columns for two. Clearly work on DA Any times you want to change, the schema is going to be a very costly affair. Cost, of course, if you're buying article that one of the most costliest product you can buy today on Dog Also Microsoft sequel. Whenever you're talking about scaling, these are only Bemis. It comes with a lot of hardware and software cost associated with it are the We must are not good with tech storage, even though today they support blobs and stuff like that. Are they still lack in terms of a lot of text processing, text wearing capability so that that's not there? I the USB off not baby Messina arguments is all about number crunching use cases for our baby must. When do we use our Demus is we want to store my data. If you're building your own custom application and you have data in a big data repository. Big data has not been that good with maintaining meta data. Even if you look at something like hi hi. Internally uses. Are they being most to keep track off its key modifications and stuff? So that is used for Andre Bemis there on. Even if you're using your own custom reporting solutions are ideal solutions. You can still use arguments for storing moderator. Moderator is typically small on that makes a big reservoir, maybe makes it very easy. Multi update cases are work in progress data. What this means is that when you have record that have to be continuously updated do especially with multi clients. That's where our baby must might be used when we use monthly update our work in progress data. We're not talking about really terabytes of petabytes of data because we're not really talking about one year sobriety. We're talking about data in progress, but just like light, light transactions alive use our sessions. No, that is not indecisive Terabytes. Typically, that isn't place off mostly tembisa maximum jeebies. So what do you do with use and are daydreamers like a temporary later store. So when? Let's say somebody drugs into obsession, you create a record for him and not to be missed. And you can continuously keep updating the record were decisions and progress when the session cos you wouldn't take that record, put it on a big data storage. And we know Derrick Ultramar dbm is that we the number of records and our baby miss will be like, very minimal. While you can just you can just use it for on multi operate scenario. In this, you can use it to store summary data, which is you can use big get out for all the processing that you want to go and want you pressing us. Alden, The summary that comes out effort maybe you can use and our Deb Amos again. The sizes may not be that big when you're looking at some of the data results. Similarly, if you're looking at some analysis and then you want to put compile some results and then store them in a table, that's again a good place for arctic payments because remember that reporting out of our baby muscles very easy. So putting somebody, that s all state and our baby must make sense, provided that decisions are not that great. I mean, typically, if you can use an RD members use and rbb emotional, that's a very much of technology. That was something that will work for you in any kind of use case. The next option that we have is HD Fest, which is the file system that comes with Huddle. Now, this is the one that revolutionized the whole big data patterning. This is where it all started. If you look at Hoekstra here first, the purpose for which the high Davis was created was to store files in the file system that is not limited to, like one box at one a sign or something like that Saturday. This is a distributed file system that can span across thousands of nodes. It can store like very large files, and the files can, like, really, you know, span across so many north. A single file on it also has a good way of keeping multiple copies of the same file, which means that it eliminates the need for any kind of backups. It can run on commodity servers it doesn't need, like a expensive same kind of set up. It can just run on commodity servers. That makes it a lot more cost effective. It is a Sicilian North failures. You might have thousands of notes of a few notes fail at any point in time. That's OK. It will continue to run because it keeps track of by the data and send it and back up copies. It can sink a bit. So it's got all these kind of things. You know, every of us was dip. It was originally created toe handle, birth crawling kind of projects where you know somebody is calling the verbal Google's and Yahoo's. They started crawling the Web and collecting so much data. They need a place where they want to put all the data, and that's where they got Actually efforts Greater HD Evers have similar systems for storing data across so money servers on that is a trade use case. Forage here was even today, it's an open source of party project. Of course it is free. You can just download it and start using the date and we want on. It can scale from one server to any number of servers. It has a limitation off the counter and Windows, but the other people are trying to find ways around it. Actually, ever the servers was Burton Java's or easy to put across, and any layers you want in a loss for parallel execution of my produced a see when reading a map produced us that can process data and actually FSO on my produce task are capable of running on so many different notes and in battle, and that gives it and phenomenal capabilities. If you come back next year first with others. What are the advantages is that it is massively scalable and relabel it also properly data processing. That's great, and it doesn't need any backups, and it's very cost effective. So no backup, very cost effective. Your finance Gray is going to allow that it also partly later processing. So that makes things really fast. Even Steven toe the amount of data that is stored is pretty high. The shortcomings it doesn't have any indexes. You, if you want tohave any kind off search that is going to be very slowly, you want to sell it for a record? Yeah, it's gonna be really slow security concerns. A slave is not that very secure. so you need to surround ourselves with external security to make sure that the data is not easily breached. The map produced programming on its way over this kind of limited to Gela programming of the other programming languages have some pipes and stuff, but internally, Java is the most optimist. One for you to work with down HD efforts. That's a limitation. Use cases, lock fires. Any time you have raw log files and you want to store them, this is the way to go. Just go dump them on. And then after that, you can maybe process them and will be used A J visits on foreign either it just over a story in the reserves. But But once you get the data rod, it fights. This is the place to put him media files like recordings, wise files, videophiles, audiophiles. This is the place to put them on. It can actually serve us online backup for our baby, my state alleges in a your data in a debate, Emma's and on, rather than using a backup system like a tape backup are addressed, back of you can use Extreme has on online backup for hunting women's data That is also very good Use Kiss off the studio. Our baby. Imagine next to give us the one from the old world and one from the New World. Very different use cases are so you can see and we're going to continue to be seeing more, uh, examples off, uh, greener stores by 17. Persist Options Cassandra and MongoDB: hi. Continuing on the various options for persistence. Your stock About Cassandra Cassandra is a White column Big People Leader store. It's a no sequel. It is classified as a white column store. When you say it's a white cloth column store, it does have tables on. The columns are very flexible in that you can keep adding any number of columns you want. You do not need a a NA operation like all of table anything like that. When you're inserting data, you just give a new column name and it'll pick up by itself so you can keep adding new column. There's no special operations like adding new columns. When you inserting that I give a new column name and it's gonna pick up, and it's going to start using. So that way you can have any number of columns for a given row on between rows. Each row can have its own unique set of columns if you want. Oh, there's no need that all the columns should have all this at the same column name, so it's pretty flexible as to what kind of data you can store within a row. So with every row has what you call a key for the road and then a list of attributes which are considered name value. But that's a key for the row and then a list of attributes that don't need Valley part. So the most important thing in Cassandra is that one key, the key for the IRO. As long as you can get that key, right, Castle tries gonna be a really useful database. It is an open source that has been developed by Facebook that tells you the amount off, Ah, scale, ability it as building into the system. In fact, a lot of the big data technologies are developed internally by one of these, a new Internet world, James like Facebook, Linden with her Google yahoo. And then they made them open source. And that tells you that they they're bolted for really great scalability. It has that animus came as you've been talking about within every Do its attributes. Consider name barely passed. You can keep adding any number of attributes you want. It has a nice decentralised architecture that can keep scaling asthma's ous. You want North failures are not going to affect the overall performance are the reliability It's got absolutely greater liability. Ah, the greatest thing. Our impotence in cousin dries that there is a single index for each table. There is only one index and that index as be as based on the Rocchi and talking about as long as you use that broke. Ian, your queries qualities are going to be absolutely great and absolutely fast. If you cannot use our broken argues that broke in the query it ends have been a full table scan. That's gonna be really, really slow. So the use cases of Cassandra we're gonna be seeing later also is that they are kind off limited to where you can have this nice single. Uh, I d for every row. If you look at Facebook, I believe that ideas basically your use already. So they have one dro for every user identified by that uses idee. And inside that a row, they can store any kind of information about the user like so many columns and anyone in gigabytes of data about that user. But the holding is based on that one. I d. It has excellent single row query performance if you're quitting based on the idea. So when a Facebook user loves, and you're takes that users log in i t and quality for dark, specific recording their table. Yeah, you're gonna fit. Get that data just like that And you can do anything you want with the data. It has got bad range scan performance. If you're going to scan for a range of data, are you gonna be scanning for Give me all users whose data about the great with them so and so. Okay, that is good forever to come back. It has got no aggregation support. Like you know, you, we need you. Condo grew by some averaging. Listen, have any kind of operations like that? As I said, this is like a very special kind of table where it has got all information about a user or any an object. It's more like an object. Later, basically, there's an object that is an object, Eddie and a bunch of other goods for the object. As long as you use that object that Ian, you're quite is it's gonna be working. Really great advantages of Cassandra is that it has got excellent scalability and performance. Great scalability. It's no sequel. It's built for the new world, and it's really scalable. Very strong security, building multiple rights with excellent performance. So as long us again, when you're doing updates to records and that update carries that single grow I d. You can update the records any number of times you want to go and remove. Contribute that Africa is gonna give you all that capability with excellent performance shortcomings. There is no transaction support, so you have double transaction support into your own application. There are no adequate in capabilities you can't write and, well, kind of query against any kind of column because it's gonna take forever to come back. There is no support for group by. There is no support for giants, so those are the limitations of Cassandra. If you go to any off this no sequel databases, you're going to see this and later They're kind of very limited and functionally compared. Toa order not baby must CanDo. These are like built for specific use cases, and you need to identify Daddy O's case and use it for them other ways. You know, there's no one size fit all. It's all at this point in the North sequel world. The use cases for Cassandra is you wanna build a customer? 3 60 A customer 3 60 is like getting one regard to provide a 3 60 degree view off a customer. The customer. 3 60 recorders. And if I buy the custom already on the data, you can put any kind of data you want about the customer associated with the customer. And the trigger on this is something. This is a very good use case, actually, because you go to any company enterprise anywhere there is this concept of an object or a customer, right? Eso you near the customer? Ready and around their custom. Ready. You want to keep any end of data about the customer, and this is a great use case for that because quitting for that customer becomes very easy . So any time you want a query about a customer use I d go here and you get the jet. I just like that, even the even if you have millions of customers. This is especially true if you're looking at a larger websites like Amazon or Facebook. Lincoln is the moment of user logs, and you just go and maybe pulled the record for the user and memory. And then that's pretty fast Monitoring status sticks and analytics. This is in operations Analytics, where every note becomes an object. So every Notre every server that you have becomes an object. So there is Wonder Card partner, and then you keep track of everything about that particular server in that record. Ah, location based Look up again. Locations can become the index, and then you can carry any kind of information about the location. So there's all this, our Cassandra use cases. You see the most of the given that is built on Facebook. You will see that the use cases are very similar toe What Facebook might be using the next one we're gonna be seeing is mongo DB another very popular no sequel databases. Mongo DB is a document oriented database. In other words, you're gonna be storing data as document. These documents are in Jason format, and Jason, as you can, as you know, is it's got attributes and values. You can also have Nestor documents within Jason, so that gives you a lot of capability in terms of what you can start even store like, ah, multiple relationships in a one to many relationships and stuff like that inside mongo DB It has pretty strong consistency in day that if you look at the acid property easier, gives you Strunk. Understand, See, it gives you an expressive query Language. Acquiring options in Mongo DB are pretty good, and you can quickly on any kind of column that you want, and you can do a lot of things in the query. It supports multiple indexes. That's good thing. You can create multiple indexes on the same table on columns that you would frequently use for quitting. It has support for aggregations like some you're going to wear grew by some average min. Max has God great scalability options. Application options fail our options. It uses a master slave model on. Then I can scale across animal people a number of nodes, and it is a pretty good day of a general purpose database. In fact, on the North sequel world, this is the closest you can get to an RGB must. So if you're trying to replace and our Deb Amos, this could be your closest match. It doesn't fulfill all the requirements off on our baby much, but this is the closest match you'll among go to be comparison advantages it It's supposed multiple indexes. That's a great at one, right, so that makes equating across on any kind of activity pretty easy. It support. It's an open source. Of course, that means that this free so you can get it and start using it. And I turned out to pay any money for it. It has got a strong and reach sequel support, so the query language is very expensive in order to write a lot off court to kind of get it out over the query languages. Pretty expressive, and it is really very easy to use. It's gone support and a lot of programming languages. Libraries are available in a lot of programming. Language isn't pretty easy to use shortcomings. There is no transaction support. Like Cassandra, there are no foreign key support, so you have to enforce foreign keys in your code. That's some work, according for you. It also doesn't support joints. Now, if you look at the jointing, you'll start suffering. Way is that no giants on that is pretty much because in the nose, equal world your own expected to John's, uh, there isn't as that you typically store data and it be normalised way. You do not expect to be doing giants the way you do in relational, whatever. That's why people are not putting that afford to create joints in the nose equal world. So you also need to make sure that their applications when you are connecting you you kind of d normal is all your data and store them in a d normalisation. So there is no need for joint for you. That something us an architect. I need to make sure when using these no sequel databases use cases for mom. Would they be right? One speed, many data store. So you can do that if you're going to be doing this right. One street, many kind of later. So this is very wannabe using it, that is, You do all your processing complete only oppressing. Then you go and put the daytime mongo db You want to start documents? This is a great place. Anything. I have things stored, documents like on you're gonna be getting some user tweets are pose from social media and you want to store them. Are you trying to actually extract? And they are blawg text from the Web pages from the weapons told them, This is a great place to start. Among would be is a great option for you doing real time analytics because it is pretty fast. So you want to get a It has toe demonstrate the right in Texas, and then you can use it for doing some real time analytics, and it is a possible are they be most replacement that you can use it for also, So that this is some of the good things that is there for my own good. Pretty popular, more general purpose than Cassandra. It has its limitations. I believe they would be working on a total solve us all. The mass time goes on like we talked about all of them. Start with a specific use case and then they're trying to expand the range are reached across our multiple use case. Thank you. 18. Persist Options Neo4j and ElasticSearch: continuing on more assistance options. Let's get Tonio Forgy Neo four J is a graph oriented database, and it builds with relationship. So when you say it's a graph, we're talking about a number of objects on how they are related to each other. So this is not just 1 to 1 relationship, but also how they behave as a group. The best example for socializing near for Jay's looking at, you know, Facebook friends are Facebook, Friends Circle or the Lincoln relationship. So where you are one person later toe scent of other people. We went on early to some other people who interrelated. It's like a continuing relationship chain or graph, like so many people related to so many other people. Um, Neo four J is a database that is useful for storing those kind of relationships where you have a lot of notes and there is relationship, and that relationship by itself has some attributes. Typically, if you want to store these kind of things in an oddly women's, you'll be having a lot of child tables trying to stole this relationship. But near four j gives you a great way off store in relation. Jupiter very specialized database that is used for storing this is used for use cases where relationship matter, like how one object is related to another object. It is an acid complained that this is some of the greatest things about near for Jade. You surprisingly find in the nose equal world that you have a database that is close to our baby Emma. In some of these attributes, like this acid complaint, it is God transaction support. It's got an excellent graph query language. It has gone to joint support, So that's all good things about me. You're far J. But then it is limited toe how you want. Oh, store relationships. It has got very fast relationship Travers and see one of Travel's three notes, though find relationship between NB. Suppose you want to just ask a question. I was, ah, person aided Person Bay. Are Is there even a relationship lifeguard Lind Lind When you show that Okay, this is like the afford circle of the fifth circle in which you are related to one are. Sometimes you say you're you are connected to him by, and it'll show your tree of how you're connected to neo four. J helps your toe store information and quit information like that. That is what it's main purposes. Advantages off neo four J transaction and as its support. Okay, one had one less headache. If you're using this one compact, whether no sequel databases and it's got excellent query support, it's got great travels. A love note. So you can do this travelers all in real time and get some information out of it on it. Maps easy toe object oriented applications. Typically, when you say any kind of databases you want to use, they don't not map directly toe object or entered the world. You know that's where you have all this object relationship relational models that people create in between toe toe map. The relational boulder, the object with in the case of New York for gay, it matched pretty easily and arigato the object or into drugs because it talks about objects in relationships and actions. Shot Cummings. It has no building user management. I believe that something that will get they will get sold as you go on. It doesn't have any aggregations apart. Oh no, no group. I know some average kind of things, and it's generally not suitable for data. No, there doesn't have a lot of relationships. So if your data doesn't have relationships are relationships are not important for your use case are there is no point in using your fudges. So this is a very specialized database that you want only use when you really need it. Use cases. A Master Data Management in Master Data Management You are trying to store information about relationships like between data data. Are you trying to talk about an object object or even between Newt Nora's? There's a lot of muscle. Data management is one place where near 40 years go No rock on Any time you hear the word network, this is the thing to go. To live your water model and 19 network, you have an idea for girls. So money notes and servers and switches and then you want to find which is related to what this is one place that you can use it. Similarly, you wanted, oh, social network modelling like how Facebook does the Lincoln best. This is under the place. This is a very good there are ways to use for that use case. And also, if you're going to use it for identity management. You know, you have so many users. You have so many applications, you know, which is that can use water application. What kind of admissions do they have on each of these applications? Are authentication authorization? If you want to stop that kind of information also, you know, for GE is your son recently, I haven't ordered that at this database. In the discretion is specialized database. Very wonder still relationships, You know, a large number of people, a large number of people, especially when you want to create that kind of relationship map. This is the one to go to because everywhere else it will fail because you want to do the same kind of things In RIA arguments, for example, you might be writing in less number of John's to get the same information. So that is not gonna work. Next. We want to another specialized database that is called elastic Search. Now I last exert is a full text search engine. It's a four text search engine. Where did I has told us? Text like stored in our tributes name early parts. I am. You can search basically on any text you want. It's a distributed document store, the highly scalable so many number of nodes fail safe and all the stuff comes with it. And it's those documents on document as text, and you can search anything on this texted be. That's because every field is indexed and searchable. The documents are typical edges on documents. So all our tributes are indexed, so you can their index a bite before you don't know to create. In Texas, every field is index and every religious searchable. There's her excellent query performance, so much flexibility and so much performance in terms of where incomes for elasticsearch and it can scares toe hundreds of servers for both structured and non structured data on it has got aggregation supporters of that. This is this is phenomenal. You can do any kind of search of any kind of data on you might already started me wondering , you know why? Do it then. The other databases don't because elasticsearch is not that golden sword in areas. We're going to see that also the advantage of elastic surges that has got outstanding search capabilities. It has got aggregation support, and it has got a nice, flexible scheme And if you wanna build a database that people can use to search anything and everything, this is the one to go toe shortcomings. There is no acid support on there is no SQL support. Okay, this is this is where you start having some problems with this database and there is some data loss risks. That is a problem with the product. Currently, possibly they would fix them as time goes up, was on. But this is something people have been talking about that is possibly debater last. But of course, I believe the company would be working on fixing these things out. Use cases not recommend as a primary duress stories. I've seen this in a lot of places. People not recommending. Okay, don't use this as a primary data still for you, but use it doesn't object. Data store where you also popular data into this database which then people can use it for doing any kind off are acquiring and building and aggregation and what not? Elasticsearch also comes with the tone associated visualization told called Chibana KB Any eso, you can use them in combination to do any kind of a dark wearing on aggregation It's a great tool for real time analytics, especially adult real time analytics. A data can be adopted. Real time data can be stolen. Elastic surgeons, documents. And then, yeah, you can do any kind off Alex on. That did very fast. Excellent performance s. So this is a good one. You used to this like a more objected Abbey's where you store data and a pretty flexible de anomalous format. And then people can do any kind of quitting on this data a great option under the great option to consider if you are not quitting aggregation is an important thing for you with big data. Thank you. 19. Transformation module: All right. Welcome to this lecture on the transformation model. The transformation model is the place where, typically the big data architect ask for Lester. Developers would spend a lot of time, are trying to design and develop record as a lot of space. Good attention to what happens in the transformation. So what are the responsibilities over transformation? More deal In a big data situation, the 1st 1 would be cleansing data are looking at the data and training out a lot of unwanted stuff. Extra jack garble General. Removing any gin characters that kind of stuff are former reformatting dates, and there's a lot of Tenzing that typically happens on data, especially if the data is coming from social media Are from some kind of Web calling applications you might be also be filtering data, which is removing any unwanted data. Are incomplete data that you don't want to be using it for future processing? Ah, standardization. So organization in terms of former like date for months, name for months. Some believe it content, and you want to do like you need caps, all caps, all small letters. Now all kinds of standardization you want to do and you want to first of all, Also be make sure that what you need to let stone on all the sand positions you want to do because that makes things a lot easier when you go down the line to do various activities data enrichment. Like, you know, adding any names, this is the anomaly stuff, you know, you want to be normal is in our names all over are you don't want to be keeping a lot of referential integrity stuff when it comes to big data. You want to be normal is them all and put in all these I D names in there and once in the record under the expectation is that big data. And when it comes to big bear situations, you're not really going to be updating their data Very frequently are going to be changing these names kind of stuff. Then comes back organization. Which organization is like, for example, age. You might also want to create an additional attributes like age range, like 1 to 10 10 to 2020 to 40 and we want a bucket of data because you also want to do another ticks enquiring based on popularization. You can also tow categorization That is also kind off a de normalization enough staff. So you basically, if you have a customer recording, then you have categories of customers. You want to put in all those information in the same record. That is integration of data, especially between data sources. Like for you. Custom information is coming from your CRM application on the social media information, and there could be something else. You wanted to create them all together and keep them as a single room. Whether, when it comes to big data, it is fewer tables on a lot of data that are not going to be hundreds of tables in a big dinner situation. There's gonna be very few tables. You're gonna be normalizing everything, integrating everything and store them all in a single day. Normally, Stryker, you don't you're not going to be having, you know you are going to be a body double space when it comes to big data spaces given because using commodity hardware space is given. So it's okay to de normalise and keep the day that that you're not going to be creating hundreds of tables and trying to link them to wither with judge, you're gonna be be normalizing everything. And finally, summary station and promoting. If you have to do any kind of centralisation and reporting are to create some summary report that might make your reporting down the landlord easier. What other things you gonna architect in the transformation, Lee? The first thing you're gonna be are protecting is the difference between real diamond historical, whether you want to create them in the same by plane or you want to separate them separately because real time need speed. Historical can take a lot of time, especially when you to comes to data processing will be passing record by record, you might not want to be doing all of them and one shot because they don't take. Sometimes if you have some real time requirements, you may want to isolate them and process them separately because rare time report requirements only when the critical requirements not going to be all the requirements, whereas historical, you want to take your own time to process the data. Our templates. You want to create templates for processing data? No, that's a good practice. Generally, the kind of tools that you have for writing any of these transformation layers. It tends to be like a dock scripts, but you want to watch out for that. Don't start building a lot of our doc scripts. Another. You want to get into this function kind of a framework or template kind of frame book, so that's a lot of responsibility that we build into the court. So they're dinner. It can be used for a lot of other purposes. Also, de normalization. You wanted a anomalous All your big data do not have any kind of referential integrity running. You know you have really plan or busto how many tables you really want to have. If you look at any of the large deployment you don't have, they don't have more than end tables in the system. Actually, there's very few tables about the very large amount of weight on each of the stables. That's how they typically architected any kind of re processing that meaning toe accurate for because there is always a possibility that you might find that some processing did not go through and you have to reprocess something. So that is something you wanna build into consideration on your building and architecture you need to have parallelism. How can I actually parallelism in court? Typically, a lot of the tools use bringing battle isn't by itself, but you need to make sure that you also use them when you're tryingto trigger certain applications. And there's no point in having a tool that can go part of those things and more triggering that catalysts. You need to know how to trigger parallelism in each of these tools or technologies that you're using the uniter of parallelism when both this set up part of it, which is how you conflict your beings and stuff, how you can forget even the PM's. You know, you should tell it how maney trends you want around. Now, how are you gonna find out? The threats are possibly based on the number Of course you have on the box, and but you're decided have similarly and York, Or do you make sure that you called it in such a way that would enable spotless and minute actually runs street? Of course, you want to make sure that things run as fast as possible so interesting through that and provide for that and also work in progress story. You know it is not possible every time to keep all the data you want in memory because you're gonna be quickly running out of memories. You want to also create, like a pipeline or a state by state passing. Where you possible one set of processing. Are you a work in progress story? Then do the next set of processing next data processing in big deal, right? This is something you need to provide for because it's very hard where you do. I do a lot of things in memory because it will be quickly running out of memory Pretty soon . So you need to be watching out for that on may be a planet like a step by step kind of thing. Now, a lot of the comments yet are contradicting with each other once and you say speak Next said, You say that, Okay. You want to be doing step by step crossing that might actually delay some things. And you need to have come up with a nice compromise us toe. How are you going to violence in the Speedway's is working problems and stuff like that. Best practices keep really Simon Historical data separate when the data is greater than terabytes. You know it's better to keep the data separate other ways, you know, trying to build both of them together. You need to watch out for because you might be getting really pick trouble trying toe building both the walling off historical data on the throughput expectations of really trying to the same single application, my producers use my produce as much as possible. No, my produces more in January concept. I'm not talking just about 100. My produced my producers also kind of the map produced kind of paradigm is also supported. Like if you look on spy culture support matter, it is kinda parenting on trying to use it because the good thing about map produce has been used in are produced based functions, and it gives you a lot of part lacrosse and capabilities based on the number of north you have. Eso that is a good thing for you to do, and almost every big data processing tool gives you some form of my produce as a concept. Not I'm really not seeing The 100 Member is generally my produces a concept and tried to use it. Do not reinvent the wheel. Don't tryingto think that you can build something on yourself, because that is going to be a pretty costly and, ah, fair in terms off time and money people were built us our birthdays. Stores are people or build them for big companies like Facebook and LinkedIn and Twitter. So they're built in with a lot of scalability in mind, trying to use the existing technologies as much as possible. Ah built template court on functions for inter praise known use cases. So your operational left your own processing logic as to how you process where you stings. And now you want a summary stuff. Try the bills template core when functions whenever possible that whether you in improve the reusability off the court, keep intermedia data for sometimes, do not know of a data. There's a tendency in big data at the patrol radiator, which your process to Don't know that because you might have to reprocess. Sometimes if you're crossing walls like 10 different steps on there is in the media data you create may be stored The Intimidator from sometimes, So if you're to reprocess, you don't have to start all the way from the first step. Maybe you can start from the fifth step or 10 steps so you can architect in such a way that you country process from any point in time. Maybe give it service on the three days of five days until you know people Jake, Odorizzi and the baby. That's fine. Keep the data for some time. That's our good practice. Summary data on leaving required. So big data is all about the normalised data car. Get rid the granular most level. Keep the orginal data. Always do not probated the real data on somebody's only if required, because there are some great crossing functions capabilities provided by the big, better tools. Under the other ways, you think that there is a lot off can reports that the order and on a regular pieces on the summers later, it does no need for you to somewhere I think and finally build monitoring Kimbrel. Peaceful performance performance on on These are big data processing can be Tory it pretty quickly. So have some monitoring committees within your core that that represented prints out our laws how well your court is performing because you may see there is fine in fine for 20 days, 30 days and all of that it might start going down on. You might need some kind of trouble shooting help on that point. That's for the transformation part will go out and look at the options for transformation. Thank 20. Transform Options MapReduce and SQL: Hey, welcome to this lecture on transformation options. What are the options? Do you have transforming data in a big data architecture situation? The first, of course, is custom code. We are louder, lad. Right? Chorzow writing custom code and your favorite programming language would be like your favorite option because you want to go on build everything yourself. But let's make sure that I don't think before you get there, because when you're quitting something for the big data situation, you need tohave, scalability, reliability and pastoralism. And if you're building something from ground up, you need to build all these stuff. Rather, you want to take an engine like a gendered processing engine like sparse Equilar on my produced and then build on that so used to love court you build. But you built on an engine that is giving you these functions by default on. Remember that people will actually build things from scratch, actually built the technologies we will be that we're talking about in this election. That's how pretty complex it in. Then that's why wants the built these technologies are they made it open source and let it for everybody to use for comparison reasons. What are the advantages off custom code inability, your specific needs and situations. You have easy integration with custom sources and things or you fear Ah, you verone data sources your own things. Very custom to your in the price than custom code would be the way to go. Ondimba Knuble Custom Code. You can reuse existing computation court from your older systems because you know, a lot of times you'll be doing the same kind of processing. Also are those are some of the advantages of custom core. What are the shortcomings that's too much to build and maintain? It is good if you can limit the amount of custom code you have to build as a part off your beginner solution. You still out a build custom court, but watch out for it. Maybe you want to build it. Afford those areas where it does not require parallelism and skate ability and, if so, limited toe. How much ever you gonna be invested here? There's gonna be a long cycle time because you are to be building them, testing them and maintaining them long as I could sometimes and heavy resource requirements and Donna people required the builders custom court use cases. I would not recommend this unless you are a use case, but there is no ready made solutions available. So first look for ready made solutions. If not, then go look for writing custom court. You still have the rights, A custom court, some custom or anyway, you know, like scripting some integration code and stuff like that about always look for an existing tool or technology than trying to build something years our next Lex looker. Look at hudl. My produce hope, as you know, is a combination of two technologies. The HD efforts technology on my produced technology and the map produced technology is the first big data processing technology that came out that revolutionized the way data is processed and on. That's why we have bean. We're here where we are because of this technology. The good thing about this technology is that court moves to wear the did a parasite typically in a ah, in a regular application or, you know, the one that you used so far. The data will move from the database to the data related to the application layer afforded to do Eric and a crossing in this case, the court is more so for the day. That isn't if you have a piece of course headquarters more to every node and then the Indian and the big cluster about crossing is done. Mappers are pieces of code that can work in parallel on individual records and performed transformation. They work on individual records independently, and they can transfer, transformed them. Which means that they can work in a battle really part of the world. And you have reducers that can then some raise the or part of the mappers and then aggregated. So my produces has mapper cord that runs on every individual record, and there is reduce accord that can summaries data across records. And you can build a series of map producer code toe kinda build a pipeline on my produced, my produce. My production can win the actual processing pipeline by your series of my produce scored in your building. My pretties called. You're building a piece of court, typically in Java, but you just focus on the functionality. You want that you on then the whole thing off, you know, running it on partner systems and moving data between the systems and not all that the stick and get by the hard up engine. It uses cheap hardware with extreme parallelism. That is a good thing about my produce. So what are the advantages of my Proview's parallelism that helped it handle a huge area? Loads. This is what revolutionized big data. It can handle text very easily, and it can work with flexible data very easily that one of the strengths of my produced is that it is not came out rejuvenated. Text driven on that makes it very easy to 100. Best are you can actually build custom processing code for your business functionally so you can build code that focus on your business, functional, be in computations and not worry about all the scalability and battle is and things that also comes with big data shortcomings. It is not suitable for real time. Map Produced is a really bachelor ended operation. It is not suitable for real time. Reduces can be choking points if you are expecting the reducer to do a lot of stuff, because producers are a single piece of court, which all day dies by plan, so you need to really architect in such a way that what does the functionality of the map? But what does the functionality of the producers on then? Make sure that the reduces functionalities minimized as much as possible on. Indeed, developers with can think in this, um, opera disparity, saying You need to architected correctly. The mappers introduces in order, work in an optimal way. And, you know, developers who can get this not produced kind of thinking going that it requires some training and some experience before you can get that use cases for my produce batch more pressing, any kind of batch more crossing on flat files, text files. This is a great option. Next mining. You want mind text known? Look a text and, you know, spread the text into strings and then come up with the words and all the engram processing your dough Are the X cleansing upper case lower case condition. All this is dry produces a great place to do our data cleansing and filtering a want to clean our day dog in the act on a record better card manner. Similarly, filtering off data that again is based on record by record, this is an again excellent place to do it and, of course, analyzing media files. If you want on the last media, well, that might be some strict that can go through a media file and discover some information about the media. If I live in for that map produces a great option by then office. We have been saying, and this is a slightly old technology still very good when it comes to batch more kind of processing. But it's also, you know, slower does not really suitable for real time, but it carries a lot off. No weight in terms of water can achieve. The next option is the stark You will query. Call it Stark. You will quickly because sometimes it's called SQL. Sometimes it's called sq, and everybody have their own query language. So every no sequel databases are any other database that you use in your big gettable have some form off query language that is supported so data products have some kind of sequel support, or anything in his native or there's a product that gives you like a high product or an Impala product that gives you equating interface or what grade of the student Hajto and they come up with an also my different sets of capabilities. What this sequel queries can do is it can do filtering, cleansing transformations, memorization. They can insert an object back to the source like you when you do a select statement. For example, in the select statement itself, you can go filtering were using the wear. Plus, you can feel the data you can joke rinsing by, you know, having some functions. You can do some transformation from upper gait. Slower. A simple thing by using a function called operable Over. A lot of these query engines also let you write your own custom function, so you can also use that as a part of the query toe. Do your own transformation. You can go summary ization by group by on but but then, different Indians have different capabilities. Not all of them support all the capabilities and some obscure Also, ah, lows you to insert update back to the source. You can go where in certain toe table as select something from under the table. And that kind of in one chart does filtering transformation, dancing and inserting it back toe under the table. Even put all of them together in one sequel statement. That's pretty powerful. What a sequel. CanDo again, you're limited toward the database supports, in this case in disquieting in the sequel Injun Dust. The heavy lifting. So it also has its own optimization algorithms to make sure it processes them in a very good manner and do all kinds of load balancing all the stuff so that that's a good way to simple thing. To use a great and SQL and then put it in a script. And then they can keep running forever. Advantages. Off this query mechanism is start a member, afford maximum returns. One query, and you can do a lot with it. Where the Indians are optimized for performance in Parliament, somebody is already invested time and material in to that. So are you. Reap all those benefits by starting a simple query. You know, you can do a lot like using an Impala engine, our hive engine. They give you a lot of speed. Ah, lot off ease of use. They have their own cataloging and metadata on. All they had to do was create some scripts in those great some scripts, shell scripts and put them as crown jobs, and that will do the work for you. Shortcomings. Your has limited capabilities. Each of this SQL engine comes with its own set of capabilities, which are actually primitive compared toa what you get in and hardly Bemis, you go to our baby must like article are my sequel. The number of functions that you have there is a lot when you come to any of this big beard engines. It doesn't have a lot of functions. And the functions like, for example, date formatting functions, string function the law, it doesn't have a lot of capability. So you need to write some custom functions yourself. You are there to all of them, and they don't provide some ways in which you can write some Java classes or something like that. To create this custom functions. Combining different sources sinks is difficult, and that's why the query languages limited toe typically one data source off the sink way. It's also you can like. For example, Sarah can insert into a table into the same database system not into a different when you can do sequus equal queries like between Cassandra on mongo db on mongo db and you know my secret. So that's limitation on there. Use cases filtering if you can do within the query summary ization. Copying data, of course. All of them, if the query engine allows for this is a great way for you. To all of them are We will see a couple of more option in the next. Thank you. 21. Transform Options Spark and ETL Products: Okay, let's move on to the big early find in the room. And that is Apache Spark. A practice park is the new generation general data processing engine. It is built for data processing for performing on the transformations that we have been talking about on. It eliminates a number of Shot Cummings off the traditional. My producer, my produce first came into the world. You'd added to use cases. It was running fine. Whereas people started using big data technologies more and more, they found that the big the macro, this paradigm of the produce in huddle was not fitting a lot of their requirements. And they did a lot more stuff like they wanted speed. They wanted some flexibility. They wanted to do a lot of other operation. They wanted tighter integration with the programming languages and stuff like that, and Apache Spark was born to fulfill a lot of these needs. It works on data in memory that makes it really fast, and it works in a good distribution. Feitian interested us load across works and then collects back all the lower back. And then it has a phenomenal job in doing things really fast. It's supposed my produced type operations, you can still write my purse and producers. Ah, but it's a lot faster. Not only that map roadies programming itself is lot more easier and a party spark a combat toe hard up my produce. It's a simple, and you can write, like one line of code to do all of that stuff map introduce are actually functions in a party spark, and I don't have great that 90 classmates off. It supports streaming so you can have streaming, pristine processing in a purchase. Parker just actually date guys coming in. You can self crypto us a stream, a publisher. And then, as the data is coming in, you can perform operations on the stream. That's really cool capability for real time processing. It supports Java by Don r and Scaler. I know even those Parker's return natively in scaler. You can work for the true Java or python around, which gives you a lot of flexibility in what programming language that you want to use on there is a great It's a great benefit. It has got sequel and graph capabilities also, so that is the sequel Ah spark sequel that has phenomenal capabilities I would say, because it will provide you like Ask your like operations. Is your life operations for, like, select group by order by filtering are you in which you can write in a one line of Poland's off court and in Danley converts them and toe Pamela jobs and best things for users of Onda friend off it. You're gonna be getting doing simple stuff, but at the bark. It really uses a lot of powerful program processing capabilities. It also has graft crossing Annable busy want toe our day with a lot of graft kind of information, which is an old linking between objects and stuff. So does that also on. It also has interactive processing capability. So if you look at my produce you're right. A map produced program and run it, whereas in spark, you can even the interactive processing capabilities to work on it line by line. So it's almost like you have an SQL window that even right, keep rating SQL statements. You have your spark interactive, the interactive the command line prompt, in which you can keep giving spark amounts one by one by one on work on data, and he takes care of, you know, starting on the very birds and keeping track of very built in memory and stuff like that That gives you some phenomenal crossing over like you can either use. It doesn't a doc processing when you and Adidas and this is working on it, you can just use it for our buck passing. But once you know you're not president can compare all of them into a script, and then you can ended as a headless script also. So that's phenomenal processing power that you have with our budget spot. So advantages of a party's park. It's fast, It's flexible, it's powerful. It's both supports a different kind of processing capabilities. And look ask your leg crossing my produce like processing real time data stream processing . Graph processing on it can run along with her duke. We can tyrannize standalone. It can turn on a Windows box you can, and along with her dope, I try and run along with missiles on shortcomings that a significant coding effort, maybe you compare it with sequel. It's the more courting effort you compare it at my produces less. According a four. It is still immature when I say in my Children is rapidly evolving. You can see that between different versions off apart, despite that is migration that you need to do because a lot of new capabilities are being added and a lot of old capabilities are being dropped. And that kind of thing is very fast And the fast moving technology eso You need to be very careful about that one that what you're doing upto them and we has down the line you made her don't migrate via court. I don't know. That's something it has a significant hardware requirement in, then dumps a memory and see pewter simply because you know it's optimised for speed. Of course, it needs resources to run at that speed. On large volumes of data use cases, it has a wide range of use cases from text processing number processing, the data filtering transformation any almost anything you can do. Additionally, it can be used for interactive processing. So when you have data set, typically what happens is when you're trying to build a project, you're not going to start quoting from right. Are strong riding a full programme from Scotts fast, you'll be going to be doing interactive processing So you're gonna try out a few things that you're gonna take the data can for the data. I see how it looks like Maybe Then I play in machine learning and got them. Try it out. See how it looks like, so that where you're playing around with the data and in trouble processing helps you there on. Then, of course, real time stream crossing a. Pagis Parker's a great use case for a real time process. Also on. I would go up and say This is the de facto standard now for transformation engines is the isn't kind of the best option available? Yes, possibly. But watch out, that is something called a party Flink that is coming. And I don't know whether that's going to go into, but at this point, this is the kind of de facto standard. No, Mother Apache spark is not on me. A candidate for transformation. It is also a candidate for reporting for advanced analytics. It is also going to be helping in in terms of it s tighter integration with things like Africa and flu that health in the transport layer. Also, it can help in acquisition also because it is a tight integration with Twitter. It has got verity of libraries for various data sources. Very Cassandra mongo db for J r db Ms. Carr, all kinds of connectors to all these databases to So it gives you a wide range of capabilities that makes it like a like a very optimal option for any kind of transmission. Fact. In In our yours cases, almost all use cases that we'll be looking at today would be using Apaches part for transformation. So that's the power of a party is back. Then comes ideal products, which year products of products in the market for extract ransom, and Lord, which are basically to the same kind of functionality. And this function does a lot off products out there and they're coming up. People are developing products like this, left, right and center. Some of the very popular ones that talent Penta how just that soft a snap logic. All this kind of guys on all these offerings have commercial on open source offering, so open source typically comes with limited functionality than that the commercial Washington with comes up a lot of functionality. These products have their bushel et in and biplane builders. Or you can have a design ever. You can go drop a dragon, drop stuff and connect various and majors and all those icons. And then you have a pipeline going pretty easy to build things using these biplane builders , and that's really cool on you can build floor strike from a question confirmation two stories. So this is the need yell engine. So even though we're talking here in the category of transformation, there's also a place for acquisition the data at acquire option as well as the transport option. This is a good thing on day, and it has. It has support for custom functions to these ET. AL. Products typically have, ah, lot of connectors to various our new databases that are there, and you can write custom functions if you have photos, some specialists crossing. Also, there is operation and management available in these products, and what that means is that you might have a sandbox where you develop your scripts and there is a production in deployment. Very deploy the script. They have a way by which you can click and deploying a script from from your sandbox to production on. Then that this operation management. And, um so you can use concealers. You confront them at any time you want, and then you can manage them. You look at the state as it comes with, with its own full baggage off our country features eso. How does why don't we? One of the advantages off easy deal products is that they're easy to build work flows. You can have driving drop capabilities pretty easy. The birth. It has good integration with various data, sorts of some things. Yeah, they do have a lot collectors to everything and everywhere. It has gotten management capabilities as we just talked about it as a good thing. If you're going with this park kind of implementation you made out of dough, the management can abilities yourself like how you move from court from, ah, sandbox development toe a sandbox to a Q. It a production. Your do all this yourself, our shot cummings. They can get complex pretty quickly because these are built for some standard use cases. On the moment your use case starts getting a little little out of sync with the standard bones, and this can get really complex. Maturity would again be a question because it's a new product. Scripting, tumbling, still moving up. Move that a lot of moving parts is being still aggressively developed on. They can get really pricey for commercial offerings. You might want to watch out for this. You might think that is an open source version, but they're hardly has anything. You wanna bills and real products that comes through the commercial license and the commercial licenses are very pricey. Interact positional work flows can become tricky. Typically, the cereal products work well within a single organization, the source on the sinkers all in a single organization. But once you start building on network of data flows and pipelines, this get start getting message pretty quickly. Use cases. Let's this is, ah, iffy that I'm going to say here any U. S cases supported on paper and the devil is in the regions, and all these products tell you that they can support any of the popular use case that you have. You go to that website, they're going to say we can simple do this. We do that and everything, but the devil is in the details on before, you know, commit yourself to any of these products. Please try out and take the one of the products and try out. Sometimes they might end up being a very easy to use use case B and then the product will just fit in just like that. And you are up and running, you know, without any issues on, because the U. Y base product you run out don't really do a lot of courting. Just dropped throwing in a couple of days. You have this up and running as a zoo. Good application. So that's the one good thing about this one. But then it can pretty quickly turn a little very tricky trying to do everything to the U. S. Sometimes you want to bend your back toe, get some functionally at your it might get messy also. So this is a this way. Are dark places you might want to be watching out for this one, though. Thank you 22. Reporting module: Hi. Welcome to this lecture on the reporting model. This is your instructor called Branch here. So one of the big purposes of your trying to go into a big data architectures to build some kind of reporting solutions that waas you earlier not possible with your regular a regular reporting stuff. So that's something you want to give consideration. Because reporting this the area which others in the company can easily see the work you have done in big day today. So people who are not technically involved in the project they're looking at Okay, we had this big data project. What does it go to give us new reporting as the place where they can go and look at something new that was not offered to them before through the traditional irresolution. So you might want to know, provide give some talk about reporting because this is where you can show some value that your broad through big data, which was not you earlier possible through the traditional data solution. So what are the responsibilities off the big off the reporting, Lee? And of course, it starts off with canned reports, you know, provide some ready made report that people can go and execute on a daily basis on a weekly basis on get some data out offered. I'd also be good if you're reporting. Layer has a do it yourself report designer where people can go and create their own reports by just dragging and dropping a few columns and then getting the report going. And exam graphs are pretty similar letter. Have you were player on with data next that gives people some ownership? Indo your reporting solution, a dashboard designer would be could because you want to create a number dashboards, possibly personalized, Did dashboards for each and every individual s so they can look at a lot of data, different kinds of data all at once at the same dashboard. You, of course, will need a PS to extract data from the persistently have so others can build applications on your data that you're processed and created and kept it in the beginning repository other people can credit through a p is possibly, like rest MPs on day be able to do some kind of reporting out effort. I know that The rest api is are they can use, um ah, secret grand off a p A so they can get some data out off it and they can build some more applications on the data that is told in the big data repository art indication of notarization. Any reporting solution that you build showed of its own proper authentication and not the recision screams for security and privacy reasons reporting should also provide for real time great up a presentation. There is no point just crossing real time data and keeping it if you can cannot visualize it in real time. Also, which means there has to be very little late and see between where the data is created to the where the data is president. There waas word a base. You want to show a refresh rates of one second or two seconds. Our data keep refreshing without any delay, our without any latency and finally, the reporting there should also have some kind of alerting eso. People can manage the reporting layer if some things are going wrong. If some connective days lost to the data layer or any kind of tells things that happening, people can look at alerts and see what is going on. So alerting is also a key feature. When you're building the reporting layer, what do you want? Architect in the reporting layer? What is that you want to focus on? You want to focus on response times. You want to make sure that people who are using the reports interactive Lee. I do not spend a lot of time sitting and waiting for the report. Strap on. One of the things about Big Data technologies is that there's a significant amount off latency or delay in terms of executing queries. Eso When you're giving reports that are in an interactive in nature, you want to use technology that can provide, uh, very low late and see a very low response times on. You need to architect in such a way that their data is built and stored for this kind of access. More bile and work were back, says both are important today. There's any reporting solution that you should have mobile access these days. Personalization, where every individual can build their own Nice little dashboard. Nice little reports that way, you know, they all have, like one dashboard to kowtow for, looking at things that matter to them and not trying to share in your dashboards across people and they don't want to be working with 200 trained report. Rather, they just want to look at one report that has data for them from all the places, all the data that they want it. It will be a cold dashboard for them. Advanced graphical capabilities, a lot of toast. Or they provide advanced, capable graphics capabilities much more than your pie charts and bar charts on This is also a very important key feature these days. Threshold management. Racial management is about reporting Layer that, Maura push kindof layer. What I'm trying to say here is that not only that you're capturing and storing data, you're also looking at data maybe possibly in real time to see if certain thresholds are exceeded in your application. Right. So supposed you are serving bunch of pages for your enterprise and the delay on private land bringing of the pages. You know, that's going beyond a certain level. You want the reporting to tell that. Okay, there is something happening with respect to performance off. Our applications are maybe your sales is going down, you know, are there are minute by minute sale. You're looking at that I made by meat production you're looking at It was going up going down. You want some kind of alerting? Also based on that kind of data changes integration with other systems. Of course. You want to make sure that the data, the reporting there notably just integrate wondered our souls. It can also integrate toe very other data source, but then are outside your big data application framework. It cannot integrate with your traditional sources and also to reporting or not reporting layer. You know, it should be pretty flexible, flexible for you. You don't want to be using one reporting layer for your traditional solutions and one for your big resolutions possibly want to have one single layer for both of them on Finally, such no service is becoming a more and more key activity these days. Now, ever since Google came out with this one, you away with just a one search box. Such is becoming more and more important. No, people are trying to build applications, but they can search other than having ah, fix that set off para meters to enter for a reporter. I'm looking at my daily factory performance report. I want to start dead ended and then I want to put some factoring in something like that. People want to have gender extra cities. It is want to start typing something on day one. The reports to calm a so so much is becoming more and plenty flexible. Search is becoming more and more important these days, and there are tools on technologies that have also been built with the capability eso you want to consider that also, as a part off building your reporting there, Best practices picker told That is easy to use and it s good Craig graphics capabilities. The toll should have good integration with the variety of data sources and from RBB must know sequel interpreters to rest ups. The Web based AP is everything it should be able to do. So ah, aggregation on the far fly. You know, you should have enough performance and scalability in itself s so that it doesn't have to depend upon other layers for for doing the kind of stuff you know, like parallelism are processing and in memory processing, all those kind of stuff. You should try to use open standards for easy there. Doctors integration, open standards like, addressed a p A support for what? Support for Jerry, Busy connectivity and stuff like that. It should have broke. No should provide for a specialized, personalized dashboard. That's a good best practice these days. Everybody wants to have their own dashboard that they can look it in the mobile around on the web. This is a rumba. That reporting layer is one place. But the entire company looks at the work that you've done for big data. So you want this reporting late would be really cool, really differentiating with respect to the other solutions that are there. And one way to do that is to give them personalized dashboards, which they probably won't ever. They're using a traditional irresolution designed for multiple interfaces. You should make sure that when you're designing your solution, it should cover for mobile, web and and embedded kind of solutions. Right? So that is something you want to consider also, and finally such again such think about such think about providing people flexible search on the data on that could be a really cool option that people are going to really like, so things for you to consider in the reporting. There are a lot of people kind of thing that when you start architect ing a big data solution, it kind of stops with transforming and storing the data in the database. No, it actually kind of continues into the reporting layer on the advanced analytics. Clear. So please remember that also. Thank you. 23. Reporting Options Impala and Spark SQL: Hi. Welcome to this lecture on reporting options. This is your instructor, Cameron here. So when it comes to reporting, when it comes to basic reporting or basic and analytics reporting, there are a few options available for you with the big Gate of World on. We would start up with Cloudera Impala. We're not going to talk about high because of Impala. Kind of literary replaces High when you know, overcomes a lot of the deficiencies of five. So we'll start with Impala in politics and in memory. Distribute that query engine for How does this mean the data student Hudock Impala should be able to give you a lot query and ability on the data. It is an interactive shell, so it's like if you're used to SQL Plus in article are similar skill shells, another database engines to be a similar Indian for you. There is a common problem there. You can start typing inquiries and results start showing up, and it's very fast. Compact Ojai, because internally hive used to do my produce. This guy doesn't do my produce, and it is more optimised logic to daycare. Off the deficiencies off I. It supports joints, subcommittees aggregation. So there is is pretty much a powerful tool. It supports Hadoop management, so it supports both raw data that is stored in hado Pass sequence files, or CSE files of various formats are It can also support quitting data that is stored in hedge base. On there are available only B C drivers and thrift, a PS that can work on in policy. You put these order busy drivers and thrift opiates on Impala. Then it gives you some quitting facility where you can use an old BBC driver quality data that is provided in huddle. So that's the great are our front page of using Impala because now you can either use a shell or you can use the order busy drivers from within a Java coder. Something like that to a query. The reader that is there an Impala? If you try to compare Impala, what are the advantages of Impala you start up with? Okay, that's got Berries. Fast data access to Hudock compared to hide it, of course, is a family or SQL interface of people are used to sequel in the regular army. Bemis World would be very family or the good start getting into this one and start using this one. So as as a big data architect, you are not only concerned about the end users, you're also concerned a boat. Our database administrators on data analysis are even an Adidas Scientists. Developers want to quit the data. So this kind of a tool gives you a great access on when it comes to Ah, the reporting options. It's not like you should only just one of them. You can choose a number of them. You cannot impala side by side with other options. Also, it's not like you only need to restrict you till one option. You're kind of Impala inside with the other option that we're gonna discuss also. And it has got some pretty strong integration with Duke in terms of shortcomings that are no graphics support. There is no ability toe convert this and can do any kinda. Graphical capabilities are not there. There is no fault tolerance capabilities. Aquarius running it breaks, it breaks. You gotta run it again on it does not have support flying at the read a story on the toe, we risk a. So it's only limited to data that the student hard up. If you have data and Cassandra something like that, you need to use the SQL query tool that comes with Cassandra. Use cases, of course. Data had a car got dark wearing and data with Condado that the student kind of present the main use cases. It has got an A p A interface, so you can use that for Fireman had been kind of capabilities and also paid a student hitch based. You can use his interface to quantity they forget as against toward another kind of databases like Cassandra are mongo DB They have their own plans, to which you can quit it later. This is pretty much clout Impala that it's pretty limited the waters there in How Do Panitch based the next comes thes star averages the sparks equal. Ah Sparks equal provides program A provides a sequel like programming capabilities, and it is very easy to use and pretty powerful internalise Pop, a Spark sequel is implemented as my producer operations on Spark Rd's Our Spark Data Friends. This man produces not the Hudock Labradors, but the map reviews that is supported by sparks equal by spark itself. It is very fast and it is very flexible and it supports aggregations, and Giants are it knows, a lot of powerful according techniques. In one line, our tolerance. You can do a lot of stuff using the capabilities provided spark sick. Well, it has mission learning integration with Emily Part Emily. In fact, spark of machine learning. It's built upon sparks equal, so that kind of makes it really good on immigration with spark. Family, especially wouldn't want to do with Advanced Analytics on It can be used for both interactive as well, lest stream program. So you can go scream programming with the defendant. Interactive programming of it. That s, Oh, that's pretty. A powerful careful it is for you advantageous rates set of capabilities. A spark comes with some really rigid of capabilities, family and syntax, the same group by order by the same group by some mean max filtering where classes and stuff like that excellent performance capability. Because this has been a punch park. It's all they came, a produce that comes with sparking some tough scalability, fault, tolerance and stuff like that. It is supported in multiple languages. Java scale up, I turn and even art. So that's pretty could be really a white already of language that the supporter and you can actually easily integrate with other libraries. No spark that some of the capability that spark provided was that you can easily integrate this. But other libraries in in this park ask in a job our shortcomings. There are no graphics. Once again, is Maura programming kind of languages. Programming request is not really. Even though you say there is an interactive feel, it is almost like your programming something's and on there are no graphic supporting sparks. Equal use cases, a programming programmed squaring off large data said, This is a great engine for doing grand, quitting off lot of a program that I mean that you have squaring that with than another software programs. Or you might have a spark script that you write in job our scale a bite on. As a part of that you have sequel capabilities, and that's what you can do with this one. You can add a good thing about Sparks equals you have a single system that this part for PPL analogue takes advanced analytics, real time processing everything. So that's one thing that Monday years Parker's gives you all kinds of capabilities. So once you adapt spark, you can get one technology that says that can be used for various models. And, of course, real time in a when data has coming in streams and no, no, you can do some sparks equal based squaring on the stream data that is coming in and do some analytics Also again, That's something that was sparked off us. You assets power. We'll continue on the more cable. More options in the next lecture. Thank you. 24. Reporting Options Third Party and Elastic: Let's no talk about third party tools. There's a number off open source and commercial options available for third party Big Data and analytics tools. And these tools typically support a rich set of capabilities. And they can work with any kind off. No sequel databases. Or how do Perhach peas and stuff like that on? So the choice off this kind of tools or narrow don't cost family era be as for less use, case matching? How do you really want in Western Third Party tools stands up to that question will depend upon what can open reporting capabilities do we need in your solution? Do you see a lot of endorsers getting into your product and using the tools to do some reports visualizations and excellent graphics, customers, dashboards and stuff like that? Then you might want in Western third party tools. That particle will definitely reside side by side, with the other reporting. Told that you would be using anyway, I thought that depends again, as he said upon the use cases you have on the option for you in cured. But things like an R tableau Penta how Jasper click for bursts and then a bunch off guys like them and they have excellent graphical capabilities, and they got integration to Native War. What BBC Jerry B. C. Drivers to any of these oddly be Amazon. No sequel databases. They got visual design capabilities. You can go when they don't draw a a report or dashboard all by us by dragging and dropping with no, almost no, according, and they do have authentication and authorization integrations. You can easily hope them on the your own in the price log in, and you can have the same log and single sign on kind of thing, working for the start party tools. Also no comparing and contrasting advantages. It has got Richard graphics. There are excellent templates for visualizations and graph on dashboards. There is easy the obesity use design, as you can go on, designed this report. They have support for authentication and authorisation schemes on you can do some customization also in terms of logos and, you know, look and feel and stuff like that. Ah, shot crap coming wise. That is cost. They do cost a lot, eh? So if you need to really know valued and see if you really want in, was that kind of money to get this kind of capability on. Does your organization really need that kind of capability? Our native support levels? You know how well they support each of these databases the before? Of course, there's there's always marketing. I think there says you be support anything and everything, but you need to try out and see how the integrations are truly working to be use cases where enterprise dashboards or reports on Iran. You know, when there is extensive use, you see that there are going to be multiple your source. You think this doctor things both down. The very best for that mobile. By the way, these guys also have excellent mobile support. If you want them to be, they support about the when there are multiple data sources need to be used for this reporting process. Yeah, they can dig it with a lot of data sources when you need to provide for wishful designers. But you're inducers can trip up orders, and the reports against the start guarded tools will be a good use case for all of them. And then I said you need to really were called. The cost was benefit on these guys the last option, we would be disgusting. His last I'm saying last ignored elasticsearch because we're talking about this company called Elastic, which has a bunch of products that will help you with respect to the reporting leader. So elastic has a product called elasticsearch. It's an open source product that provides an excellent search engine on existing data. We talked about elasticsearch also in the persistence options that gives your excellent engine in built on a water. On top of that, it gives you another protocol Cabana that provides excellent visualization capabilities on the elasticsearch data so you can have elasticsearch data. You can use Cuban over the provide virtualization capabilities. It has got aggregation capabilities. It is tightly integrated between the elasticsearch in Cabana. It has got streaming their support, so that's good for real time. And I'll pick so you can have real time really getting into elasticsearch. And you can use Cabana for visualization purposes. Ah, there's excellent graphics support, of course. In Cabana, a scalable be comes from elastic surge elastic, such as excellent scalability on. Of course, there is certain this one gives you search capabilities by default, and this is the best search engine you can find in the North skill world on combating elastic advantages are of course, you got rich graphics. You got flexible query capabilities that comes to elasticsearch. You get real time analytics and of course, you get such a realtor. Morale takes search. You get out of the box some great reports when using elasticsearch in cabana shortcomings that is additional work populating elasticsearch you. I believe you are not going to be using elasticsearch as your primary database. You might be using something else, but you might be taking data from the other systems and populating elasticsearch for the purpose of reporting. So there's some additional work and work. You might have accuracy issues, but this is something we hear about when we're talking about elasticsearch. But of course, I believe the project will also mature on these creatures. Might go in the future, use cases where you need, and the praise dashboards and reports pretty similar to the third party products we have been talking about. When you want to give him a dark wearing you wise, then this is a great use case. And of course, if you want to have real time monitoring real time monitoring Another great use case for elastic. So this is the capabilities that you get any good elastic Ivana and on bought elasticsearch or freely downloadable une so you can go and download them and use them and see how well they work. 50 or use case. I hope this has been helpful for you. Thank you. 25. Advanced Analytics Overview: Hi. Welcome to this lecture on advanced Analytics model. But you're trying to design an architecture for big data solutions. Typically, the focus is on the E. T s on the extract transform Lord process as well as basic reporting. But advanced reporting comes more as an afterthought because people typically think that advanced a poor politics is something that is done ad hoc. And we don't have to really architect anything for that. But that is not the case, because Advanced Analytics, even though it is an end, use a kind of work. It does take a lot of resources, a lot of shad resources that is there in your database, and you're processing engines and stuff like that. So it is important to kind of integrate advanced analytics into your big data. Architectures such that the competent are leveraging one another. And there is no issue with later on going to discover that we're missing something here and missing something there, and we need to add a few things here and there. So let's start with understanding. Waters advanced tonalities. So there are a number of types of another texted their on two different organizations, implement analytics at different levels. To start with, we have descriptive and our dates, which is about what happened. Like, what was the total sales for last month across the world? And then what are the sales were not America, you know, eight. Back Asia, Europe. So that is descriptive. Exploratory is then trying to find out why something is happening. Now you look at not American sales was up 10%. A pack was don't vipers and wise. It's no wise. That is what is the reason that North American sales higher. Then you start looking at different things. What about the product mixes? Some are excelling better in not America. I know that's where are you have better teams and our discount and not America, or what is happening between the various countries that is exploratory. Inferential is where you are. Play statistical techniques to under buster, fall on allies a sample and then extrapolate what you find in the sample toe. The entire population. Predictive Analytics is trying to forecast what is going to happen based on what has already happened before. Cost analysis is trying to understand how a change in one variable will implement will have a change in a little variable. Which is to say that What if I cheer? Change my product pricing? If I give a 10% discount, will I get a 20% increase in my saves? That kind of analysis is cost of analysis, and there is another term that is used called deep banalities. Deep Analytics is more like a combination of all these different types of analysis, mostly causal and predictive in friendship, that is to get more deep into the problem and look use various advanced techniques to understand how certain things are working and how it is supposed to work, and to predict what is going to happen when we say something is advanced Analytics. We're talking about the top of the bottom three or the bottom for in the list of inferential. Productive causal and deep finality on there is typically done by people whom we call us. Analyst are even data scientists. Art of instant decisions are these people take the data your mind and stored in your big data repository on Bright or unless the data and come up with various findings and predictions. So when you're tryingto architect for an Advanced analytics, Smart Analytics model. What other responsible to use expect that model to do? First of all, that modern Lee got module needs to have model building capabilities. You should have the ability to build a variety of models, a statistical models prediction models. Whether the supervisor unsupervised, they should support various validation techniques. If you are learned more about data, since you understand more about what are these validation techniques? They're basically different kinds of on guard items. Same to go with in somebody and got a garden and simple algorithms. Try to use an incredible multiple Marta, based on different subsets of data that has to be support for and simple guarded ums. It should provide for interactive development of this, because advancing Arctic is initially an interactive process where the data on a list or the data center sits and works with the data and direct proficient right. Do Step one trying to see what they find, then decide water doing Steptoe, then that water do instead. Three. It's an interactive process of working with the data and coming on with findings, so the Advanced Analytics modeled should provide for interactive analytics capabilities, but also he should provide for automation capabilities because of once interactive and electrics is done on the there is a model that has already been dead. Like this is how we're gonna build a model and we can rule the production. And we're gonna take that and automated that and operationalize it as a process. So there has to be automation capabilities as to how you can taken, automate the code and actually build some applications are products, and finally, you should be able to predict in real time protection how well you can predict in real time is also one of the responsibilities off the block. What the architect? What other things you have to consider when you are connecting the Advance Analytics platform the 1st 1 A scale, ability advert. Analogic operations typically take a lot of CPU time. There's a lot of number crunching ex con trenching that is going on that typically is very superior, intensive on defusing a large amount of data that also mean it becomes member intensive. On that, the data is really realized in Yorba, then span across multiple notes and clusters and stuff like that. The scale ability is one of the most important aspect you want architect in Advanced Analytics model, then performance. How well, how fast you can perform with these Al Qaeda, especially when you're actually predicting it. No, because predictions typically happen in real time when a user is logged into your website are when somebody's talking on the phone. Eso predictions have to happen in real time. They have to be have a subsequent response time, so you need to architect for that. Also, you need to architect for validations. The ability to validate both the models and the predictions for accuracy is an important aspect in an advanced knowledge next morning algorithms. I mean, there's a lot of algorithms for model building, but these algorithms also comfort are various options tuning parameters. Conflagration by with you can tune Delgado's try different parameters. Deceive the model is improving on art and stuff like that. So it's not just important that you support your garden. City girls are support various options for tuning. Also on Finally, you auto architect for automation and oppression ization, which is once some things are done, you have found found a way to build a good model than that model building process needs to be automated on implemented and operationalized that they can keep running in the background and keep building moral center data bases on that keeps, you know, going on. So that kind of automation capabilities need Toby also met built into your advancing Nynex . More deals. Best practice architecture should be aligned with the methodology. So when you have adverse analytics, you have a set off on a Leicester data scientist. They typically have a methodology of how they go about doing things. You know they have their own processes, and what do you need to understand is you don't understand what their process is. Our and then you need to build the architecture such that the product is aligned with. The process they captured will go hand, and I am so that that makes the work of both the conductor easy as well as the The work of the data centers easy, so the methodology should be aligned with the architecture you need to plan for adult moral building, which means you have to provide for capacities for are not model building both CPU memory battle axes, toe artist So asked us to your data so typically, when the architect, the only architect for for the capacity known capacity, which is like, Okay, how Maney Web users are going to be hitting the system and based on then come up with your sizing and stuff like that. But you also have to give yourself a scope for our living for bringing these advancing our dark analysis with you. Somebody just goes and start running. Building model on the same box on that is going to put extra load on your position player on your computing layer Andi unit to actually allocate. You know resources for that also. So make sure that what you do in a dark model building does not affect the regular list of the disseminate the regular it because that are happening in the river burglar reporting that is happening. All advanced analytics projects do not really results. That's one thing. We have to be very clear amendments. If you go through data science related courses, you will understand. It's not like OK, I start a project and I will have some improvements. Embolisms, not call projects are going to hell in the results of the data, does not have any signals that has notices that also some expectation ship something has to be done pretty clearly here because no other people in the company might hear that there are competent, as we're using predictive analytics to do this and that, and we should also try to do the same thing. But I they don't They may not understand the fact that all data might not have all signals , and if there are no signal, there's no prediction that can happen. There is something you want to keep in mind and always keep in mind for automation and our specialization in whatever you're doing so that you have toe at some point, you don't and start automating things, and operational is and things in the advancing and export. So this is how you would look at the Advanced Analytics model and try toe architect Hope this has been helpful for you. In the next lecture, we're gonna be looking at the Advance Analytics options. Thank you. 26. Advanced Analytics Options R and Python: Hi. Welcome to this lecture on Advanced Analytics Options So far advanced in Alex. We're looking at programming. Languages are tools that provide you with the features required for advanced now digs that we talked about in the early lecture on We're going to start off with our Now Our is a language or even kind of in college and environment for statistical, computing and graphics as as a very old language. And it has a pretty long history of use based on decisions. It doesn't specialized language. Ah, built for statistical computing and has been used for a very long time by statisticians. It wasn't it was not widely used before for other purposes, but after the big data work came in and there's more traction on predictive analytics are started getting a lot of collection. Good thing about art is that it is a white package set off various mission learning unguarded omits Carter, are tons off implement agents of machine learning algorithms. You have so many options. So maney implementations so many variations off the other guarded them stand. You can keep trying a number of them to figure out you know which one fits your model which one gives you a good mortal that gives you good productions? It does have capabilities for data cleansing their transformation. It doesn't excellent graphics package, and there is our studio, which a 90 foreign tracker programming. So you can use our studio to actually write court. You can build applications but that you can do interactive programming but that you can go documentation with that. It is amazing. Now the bad thing about art is that a trans on data in memories are is going to load the data into memory and then walk on the data in memory. That means that it is limited to the memory in when the North there is limited. Total local box one on was limited to the memory in the local box that severely restricts its capability. Now there are some commercial washings of art that have been worked on Bye bye bye bye. Other third parties who are trying toe add all those scalability capabilities into our good thing. This year is good for our But then that comes the price. Also, there is something you want to remember now comparing our advantages of ours that it is an excellent center machine learning algorithms. It's got graphics, and there are other presentation tools also for you to build presentations for irritable documentation. It has interactive model building capabilities. Toe the our studio that a great feature are desperate in mature. It has been there for a very long time, and it has matured with a number of packages in our garden that has come true. Shortcomings is that, as we talked about, scaling is limited with the local memory, so its ability to handle really big data eyes questionable are cannot be used to build like robust application suites like the way you will lose a job irritable, a robust application I J Year applications over kind of thing artist that very limited. Oh, no scripts kind of president, Not anything else. Big data capabilities are limited. There are people coming in, but our how do barge because those kind of interactions. But it is still pretty limited because those things need they used, the full came a better days off are not the use the full capabilities of huddle because once you get the data from had opened, move it into our then it becomes again limited to its memory so there are some limitations was using are what are the use cases for us? Are is that you can do interactive model building and trials on small datas that you might think that our doesn't have a lot of value. But no, because Dana Sanders spent a lot of time are trying to understand it are trying to play around with it, and they typically do it on small data sets. And when you have small data, since our is a great way to start playing around with data and trying to understand the data trying to build models and once you kind of figure out what this is what I want to do with the data, then you can translate that into maybe Java are maybe Apache spark and do the actual thing . So job are it's a good sandbox in which you can sit and play around with the data. It can be used for small, laters and applications that typically in order this time, and you can always put a box that has Guardado gebe auf memory or 62 GB 64 GBS memory. You can get a box in AWS and then set up our and you can do a lot of analysis with that, actually, and it can be used to make presentations also. So these are the things about our next week we want to fight on, which is pretty similar toe are in terms of it crossing. But the thing about by a python, it's It's a regular programming language and made the standard programming language that has big data. Science rated Packer isn't capability, so with bite on, you can't just no, don't don't we do the data saints and are exposed? You can do a lot of things With Danny. Extent is a top programming capabilities that you can do with Britain. So Python has a number of packages, like Gnome by Skype. I found Us and Sky Kit Land that help in managing data and process under down doing data science. It is a vast array off third body libraries. There is good. It has got great data cleansing capabilities, graphics capabilities that our ideas available for interactive programming like I pretend there's a notebook. There are you can look out spider in on a corner. There's a lot of capabilities available, doctor, and it has integration into spark. Also, it is possible to past data back and forth between my town and spark if you deserve do so. On Bite on is a multi purpose language. It can be used to do a number of things. You can build scripts for doing anything. I mean, there's a lot of things that you can do with by Don. It's a general purpose programming language that also means that from a skills that point of view you have already family with bite down, then fight on is the way to go formation learning, then learning a totally new language. Like so advantages of fight on its card number of graphics and data cleansing tools, it has got interactive model building capabilities on it is God on good integration with Apache Spark. It has a year to see a learning curve compared toa are simply because, you know, by Thomas General programming language. We come from another programming language that's and actors last one more easier to understand and learn. Then you can do with our shortcomings of bite on it. Scaling is again limited toe local memory. Pretty similar tohave are used to work on. There are limited Emel implementation mission, mission language and guardian of limitations. Compared to our, it is still sizeable, but still 11. If you compare our has not done much use cases is pretty similar to help you would use are for I was interactive mortal building, and you can do some trials on small data sets. You can do a lot of tile work with bite on. That's That's pretty awesome. You can go there a cleansing work with bite on on you can Bill. Of course, more advanced X applications now aren't bite on. You need to provide for their capabilities at least one of them in your and their architecture. So people can use these models to go and cook a day down there on with data. But typically they can't recite side by side with the other option that you'll be seeing later also, so it's not that unit only choose one you can just more than one depends upon what your data scientists are from leer with. You need to prevent for this capabilities. Awesome. Thank 27. Advanced Analytics Apache Spark and Commerical Software: Hi. Continuing on the various options for Advanced Analytics, we just now look at about this park Apache spark, as you know, as a very a set of capabilities as a good transformation Engine and analytics engine. Quite the engine. It is also a good machine learning engine. So a party Sparkasse Machine Learning Library. There are two libraries that is the family lived Reiber library on the M A library. The M A lib library is being faced out. The animal Emma library is the new one. We're actually supports a good set of machine learning algorithms. I wouldn't call it ostrich as it is an r by town, but it is good. But the good thing is, they keep adding are these new sets of algorithms all the time? So I believe in a couple of years, it should be, you know, really, really leverage. I said of libraries. It uses data friends from sparks equal. So the data input for all these libraries the data frames from off Ramos Park sequel. Why this is important is that when you get data from other sources are even low data from sees, we are transformed data. You're doing it anyway in their frames, so that they using data frames as an input for sparks equal makes it very easy, because you don't have to go through specific transformations to fit your data into the mission lending libraries. You know the Stana base approach, where the interfaces are very similar for all the algorithms. It is, in fact, very easy to switch between our Guardo's pretty quickly in the same piece. Of course, that makes learning easy and also building code pretty easy. Mission learning. The good thing about Park is that the algorithms can scale across a cluster, which is if we have a cluster of 10 missions on Garden Mess capable of executing across the plaster, which is debating its job across the cluster. Integrating them, it tries to use the map produce framework internally on optimize it in such a way that it is really useful for you. I mean, it can speed up. It can handle a large volume of data. I know it's us against our fight on, which is focused only on the local fluster local node and the local memory spark and really scale horizontally and can work on huge data sets you can use Kayla Jabba are by down the build Spark order. No, we have also are coming in. So that's really could you have a lot of options to build a code interactive data modelling plan possible. So there is an interactive Shelvin which you consider and called the stuff interactive. Leanne tired various things and once you have ah firmed up model building procedure than you can actually take it an automated. It has excellent integration with big data sources. That is a big plus in your building. A big data when you're trying to build a big data. Best architectural US immigration With Adobe pdf US all the no sequel sources are maybe more so. That's really, really good. Andrea Lehmann Analytics and predictions are possible with streaming. So if you are are using streaming, it is possible for you to do in stream another days in strict production with the parties back advantages about badges Park as we've seen that this excellent scalability especially compared toa are, and by town it has interactive model building capabilities that is go to UK, your data scientists consider and build models in an interactive fashion. It has got real time prediction capabilities. You know that's data is coming. You can work on a stream and then makes predictions and a caress supposed support for various data, sources and tools. So that is again in the place. And I support for multiple programmed languages so you can choose a programming language after your choice. Our shortcomings is that there are no graphics support. There is absolutely no graphics, a book for doing any kind of actualization. There's no I D. That is only a shell for interactive programming is it's more like command like programming that you ever do with Spark. Andi has limited set of algorithms and implementations compared Toa are so, but I think that got inside is growing as time goes on. It is not mature. It is a fast evolving product, but it is not mature yet. The desert was significant changes that's been happening with spark. So that is something for you to watch out for use cases predictive modeling on very largely that sets the model building or no sequel later sources that can connect directly to known secretly resources on you on do mortal building. And of course, glad us it it it's strange because it can scale across a cluster on. Of course, it can be used for real time predictions. Now Spark is really becoming the main, the main Oracle product for any kind of beginner model building. We have a party moat, but by ever doctor born mode at all in the lecture, because my hope is kind of fading away. It has very limited set of algorithms, and it's not going to be really useful it out. Spark is far more superior than that. That's why I don't talk about my hood at all in this all, of course, as a spark assed really, really good index. Since commercial software, there is a lot of commercial software available for you to do. Ah, the night Johnson, like Becks, processes so that things like a taboo sauce, a rapid, minor exeter, a lot of products that I give a beloved doing Advanced analytics. They have a good set of algorithms. Some there are some of them operating mijo. Some of them are evolving, but they have a pretty good set of all gardens. They got some very good graphics, supported those on which realizations they can scale the end work with big data sources. They can work with no sequel. All this is good, so there's a lot of capabilities in the commercial sock. But I mean, they can. They can pretty much do everything that you know, somebody like spark. And by the only problem with us, it is very pricey. These are extremely costly products that that's the only thing for you to watch out for if your company can afford this product. Yeah, you can go and buy it and use it. But I said it's very pricey. Uh oh. Sparkles kind of building up as a very good alternative in open source alternatives, which anybody can pick up and use on de. So that's why I'm not going toe. No discuss. The commercial software are Grundig is in charge coming. There's only one chart coming. It's very pricey. Other ways that is pretty capable in terms of water can thank you 28. Use Case 1 Enterprise Data Backup: Hi. Welcome to the first use case for big data architecture. The interpret state of backup. This is your instructor common ones here, the water, the use case. We're trying to look at the use cases that that's ABC Enterprise Fictional Enterprise currently keeps 18 months off CRM data in our baby Emma's that is online and seven years of archive data in tapes. No, this is a lot of companies work where they keep some data out. The 11 months of data 13 months of data in an RV, Bemis for only access and seven years old. A more amount of data has kept in tapes offline backups. Onda reason that you have to use backups on tapes is not gonna be because they need a backup for the data. But they also want to keep data for a longer time in case they want to use for any analytics purposes. But keeping that kind of data and and not to be must can prove to be very costly in terms of licensing caused and hardware costs and stuff like that. Try to keep the daytime tapes and any time they need to access the data for any kind of analytics. It's going to be a project and next up to get the data to be awarded back into a database and you know it's gonna take some time before everything gets done. So they want to see if they can create on create an online backup. But they keep the data rather than hex TFS because it gave us us, you know, provide reliability in terms off just multiple copies of the data. It is recently in the North failures and the data is, but it also provides access to the data easy and so that people can actually access data. I mean about the need some sometimes, and I'll take the common say Okay, we want to look at data for the past three years or four years and look at how things are. And they can always look at the data because it is available online. And of course, it can provide at Aquarian capabilities on the data. Now, this may look like a very simple use case for you, but this is where a lot of companies today are starting with big data. The reason is that big it has a new technology in an organization. People want to get needs to Foskett familiar with the technology. When is their family and is the architects that developers, the operational people, the database administrators, the by gentle API general and the use of everybody needs to get first familiar and comfortable with the technology? Before you, we started investing in more significant use cases on people. Investing more time and effort into more will be that you escape this where a lot of organizations, big organizations are starting up. So what are the characteristics off this? So the source of data is our baby Musso of typically CRM applications use the center largely payments in which they store all the data alligators, about a 1,000,000,000 tables and columns on the data types of numeric and relational data. That's hardly any texture that you will find an NFC and Burnaby, the more off operation off. This use case is purely historical data pool is going to happen historical and the data access is going to happen. Historical. There's no need for any kind of real time stuff here. How was the data acquisition going to be the data acquisition going to be a full mode where you're gonna be pulling data and a pediatric bases from the source on pushing it. So the sink. What kind of availability do we need to know that the data is typically after one based on the data available? The data of Arabs after one day? That is kind of OK, so that's good. What kind of store type you needed that right? One treatment needed. A source. You want to take the data from the CME and righted once into the data store? And after that, he only going to be leading into many number of banks. You hardly have another over, right? You know, in case that is a freely around this year. I'm sorry, Does something like that, But typically gonna be right once and you're done with it and is going to stay there forever. What kind of response time do you need on the sink? Data? That response times asked. That's possible. You're not going to be using it on a regular basis. Somebody is gonna live. Read it. And when they really needed it's okay at that point. You know, you have some time taken for running this way. That's OK. It's a lot better than having to bestow tape backup. So yeah, I praise you. Can see how that quitting is okay. And there is no model building. There is no other wanton analytics capabilities that I needed in this use case. Pretty simple, straightforward years. So how would the architecture for such a use case looked like? You have a data source, that is, Are they being math based? Usually a single. They have based connection, which will give you all the data eso for pulling data from our baby message. The big question option is going to be scoop. Because I talked about this is the best player list our tool available acquired in Everman Rd Bemis and push it to any of the data store. So scoop is a script that you put it under a scheduler, ask a dealer like a crown scheduler that can run on a periodic basis and sucked it out off the habemus on. Given that the data really doesn't need in any kind off no team or anything, you just push it directly into history of us. Paying old HTM was pushed is kind of enough in this case, in case you want to have a little more schema and want some kind of steamer to be going on here. You can possibly also doing toe a party high bar impala, economically pushing 60 and you can put Impala on it for giving it a dark wearing care abilities. Even also used Impala the waiter inside Data into history of us are the opinions Apology hive To put a scheme on the exterior with database and insert through a budget hive also, but this is a straight This is off the are connected would look like a very simple, straightforward architecture. But this is typically the first step architecture that any organization can get into on that that makes life on all pretty easy. A good start. That's hardly any any school for failure in this art thing, even if things go wrong, people on I'm gonna really scream about because you're not doing anything critical for the company. So this is a good place for you to start off any kind of big data. If what's within your company off, this is helpful to you. Thank you 29. Use Case 2 Media File Store: Hi. Welcome to the second use case Media file. Store your skin now. What is this use case about? There is a company, a busy enterprise that has a call center where all calls recorded. So if you're calling any kind of call centre, you would hear the message that for training reasons, typically the calls are going to be recorder that are also statute any reasons for which the calls have to be recorded. So that is going to create a number off media files are what we call very files are in with whatever former in MP three format way, former whatever. The recordings have to be kept by these companies for a long period of time, like their week for statutory reasons for next seven years. The recordings are also used for analogue dicks purposes because they were Typically there are some self that can is under these recordings and understand what is happening and come up with some recording some analytics as to whether what the agent is doing is the wise quality. Go to his age and following a script and all kinds of stuff. So there is there is things that became for strategy reason on financial days and recordings are, you know, a sizeable number. And given that their stature, the equipment is recordings, how to be stored in a fashion, that it is safe from any kind of failures. So there's backups and stuff require. So what ABC wants to do that has no move from a tape arcade that they have been doing so far, because tapes are going to be difficult to retrieve and put back and stuff to an online archive. That way we can provide some about quitting everybody on the data that is there, that that way, you know, you can keep the data always online at the same time. Make sure that it is the label and it is stored in a safe manner. And I can just So what are the characteristics off those use cases? The source. Our contacts into recording solutions on the data types are actually media for the data is actually media first. Now, this kind of is also related toe other forms of media that you want to track, right? I like recordings, video, audio photos. No. All of them fall into the same kind of use case category. The mortar is going to be historical data pull. Typically, the daytime it is available offered rare. So it is OK, you've got even going to be moving to the recordings. As the recordings happen from the recording software, according typically comes and some call, according so stored in a separate place, separate location because it's a huge number of files. Are you going to people in the files on a daily basis and pushing this on the repository? The storage type is once again, it's like a right one street, many kind of storage type and the response times you want on this is as good as possible. You know, it's again recordings, and I don't care. And there's hardly any more than building that is needed. This is the kind of characteristics of the use case on How does a solution for the use case would look like we start off with a media file, so the database, typically there's doing that finds that very large. They're not going to be that the media five typically income kind of a separate location, maybe in online education, maybe a separate a network with a separate network from the data center believe that could be a possibility. Now, given that the South Files how do we want to move them is we have to move them toe ftp. And as we talked about F baby gives you a great capabilities are moving on fights. You put the FTP under a schedule. It, sir, can turn on a schedule manner every more now where every fires, whatever and keep pulling all the new media felt that there from there, the sources on then these media files can be then push directly into the heart of file system, hedge the efforts a surveyed starts getting arcade and are getting out ahead forever. Now, once the data is in HD efforts, what you can do is you can put a media on a laser. A media on a laser is I'm calling it at this point a very custom with the analyzer that can read it. These data files from Hitachi Evers and unless the files and come up with some findings, this is called tagging process, or you take a file and act it for various things, right quality, and then water orders a doubled and depending on the business scenario, attacking can be a lot of things, and there are soft further can analyze at the voice in these media files and converted to Texan ones. That it's converted a tech cities is either tag and all kinds of stuff. And all the findings that you have on the media files you can populated my secret database , all the tax for every recording and as also, all the related entities, like was the customer was the agent. And stuff like that can be put in Nome icicle in a bay. So you have actually of us and your my sequel. Then you can have a custom media reporting solution. This is again custom because it's very specific toe the very specific to the U. S. Case that can bring data after my sequel database and provide you some kind of reports on I can also be used toe provide a player for the recordings that are there. So you look at some moderator for the recording and then he can use a player toe. Play the files also if you want, if somebody wants to send to them from HP of us. So this is how you can set up a architecture for media files. And as I was saying, this is a similar architectures for even if you want to store files like photos, videos, audiophiles, any kind of large files that you want Arcade. This is kind of a template architectures for So the solution, How does it look like the acquisition process is going to be? You're gonna wear files, So it's not gonna be any idea when anything like that it is files that transport layer is going to be ftp ftp provide secure. It prevented compression on it prevents you easy plants where knows, tracking capabilities and everything. And that is good thing with f B B persists. You're gonna persistent history of us for all the media files, and you're gonna be using my sequel for any kind of analytics data. So this is where the politics, not persistent, comes in. You're trying to use more than one form off stories, very picking horses for courses where you put all the large files into extra efforts and all the men aerator, which is not going to be that large, you know, and that you put the bicycle because my sequel gives you some excellent wearing capabilities the transformation. There is a custom media analyzer that there are software available that can that can listen to a recording and converted into text and tag them and all kinds of stuff. That's why I call it a custom. There's a reporting layer. Either you can put it impala are. You can have a custom reporting tool that can read out off that that can read out off. You can use Impala because in HD and SD efforts and do some analytics. And you can also use a custom reporting to me if you want a really custom solution for your for your thing. And finally there is advanced. And I think there's no one's analytics in this use case. Now, remember that all the use case we're discussing, they don't that they're not really in isolation. Typically, in your company, you have the implement two or three years cases together, and then what you do is take all these use cases and combine them to create a solution. So that is what is gonna happen. We will be looking at use cases separately, but in your company you might actually have two or three or four years cases actually, And you might have to take all of them and combine them together and create an overall architect. Okay. Thank you. 30. Use Case 3 Social Media Sentiment Analysis: Hey, welcome to the use case. Social media sentiment Analysis. Social Media Analytics is one of the very popular use cases for bringing in big data into your organization. Let's look at how this specific use case is about so x rays, e news Corporation, a news company, news corporation or a new Cheerios general. Whatever you wanna call tracks popular topics and social media and uses them for that news reporting. And this is something that all news channel stood today is the tracks a popular topics for the day, whether this bullet exhaust sports or whatever, and then they kind of give you some reporting based on that, they kind of analyze and say how many, how much positive sentiments are happening in the world about a specific topic and how much negative sentiments that happening. Let's say vesting that you take us four steam and then see how much positive sentiments happening, how much negative sentiment to its happening about them on. Typically, when you're picking a popular topic, the number of please that happening is huge, right? It's not like you're gonna be getting 100. We today is going to be reading 100 tweets, a second. And that's the kind of volume that these tweets typically generate on day. Want to look at the streets as they happen and produce and reporting, especially if supposed for assemble. Some really could. Even even some even that's going on and they want to tell in real time. How was the sentiment of the people themselves changing as even what's going going on? That is really, really you know, something like a welcome finalists happening. And they want to track our people sentimental assed changing as the game goes on, when some goals are being go scored all the different change and stuff like that. So they want an automated system to capture all these social media interactions on popular topics on do some real time sentiment analysis. SAAND sentiment analysis that needs to be summaries for the news people. And then it has to be our kid for future analysis also, so that's going to be a big, big work for them. So how do we then go on design and architecture for this use case? Let's look at the characteristics for this use case. The sources are going to be prettier and Facebook, of course, you can include the other social media websites also, but we'll start with Twitter and Facebook. They have us popular. They typically have. The topics of typically are hashtag, so that's Hashtags is over. You will typically track popular topics. The data types are going to be the tweets and posts on. They're going to be in Jason for, but so they typically give you and Jason for months the more off operation here, Israel time, real time acquisition, real time analytics and reporting on the data question process is going to be using streaming, which is a push. Technology, which is thes clients of crypto, the teeming streaming server and then with under tweets, happened the day that the street the Twitter then pushed into the self. Kleber, the availability here, Israel time. It has to be done in real time, and this is on the flight and heretics, and the storage is right. Many read many. You would wonder why the site maneuverable just there. Look at that as we go along. The example and the response times are going to be real time. You know, they typically out around pretty fast because the whole thing is happening in real time on the model building is through sentiment, analysis and imminent, and this is a thickening their libraries available that you can use to create sentiment, analysis and even a typically comes typically with ah positive sentiment, negative sentiment in a beautiful sentiment. It also provides some additional additional capabilities to understand whether the tweets are happy, so angry. You know, those kind of stuff falls. It depends on the library that you use. How does the architecture for surname analysis looked like? We start off with the sources, which is the Facebook and Twitter sources. Thes website. Probate issues streaming AP A capabilities as a year to go, create an app on these on these website and then set up a streaming streamer. Then your your court is going to hook onto these streaming hooks, and then they will receive a stream of data as and when they happen on, how do we get the data out of transport? We're going to be getting these streams and putting it to Kafka. Kafka does have some direct hooks into Twitter and Facebook. That way you can configure cough dramatically who conduct Twitter and Facebook and receive the streams on as these streams are happening, Kafka can get these streams and then transport them across various across your network as various topics on these stops. Topics can then be such great by any client. And then that plane can actually suck out all the topics and use them. How are we going to analyze this? Data is through sparks, trimmings and spark streaming. You're going to set up a spark streaming client Jodi Kafka on the Kafka is gonna keep publishing the data that is coming through from social media into the spark streaming on. Then Spark is going to keep listening to the streams on it is gonna analyze on the flight is going to separate aggregate the tweets that are coming in by topics and then by topic it is confident start aggregating on. Then it can do like second by second, minute by minutes. Independent analysis can actually sentiment analysis on every tweet that is coming in. And then it can also aggregate the data on. Then, once it ugly gave the data, you can push it into Cassandra. Why do we just, Cassandra? Is that as we talked about Cassandra is a great option if you won't do crack everything our own a specific object. In this use case, the object is the topic. So every topic will have a record in Cassandra. On anything that you get on their topic, you're going to be using Cassandra to create one record for the topic. And then you keep updating that record ass and when information keeps coming about the topic in terms of sentiments, so you can have sentiment counters for the topic as it goes along. That's why I said it is the right many great money situation. The cousin drives great for that. You could want a great wonder. Cut a topic and you keep updating that record as and when they was coming in about that topic. So that makes Cassandra has a great querying told phone about. The weddings are a specific object. Of course, you're gonna have a custom new summary application that is your own news application that constructed out of Cassandra and give you Those are great in a monitor of all boats and numbers and charts and everything that you want to publish for your readers. Our viewers are home over there, and of course, you need a custom topics configuration system you need to tell calf can bond spark as to what topics they should look for. So you need a custom conflagrations, which you can keep saying to Cuff can stream on spark as token this other topics I want you to listen doing process and as and when the top scores and they'll start listening to those configurations and keep publishing them. This way, you use the big data capabilities as you can see you off Spark and Kafka and Kassandra. While you also complained your own custom court or suit your own prisoners use case in terms of the new summary and topics conflagration and that value building overall Ah, big data architecture for sentiment analysis Moving to a summary of the solution. That question process is going to be streaming on. As we know, streaming is supported by all social media s website, so that's a popular option for acquiring data. The transport model is going to be Kafka because Kafka provides you a scalable no way by which you can transport data and you can create Cathcart topics based on the topics you want to listen to on that we who are needs the topic and such grave and listen to them. Persistence layer is going to be Kassandra because Kassandra gives you a great way to store information around the specific object. And this object in this case is going to be the topic. So you think the topic and store everything around that topic in Kassandra, the transformation layer is a party's part because it gives you a real time stream subscription transformation, Advanced analytics, all of them in 11 shot leave an ass. And when you listen to the top, you can do all of them in one short and then actually get it another picks within seconds. The reporting layer is going to be a custom application for reading the gas under data, and it can summaries for the news and then show all kinds of graphics like very specific toe the news companies, or really doesn't advanced option. A set of custom option on Advanced Analytics is going to be Sentiment analytics and they resident whose party spark which gives you on the flight stream processing. Now, if you look at this big news corporation is going to be any time, we listening to a lot of topics and these are very popular topics. There's gonna be a lot off trending happening on this topics in terms of the number of tweets. Soto handle that kind of volume. You need a big data kind of set up Toby ableto manage that glowed and keep coming up with analogue dates. But if you're trying to do the same thing for your company and if your company is not, I would ask Popular like a president off United States and you're not going to get that many tweets. You may not need this scale off a structure, but you can still build destruction because this whole thing we talked about can still run on one mission are can scale across multiple notes. Now, that's the great thing about big data. So you can build this and keep scaling as and when time goes on also, so their solution fits. Even if you want to run everything in a single box, are you want to really scale across hundreds of servers? Also, that is the great thing about big data applications. All this has been helpful to you. Thank you 31. Use Case 4 Credit Card Fraud Detection: Hi. Welcome to this use case. Credit card fraud production. This is a very popular use case. When is it popular use case? There are a lot of similar use cases like that, Like this one would like If you're looking at spam filtering our network into sh introduction, they're all similar use cases in general, you're tryingto find off a lot of events that are happening. Which of the evens you want to classify? These events are either good or bad. That's what we're trying to do. So one of the use case we're trying to deal with here So a visa system runs of their based retail solution very similar Toho Amazonas, where customers can, of course, coming in order any kind of product. This is reconsiders. I'd larger upside with a lot of people are shopping for a lot of kind of products and a large skill based on our readings shop, sometimes credit card thieves. They used stolen credit card information to make the prophet Jesus. So the bottle from the summer still the confirmation Then put in that and make some purchasers and after some time because of the critical companies have invalidating these ones you actually might be losing money when the credit card transactions are turned out to be fraudulent. So that leads to loss of revenue. So what your company wants to do is that they want to put a critic real time clear card fraud prediction system so that, you know, as the transaction is happening, you can see if this is a fraudulent transaction on you. Can some a block it before the car, the customer actually buy something and then get on with it. So one of the characteristics off the solution So the source off the data is so straight is gonna be Web transactions on data is captured in real time. So you're not only capturing the payment data, you're also capturing the behavior off the user. So as the user is navigating across your website from page to page, you're trying to track the behavior of the user. Also, because you want all that information toe, figure out if a transaction is for Ireland or not, the later times you're dealing with a numerical serum, there's not texture. These are more events as to which pages are being clicked and what no one double their more numerous crater. There could be text, but not I'm not, like, you know, megabytes or gigabytes of text that we're dealing with you for every transaction, the motors kind of real time and historical, but there goes there is in a historical data collection, but the prediction is happening in real time. Data. Acquisition is a push more acquisition because every browser it's gonna push data as and when the events are happening on the browser, the guy, the user is actually doing things on the process. Those events are pushed possibly from John JavaScript it directly into your data acquisition system. The availability is going to be real time because of predictions have to happen in real time and real time. You're going to predict whether the transaction is a fraudulent transaction or not. Response time has to be many moment because I said, this is really paying. The production has to be me. You can take asked the use of the great until you do that, many of the transaction is a fertile and are not here to do it then. And there on that is model building involved, which is Binah reclassification, which is classifying whether a specific transaction is going to be fraudulent or not. So there's a banner reclassification model building that is that. Now one thing I want to see a once again about all this use cases we're discussing is the focus for us is to on the architect ing the overall solution. We're not going to be focused on certain specifications as, like the data center. What the data scientists would do in terms of mortal building on this is a great and stuff like that. That is the date assigned this job, not the architects job. As an architect, you want to give an infrastructure in place that can handle this kind of walling off data, handle all the big data and provide provide provide capability for the data data engineers and the date assigned us to building models, not really building the model. It's off. Seven. This course is focused on the big data architecture that there are other course as to how you would actually build. The models are not. That's not what we're focusing here. Just f y credit card fraud protections. How would that work? We start off with a custom with payment off just your own company's application in which people are gonna be logging in looking at various products that you offer on. Then you click on them by them, put them in the shopping cart and stuff like that. Now you might have attends off hundreds of the service actually serving out this application, and there could be thousands of users at any point in time. Holder, using this application toe actually click and buy and new kinds of stuff. So these Web applications generate a lot off evens as the user love. Another user navigates from page to page, and all those events are sent directly toe Apache Kafka. So cough causing in the good use to get all these Web evens and then and then transport this across your network into us and delays Data Store. So a Patrick, after I know that would be deployed on all these reps with Patrick Classical client on the absurdly pushing, pushing the date and the Apache Kafka on the Net for cough magic off gas. I was with them trying toe Colette and combine all the data and push it into a mongo db mongo db database. So the longer the wait, everybody is going to start accumulating the data as and as it comes in now, you would put also fraud input, which is, you know, which transactions were actually fraud that do you can find from November. But this is typically opposed. Fact of work as to you know, you typically find out some transactions later Are fraudulent transactions on that input us to come from externally to go on Mark Each of the records as Tokyo This is a fraudulent record is not for the land, Ricardo. Say which one is good and which one is bad Because this prior identification is required for building a model for future prediction. So that fraud and put us to come from outside tomorrow, each of these records, either fraudulent or not. Now, once you do this, then you could Apache spark to work. Apache Spark will look at all the transactions that are there in mongo db on this transaction being flagged as fraudulent or not. And you start to build a model that will tell, you know what kind of how How does fraudulent transaction looked like what is known fraudulent transactions and use that to build a model that can actually predict when the transactions are happening. With that, this can be a fraudulent transaction on North. So you build a model, and the model itself can be against toward back toe the Mongo DB. It's also you have historical data stored on Mongo DB and the model lists of being stored in mongo. DB on also, a model can be cashed in memory within Apache Spark if you want. Oh, and keep that also now, how will the actual prediction happen? You would actually have another fraud prediction at this is more like a Web friend. And for Apache spark itself. A new application to which your Web payment application will appeal does when the customer actually clicking and saying, This is all my it is. I want to buy this and they go and click Buy Then the Web, absent a request to the fraud protection app, asking it whether Hayes is a fraudulent transaction, Not not, you know, And then the fraud protection Abdin release that Apache Spark now about this part will then use all information it most about the transaction on the model it has already built on use both of them to make a prediction whether this particular transaction is a fraud transaction or not on. Then that would get rid it back to the fraud protection up which would lay it back to the verb happened in the back neck Exam action has to say there was that You cannot buy this at the time or contact our customer service president whatever you know. But this is how you put a head up a fraud production set up? No. You can do a similar thing for spam reduction for emails, time reduction for any messages, you can do a similar on architecture for network intrusion, direction, everything. This is like a similar template that you gonna play for all this kind of use cases out of the solution. Look like a question of data is through web evens. So this is generated by your customer about that. You're built for your organization on deployment. Likable form. The transport layer is going to be calf car because Kafka provides a reliable real time transport for data label and scalable on that can collect data from all these reps servers and then committed them into a mongo db a Munger TV is a good general purpose no sequel databases where the Arab events and transaction can be accumulated and the models but can also be stolen. Mongo db the transformation layer this park so you can actually use Parker. Look at the events and, you know, do some transformation, do some summaries and whatever you want to build in order to make this get a date already for mortal body even though we have not heard anything for reporting here. Given that it is mongo db, you can put a reporting layer on top of the data that is already in mongo DB and then provide some interesting reports. Also, as I was saying before, even though we took a look at each of these use cases in isolation, typically in an organization, you're going to get two or three years cases and you're gonna build over all solution. Put them all together. Advanced Analytics Sister Sparked Herbal, a binary classification model, are to predict whether a particular transaction is fraudulent or not. This is all you put a solution in place as to how you this is an architecture solution again, I said the date assigned Does it both bother about how the model will actually be built. And what attributes and features will be used to build the mortar. But you're focused here on the architecture part. Hopefully, this is helpful to you. Thank you. 32. Use Case 5 Operations Analytics: Hi. Welcome to the lecture on the next two years. Case Operational Analytics Operation and Analytics is a big field in the field of big data , especially when today a lot off the companies are moving to the cloud. And a lot of this club Big Centres Day doesn't Earth than ah, Host of Webster was like hundreds of them. And then they want to be able to manage the operations off all the servers. Look at the health of the servers on. Make sure that no failures are prevent that We thought the occur. So what is this? You scare the Your skin is very similar. So ABC Systems runs a cloud based the rest center with hundreds of nodes on the data center needs to be kept operational 24 by seven. Whatever they're doing. The running Hundreds of William notes on each of these Veum notes have applications running on them. So the North themselves are generating a lot off craps and alarms like CP use age memory, you said, and the applications on running on the notes that are generating a number off logs. I mean each other logs that are coming in there are definitely critical logs, which is like, You know, there is a failure happening in critical failure. There could be warning messages and there's that would be a lot of infamous areas were current utilization and stuff like that. So you need a way to suck in all these logs that are coming out of each of this note on be able to, first of all, on the friendship between various types of logs and then whenever critical information is coming and you wanna be ableto process this critical information in real time on honored the users. As for less, you want to store the historical information, toe dendrites and statistics. So to help in management, they want to set up a reporting analogic system that will provide them the following. They want to look at real time, no health monitoring in real time. They want to know if there are any notes that are in a critical state. They wanted to historical do cause analysis of problems, which is they want to look at the logs in general and wanted to historic analysis and try to find out if there are any kind of patterns and how ah failure might happen and stuff like that. And finally, they also want to predict note failures. Which is, can we look at the sequence of logs that are coming in and then see if this kind of pattern would result in a failure later? And if so, how can we then go ahead and prevent that kind of failure? And this is what we want to do with operations in Elastic Dexter on day. Want a Utah architect of bigger solution to solve the same? So what are the characteristics off this day? This kind, this requirement, the source of data is going to be server logs. No server logs are generated by Williams and applications typically this log are rolled over. They get rolled over every five minutes, 10 minutes, depending on how they used on. You can actually put in our monitors on this love. So whenever new log message put in the log that can be sucked in and propagated the data type that is being sent us a text message, just no log messages are typically text messages, but they also have some three defense structures like it will start with the times time than it made. Have a no name and stuff like that. The more of operation is going to be really time because we want really, I'm health information. The data acquisition is gonna be streaming or push where there would be agents or client sitting on each of this note. I was listening to all the logs that are happening on those notes on whenever new log messages happening those long was taken and pushed into an industry of every team needs to be real time because we're trying to do real time monitoring off the data that is going in the store type is gonna beat, right? Many read money, and that is because this is a use case where you're trying to look at each north. Each node then becomes like an object. So you want to track and all information around an object, which is the notice or body box of, um, whatever you wanna call it. So you're gonna create one note record, and we're gonna be operating that record frequently with the current state and historical state, and you're also going to be reading the data for reporting purposes. That response times are gonna be real time because you wanna have real name network monitoring, and the model building is going to be classifications in any kind of classifications toe. However you be fair to classify, you're not still predict note failures. So how does the architecture look like we start off with a Web note of Farm, which is a form of observers, are moto. However you architected this one? Ondas Not farmers going again. Dreadlock messages that will get pushed into Apache flu. A party flu? Is that one of the best options available for you to propagate lock so you'll put a party flu. Agents on each of the notes there on the North will take care of sucking and data from each of the locks on, then pushing onto a central stream off logs that will flow toe multiple off these a party from servers are agents. Before did we just the destination? Now the data that is coming in Apache flume needs to be used for two purposes. You want to do real time health monitoring and also you want a toe started, cannot analysis. So you first set up a sink for this floor, which is for real health monitoring, which would be spark streaming, so you put spark streaming in the middle. So it is going to look at the events happening in real time first around evens that are critical for you. So you don't have to be really We don't be looking at all the log, Mrs, because there could be tons of log messages. You're focused on really serious issues. So you felt throughout those log messages in our budget and spark streaming and then you want to transform them as a property as possible and push that into passenger kind. Kassandra. You gonna store one record? But no. And that wonder card is going to keep getting operated again and again with all the started sticks and everything that you find about the North. And that's what about cousin that's going to be more suitable for the spark. Streaming is used for filtering data on summarising data in real time, putting into Kassandra as well as it can be used for doing real time protection. So you might have historical data that you're used and build models before or predicting failures on the same spark. Streaming instance can be used to predict failure for the notes also, and that also gets a braided into Cassandra. With Cassandra, you can put operations dashboard, maybe a custom operations dashboard. Are you by that party product? Because you know, a real time monitoring. It's a big things that you want to invest in and a good dashboard that can read that other lenders in the Castle Grande keep showing you real time analytics like note by note, state as our somebody of the notes traitors. And I'm gonna get up when you're down, harmony out in the risk of going down on a lot of stuff like that in parallel, you want Also wonder dump all the log messages that you're getting into history a first for future analysis all the laws of your getting goes and urged Everton keeper giving as much as much as you want. This is a second channel that you open into which you're gonna be dumping data into Extreme and the New York input. Another instance of spark there whose job is to analyze this 60 of US data and historical fashions Go. This is gonna sit on mine all the log love that they're coming in on. Then it can collect the stylistics by North and then operate a Kassandra. But all kinds of status sticking up time downtimes and metrics and or CPU variation utilization regularly all kind of stuff. The same Apache spark instance can also be used for building a model for more failures. And that model can also be upgraded in the passenger, which will then be actually used by the real Time Street Sparks, claiming instance to predict failures on the same date I was off because I can be going to be used on the operations that dashboard for starting and analysis. So this is how you get one stream of data and then you spread it with in real time and historical on. You've crosses critical issues through real time. And then you will leave the rest of the stuff historica, and then populate the operations dashboard. You can ask a question. Why can't we do everything into your pen? You can. But then you have to provide for that wall name of data processing, right, Because real time you cannot afford any delays, and the volumes of data that you're gonna be pressing is laugh. You're toe size it up. Statue of a non. The number off the cost of the solution will go because you're investing that money number off servers that can go a paddle processing of data and push the date and Cassandra in historical data present you can afford afford some delay. So you know what you need is Maura queueing mechanism on that queuing is provided to you by hexi efforts. I Steven's gonna keep dumping all their it and doing it for you. And you can process that data annexed even in your own time through Apache Spark and put it in passenger. So it's after you as to how much off the data you want to split between the real time in historical are you want toe everything resulting Are you want everything started covering but that more you want to ruin real time, the more investment you're to make in terms off hardware on barrel operations, let's review the overall solution we put in. The question of data is going to be in lock files that are being created in this Various of Williams and applications the transport layer is going to be flown from agents are important. Each of these notes which will help our help acquire all the log messages and transport it through a layer flu can be set up in a one box of stomach multi back system, depending on the scale ability and that can just put all the day down through the big data layer. Persistence. We used HD of us for storing raw files and Kassandra was storing summaries by each node that you have transformation. You're gonna be using Apache Spark for real times stream subscription and transformation for computing. Really, I'm wondering statistics and also for predicting a node failures on this same about this part you would lose for all dressed, radical log analysis and statistics. You will see that the only option I've been using in all the use cases for transforming advancing our big sister party spot because that's that seem to be the best option available that can provide you all the scalability in the labour be that you need. Today you can use our and by dawn for analytics. But that's more for a very small scale. Operation reporting Third party are party important thing. This kind of solution requires a good reporting system because when you do real time monitoring, you actually gonna have So money monitors up there. That is gonna be a an operation center in which you're gonna have all this monitoring on your wise and people always constantly looking at them. So you want to invest in a good solution at that party solution that suits you on Advanced analytics, Of course. Through organs throws park for crediting north failures on the failures. Information is also stored as a part of the same record in the notes. I remember that in the case of Cassandra, you're not gonna have so many rows, but rather you gonna have so money column. So there is a big day in terms off the bit off the road, not in the number of rows that you have that. So I use cousin Russian. I hope this has been helpful for you. Thank you. 33. Use Case 6 News Articles Recommendations: Hi. Welcome toe this use case news article recommendations? A news article recommendation is a use case that will be very similar. Toe on item, recommendation for use, Supposed to go toe Amazon and they start recommending you. OK, You might also like this item on them on these use cases. Also similar toe have you, Lincoln and Facebook. And they start recommending you people who might be friends with you and stuff like that. All of them followed, doing all the same kind of use case. Ah, paradigm. So what does this use case about? ABC News Corporation hosts a website where users can read articles about various day today . Happenings. I mean, you know about a lot of these sites where you can go and read everyday news on day, have news on various different topics and in a different happenings. Different countries X attracts a trap. So for every registered user who logs in every scene, news wants to provide a list off recommended article. So they want to track what the users typically reads and likes on base turned that when they use the laws that they want to give them a recommended list of articles which say you might like these articles and give them a list. Eso that kind of list they want to create on day want toe klik a little this based on a web click analysis and production system. So how they do typically is that they want to look at the user behavior when they get to the website. You know, wildlings toe, they click to go toe What articles on Similarly, from which article. Which article they do they usually jump to and those kind of relationship analysis. And they start on that they wanna build a predictive model eso one of the various characteristics off this use case. So the source is going to be with click events where you were clicks, which is when the user goes and clicks on various links and the website You just started tracking them. You start tracking them brocaded. They went from click at a need. I be item C two items the that kind off link analysis on. Then you look at the data types. It's gonna be texting you worlds. Only this is going to be a list of you are the users was sitting, the more this kind of both real diamond started. We're gonna be collecting data in real time as the evens happened. But the processing and model building would be historical. The data acquisition is going to be pushed board where the browsers are pushing, even swept like evens through toe, the servers toe the data processing engine to the database. The availability of data is going to be real time in real time. We need to know about user and night and relationships like which uses are related to which uses. We also want to find similar users and similar items. So there, when somebody logs in there, look a history of users who are similar to this use that on, based around that they want. Also make recommendations are similarly with items. So you want to create all these kind of relationships? The storage type would be right. Many read money. So you will be building a profile about every user on every news item of news articles. So there is one item and you're gonna be building information around that on you're not gonna building attributes. Rather, you're going to be building a relationship between these various users and occupants. The response types are going to be real time in terms off predictions on model building, you're gonna be using various moderately techniques that collaborative filtering our association tools in your advanced on next models you want because you want. I understand user relationships, article relationships and stuff like that. Eso How does this for the great architecture look like you start out with a Web browser, which is not just one group cross about the series off Krosa Instances, which uses logging typically at any point. And then there could be 1000 users. Oh, are browsing your news website on as they're looking through the links you're generating click events on this trick events on dance and do the corresponding Web servers. From there, they're all marshalled into an Apache Kafka, Bip ling. So although there were multiple instead of Kafka running and you're gonna be just pushing them all in Apache Kafka too good. This data on that, I just the data from Kafka, given that there are no real time analysis made and you're just gonna be pushing them all in tow, actually offense for initial storage for or data storage. Once you push the Madrid Sharia as Jefferson. On a periodic basis, you can have Apache spark, look at this data and go through the links and start to understand how the relationship between various users and various news articles are. And they can start looking at all those relationship and trying to come up with this relationship diagrams or what do you call us? Graphs graph that showed relationship at the affinity of relationship between two users are to wait, Times are was writing, and then all the results of this analysis can be stood in near for G. We talked about before. That mere for day is a great there is very want to store relationship related information on this is a great opportunity for you to use neo four J because you are trying to track relationship between users relationship between items and the user item relationships. Also, we're gonna be capturing all this data into a neo four J database, which gives you some really great performance in terms of when you're trying to traverse relationships. Once you have neo four J, then all you need to have then is a custom recommendations, but this will be a custom. So what the you would build where you would act more like a friend and toe the near Forgy database. So there, anyone who wants a recommendation for a user will come to this over and this over would. Then we're even here for JD Rabies to give you the recommendations. So the Web rose that that the user is going to use to browse is going to actually contact the recommendations over many times, a user logs and and I give me a list of recommendations for the use of the recommendations ever. Then Causton here for Jake, where is the train? Comes up with some results and provide that with the Web. Rosa. So this is the cycle that you will follow when you gonna be building some recommendation engines. The key thing you see here with respect to the other use cases is that you're going to be using a graph database on a relationship in a place like New York for G to store the relationship between the users and items. So let's go toe everybody of the solution. So a question of data is going to be by BET, clicks and events. Web tricks from the bro's are collected through a form of observers and for transport, you're gonna be using Apache Kafka that is going to transport this information from a form of observers into a centralized hedge. The efforts for puzzle stands we're gonna be using extra efforts for storing all the raw even said they're coming in on. Then the analyzed data is then some aeration store asked relationship information in tow. Neo four j on all transformation work. You're gonna be you do doing using Apache Spark for this work that can look at the data and actually it was process it and then dumping it toe meal for J. There are no reporting requirements, even though if you want, you can always slapping reporting engine onto it elevates. I mean, any time you have a database, you can build a reporting solution on it. If you want from an advanced analytics prospective via you're gonna be using Apache Spark once again to do historical event analysis, and then you're gonna be doing things like collaborative filtering and associate rules mining and then big reserves off that data and stole them back into near 40 from. But you can do any kind of reporting that you want. So this is a use case that is very similar to how you do any kind of recommendations based on similarity. Trying to find similar users, similar items. This is what we will see in Amazon when they tried to recommend you. Items are even if you go to London. They're trying to look at your relationship tree on trying to recommend friends are people to more you might know and stuff like that. So hope this is helpful for you. Thank you. 34. Use Case 7 Customer 360: Hi. Welcome to this year's case. Customer 360 No customer. 3 60 is a classic use case that almost every business wants to create and have in their system. What is customer 3 16 Mean customer trey, 60. Mean having a 3 60 degree view off your customer reading the customer wants. Thus likes Doesn't like everything you did with the customer. You want to track everything that you do with the customer, and the customer does with you on keep track off in a database so that on that data, then you can do any kind of another days, any kind of predictive learning, and then you start to include your business with that customer. So let's have a discretion off the use case, so X Way cooperation produces and sells computers. And that's exercise. So it's one that sends a lot of computers and accessories, and you'll have been family with a lot of companies that you do similar business. It wants a crack and manage all information about their customers and create a customer 3 60 view on when they said they want to see a customer 3 60 view. They want to see what the customer does on the website when the customer comes in on start grossing through chronic searching through products which Beiji goes looks at, which reviews you look that then look at the customer's purchasing history, what the customer actually bought from you, whether this through the website or whether they called in order with a suggested through a STO. You want all of that, then a kind of issues that the customer has reported as well as the kind of called the customer has made to your contact center in terms of reporting some issues, getting some issues resolved problems with building problems with the product, it's of all kinds of stuff. And then you also want to know what the customer is doing on social media. Whether is committing about your product. Is he committing in a positive and negative? We you want to get all that information about the customer on, then finally want to use this information for doing some prospect of selling is you want identify If it is possible for you to use the customer to sell more products, can you get the customer to recommend your product? Are you can sell something directly to the product for the customer. You can absolute something when the new product comes in order. When an accessory comes in, I want to make all their decision on before making all those decisions you need as a base. All the data that you have about the customer. So characteristics off this use case, the source of data is going to be a lot. You have saved Rose in data you have CRM rate. Other talks about what the customer's purchase history as one of the customers issues, contacts and the data where you look at the emails, the customers and that the charge the customer had are the recordings of the phone calls and then analyze that to understand customer sentiment. Then you have social media data, but the customers are tweeting something about your product and you want to capture and see what the customers say. So you are all these kinds of data on the data types. Of course, it's a lot of kinds of data, as you can see numbers, text, media, everything, the mood off our data, the more of what is going to be historical, which is a lot of this you're gonna collect and process historical fashion. The data acquisition is going to be bought. Push more and put more, depending upon the type of data Results you have on availability of data is also kind of historical to get all the data integration everything done to build the customer customer. 3 60 The storage Ted There are multiple storage types in this one, but most of them are right. Many read money, especially the customer. 36 Studio At this. The response times are going to be real time because you want a quick the customer profile and get the profile information in real time. Like when the customer is actually browsing your group. Say, do you want to make some recommendations to the customer? Are based on its profits, or that query is going to be the time on model building. You would be building a number of models. In this case, you might classify the customer like based on a sales question. You want a cluster customers to groups of logical groups based on their behavior. You want a collaborative filtering to find customers who are similar to each other and products that they like in all this kind of stuff with a lot of model building you can do with the data this kind of data that you have. So how does the architecture looks like? So the number? There's a number of data services sources, CNN. The browser is one data source where the customer is grossing. You generate all the broads and clicks. The C atom is another data source where you have information about the customer's orders. The order the customer took. As for less information about the are really is incident report that trouble applications created and stuff like that, Then you have the contacts and the data, which is all the emails you're saying the chance es made the voice calls and then trying to analyze the content off all this media to see what kind of sentiment to use as a What is that the user is talking about? And then there is finally social media. They other is coming through the Internet on. Then you're gonna be getting that they'd also So how do you get all this data? First, let us get the click even from the browser or the click events will be. Then Cento a really ah life calf car system from whether this data will be marshall into the big data repository. So click events from Rosa's ascent to Kafka and similarly Social Media data also will be pulled in real time, streaming a Bs and again pushed into Kafka on all of the Monsanto Kafka on from Kafka. They will all go in land and I'm extra offers database for story. So this is all text data raw, even Andrada text that goes toe took off car tow HD efforts for stories No See Adam and contacts in the data are mostly coming from databases. Also see Adam and contacts and area down, mostly already in databases, even though there could be media first. Yeah, you could also handle them the same way you did the media file store for, at least for this wonder that's assumed that there are pre processed and kept in a in an RGB must kind of database so far that you're gonna be putting a scoop with the scheduler on the transaction from the Sierra must for the scoop. Similarly, the contact center transactions our information from all the recordings and all the things that summaries information is also fed through scoop and from that they can be dumped and toe among go TV. So this is a gnarly Bemis database that is processed and but number two among you. So your information and hedged here first and then in Mongolia. Then we have sparked coming in the spark and then big data from Hedge three of us on mongo DB and then connect this record based on the customer on, then tried. Unless all this data some rest the data by the customer as well as do all kinds off advanced knowledge, it's like ready to even on addiction, custody in collaboration, classifications, everything. And then you put the data finally into a Kassandra enemies. The Cassandra database would be indexed by the customers. The customer idea will be the index. And on that record, you'll be storing all kinds of data about the customer. Long a huge set of columns for every record in which will be storing the customer 3 60 information everything you of analyzed and used. Once you have the data in Cassandra, then you can put a custom recommendation server, which can then read the data in the Cassandra database. On this custom, the recommendation server can then be used by the browser on the social media are the serum applications, which and then get some real time information about what can be recommended duty customer, and then use the minutes that their own way. Procida president through the browser or toe prom, the calls, contacts and the agent toe upset something the customer, Anything like that. Now all this Cassandra database. You can also put a reporting layer on it because you have some bridge, some really rich creators sitting there, Some you can put on a dock reporting layer, which then your data analyst, our customer facing people can analyze the data on get more insights also. So this is how you would marshal all the customer data from various sources into one destination. I just summarized and stored in a nice way in which people can then use for any kind of reporting. Are the solution distribute of the solution? The acquisition is going to be to Web clicks transaction social media. There are multiple types. Transport layers are going to be a calf Carlier onda school player. The persistence layer we're gonna be using Poly Glock persistent. There is going to be hitched the efforts that is mongo db. There's Cassandra and there is already rgb mother restoring on Lucy. And I'm gonna the streetcar multiple types off databases on we're choosing best database for each type of data. Then come to transform lier transformation is done toe a party, spark that cannon just from both real time and historical data sources and then in just all the data and create any kind off. Some rest data you want reporting will be third party. If you wanna put born on you can use the data that isn't mongo, db and Cassandra Far building someone Alexi Wise and dashboards and stuff like that. And finally, advanced analytics. Good luck. And maybe through spark that can generate all kinds of customer classifications in. And the clustering data on that information can be stored and finally in your cousin Braddock and then be used by the reporting layer for are doing analytics. Asprilla's by the recommendations over toe make recommendations to any of the any of the requesting engines. This is how you baby clearly is an A customer. 3 60 very popular use case. Almost every company wants to create and have one hope. This is being helpful to you. Thank you. Bye. 35. Use Case 8 IOT Connected Car: Hey, welcome to this use case on I ot of the connected called Khar. So today, the this Internet off things they feel is becoming more and more popular. And what does Internet off thing is that you want to make every item in the world Toby Internet connected so that you can get data out off it and do some analytics. And based on their does some real time work based on in this case, the item you gonna make me connecting to the Internet, our cars. So cards have a pretty and costly items on them on what the card manufacturers want to provide our services where and they can track what is happening in the car on based on what is happening in the car, they want first provide some kind of alarms as to if something is going to fail in the guard they wanted in from the owner there at the driver that something is going to fail, you might want to be careful. Also, they can use the same kind of information to sell some products. They can know some location based marketing if you they want total Internet off Thing is going to be a more and more popular field on. That is also a major use case. When it comes to big data and analytics on, we're gonna look at how one off the use cases for Internet of things is going to be working . So the use case description is that a pick you are car company wants to connect cars in real time, toe their analytics engine. They want toe, collect information from the car's in real time and analyze them on cars as you know how they have multiple sense on that. At multiple sensors on the engines, the transmission at the drive train the fuel from all can take a decade. Presents are almost all over the car, and they want to collect the sensor data in real time at unless the data and see for any potential issues on if there is any potential issues, they want toe unregenerate alarms on informed the driver that certain things might be going wrong on pick your needs. A satellite enabled data collection system they want to collect connected cars to either satellites are mobile systems, so which they can collect the data and they want to collect all the data into a central repository that they can do analytics and then they can generate alarms back to the car characteristics off this use cases that the sources are going to be car since are so this is a sensor log data. The data type in this case is going to be numbers. Even that can even be binary because these are sensor data that will be there. Usually, numbers are binary data, the more of collection as both real time and historical, because when you collecting data from the car, you want a process, the critical data in real time on there could be the rest of the data that you can collect and store an analyst on your own time. That could be historical. The data acquisition mode has pushed the Carsons us are pushing the data to satellites to the central repository available. Red eight out of data has to be the all time because you want to create real time allowance . The storage type is gonna be right. Many read many's. Are you gonna build a database around each car on each car being an object and all information about that object stood in that particular record response times are going to be real time, of course, because there are lumps are going to be real time on model building is to be able to predict issues of the car are based on based on past data. So you wanna build models by which you can predict. Okay, Certain things are happening in the Carson kind of censor messages are seen. That means there is a chance that a potential issue made happen in the future. We want to create this kind of models. So how does the Internet connected car is gonna work? So you're the car. First of all, the car has a number of sensors and from the sensors, the creator is collected. Since I read has collected periodically and that is sent to data collection notes, these data collection note, maybe satellite nodes or mobile collection centers in which the data is collected. From these connection points under until this point, it is all going to be custom solution. We are not gonna be involved in it. But once the greater get took his data collection notes, they're put on toe on Apache flu bus. So there are a particle flow. Agents in each of the north that will collect events from all this note on sending the Apache flume. No, Once the data gets into a party flume, the data is kind of collated into a central place on from where you can have a spark streaming server listening to the radar, coming on the Apache flume and doing analytics in real time on the data that is coming in on based on that doing both real time metrics as well as real name predictions in the vault failures in and this all these things are happening in real time. On that data is stored in Kassandra. In Cassandra, you're gonna be creating one record, Parker. And then you're gonna be collecting all information about that car in that one record in parallel. You will also be sending all the sensor data that you get into Hetchy efforts for future analysis. So you'll be Lexan all the daytime dirty a first on then you cannot again Apache spark working on the daytime. Nastya Busto do a lot more analysis and real time and spark seeming you'll be only focused on critical items in historical. It'll be a nursing a lot more information about log statistics, glass beads and all kinds of stuff. And that also would go back to Cassandra. You know, in terms of all the information about various cars, I want the car. Later, it's into Cassandra. Then you can actually do any kind of real time reporting historical reporting based on, and you can slap in a reporting agent or this one. But for the most important thing we talked about is the lumps for this. So your creator alarm server, which would be continuously reading data from this Cassandra database on then pushing those alarms, tow the car back to the car so the car will have an alarming system that is connected to this alarming seven on a periodic basis, the alarming system is going to contact the alarm alarm severance, see if there are any on messages for the car and if there any alarms, they're pushed to the car and then the car alarm system will then push them toe. The driver on the driver can then look at the data and react appropriately. So this is our Internet off. Things are more data collection on crossing would look like traditional. Another wants the differences being another source and of the sink works. In this case, you would actually see a lot of similarities between the various use cases, because once it gets the data processing, architectures are going to start looking almost pretty similar. So this is the way to look like a review of the solution. The acquisition is going to be events from cards and serves, and they are typically imbedded in every car. Then you have a transport layer, which is custom the customers for transferring data through a the mobile launch satellites . And then there is also Kafka. The persistence layer is going to be history here first and Cassandra Davis for the drive ins that Sandra for the summer is data. Our transformation is going to be toe a party spark Pagis Park and in just data both in real time and historical moments on the reporting is again going to be custom oneness. You can have reporting inside the car to show some started steaks and comparisons and stuff like that. For example, you can even show method, which is currently you were average mpg. Soon so and other similar models are actually getting in my small corner. So and so they can actually compared the card with the other Carson progressing. Reporting also. And it can be centralised reporting. Also, Toby, car manufacturers and now across all cost you can. They can show some summary information in terms of advancing heretics. Yes, you will. Again, he was part for pulling your prediction models and actually doing real time predictions. So this is our priority is becoming more and more popular. T were in every device. Everything. Every item in the world. You want to give it an I P address and connected to the internet and get data out off. It s Oh, that is getting more and more popular. Use case for Big Data Analytics. I hope this has been helpful for you. Thank you. 36. Transitioning to Big Data: Hey, welcome to this lecture on transitioning to big Data. Typically, you would be a part of Ideas Department, our software product company who currently have traditional data solutions and tradition applications on you are trying trying to transition into big data. So what are things you need to take Carol's so that start with the 1st 1 which is start with new use cases. Do not you? If you already have an application that is building and running for some purpose, do not try to do a more off that application. The big data for us because they're obviously things will be like comparison and contrasting. Plus on already built application typically has a pretty rigid features that you have the completely replicate in the news and up, and there will be a lot of challenges that so, rather than that look at use cases that are totally new to your company that might be related to advance in a X or something, Toto Social media are something like mobile s so that it has some real value that it can show off separately rather than being compared with something that is already running, which is going to be a more difficult for you to sell inside the company. Leave work, works alone for Oh, no, not Titus. Move something that is already working. Finding two big ass. Your first project because it's already working. People are happy with that. You just try to move it and something doesn't work. You know, you're gonna be getting beaten up for no reason. Two simple use cases with quickly self so simple Use cases like nor the data were hosting simple use case. You can quickly do it, and that will convince other people quickly of the value before you start off on our more difficult and more time consuming projects. Do some proof of concept Project, always with big data. Don't jump straight away into the product development. I know some proof of concept trying to see everything, that how things would where I work fine and ran fine on get a good feel off the solution before you drive into doing the production version. Because there could be a lot of challenges that you will unearth when you're doing the proof of concept in a big data process. Some unique challenges. Unless you get your hands study, you're not going to see what they are. Be careful about advanced analytics. That is a lot of hype going around regarding predictive analytics, and it benefits. Not all data has signals. Not all buried him. Analytics projects developed value. So you got to be pretty clear up pretty terra board when you start off advance of an ex projects to tell people that you know the values are something that you can only tell them later and do not happen. The high expectations to start with are both sandboxes and move them gradually to production. Do not jump again straight into production again. That's why you can t choose a new use cases. You can have some time to put something on a sandbox. Tried for some time. Let people experience it for sometimes give you feedback and then you kind of incorporate them on. Finally, I don't think about scalability. Always. This is big data that needs big stray liberty. The good thing about all this big data technologies is that they all skill are is on early , which means you can just stop with one server with one instance of each of the conference running on. Then you can keep adding more and more on notes and more and more instances as you grow. That is one of the because of colleges off all the big data technology. Don't over invest a friend heavily. You can start off small and then you can keep growing as you want. Limbs grow off. This has been helpful to you, Thank. 37. Closing Remarks ADBS: Hi Welcome toe the closing remarks for the course Architect ING Big Data Solutions. I believe you had a great experience learning information from this course. Adjust interview. You have gone through the differences between traditional data solutions and beginner solutions. You have looked at an architecture template as to how a big dinner solution is built. You then looked at the models and a big data solution, and for each of the models who looked at what the architect and what are the best practices ? We looked at one of the technology options available for each of the modules, the advantages, shortcomings and use cases. And finally we looked at a pre love enterprise use cases. So where do we go from here? So I would recommend is that you continue to learn more on data science and analytics stuff . Learn more about the various big data technologies that we discussed in the course. Look at programming with big data with bite on scale on our look at Apache Spark, that being a very critical, competent off any kind of architecture that you put in place, I learn about no sequel, very is no sequel databases and how to use them. And also look at various functional areas like finance operations and I'll based customer knowledge, ex cetera. No more. Announce more about your learning on each of this technology. That way you get a complete a picture off, how you would use all the technologies and implement them as a part of any solution. So congratulations on finishing the schools, and we hope that this court's helps you in advancing your career. Thank you. And best of luck in your future endeavors. This is your instructor, Cameron signing off by