Executive Briefing: Big Data and the Hadoop Ecosystem | Frank Kane | Skillshare

Executive Briefing: Big Data and the Hadoop Ecosystem

Frank Kane, Founder of Sundog Education, ex-Amazon

Executive Briefing: Big Data and the Hadoop Ecosystem

Frank Kane, Founder of Sundog Education, ex-Amazon

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
11 Lessons (42m)
    • 1. Introduction and Learning Objectives

      3:00
    • 2. Please Follow Me on Skillshare

      0:16
    • 3. Definitions and Applications of Big Data

      6:31
    • 4. Vertical vs. Horizontal Scaling

      3:09
    • 5. Big Data: The Lay of the Land

      10:03
    • 6. The Hadoop Ecosystem

      7:27
    • 7. Quiz

      1:58
    • 8. Big Data, Big Decisions

      5:14
    • 9. The Costs of Big Data

      3:58
    • 10. Conclusion

      0:31
    • 11. Let's Stay in Touch

      0:46
  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

292

Students

--

Projects

About This Class

If you're a business leader who keeps hearing about "big data," but is confused about what exactly it means and how it works - this short course will de-mystify it for you. Anyone curious about learning more about how technologies such as Hadoop, Cloudera, Hortonworks, and MapR work and what they're used for will learn enough to be dangerous here.

This is not a hands-on, technical programming course, but rather an executive briefing on what these technologies offer, how they impact your business, and how to make the right decisions surrounding their use.

We pack a lot of information into this 45-minute course, including:

  • How to define "big data" with the "four V's"

  • Real-world examples of big data in business

  • Horizontal vs. vertical scaling

  • The main components of Hadoop

  • Understanding the main data platform vendors (Cloudera and Hortonworks,) their product offerings, and the impact of their merger.

  • The many components of the larger Hadoop ecosystem, the buzzwords surrounding it, and how they fit together

  • How to decide between hosting your own "big data" cluster and leasing capacity from cloud service providers such as AWS

  • Understanding the real costs of deploying a big data platform in your organization, including the need to re-organize.

  • Deciding whether you really need "big data" systems in the first place - it's not for everyone!

Your instructor is Frank Kane, a former senior manager from Amazon and IMDb who led the development of numerous "big data" systems from both the technical and managerial side. Frank will share his experience on how these systems work from a technical side, and how they impact your organization from the business side as well.

There is no need to know how to program in order to understand the content in this course. It's designed for someone who is technically literate, but not necessarily a technician.

Meet Your Teacher

Teacher Profile Image

Frank Kane

Founder of Sundog Education, ex-Amazon

Teacher

Frank spent 9 years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers, all the time. Frank holds 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, Frank left to start his own successful company, Sundog Software, which focuses on virtual reality environment technology, and teaching others about big data analysis.

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
    0%
  • Yes
    0%
  • Somewhat
    0%
  • Not really
    0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Your creative journey starts here.

  • Unlimited access to every class
  • Supportive online creative community
  • Learn offline with Skillshare’s app

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

phone

Transcripts

1. Introduction and Learning Objectives: welcome to your executive briefing on Big Data. I'm Frank Kane and I spent my career at amazon dot com, where I led the development of many systems that collected, stored and analysed Amazon's massive data sets. I'll share my experience as a senior manager to help you make sense of this complex and rapidly changing field and enable you to make business decisions around the use of big data technologies. Making the wrong choices about how to manage your company's data can come in a very expensive cost building, and maintaining a cluster is a large, costly and disruptive task for any organization. How do you know if your data is really big? Do these systems offer enough benefit to justify the expense? Do you host your own systems or lease them in the cloud? And what's the difference between Horton works, Cloudera, Hadoop and all these other technologies? This short course will clear up a lot of the mystery that might surround these systems for you. We'll start off covering some examples of big data and how the ability to process an extract meaning from it can generate real business value. Then we'll do a really quick overview of the many technologies and vendors involved in this field, so you'll at least be familiar with some of the main buzzwords. Well, then talk about the realities of deploying these systems in your organization and the main decisions you'll need to make along the way. Finally, we'll have a frank discussion about the challenges and costs you can expect to encounter so you can balance that against the potential benefits. This course is targeted toward business leaders and companies with significant data and information technology who just want a quick, high level overview of the world of big data. The goal is to get you converse into these technologies and aware of some of the considerations involved with, um, if you're looking for in depth technical information or hands on programming exercises, you want one of my other courses. Not this one. We have a lot to cover, and you don't have a lot of time, so let's dive right in. Here are the objectives of this short course. Well, define what we mean by big data and look at a few examples of how other companies air, generating it and extracting value from it. We'll have a really quick crash course on the main vendors involved in the world of big data and the wide variety of open source technologies they piece together in their platforms and how those technologies work together. We'll talk about some of the main decisions you'll have to make when choosing to deploy a big data platform. And finally, we'll have some real talk about whether your organization really need see systems or not. There are real and significant costs involved in moving to a whole new data platform, and you need to know the right questions to ask before you make an expensive mistake. I want to reiterate that this course is not for engineers or people looking for low level technical details on how to administer or use a platform such as Hadoop or Cloudera. This is an executive briefing for business leaders who have little to no prior knowledge of what is really meant by big data but want to understand how it might affect or benefit their business. We're going to keep things quick and high level here for more depth. You'll want to look at my other courses 2. Please Follow Me on Skillshare: the world of big data and machine learning is immense and constantly evolving. If you haven't already be sure to hit the follow button next to my name on the main page of this course. That way you'll be sure to get announcements of my future courses and news as this industry continues to change. 3. Definitions and Applications of Big Data: people throw around the term big data a lot, but what do we really mean by it? How do you know if your data is big and what technologies are available to help you manage that data and extract meaning from it at large scale? I like to keep things practical, So let's talk about some real world examples of big data. We live in a world where their arm or Internet connected devices than there are people by a lot, and all those devices air constantly, creating data that needs to be collected, stored and analysed. Somehow that's the inherent challenge presented by big data. This proliferation of devices is often referred to as the Internet of things. As a relatively small example, consider Nest, which was purchased by Google in 2014. By some estimates, they're selling 100,000 Internet connected thermostats every month. Now all of those thermostats report usage data back to nest, where they analyzed that data to automatically adjust your thermostat based on how you're interacting with it for millions of thermostats, they collect every time someone adjust them whenever someone walks in front of one and whenever someone's cell phone is close to one. The resulting data is massive and they need to employ Google's tools from wrangling that data as it is truly big. Any large website or retailer will be collecting massive amounts of data related to customer purchases, and even medium sized websites can generate massive amounts of log data, recording every interaction people have with that website. When I worked at imdb dot com, just figuring out how many people visited our website every day turned out to be a massive technical challenge due to the scale of the data involved at Amazon. Everything we did involve big data and a curse for any technical proposal would be the observation that it doesn't scale. Wal Mart is estimated to process over one million purchase transactions per hour, resulting in 2.5 petabytes of data to deal with. You can't just throw that sort of data into a traditional database and expected toe work New technologies, air needed to handle it. There's a lot of intersection between the fields of big data and deep learning or artificial intelligence. Artificial neural networks can take in data from a variety of different sources and learn how to find patterns in that data in much the same way that your brain does. Often the neural networks themselves air relatively simple, but the main challenge is just getting all of that data collected, cleaned up and fed into it. You can't deploy a large scale AI system without solving the big data problem. First, the medical field also generates a lot of data, whether it's claim data from insurance companies, lab test results or even individual DNA sequences. Our bodies in the care we provide for them also generate massive data sets. And extracting meaning from that data can result in new therapies and new medical tools that can actually save lives. Big data is often said to be characterized by four V's. It's not just the scale of your data. That's a consideration. One is velocity. Storing large amounts of data isn't necessarily a hard problem, but handling a constant firehose of new data all the time can become a very challenging problem To deal with. The rate at which a single computer or even network cable can keep up with data can often be your main limiting factor. And what forces you into moving into a more distributed system that can handle higher data rates. Next is volume. That's the big in big data. In general, we can define big as being beyond the limits of what a single computer can process efficiently. It's not just about storage space. A single computer is also limited by the disc seek times of its hard drives, its bus speeds, it's processors and the amount of memory it has to work with. So even if you had an infinitely large hard drive, there will still be a point where processing data on a single computer becomes impractical . Variety presents another challenge. Data often comes from a variety of sources, and it can also vary in its quality. Consider a self driving car. Its data comes from cameras, radars, internal sensors and maybe even lied Our data. All of that data varies in its quality, its nature and its quantity, and they all need to be processed in different ways before they could be used to train your car. How to drive veracity is a more recent addition to this list, which calls out that data quality issue more explicitly. Consider the log data from your company's website. That data is going to be polluted by bots, crawlers from search engines and even malicious attacker is trying to bring down your website or steal your company's data. If you're trying to extract insights about your real customers from that data first you need to somehow filter out all of the data that doesn't come from customers at all. This is a huge problem in the world of Internet advertising, for example, as it's hard to know how many of the clicks on your ad came from actual human beings. Although Big Data presents its challenges, the rewards of extracting meaning from it are often very compelling. You've probably noticed in recent years that your bank is doing a much better job of detecting potential fraud on your credit cards. The amount of transactions Bank of America handles every day is unimaginable, but they have systems that automatically look for unusual behavior on every account and automatically flagged those for review from their customers. Although developing that system must have been very expensive, it was probably a bargain compared to the losses they had been seeing from fraud beforehand . The human genome contains the equivalent of 1.5 gigabytes of information, and the applications of mining. That data have real applications in gene therapy for curing rheal diseases, even for more casual use. The Company, 23 ME, has data on the DNA of over one million people, and they mined that data to help people connect with their ancestry and their own individual health risks. A single autonomous vehicle generates an estimated four terabytes of data every day. We're not quite there yet, but training cars had to drive themselves. Using all of that data could ultimately save millions of lives once the cars get better at driving than humans are. And although it isn't necessarily saving the world, perhaps the most common application of big data is just analyzing that data in order to make better business decisions. You might take Google Analytics for granted when you're analyzing trends on how your company's website is being used. But under the hood is a very complex and large system for handling and analyzing all of the data Google Analytics collects from all of the websites that use it. Once your company outgrows what Google analytics can provide for you, the problem of handling your own Weblog data becomes your own 4. Vertical vs. Horizontal Scaling: before the world exploded. With data, we typically manage it using a monolithic database of some sort open source. Relational databases from my sequel, Post Rescue Well, and others are still in wide use today for smaller data applications in the Enterprise. Commercial relational databases such as Oracle came with professional support and generally required specialized database administrators to maintain them. But for many applications, they do a fine job for both fast real time queries and for more complex analysis that you might see in data warehousing operations. Although there might be standby instances maintained in case your primary database goes down, these traditional databases typically would run on a single machine. If you needed to store UM or data or handled more transactions with it, your only option was to get a bigger and faster computer to run it on. We call this vertical scaling, and as your needs increase, you just keep adding more capacity to an individual database host more dis space, more memory, faster, CPUs, whatever it takes. But vertical scaling has some inherent limits because there is only so much a single computer can do. As large organizations started to generate more data than even the most powerful relational database host could handle. We were forced to find alternative approaches to storing and analyzing all of that data. Big data, if you will. The answer is what we call horizontal scaling. Instead of piling all your data into a single database host, spread it out across the cluster of smaller computers. Then is your capacity needs increase. You could just keep adding mawr and more computers into your cluster toe. Handle it with practically no upper bound. The combined dis space memory and processing power of the entire cluster becomes available for storing and analysing your data, no matter how large it may be. And as long as you have room in your data center, you can keep adding Mawr computers to your distributed data store. Another benefit of horizontal scaling is redundancy. If you have enough computers, you could store backup copies every data across 23 or even more physical machines. That way, if any individual computer in your cluster fails, the system can automatically switch to one of the backup copies of that data on other computers in the cluster, you don't need to suffer any downtime. While that broken computer is replaced. The's individual computers can be very inexpensive compared to hardware purpose built for large relational databases and may even be moved between different clusters as your requirements increase or decrease for different applications. You can even lease thes computers from cloud based services such as Google Cloud or Amazon Web services if you don't want to maintain them yourself. How this works is, of course, more complicated than it looks here. In reality, some sort of master system needs to sit atop this cluster to figure out where requests for data should be routed to and to keep track of which computers air currently operational. One of the first open source systems to manage this complexity is called Hadoop, and we'll dive more into it momentarily. But many other systems have arisen for specialized to use cases, including elasticsearch for searching Discovery Mongo DB for handling simple database like queries at massive scale Apache Spark for analyzing massive data sets using programming languages and many more that will quickly touch on 5. Big Data: The Lay of the Land: the world of big data involves a lot of different technologies and vendors packaging those technologies for you. All of the names and buzz words could be very overwhelming, and I'm at least going to introduce you to them so they seem less intimidating. Let's start the top level and discuss Hadoop and some of the main technology vendors that have arisen around it. Our story starts with Hadoop not that long ago, really. In 2011 one of the first companies to be faced with big data was Google in the early two thousands, and they were kind enough to publish their technical architecture for storing and analysing their massive data sets gleaned from the entire Internet. A guy named Doug Cutting, who was working at Yahoo at the time, led the development of an open source implementation of the ideas Google had published. He called the system Hadoop after a yellow toy elephant that belonged to his young daughter . And that's also why had oops logo is this yellow elephant. Hadoop consists of three main components. H DFS yarn and map reduce. Let's start with H DFS HD if S stands for the Hadoop distributed file system. It's based on the same ideas behind the Google file system, or GFS. But H DFS is open source and freely available to anyone. All it does is manage how data is spread out across a cluster of computers. A lot of thought needs to go into how to spread that data out to make sure it won't disappear for an individual computer in your cluster fails while still making it possible for applications to use that data to access it locally on the same computer whenever possible. H DFS does all of that thinking for you, so you can basically say Upload this data into my cluster and HD effects will figure out all the specifics of where that data actually gets stored and do it for you. The next component of Hadoop is called Yarn, which stands for yet another resource manager. It's the system that keeps track of which computers in your cluster are currently operational, which ones air down and how heavily loaded each computer is when you want to run some sort of data analysis, a task across your entire cluster yarn is what decides which computers take on that task and how to best distribute it. It can try to make sure the data you want to analyze is physically stored on the same machines that will be running. The analysis, while still trying to balance the workload across your entire cluster, is best to can again. It's a complicated test, but you're on means you don't have to think about it. The third component of Hadoop is map reduce. This is basically a copy of the same system Google originally used for analysing its massive data sets, and they didn't even bother to change the name of it. With map reduce, you can write a simple script, analyzed data that might be spread out across an entire cluster, and it will figure out how to split up that analysis into individual tests. That yarn can then distribute out for you. A map reduce script generally involves writing mappers, which are functions that extract and transform the data you want to analyze. This mapping can be performed in parallel across all of the computers or nodes in your cluster. What's your data is mapped. A reducer operation can aggregate all that data together, for example, summing it up or taking the average to produce the final answer you want from your script. Unleashing the power of Google into the open source world made her do very popular, and an entire ecosystem soon evolved around it. Tools emerged a lot. You query your data using the same sequence in taxi would use in traditional relational databases without having to write code in map. Reduce more open source systems emerged to further accelerate Hadoop before machine learning algorithms across a cluster manager cluster with simple user interfaces and integrate your cluster with a variety of external systems. It's soon led to a large and confusing array of individual technologies surrounding Hadoop . This presented an opportunity for vendors who could package some of these open source technologies under a single product, providing management you I for it and sell that packages and easy to install cohesive data platform that includes professional support. The two leaders that soon emerged in the space where Horton Works and Cloudera. Neither of them really talk about Hadoop anymore, since they're trying to sell their larger data platforms that surround it. But Hadoop remains at their core. Let's start by diving into Horton works. Their data platform has always been distinguished by loyalty to the open source model. They always choose open source systems as part of their package instead of developing their own. If you're a believer in the open source model in the portability that comes with avoiding lock into proprietary solutions, Horton works has always been a good choice. Their product offerings air relatively simple. HDP and HD F. Let's dive into HDP a bit. It stands for Horton Works data platform, and it's been their core product. It's a collection of Hadoop in some of the main open source technologies that have sprung up around Hadoop to manage it, query it and before machine learning with it. The Value HDP ads is bundling these technologies together in such a way that the inter operate properly out of the box and in such a way that makes it easy to deploy across your cluster. And if you need more open source tools that aren't included with HDP, it's usually pretty easy to add them in yourself. There are the product is HD F or Horton works data flow. It's similar to HDP, but it's focuses on what we call streaming data. It's a collection of tools that specialize in processing and analyzing data as soon as it is received. So instead of running some big analysis job at the end of every day on the data that's been received, HD F allows you to analyze data all the time. 24 7 Updating your results the moment new data into your system is generated. The idea of batch processing your data indiscreet chunks goes away and is replaced by systems that monitor data all the time as it streamed into your cluster and re compute the insights that arise from that data in real time. If your business can't make do with results that are only updated daily or so, HD F gives you the tools to always have up to the second results available. It's not really an either or situation with HDP in HD F, however, it's possible to integrate open source data streaming technologies into HDP, and some even come built into HDP. Anyhow, it's not the sort of decision you can never change your mind on. The other big player in this space is Cloudera. Quite honestly, some of their success has arisen from having a more appealing name than Horton works like Horton works there bottling technologies built around the Hadoop ecosystem into a cohesive data platform. But unlike Horton works, they aren't as committed to using open source systems if they feel they can develop something better on their own. For example, they have a proprietary management you y for Cloudera clusters and develop their own sequel interface, called Impala. Most of what they offer is still open source, and they generally contribute their own systems back into the open source community. But Cloudera does not limit itself to systems that already exist in the open source world. Their product offerings also follow a different strategy. Instead of splitting the world across batch queries in real time streaming, their products are more focused on the sorts of applications you might want to perform in your cluster. Cloudera Enterprises, their flagship product, much like HDP, is toe Horton works. This is their main product that can do pretty much anything you want. But if you want a more narrowly focused set of applications in your data platform, Cloud Air offers several specialized products. One is Cloudera Data Warehouse, which is aimed at users who just want to run the same sorts of queries they ran on their older relational databases, but at larger scale. If all you really care about its ad hoc business analysis of your data and ive outgrown with a single relational database host can do, this could be a good place to start. Another often is Cloudera operational database, like Horton works is HD F. It offers real time streaming capabilities on your data as it's received. Its focus is still on business analysis, every data, but on a continual basis, not just dealing with ad hoc queries on large batches of existing data. If there are metrics, you need up to the second results on it anytime. Cloudera operational database might be what you want. Next we have Cloudera data science. This moves us beyond the world of database queries and into the world of data science and machine learning. It's made for organizations with people who can write code to train systems that can make predictions based on historical data or to apply artificial intelligence to classifying new data that comes in in reality. There's a lot of overlap between these offerings from Cloudera, and many of them use the same underlying technology just packaged in different ways. Again which product you go with initially isn't a decision you can never undo. So how do you choose between Horton Works and Cloudera? Well, you no longer have to choose. And that's because Cloud Erin Horton works have merged. Their combined company will be called Cloudera, so it seems likely that Horton works HDP in HD F products will be phased out over time. Cloudera claims that this merger will combine the strengths of both companies, but the reality is that they both use the same set of underlying open source technologies. For the most part, one system in particular, called Apache Spark, is really at the heart of much of it. I expect new capabilities that are exposed to existing customers as a result of this merger to be minimal. But given this development, it seems that Cloudera is the right choice for new systems you might consider putting in place. Cloudera won't quite be a monopoly, however. There are still a few other players in the space worth considering. Amazon offers a plastic map reduce as part of its Amazon Web services platform. However, it's fairly limited in its capabilities compared to Cloudera, and it isn't quite as easy to use, nor do you really have the option to run it within your own data centers. Map are is more comparable. It's a Cloudera competitors and is probably the main alternative you should be Looking at. Their main differentiator is a proprietary database system that they claim to be several times faster. Like Amazon, Microsoft offers a somewhat limited data platform ons. Azure Web service called HD Insight, and all to scale is another cloud based solution that I'm seeing mentioned more often as well. And another option is to just roll your own. Remember, most of these systems rely heavily on open source systems that are freely available to anyone. If you have staff that is up to the technical task of building out their own cluster and installing their own systems onto it, it's entirely possible to set a precisely the systems you need on your own hardware for free. But the ongoing cost of paying people to maintain and update that system may be more than if you were using a pre packaged solution from one of the vendors we discussed. So that's the lay of the land when it comes to technology vendors. Next, let's dive into individual technologies that live under the hood in these systems, 6. The Hadoop Ecosystem: this lecture is going to seem like a bit of a whirlwind. There's a huge variety of individual technologies that make up a data platform that can all be integrated in interesting ways. My only objective here is to introduce you to the names of some of the main systems you might encounter and see it a very high level. What, therefore, and how they fit together. There's no way you're going to remember all of this, but at least you can come back here for a refresher if you need it, and the names won't feel entirely new to you when you run across them. There is no correct way to organize all of these different technologies, but I've segmented them into three main groups. One is what I call the core Hadoop ecosystem and its technologies that are built to run on a Hadoop cluster. We then have external query engines that can execute queries that span multiple different sources of data and external data storage systems that stand on their own but can have their data integrated into your Hadoop cluster. Let's dive into the core systems as we covered earlier. Hadoop itself consists of three main components. The first is H DFS, the Hadoop distributed file system that manages the storage of your big data. Next, his yarn, which manages the resource, is on your cluster and figures out how to best distribute work. But an alternative to yarn called May SOS also exists, which can provide better performance in some situations. The third main component of Hadoop is map Reduce, which offers a means to write scripts to analyze data stored across your cluster. By mapping that data into new representations in parallel and then reducing that data into the final results you want map reduce itself has fallen a little bit out of style. However, most people these days don't write map reduce scripts. But right there queries on higher level systems that can use my produce under the hood. And a faster alternative to map produce has evolved. Called Tez, people usually don't interact with Tez directly, but it offers a more efficient way of breaking complex queries into pieces that could be running parallel and in the optimal order, confused yet, Like I said, all I'm trying to do is get you familiar with these names. You need my much longer course to really dive into these technologies. Let's keep going. Let's zoom out and remind ourselves where we are. So at the root of everything is H DFS, which manages the storage of our data. Yar manages the cluster that includes HD FS, but you can use May Sosa instead. Map reduces mostly used as an intermediate layer for distributing sequel style queries across your cluster. But a faster alternative called Tez is in wider use. Now, one of those higher level query engines sitting atop map reduce Ortez is pig. It's an older technology that uses a special language called Pig Latin to query your data. But most people would rather use the same language they use with their relational databases , which is called Sequel Structured Query language. Hive lets you write standard sequel style queries on your massive data sets, which then uses map. Reduce Ortez to figure out a strategy for executing that query across your cluster yarn would then figure out the specifics of what to run on which computer in your cluster and h DFS down at the bottom would manage accessing the data itself. An alternative to hive on the Cloudera platform is Impala. This is at the heart of cloud eras Data warehouse product offering. It's a more user friendly interface for querying your data and performing business analysis on it. Let's zoom out again and get our bearings. You can see there are a lot of interconnected pieces here, and we need some way of managing the cluster as a whole. How do we see the status of the cluster and quickly install, Remove or configure the systems on it? We need a layer that sits atop the whole thing, and one such layer is Apache. Anbari. It's what horse and works uses to manage its clusters, but Cloudera sometimes prefers to build its own things. And on Cloudera, you'll find Cloudera manager fulfilling the same role of managing all the various components of your cluster in one place. Next, let's talk about Apache. Spark Spark has emerged as a core component of modern data platforms. While it can run on top of her dupes yarn Oremus owes, it can also run. It's a standalone cluster by itself. If spark is all that you need, spark is a very efficient platform for analyzing big data. It consists of many components of its own including Sparks sequel, which allows you to issue sequel style queries on your data. Spark streaming, which lets you analyze data as soon as it's received. This is what's at the heart of 14 works. Data flow, for example, and spark also includes a machine learning library that allows you to apply complicated data science and machine learning algorithms to massive data sets all in parallel across your entire cluster. Apache Spark is an incredibly powerful tool, but it does require some programming language in the Python, Scalia or Java programming languages in order to use it. There is a lot of momentum behind spark, and you should make sure the data scientists in machine learning specialists you hire are familiar with it. An alternative spark is called Flink, and some companies use flank instead of spark. Let's go through the remaining technologies really quickly. Apache H Base is an example of what we call a distributed data store, sometimes referred to as the No sequel database. It's a very fast, scalable way to vent the data on your hoodoo clustered outside systems or services provided the queries those outside systems need are simple in nature. Apache Storm is basically an alternative to spark streaming for processing real time streaming data. Susie is a tool for scheduling jobs on your cluster, and as such, it sits alongside all of the other systems you have. It just lets you define which batch analysis jobs you want to run when and on what schedule . Zookeeper also sits alongside everything. It's basically a place where other systems can reliably store their own data for configuration keeping track of which notes in your cluster are currently running and stuff like that, it's designed to be highly resilient to failure, since the data it maintains is critical to the operation of your cluster. There are also a variety of tools that exist for reliably getting data into your cluster from outside sources. These include Scoop, which can import data from external relational databases, and Flume Inn CAFTA, which are used for streaming real time data into your cluster as it's generated. Some of those external data sources might include relational databases, such as my sequel or distributed data sores such as Cassandra or a Mongo DB. The's distributed data stores run on clusters of their own that conserve data toe outside systems such as your website at high transaction rates, but you can integrate them into your Hadoop cluster to maybe you run some heavy analysis and processing on your Hadoop cluster and push the results out to mongo db so the data can be fed into your website. Finally, there are a huge variety of systems that allow you to query your data that might span many different systems. You might have data that's access through hive H based mongo DB external sequel databases or other sources. Thes systems allow you to query data across all of these systems at once and combine them together. All of them have different use cases, but they serve the same general purpose. The main technologies in this arena include Apache Drill, Hugh Phoenix, Presto and Zeppelin. That's quite a whirlwind, and your head probably feels like it's spinning right now. But at least you've had some exposure into the individual technologies that make up a data platform, and the names of them won't feel quite so foreign now. You probably also have a newfound appreciation for the problem that platform providers such as Cloudera solve in trying to figure out how to combine these various open source pieces into a cohesive product 7. Quiz: all right, We've covered quite a bit already, so let's do a quick quiz to reinforce some of what you've learned. First question. We talked about the four V's that define big data. Which one of these is not one of the four Weise? Is it velocity volume, vitality or variety? The four B's include velocity volume, variety and veracity, not vitality. Although big data often is vital to a business, this one's a little trickier. We talked about horizontal scaling and systems that exist across a cluster that you could just keep adding mawr computers onto as the need arises. This is in contrast of vertical scaling, which is what traditional relational databases depended on. Just buying a bigger and bigger computer for a monolithic database host. Which of these technologies employees horizontal, not vertical scaling. The answer's air Mongo, DB Spark and Hadoop, which are all modern cluster based approaches to storing in analyzing massive data, sets Oracle my sequel and post rescue L R relational database management systems that were not designed with horizontal scaling in mind. Which of these data platform providers offers a product solution specifically for data science? The answer is Cloudera, whose Cloudera data science Workbench offers a big data platform specifically designed for data science and machine learning on massive data sets. But the reality is that you could do data science using these other products as well. What are the three original components of Hadoop itself? The answer is the Hadoop distributed file system, or H DFS, yet another resource manager or yarn and map Reduce, which was Google's original engine for analyzing massive data sets in parallel across a cluster. 8. Big Data, Big Decisions: If you decide to deploy a large scale data platform in your organization, there are a few basic decisions you'll need to make. Let's quickly prepare you for what those decisions are. One big decision is whether to host the computers that make up your big data cluster on premises in your own data center or release them from a cloud provider such as Amazon Web services. Hosting your own cluster on premises obviously means a large up front expense, as well as hiring people competent enough to design that cluster and build it out from scratch. You not only need system administrators to install and maintain the computers themselves, but you'll also need Hadoop architect who can work with your system administrators to design the system that meets your business needs and provides access to everyone in your company who needs it. You'll also be responsible for replacing computers that breakdown provisioning new computers when you need them and somehow disposing of older computers that have outlived their usefulness. But the on premises approach does have a few advantages. If security is a paramount concern, for example, you're dealing with individual financial data or information that's actually classified, hosting the cluster that contains that data safely within your company's firewall is a pretty compelling idea. Storing that sort of information in the third party Cloud service means being careful to encrypt it. Both went stored and over the wire is it's being accessed, and it's easy for someone to make a mistake that could threaten the very existence of your business. There are also situations where hosting your own hardware is cheaper. If you expect your cluster to be in constant use, that 100% capacity 24 7 that it's probably cheaper in the long run to just by the computer center in it than the least them from someone else. Hosting your cluster within your own organisation can also make it faster and easier to feed data into it. Otherwise, you have to push all that data across the Internet into your remotely hosted cluster. But for most applications, installing your data platform on a cloud based service can make more sense Usage of data platforms. It's often very sporadic, and by leasing capacity as you need it, you can avoid paying for computing. Resource is that you aren't using. There are also tricks such as buying reserved instances that can reduce your costs even further. A cloud service such as AWS makes it very easy to add or remove computers from your cluster as needed, without worrying about having to purchase, install, test and set them up yourself. Leasing your computing capacity also means you don't have to deal with fixing broken computers. And believe me, they do break. The more computers you have in your cluster, the more likely a hardware failure will occur on any given day. Cloud providers may also offer capabilities that aren't available to your own organization , such as redundant data centers in different geography ease or even different continents. The next choices which technology vented to choose for your data platform Remember, rolling your own is an option under the hood. It's all open source. You can always start by just installing the specific components you need. If you already have technical staff in place who are up to the task, let them provisional few machines on a cloud service and play around with it. It might be sufficient, at least in the near term, the only costs or the time it takes to let someone tinker with it. and the cost of the least hardware. However, as more people from different departments may want to use your cluster for different purposes, things can start to get ugly pretty quickly. You don't want a free for all, where anyone can install some new component that might break an existing one and where someone from one department can run a job in your cluster that renders it inoperable for everyone else. But ah, highly technical team could conceivably set up a small cluster just for their own purposes on their own without writing a huge check to cloud air in the process. If, however, you prefer to purchase an enterprise level solution with enterprise level support, Cloudera is really the main contender. Now that they have absorbed their biggest competitors. However, alternatives such a zmapp are still worth a look. When working with a big data vendor. The main thing you're purchasing is maintenance and support of the software platform itself . Remember the underlying technology? It's all free, open source stuff. Vendors may also provide fully hosted solutions if you don't even want to worry about managing least hardware in a service such as Amazon Web services, truly turnkey solutions air out. There were not only the software, but the hardware is also provided out of the box. Of course, this comes at a cost. An important point to keep in mind is that the world of big data continues to evolve quickly. The individual vendors will come and go and merge, as we've seen with Cloudera and Horton works. Perhaps the choice of a specific vendor is less important than ensuring that your data and applications will be portable so you can move them to a different vendor later on should the need arise. This is a strong case for using open source technologies under the hood and avoiding proprietary big data technologies and systems. As an example, Amazon Web Services offers several services that fundamentally serve the same function is open source solutions. But by using them, you're locking yourself into the Amazon ecosystem, making it practically impossible to switch to a different platform down the road. And you never know what will happen. What if the vendor of your choice ends up merging with a competitor of yours, for example, You want to make sure you can move your data quickly if the need arises. The bottom line is to look critically at flashy software capabilities when their proprietary in nature 9. The Costs of Big Data: I want to end on an unusual note that maybe you don't need big data at all. Sure, it's a hot buzzword, and it may seem like any serious technologist is taking advantage of these technologies. But the world of big data involves very high costs and not just monetary ones. In reality, most companies don't need it and can operate just fine in the foreseeable future. Using older technologies such as relational databases, there are cases where vertical scaling still makes sense. Ask yourself if big data is really solving an actual pain point that you have right now. If your data is big enough to warrant investing in a big cluster for a data platform of some sort, nature has ways of letting you know that. Are your engineers constantly cursing your database administrators because you're monolithic? Relational database keeps failing due to the low being placed on it. Are your database administrators proposing increasingly expensive and complicated schemes to squeeze some more life out of your oracle or my sequel database? If so, then you know it's time for a new paradigm. Your data and its processing need to be scaled out horizontally instead of vertically and you're going to have to bite the bullet to invest the time and money to make that happen. I want to stress again that migrating into a big data platform is going to be expensive and disruptive to your business. If you're starting a new business or a new department, it could make sense to start with a modern distributed cluster. But don't underestimate the pain that will come in a migration project for existing data and analysis jobs. You really have to ask yourself, In that case, if it's worth it, what sorts of capabilities will your business gain by being able to analyze more data in many different ways? Do you have concrete ideas on how to turn those insights or capabilities into enough money to justify the expense of migrating to Cloudera or similar system? Don't do it just because it sounds like the right thing to do. Make sure your big data will pay for itself if it won't. Maybe that data isn't even worth storing in the first place. Big data is not some magical panacea that makes your company more competitive unless you have ideas and how to translate that data into new business understand the true costs of building out a big data platform before you undertake such an effort. If you're going to host your own hardware, what will be the fixed costs of that hardware up front, the data center that hosts it and the networking infrastructure that supports it? If you're going to lease the hardware from a cloud provider, how much will that cost? Remember, you pay not only for the hardware but offer for CPU time and network than with us. Well, other variable expenses may arise from new types of technical specialists you might not have on staff today or from the need to retrain your existing staff. Well, you need to hire Hadoop administrators and architects. Will. You need to create an entire new department to manage and grant access to a centralized cluster for your entire organization? Will you need specialists embedded within each team or department that manages their own dedicated clusters? There are lots of organizational considerations to be had, and they come a different costs. Any big technology migration is also going to involve costs in terms of time and morale. Some of your employees will be excited to work with this new technology. But others may have specialized skills with your legacy systems and will be worried about their job security, depending on how you divide up access to your clusters internally, some internal reorganisation may be needed as well, and that always comes at a morale cost, too. I'm not trying to scare you off here, but I want to stress that migrating from vertically scale data systems, the horizontally scaled big data platforms can be a very painful and expensive process. Even when it's done, you'll be left with a very complex system that will present ongoing maintenance challenges . Don't undertake the effort lightly, however. Many of the businesses of today and tomorrow couldn't exist without big data and the ability to process it. My point is that if you need big data, you probably know it already. But don't switch technology platforms just because it sounds cool. There are many businesses that can operate just fine without it for the foreseeable future . 10. Conclusion: That's it for this quick executive briefing on big Data. In a review, we've covered what we mean by big data and some of the real world applications. It is enabled. We took a high level look at the main vendors in this space and at the underlying open source technologies that they are packaging and supporting. We discussed the main decisions you'll have to make when deploying a big data platform and how to think critically about whether you really need one at all. We packed a lot into this little course, and I hope you've come away with learning a thing or three again. My name is Frank Kane, and thank you for spending your valuable time with me. 11. Let's Stay in Touch: congratulations again on completing this course. It's just one of many that I offer in the fields of AI and Big Data, and I hope you want to continue your learning journey with me. If you click on my name in the Courses Main Page, you'll see the other courses I offer. There's no fixed order. You need to take them in. Just go with wherever your curiosity and interest lead you. If you'd like to stay in touch with me outside of skill share, my website is Sun Dog Dash education dot com. From there, you can subscribe to our mailing list to be the first to know about new courses and the latest news in the fields of AI and big data links to follow us on. Social media are there as well, and I've also written a few books that you can explore it our website as well. If you want something a bit more permanent than online videos again, congrats on completing a challenging course, and I hope to see you again soon