Apache Spark 3 with Scala: Hands On with Big Data! | Frank Kane | Skillshare

Playback Speed


1.0x


  • 0.5x
  • 0.75x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 1.75x
  • 2x

Apache Spark 3 with Scala: Hands On with Big Data!

teacher avatar Frank Kane, Machine Learning & Big Data, ex-Amazon

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

    • 1.

      Introduction

      1:28

    • 2.

      Installing the Course Materials

      13:54

    • 3.

      Introduction to Apache Spark

      14:26

    • 4.

      [Activity] Scala Basics

      25:58

    • 5.

      [Exercise] Flow Control in Scala

      9:28

    • 6.

      [Exercise] Functions in Scala

      9:08

    • 7.

      [Exercise] Data Structures in Scala

      22:28

    • 8.

      The Resilient Distributed Dataset

      11:30

    • 9.

      Ratings Histogram Example

      11:27

    • 10.

      Spark Internals

      1:59

    • 11.

      Key / Value RDD's, and the Average Friends by Age Example

      10:42

    • 12.

      [Activity] Running the Average Friends by Age Example

      4:51

    • 13.

      Filtering RDD's, and the Minimum Temperature Example

      5:54

    • 14.

      [Activity] Running the Minimum Temperature Example, and Modifying it for Maximum

      11:35

    • 15.

      [Activity] Counting Word Occurrences using Flatmap()

      5:46

    • 16.

      [Activity] Improving the Word Count Script with Regular Expressions

      3:44

    • 17.

      [Activity] Sorting the Word Count Results

      6:35

    • 18.

      [Exercise] Find the Total Amount Spent by Customer

      4:30

    • 19.

      [Exercise] Check your Results, and Sort Them by Total Amount Spent

      5:09

    • 20.

      Check Your Results and Implementation Against Mine

      3:00

    • 21.

      Introduction to SparkSQL

      9:44

    • 22.

      [Activity] Using SparkSQL

      7:05

    • 23.

      [Activity] Using DataSets

      8:33

    • 24.

      [Exercise] Implement the "Friends by Age" example using DataSets

      2:40

    • 25.

      Exercise Solution: Friends by Age, with Datasets.

      7:22

    • 26.

      [Activity] Word Count example, using Datasets

      10:37

    • 27.

      [Activity] Revisiting the Minimum Temperature example, with Datasets

      9:00

    • 28.

      [Exercise] Implement the "Total Spent by Customer" problem with Datasets

      2:10

    • 29.

      Exercise Solution: Total Spent by Customer with Datasets

      6:28

    • 30.

      [Activity] Find the Most Popular Movie

      5:24

    • 31.

      [Activity] Use Broadcast Variables to Display Movie Names

      11:19

    • 32.

      [Activity] Find the Most Popular Superhero in a Social Graph

      12:18

    • 33.

      [Exercise] Find the Most Obscure Superheroes

      5:14

    • 34.

      Exercise Solution: Find the Most Obscure Superheroes

      6:44

    • 35.

      Superhero Degrees of Separation: Introducing Breadth-First Search

      7:14

    • 36.

      Superhero Degrees of Separation: Accumulators, and Implementing BFS in Spark

      7:59

    • 37.

      [Activity] Superhero Degrees of Separation: Review the code, and run it!

      12:55

    • 38.

      Item-Based Collaborative Filtering in Spark, cache(), and persist()

      7:59

    • 39.

      [Activity] Running the Similar Movies Script using Spark's Cluster Manager

      14:48

    • 40.

      [Exercise] Improve the Quality of Similar Movies

      3:54

    • 41.

      [Activity] Using spark-submit to run Spark driver scripts

      11:43

    • 42.

      [Activity] Packaging driver scripts with SBT

      15:06

    • 43.

      [Exercise] Package a Script with SBT and Run it Locally with spark-submit

      2:04

    • 44.

      Exercise solution: Using SBT and spark-submit

      9:04

    • 45.

      Introducing Amazon Elastic MapReduce

      7:11

    • 46.

      Creating Similar Movies from One Million Ratings on EMR

      11:33

    • 47.

      Partitioning

      4:18

    • 48.

      Best Practices for Running on a Cluster

      6:25

    • 49.

      Troubleshooting, and Managing Dependencies

      10:59

    • 50.

      Introducing MLLib

      9:55

    • 51.

      [Activity] Using MLLib to Produce Movie Recommendations

      12:42

    • 52.

      Linear Regression with MLLib

      6:58

    • 53.

      [Activity] Running a Linear Regression with Spark

      7:47

    • 54.

      [Exercise] Predict Real Estate Values with Decision Trees in Spark

      4:56

    • 55.

      Exercise Solution: Predicting Real Estate with Decision Trees in Spark

      5:47

    • 56.

      The DStream API for Spark Streaming

      11:28

    • 57.

      [Activity] Real-time Monitoring of the Most Popular Hashtags on Twitter

      8:51

    • 58.

      Structured Streaming

      4:03

    • 59.

      [Activity] Using Structured Streaming for real-time log analysis

      5:33

    • 60.

      [Exercise] Windowed Operations with Structured Streaming

      6:04

    • 61.

      Exercise Solution: Top URL's in a 30-second Window

      5:44

    • 62.

      GraphX, Pregel, and Breadth-First-Search with Pregel.

      6:51

    • 63.

      Using the Pregel API with Spark GraphX

      4:29

    • 64.

      [Activity] Superhero Degrees of Separation using GraphX

      7:07

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

617

Students

--

Projects

About This Class

New! Updated for Spark 3.0!

“Big data" analysis is a hot and highly valuable skill – and this course will teach you the hottest technology in big data: Apache Spark. Employers including Amazon, EBay, NASA JPL, and Yahoo all use Spark to quickly extract meaning from massive data sets across a fault-tolerant Hadoop cluster. You'll learn those same techniques, using your own Windows system right at home. It's easier than you might think, and you'll be learning from an ex-engineer and senior manager from Amazon and IMDb.

Spark works best when using the Scala programming language, and this course includes a crash-course in Scala to get you up to speed quickly. For those more familiar with Python however, a Python version of this class is also available: "Taming Big Data with Apache Spark and Python - Hands On".

Learn and master the art of framing data analysis problems as Spark problems through over 20 hands-on examples, and then scale them up to run on cloud computing services in this course.

  • Learn the concepts of Spark's Resilient Distributed Datastores

  • Get a crash course in the Scala programming language

  • Develop and run Spark jobs quickly using Scala

  • Translate complex analysis problems into iterative or multi-stage Spark scripts

  • Scale up to larger data sets using Amazon's Elastic MapReduce service

  • Understand how Hadoop YARN distributes Spark across computing clusters

  • Practice using other Spark technologies, like Spark SQL, DataFrames, DataSets, Spark Streaming, and GraphX

By the end of this course, you'll be running code that analyzes gigabytes worth of information – in the cloud – in a matter of minutes. 

We'll have some fun along the way. You'll get warmed up with some simple examples of using Spark to analyze movie ratings data and text in a book. Once you've got the basics under your belt, we'll move to some more complex and interesting tasks. We'll use a million movie ratings to find movies that are similar to each other, and you might even discover some new movies you might like in the process! We'll analyze a social graph of superheroes, and learn who the most “popular" superhero is – and develop a system to find “degrees of separation" between superheroes. Are all Marvel superheroes within a few degrees of being connected to SpiderMan? You'll find the answer.

This course is very hands-on; you'll spend most of your time following along with the instructor as we write, analyze, and run real code together – both on your own system, and in the cloud using Amazon's Elastic MapReduce service. 7.5 hours of video content is included, with over 20 real examples of increasing complexity you can build, run and study yourself. Move through them at your own pace, on your own schedule. The course wraps up with an overview of other Spark-based technologies, including Spark SQL, Spark Streaming, and GraphX.

Enroll now, and enjoy the course!

"I studied Spark for the first time using Frank's course "Apache Spark 2 with Scala - Hands On with Big Data!". It was a great starting point for me,  gaining knowledge in Scala and most importantly practical examples of Spark applications. It gave me an understanding of all the relevant Spark core concepts,  RDDs, Dataframes & Datasets, Spark Streaming, AWS EMR. Within a few months of completion, I used the knowledge gained from the course to propose in my current company to  work primarily on Spark applications. Since then I have continued to work with Spark. I would highly recommend any of Franks courses as he simplifies concepts well and his teaching manner is easy to follow and continue with!  " - Joey Faherty

Meet Your Teacher

Teacher Profile Image

Frank Kane

Machine Learning & Big Data, ex-Amazon

Teacher

Frank spent 9 years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers, all the time. Frank holds 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, Frank left to start his own successful company, Sundog Software, which focuses on virtual reality environment technology, and teaching others about big data analysis.

See full profile

Level: Beginner

Class Ratings

Expectations Met?
    Exceeded!
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Introduction: And I spent over nine years at Amazon.com and IMDB.com making sense of their massive datasets. And I want to teach you about the most powerful technology I know for wrangling big data in the Cloud today. That's Apache Spark, using Scala programming language spark and run into Hadoop cluster to spread out massive data analysis and machine learning tasks in the Cloud. And knowing how to do that is a very hot skill to have right now, we'll start off with a crash course in the Scala programming language. Don't worry, it's pretty easy to pick up as long as you've done some programming or scripting before. We'll start with some simple examples, but work our way up to more complicated and interesting examples using real massive datasets. By the end of this course, you will have gone hands-on with over 15 real examples. And you'll be comfortable with writing, debugging and running your own Spark applications using Scala. And some of them are pretty fun. We'll look at a social network of superheroes and use that data to figure out who is the Kevin Bacon of the superhero universe. We'll also look at a million movie ratings from real people and actually construct a real movie recommendation engine that runs on a Spark cluster in the Cloud using Amazon's Elastic MapReduce service. We'll also do some big machine learning tasks using Sparks ML Lib library. And we'll do some graph analysis using Spark GraphX library. So give it a try with me. I think you'll be surprised at how just a few lines of code can kick off a massive complex data analysis job on a cluster using Spark. So let's get started. First thing we need to do is install the software we need. So let's get that out of the way right now. 2. Installing the Course Materials: So let's get everything set up including Java and intelligence and all the course materials that we need for the entire course. What we're gonna do is start by going to our own website here, which will direct you to the course materials where you can download all the project files and the data that you need for this course. We'll go ahead and get that installed in your system. Then we'll install a Java Development Kit if you don't already have one, we just need to make sure that we have a JDK between versions eight and 14 installed on your system. Odds are if you're a developer, you already do. After that, we'll install the IntelliJ idea Community Edition. It's a free development environment that can, we can use for Scala and for Spark. And the beauty of it is that it also integrates something called sbt. So gotten a handle all the dirty work of actually installing Apache Spark for us on Windows, we do have one extra step. We need to kind of fake out windows to think that it's running Hadoop and I'll show you how to do that. It's not too hard. And finally, we'll set up our project in intelligence. Run a little simple helloworld problem in Apache Spark. Make sure it's all working. Let's dive in and I'll walk you through all of it. So let's start by getting everything set up that you need for this course. Head on over to media Dotson dog, a dash soft.com slash Spark scala dot HTML. Pay attention to capitalisation and every little letter counts. And you should reach this page here that contains everything you need to get going. But most importantly, let's install the course materials, all the scripts that you need to actually get through this course hands-on. Go ahead and click on this link here to immediate Dotson dog dash, soft.com slash Spark Scala slash Spark Scala course dot zip. If you are typing that in by hand for some reason, be sure to pay attention to capitalisation. Once that downloads. We'll go ahead and unzip it. And on Windows I can just go ahead and right-click that and say extract all. On a Mac or Linux, of course, you would just go to a terminal prompt and use the unzip command. And we should get is a spark Scala course folder within a spark Scala course folder, That's correct. That's what we want. And within that second level of folder is all the materials itself, all the project for this course. So first of all, let's move this someplace where we're not going to lose it. So I'm taking that top level sparks gala course folder and I'm going to move it to someplace safe. Let's put it on my C drive. All right, so now in my C drive I have a spark Scala course folder. And within that is another sparks gala course folder. And on Mac or Linux, of course you would not have a C directory. You would put it in your home directory, just someplace where you're not going to lose it. All right, so next we need to get some test data here. And unfortunately the license terms of the data that I like to use doesn't let me include it myself. So you're gonna have to go and download that yourself. That's the MovieLens dataset here. To set up a 100 thousand movie ratings there that we're going to use to play with throughout this course. So you can use this handy dandy link to get it files doc GroupLens.org slash data slash slash MovieLens slash ML dash 100, k dot zip. Go ahead and download that. And if for some reason the GroupLens.org website is down, that happens from time to time. You can usually find the M L dash 100 K file on Kaggle if you need to. Let's go ahead and decompress that as well. Right-click Extract All. Again, just use the unzip command on Mac or Linux. And the resulting MLH1 100 K folder should contain this stuff. We're gonna take that that level here and we're going to copy that. And I'm going to go back to my course materials folder that I just created, which for me was see sparks Scala course. And in the other sparks gala course directory under that there's a data folder. Go into the Data folder and that's where I want to put my AML dash 100 K folder. All right, so this is how things should look at this point, whether you're whatever operating system you're on, you want a spark Scala course folder. Within that, there should be another Spark Scala course folder. Within that should be a data folder. And within that should be an ML dash 100 K folder. And within that should be all of this test data. Okay, so make sure that all looks right or else you're going to run into weird problems and you're not going to know what's going on. Once you're sure that's fine, Let's go back to our instructions here. Next step is to install IntelliJ a, which is going to be our IDE for developing in this course. Now I used to actually tell people to install Eclipse and the Scala IDE, but it seems like intelligent is winning the battle they are against eclipse. So I'm gonna have you install IntelliJ j now instead. Now in order to run Scala code, you first need a JDK and anything between versions aid and 14 we'll do for this course. But if you do need to get a JDK, there's a handy-dandy link here to do it. You can just head on over to Oracle.com slash Java and go ahead and get to the JDK 14 download for your operating system. For me that's going to be Windows 64. You will need to accept their terms. And wait for that to download. Looks like there's a little security warning here. It's fine. I trust it. And let's that comes down. We'll go ahead and install it. Obviously on Linux or Mac that you'll probably use an alternative means of getting Java. In fact, you probably already have Java installed if you're on Linux or Mac. So this is probably someone that specific thing. We'll go ahead and go through the installer here. And one thing on Windows is that when you run Linux code like Apache Spark on windows, sometimes it gets confused when you have spaces in your path. So that space between program and files could actually be a problem. Let's go ahead and change that just to be safe. And I'm going to say this instead into a C colon JDK 14 directory. Well, let that do its thing. It shouldn't take too long. And we're done. All right, we're done with that site, back to our instructions. So now we can install the IntelliJ idea Community Edition. That's going to be our actual ID. Let's go ahead and click on that link. And we want the community edition, the free one, the open source one. We don't need the ultimate one. So go ahead and download that for whatever your operating system is. You see the offer that for Windows, Mac, and Linux. And this is about a half a gig. And so we'll take a few seconds to come down. Come back when that's done. Right, the installer downloaded and let's go ahead and kick it off. And pretty standard installer here. Let's go ahead and just walk through it. And if you do want desktop shortcuts or as file associations, you can do that totally up to you. I'm just going to leave these as is. And that's fine too. Takes about a minute to install, so I'll just come back when that's done. Alright, well good. Let's go ahead and hit the Run button to actually launch it. I don't want to import any existing settings. And personal preference, if you like a dark theme or a light theme, I like a light theme. So I'll select that, do whatever you want though. And now we're going to install plugins. Maybe the only plugin that we really need is the Scala plugin. And unfortunately I'm not seeing it offered here, so I'm just going to move on. I'm not seeing it offered here either. So if you do see the Scala plug-in though, go ahead and take the opportunity to install it. I did not though. So I have to do with the hard way, which isn't that hard. I'm just going to click on the Configure button here from the welcome screen and select Plugins. And from there I can find the Scala plugin. Let's go ahead and install that. Fine. Alright, and we can restart the IDE to pick that up. And now one more thing we need to tell it which JDK to use. So go back to the configure menu here and go to structure for new projects. And if you need to select a JDK here, select the one that we just installed, that's going to be 14 for us and hit Okay. Now there is one more step that we need to do that's only for Windows. So if you're on Mac or Linux, you can, you can ignore this next step, we need to sort of trick windows into thinking that Hadoop is running on it. And to do that, well, it's a little bit clunky. The instructions are in your course materials page here under the Windows only section here, just follow its instructions. Go ahead and create a C Hadoop bin directory. So I'm gonna go to my C drive. I'm going to create a new folder called Hadoop. And within that Hadoop folder, I'm gonna create another folder called bin. Now I'm gonna go back to my course materials, which is under sparks gala course. And you'll see a when utils dot EXE file there. I'm gonna go ahead and copy that and paste it into C. Hadoop been. Next I need to set up a couple of environment variables. So the easiest way to do that is to just go to your Windows search bar down here and type in environment variables. Just ENV is probably enough. And select edit the system environment variables which will take you to the system control panel. From here, you can hit the environment variables button. And we're going to create a new one called Hadoop underscore home all caps. And the value will be C colon backslash, Hadoop. We also need to edit our path environment variable. If you don't have one, you can make one, but you probably have one already. So I'm just going to edit the one that I have an add an additional path to it by double-clicking here and typing in percent sign, Hadoop underscore home percent sign, That's slash bin. Hit OK and OK again and Okay again. All right, so now we're ready, ready to actually try and import the project for the course itself. Let's go back to IntelliJ. Now before we load up that project, it's always a good idea to restart your applications after changing environment variables. So if you are on Windows, go ahead and close out of IntelliJ idea and restarted. It should be in your start menu. And now we're going to click on Open or Import. And we want to navigate to the Course Materials folder, spark Scala course and the spark Scala course folder inside of that, that's our actual project for the course itself. Hit Okay. And let it do its thing. It's going to automatically try to make sense out of what's in that folder. I don't want to tip. And if we're lucky, it will all just work. Okay. See anything too alarming, let's hit the build icon just to make sure it built successfully. That's this little hammer icon up here. You'll also find it in the Build menu if you prefer. And it looks like it worked. So let's give it a shot. Let's go ahead and open up the Spark Scala course folder here for the project, and then open up the source folder. And under that open up main, and then Scala and com dot sun dot software dot spark. These are all the scripts for the course here. So all we have to do is pick one and see if it works. I included a very simple helloworld script. So let's double-click on that. And you can see here it's not doing a lot, but it is actually using Apache Spark. So it will actually verify that you have everything configured properly and set up the right way here. It's just going to set up a SparkContext and load up the data file inside our MovieLens dataset that we installed earlier. So this will also make sure you have that in the right place as well. All it's going to do is spin up an Apache Spark job to count the number of lines in that file. So a very complicated way of doing that, but it will verify that it works. And then when it's done, it will print out Hello World. The data file has hopefully a 100 thousand lines because that is the a 100 thousand dataset. So let's see if it works. Just right-click on HelloWorld and say run hello-world. And at this point there's a really good chance that you're going to get a class not found error if you do, it's just a bug and intelligence. If you quit intelligent and relaunch it, it should clear it up. And that should kick off Spark. You'll see a few warnings are safe to ignore though. And now it's actually running and it worked. So there you have it. Helloworld, the data file has a 100 thousand lines. So if you see that, congratulations, you setup Spark and Scala and intelligence in Java all successfully. And everything's working. And now all we have to do is go through all the rest of these scripts throughout the rest of the course and talk about what they do and learn along the way. If you did not see that output though, go back, you probably missed a little spots somewhere. There's always some little thing and feel free to post in the Q and a or comments of this course to get help if you need it. But hopefully that'll work for you and we can move on and start learning. 3. Introduction to Apache Spark: So let me introduce you to Apache Spark at a high level and just talk about how it works and what it's for real quick. So the official description of Spark is it's a fast and general engine for large-scale data processing. And well, that's a pretty good description. Basically the idea is that you can write a very simple script potentially that describes how you want to transform a huge amount of data or analyze a huge amount of data and sparkle figure out how to distribute that work across an entire cluster of computers for you. So it's an engine for figuring out how to parallelize the processing of your data. You can still tell it what you want. Do you know? Do I want to take in a bunch of log files and extract some information and put it somewhere else. Fine. Spark will go and figure out how to do that across your entire cluster and make that happen as quickly as possible using the resources of tens or even hundreds of individual machines to do it. So the key to this is its scalability. So again, you just write one single driver program, we call it. It's just as simple script written in either Scala or Python or Java. That tells Spark what do you want to do to your data. It's then Sparks problem to figure out how to parallelize that and make it scale out almost an entire fleet of computers. So the key insight here is that you're not limited to the computing power of one machine here with Apache Spark, you can take an, a massive dataset that you couldn't hope to process on a single PC and actually distribute that processing across an entire fleet of computers in parallel at the same time. This sort of divide and conquer approach is how we can process massive datasets and handle what we call big data. From an architecture standpoint, your driver program is just something you write. Like I said, it's a various, potentially a very simple script. And that gets handed off to a cluster manager of some store. You need some sort of system that orchestrates your entire cluster of computers. And that might be a Hadoop cluster, in which case Hadoop's yarn cluster manager would be coming into play there. And that's going to be worrying about how to spin up the resources you need, how to distribute that work and where to put the different jobs in the most optimal place. Thinks about things like how do I run the code in the place where the data is most accessible? So if my data, for example, is split out on a distributed file system, the cluster manager might say, okay, I'm gonna go run the data that processes that chunk of the data on that same machine to make it run even faster. However, you don't need to use Hadoop. Spark has its own built-in Cluster Manager as well. So if you just want to run Spark in a standalone environment, you can do that to just install Spark and every machine in your cluster and configure it properly. And it will just work so Spark and run on its own. It doesn't necessarily need to run on top of Hadoop, although it can. Sometimes, you'll want to run other Hadoop applications on the same cluster and set it more complex pipelines of operations. So there can be advantages to actually running on top of Hadoop, but you don't have to. So an individual machines, there are different nodes we call them. And these will be running different executors. And every executive process that can be distributed throughout your entire cluster has its own cache and it has its own task that is trying to operate on your data. And you can see with all the arrows here that pretty much everything is talking to each other. Your driver program sends out commands to the cluster manager and also directly to the executors when needed. And the executors are talking to each other and synchronizing amongst themselves. And of course, a cluster manager is talking to all of those executor processes as well, trying to orchestrate. What gets run where and then colliding those results back together to get view your final result when it's all done. So that's the sparc architecture at a very high level. Why is Spark so popular? Well, Spark is pretty much replaced. Hadoop MapReduce because it can be up to 100 times faster if it's based on running in memory. So if you have enough memory in your cluster, that's a realistic approximation. If you're actually reading data directly from disk, it will be still be about 10 times faster. Why is it so much faster than MapReduce? Well, it's because of what we call a directed acyclic graph engine or a DAG engine. Basically, it's going to look at the workflow that you've described in your driver script and it will optimize that for you automatically. In contrast, in MapReduce, you're kind of wedged into a single way of thinking of processing data. You have to explicitly map all of your data in parallel and then define some way of reducing that data back into a final answer. With the DAG engine though, can be a little bit more flexible. It can organize that workflow in a more complex and potentially more optimal manner as well. And because Spark is memory-based, that also gives it a very huge advantage as well. So it's fast and it's also very easy to use. It's also hot, it's a very widely used. This is actually a very old list of people using Spark and there are many, many, many, many more people using it now. But the point of this slide is just to show you that It's proven technology. It's being used by very large corporations. It's a very mature technology. It's been out for awhile. You know, the new features in Spark or kind of slowing down a little bit, at least in the open source world. And that's okay because it pretty much does everything you need to do and it does it quite reliably at this point. So for distributed data processing, Spark is a mature technology and it's very widely adopted. It's also not that hard. So you have your choice of writing your code in Python or Java or Scala. Obviously in this course we're going to focus on Scala and we'll talk about why in a moment. But It's easy to use if you know SQL Structured Query Language, that's the same language you use for interfacing with any relational database. You're going to feel right at home because Spark has features called spark datasets and Spark DataFrames that operate very similarly to SQL statements. And you can even give it SQL commands directly through a feature called Spark SQL. So if you know SQL, you can use Spark. It's just that easy. But not everything is a SQL problem. Not every data analysis or transformation can be defined through a SQL command. And if you do want to get to a lower, lower level API that's available to the original API for Spark is called the Resilient Distributed Dataset or the RDD for short. We're going to go into a lot more depth about how that works in a, in a moment here. But with the RDD API, you can get lower level and sometimes I can give you even better performance. And it also gives you more flexibility in what you can do. But for most common data transformation or analysis tasks, you can probably define that as a SQL command. And more often than not, you'll be using datasets, dataframes, or the Spark SQL API. From a software architecture standpoint, this is how Spark is laid out, and this goes back to Sparks original architecture. The lines have kind of blurred on a few of these in recent years. But at its core is, well, Spark Core and that's where RDD's live in the like, right? So that's kinda like the underlying engine of Spark itself. And you can go directly to spark Core. Well, we'll see that in action. But there's these other higher level APIs built on top of Spark Core as well to make your life easier for specific tasks. When his Spark Streaming, that is, it's obviously a very powerful technology for ingesting data in real-time or near real-time. You could imagine, for example, having a fleet of servers out there running your website that are feeding data into Spark through Spark streaming from their log files continuously. And in real-time, spark can monitor that data, look at a window of that data over time, give you analytics over that window of time and take some action based on it. Simple example, let's say you want to have some sort of an alarm on 500 errors on your website. You can have a Spark streaming system setup so that the logs are being streamed into Apache Spark. And in real time it's counting up how many 500 errors there are over in the past, our past minute, whatever you want to monitor and take some action if it exceeds some threshold. And obviously much more complex operations are available as well. Maybe a more common application would be transforming that log data and putting it somewhere else. So I could have a Spark streaming process that ingest data from my logs, transforms it into some format that maybe Elastic Search wants to see or something like that. We also have Spark SQL and that exist to let you integrate with spark SQL commands so you can treat spark just like a giant database that's distributed in nature. So if you can define your data in terms of a table structure, which you usually can, and you can define the problem you want to solve in terms of a SQL command which you probably can. You can just use Spark SQL to define what you want it to do. And sparkle figure out how to parallelize that across an entire fleet of computers. So that's really exciting, right? It gives you all the flexibility of a relational database. But you're not limited to one machine anymore. You can actually horizontally scale that database. Now, it used to be you had to choose between like NoSQL databases if you want to, distributed computing and a big monolithic relational database if you didn't. And there are still some limitations here. Mind you doing big joins is still not going to be very efficient in a horizontally partitioned server environment. But you can do it if you want to write. So it's kinda the best of both worlds. Now, like I mentioned, the lines are blurred in some cases here. So these more modern APIs that we're going to look at later in Spark using dataframes and datasets. They also are very similar to SQL in their structure and how they're used. So, you know, does that, do you consider dataframes and datasets part of Spark SQL or Spark Core. Again, the lines kinda blurry there, but the SQL based interfaces are kinda becoming the predominant way of using Spark. We also have ML Lib, Sparks Machine Learning Library. And if you do want to do distributed machine learning on Apache Spark, you can do that too. It's a somewhat limited set of algorithms, although there most of the ones that you would need in practice. So we'll look at that in a later section of the course as well. And that's really exciting, right? Because if you have machine learning that you want to process on a massive dataset, no longer are you limited to what you can do on a single machine? There are some algorithms that to this day, you know, are hard to scale out, but Spark has figured it out for many of the most popular machine learning algorithms that you might want to be using. And finally, this graph X, don't want to talk about that too much. It's kinda fallen by the wayside. Graphx is not about, you know, charts and graphs are printing, you know, little lines and stuff like that. It's more graphs in the computer science sense. So we're talking about like networks of information. For example, a social network where you have users that are connected to other users is a graph in that sense. And graphics can do things like, you know, analyze those graphs of information, tell you attributes about it, and let you sort of iterate through those in a distributed manner. That graph x is, again kinda fallen by the wayside. It hasn't really been well-maintained lately and there are newer alternative APIs these days that are more popular. We'll talk about that more at the end of the course. In this course, we are using the Scala programming language. Why, why are we using Scala? That's kind of an obscure language, isn't it? Well, there's a few reasons. One is that Spark itself is written in Scala. So by writing your scripts in Scala, you're kind of like getting closer to how Spark itself is written in optimized. So, you know that, that can potentially lead to better performance. The other thing is that Scala is what we call a functional programming language. And as such, it's really a good fit for distributed processing. Scala really enforces that you write your code in such a way that your functions can be distributed across an entire cluster. Whereas other languages like Java and Python don't really try to force you into that. So by writing your Spark driver scripts in Scala, you're more likely to be writing code that can be parallelized safely and easily. It also gives you fast performance. So scale it compiles down to Java byte code. So at the end of the day it's running on the JVM, the Java interpreter, and that's pretty darn fast on most systems. Obviously, Java will also give you fast performance because that will also compile down to Java byte codes. But contrast that to writing your spark scripts in Python, which you can do. But you have to go through another layer there, right? Like that, python code needs to be somehow transformed into Java bytecode at the end of the day. So writing in Scala just cuz you gets you a little bit closer to that ultimate lower level where your code will actually be running. Now to be fair, python is pretty darn fast in Spark these days. So the difference is not as big as it used to be, but there's still a small difference. The other advantage of Scala if you want to put it against Java, is that it's easier to use. So there's going to be a lot less code. You have to write a lot less boilerplate stuff than you would have to write if you're coding in Java. Java has a lot of overhead associated with it in terms of how you can actually compile that code and distributed and stuff like that. It's a lot easier in Scala turns out. And like I said, in comparison, Python slower, it's not as slow as it used to be. You're still gonna get a little bit of an edge with Scala, but that speed comparison has been closing over time. But where are the downsides to Scala? Well, one is that you might not know Scala yet. You know, it's not a very common language. So you're going to have to go learn the basics of house Gallo works, but it's not as hard as you think. For example, let's take a look at this little snippet of code. We're doing the same thing here in Python and in Scala. We're just going to like write some code to square the numbers in a dataset. Pretty simple stuff. So in the Python version and the Scala version, if you look at it, they're not that different, right? So syntactically there's little things like, you know, you have to declare in Scala that it's a immutable constant that you're using by saying val. The syntax for defining a list of stuff is slightly different. The syntax for Lambda functions is a little bit different, but it's the same idea, right? So Scala is kind of a weird syntax. Sometimes things can be a little bit backwards and we're going to talk about that. Don't worry about it. But at the end of the day, it doesn't look that much different from Python code in the context of a spark driver script. And with that, let's actually dive into a crash course in Scala if you need it. In this next section, we're actually going to go into the basics of Scala. What's different about it, What's weird about it? I do expect that you're going to have some prior experience in writing code somewhere, some scripting or programming language. I'm not here to teach you how to program from scratch guys. That would be a different course. But if you do have some Python under your belt, or C or Java or something, I think you can pick up, scale up pretty quickly. So in this next section, if you need it, we have a little bit of an introduction to Scala that will demystify the syntax for you. And as we go through the course, you'll see lots and lots of examples of using Scala. And I think it will just sink in by looking at it enough and seeing enough examples. So let's dive into our Scala crash course if you need it. If you don't feel free to skip the next section and we'll just dive right into how Spark works. 4. [Activity] Scala Basics: Hi, I'm Frank cane and welcome to my office. We're going to start by doing a little crash course on the Scala programming language itself. Now obviously, if you're already familiar with Scala, you can skip this section and that's fine. But if you're new to Scala, but you've had some programming experience before, you'll find a section very helpful for understanding the code that we're going to be looking at throughout this course. It's just enough to be dangerous, right? So don't expect a comprehensive introduction to Scala course here in this section, but it's enough to get you through this course at least and through the examples that we'll go through in this section and the examples later in the course. I think you'll end this course with a pretty good understanding of how Scala works and even how to write your own Scala code. However, if you're new to programming altogether, this isn't gonna be enough for you. I would encourage you to go and find an introductory course on Scala that goes into more depth first and then come back to this one. But for the rest of you, let's plow ahead and learn Scala. All right, let's learn Scala just to set expectations. You're not going to be a Scala expert at the end of watching for videos with me. So what I'm really trying to do here is just get you familiar with the syntax of the Scala programming language and introduce some of the basic constructs like how do I call a function in Scala? What are flow control work where some basic data structures I might use with Scala, show you enough Scala code that it's not going to look scary and intimidating to you as we go through the rest of the course. So with that, let's talk about scale at a high level first. First of all, why learn Scala? Well, you've probably never heard of it before. Maybe you have, but you certainly probably don't know it. It's mostly used for Spark programming. But it is uniquely suited for spar because it's really structured in a way that lends itself to doing distributed processing of data over a cluster. And you'll see why a little bit later on. It's also what Spark itself is built with. So by learning Scala, you'll get access to all the latest and greatest spark features as they come out. And it can usually take a pretty long time for those features to trickle down to say Python support within Spark. And it's also going to be the most efficient way to run Spark code itself. So by using Scala, you will have the fastest and the most reliable Spark jobs that you can possibly create. And I think you'd be pretty surprised at just how much faster and how much more reliable. The same spark job written in Scala is compared to say, the same spark job written in Python. So even though it might be tempting to go off and stick with the language you already know. Learning Scala is worth the effort and it's really not that hard. Truth is the same Spark code for Scala and Python look very similar to each other at the end of the day. Now, skull itself runs on top of the Java virtual machine. So it just compiles down to Java byte code and gets run by the JVM. So one nice thing about that is that you also have access to all of Java. If there's a Java library you want to pull into your Scala code, you can do that. So you're not limited to what's in the Scala language itself. You can actually reach down to the Java layer and pull up. It's a job that you want to use too. And we'll do that later in this course, for example, for dealing with regular expressions in a little bit more of an intuitive matter than you could otherwise. Another key point about Scala is that it is focused around what is called functional programming. Where functions are sort of the crux of what we're dealing with. Functions get passed to other functions and chained together in ways you might not be used to. But this is really how Spark works at a fundamental level. We basically take a abstraction over a chunk of data and we assign it a function to do some processing on that data. And functional programming in Scala makes that very intuitive to do from a language standpoint. All right, so let's just jump right into the deep end of the pool and sink or swim with Scala, We're just gonna write some code and see what happens and get your hands dirty. Now I didn't actually provide you with a copy of this code I'm going to be going through because there's actually a value and typing it yourself to make a kind of sink in. So let's start by creating what's called a New Scala worksheet. This is going to give us a interactive environment where we can just sort of experiment with Scala code and evaluated interactively. So go to your file menu and intelligence and say New Scala worksheet. And we'll call this one learning Scala one. And in this first lecture, we're just going to talk about the syntax and structure of the Scala language because it's a little bit weird compared to other languages out there. So first of all, if you want to a comment, you can just do a double slash like you would in many other languages. And first we're going to talk about values. So values are immutable constants. So that's an example of a comment line there. Now in other languages we have the concept of variables. You know, it's a very universal thing and programming to assign some value to a named variable, right? And like use that throughout your code. Now in Scala, there's two different kinds when it's called values which are immutable and variables which are immutable. And in Scala you want to stick with values as much as possible. So here's an example of how to define one. We could say val for value, Hello, colon, string equals quote Ola. And if you want to actually execute that, we can just hit a little play button here. Or you can see there's also a keyboard shortcut of Control Alt W that I'll use from now on. And it's creating an environment to execute that in right now and there we have it. So you can see that it actually executed that command and assigned the value o law to a string called Hello. So let's spend some time talking about the syntax here because it's kinda backwards from many other languages, right? So we start off by saying this is a value. That means that we're defining an immutable constant. Once we actually define what hello is, we can't change it ever again. So we're going to call this value Hello. And then after the colon, we have to declare what type it is. So we're saying that hello is a string type. So that's backwards from most other languages, right? Usually you'll see like string hello, but in Scala it's hello, colon string. And then we assign it to a value. Nothing too weird. They're just the string hola in quotation marks. Okay, so that makes sense, right? Is a little bit backwards, but you get used to it pretty quickly. Now let's talk about variables. So variables are immutable. That means you can actually change them after you've defined them. So to define a variable, it's the same thing. You just use a var instead of a Val. So we can say var hello there, which is also a string, and we'll set that to Hello. We're going to assign that to the value of the constant string, the immutable constant hello. So I'm going to say Control Alt, W. And you can see there that hello there has been assigned the string value o law because that was stored in our immutable value. Hello makes sense so far. But hello, there is a variable, we can change it now it's not stuck to being Hello. So we could say something like hello there equals hello plus space there and Control Alt W. And you can see that we've actually modified Hello there too now contain the string Hello. There are Ola there rather. So as you can see, variables can be changed values however, cannot. We could also use the print line command to print out that value explicitly. Print line hello there, which does exactly what you would think. You would prints out the value of that variable on a line Control Alt W over there. All right, so you've seen some basic stuff here. First of all, the concept of values in variables and values again are immutable. Once you define them, you cannot change them. Whereas variables are mutable, you cannot change them after you define them. And also note the syntax here of declaring values and variables. It's valor var, the identifier's name, colon, the type, and then equals whatever you want it set it to. Again, that's backwards from a lot of other languages. And just to show you what happens and to prove to you that values cannot be changed. Let's change that var2 eval and see what happens if we try to execute this again, Control Alt W. You can see that we've got some errors here now, reassignment eval, we can't say hello there equals hello plus there because you cannot change a val. A value is immutable. Let's change that back to var and executed again Control Alt W. All right, So far so good, right? Now why do we have this distinction between values and variables in Scala? Well, it's because this is what we call a functional programming language to scale is kind of centered around the idea of passing functions around and potentially running them in parallel. That's why it's such a good match for Apache Spark. And the reason that we want to stick with immutable constants whenever we can is to avoid a bunch of thread safety issues and to sort of head them off at the pass. So imagine you have a function that has a variable in it that you get, that it can change and you pass that variable into many, many threads. What happens if one thread is trying to change that variable at the same time that another thread is trying to change it to something else, the results become undefined, right? So we avoid a lot of these race conditions by trying to use immutable constants whenever possible. If our functions are only acting and processing immutable data, we don't have to worry about all of those threads, safety and race conditions. Now this doesn't limit you as much as you might think. You can still get a lot done just using values immutable constants. So for example, if I did want to do the above operation and construct the string o la there from the value Hello. I can still do that using values. I could say something like val immutable hello there equals hello plus there. Right? And I could print out that result. Control Alt W. And that works because I'm defining immutable hello there on the same line here I'm taking a previously immutable value, adding in another immutable value and assigning that to an immutable value. So that's okay with the variables. We did it a different way. Like we started off by setting hello there to hello, and then at another operation, we added the string there to it. So as long as we do everything in one line, in one atomic operation, we're still sticking to the rules of using values whenever possible. All right, so we've seen the string datatype here in action, right? So there's many other data types available to you in Scala. Let's talk about datatypes. So for example, we could say val number one. Whoops, if I type right, number one colon int. So an int is exactly what do you think it is an integer, an integer number, a whole number. We could also say val. Truth is a Boolean. And we'll set that to true. Note that in Scala, true and false constants are all lowercase. There are languages where you would capitalize to true and false, but not in Scala. So that's just a Boolean value, true or false. We can also have characters. So we can say val, letter a. As a car type, a single character, a single ascii value. We also have a double-precision numbers, of course. So we can say Pi is a double. And set that to 3.14159265 or whatever it is. And we can also represent a single precision floating point value with foul Pi single precision. And we'll declare that as a float, which is a single precision floating point variable. And we'll set that to 3.14159265 F, which means floating point single precision. Let's Control Alt W to see we have so far it's working. So you can see that all of these variables have been defined as expected. Let's keep going. There's also a long datatype, let's say val big number. We'll define that as a long. And we'll set that to some big integer. 1, 2, 3, 4, 5, 6, 7, 8, 9, out on whatever you want. And we could say also a single character number. We can save all small number. Declare that as a byte equals 127. So a byte is basically a number that's crammed into a single byte. So it can only represent numbers from negative 127 to positive 127. Or if it were unsigned 0 to 255, Control Alt W to execute those. Okay, Looking good. Moving on, let's talk about how to actually print and display your data and format your output. That's always an important thing, right? So let's say we want to concatenate a bunch of strings together or a bunch of values together and print them out. Obviously, you want to be able to view the results of your Spark programs. So this is how you would do that. We could say print line. Here is a mess. And the secret here is you can just use the plus operator to concatenate stuff together and printed out as a big string. And it doesn't just have to be strings either. It can be any datatype. It will implicitly convert that to a string. So I can say here's a mess plus Number 1 plus truth, plus letter a, plus pi, plus big number. And that should work. Control Alt W. There it is. Crab it all together because I didn't insert any spaces between everything, but it works. Note also by the way, there is no semicolon or anything at the end of the lines here. It just assumes that every new line is a new command basically. So there's no need to explicitly terminate your lines of code in Scala. What do you want to do like print F style, if you're coming from a background of like C or C plus plus, you might be familiar with the print f command, which allows you to put in sort of formatting hints for how to actually display numerical data or insert strings and other data into an existing string. Here's how that looks like in Scala. We can say print line F, which means we want print out format. Quote pi is about dollar sign pi, single precision, percent 0.3 F. Okay, so let's break that down a little bit. So first of all, note that we have that dollar sign there that's indicating that we have a variable name or value name rather, in this case, following that dollar sign. So Pi is single precision is going to insert the value of Pi single precision. And percent 0.3 f means that it's a floating point value that we're going to display there. That's what the F means. And percent 0.3 means that after the decimal point, we only want three digits of precision displayed. So let's go ahead and hit Control Alt W. And you can see that it did exactly what we said. Pi is about 3.142. So it just displayed those three digits following the decimal point there because that's all we wanted. That can be handy if you are displaying a double-precision or even a single precision number that has a huge number of digits set has more precision than you need. You can also do things like let's see how that works with integers. So for example, we could say print line F, Again just indicating print f format 0, padding on the left. Dollar sign number 1, percent 05 D. Alright. So this time we're saying that we're going to insert the value of the number 1 value and the percent 05 D. The D just means that it's a number, an integer, and percent 05 means that I want to have at least five digits to the left of the decimal point. So let's see what that does. Control. Alt W. And we get 000 000 001, promising the five digits of precision on the left that we wanted. That can be useful when you're trying to align output and columns, right? So that's also a handy trick sometimes. Also if you just want to substitute variables into a string without actually specifying the formatting. That's easy to do as well. For example, print line S for substitute quote. I can use the S prefix to use variables like dollar sign number one, truth, and letter a. All right, Control Alt W. And it worked. So you can see it's a very easy way to just insert value names into your string there without having to use the concatenation operator unnecessarily. So just a different way of doing it. Okay, What else can we do? We can include expressions in our print commands. So let me show you how that works. Print line S. The S prefix isn't limited to variables. I can include any expression like Ross, a dollar sign, open curly bracket, one plus two. And it automatically completed that line for me. So the key here are those curly brackets. After you have a curly bracket set after a dollar sign, it's actually going to evaluate the expression inside those curly brackets and print out the result of that as part of the string Control Alt W. And if we scroll over, we should see that it printed out the number three. So that's a neat trick to write. So those curly brackets after dollar sign could actually evaluate an expression within a print line command if you use the S prefix. There's also regular expressions. So if you're familiar with those, that's a powerful tool for actually munging your string data. Let's see how that works. So let's start off with a string that is Val, the ultimate answer. We'll define that as a string. Again, we weren't getting back to how to define a value. And we'll set that to life, the universe. And everything is 42. And if you recognize that reference, then U2, our fan of The Hitchhiker's Guide to the Galaxy. Welcome. Alright, so we have this string and we want to do is write a regular expression that will extract the answer to the ultimate question of life, the universe, and everything. So we're going to set up a regular expression to extract that number out from the end of that string. So we can say val pattern equals triple quotes 123. And then we're going to write it in a regular expression to extract the information that we want from that string. Dot star, parenthesis, square bracket, backslash d, n square bracket. Plus close paren dot star, and then those three quotes, dot r. Okay, so the dot R means this is a regular expression that we're defining here. And going over how regular expressions work is probably out of scope for this course. But there are a very useful tool for things like extracting information out of log files and things like that. A quick breakdown of what's going on here. The dot star means to match anything in that string followed by a space. And then within the parentheses is the thing that we're trying to extract from that pattern. And the brackets and the backslash d means I want to extract a number. All right, and any number of numbers, That's what the plus sign means, followed by any other characters. Okay, So by looking for a bunch of characters followed by a space and then a number followed by anything else that's going to pull that 42 out of that string. We don't need that control Alt W just to make sure that's working. All right, cool. So let's take a look over here. So we have that string we defined to life the universe and everything is 42. And now we've defined a matching regular expression object that consists of that regular expression. Okay, So that's what happened with this line here. Now to apply that regular expression to the string, it's very simple. We can just say val pattern, parentheses Answers, string equals the ultimate answer. Alright, so the syntax There's a little bit weird. It means we're going to take the regular expression that we defined in pattern. We're going to assign the output of that to answer string. So the syntax here, it's basically saying, I want to take what's in these parentheses and transfer that result to what's in these parentheses. Okay, That's a way of thinking about it. And we're going to assign the ultimate answer into that pattern. So the syntax there again, a little bit backwards from how you might be thinking about things in other languages. And then we can just print out what that answer is. First, let's convert it to an integer. We can say val, answer equals Answers string to int. So that's just showing you how to actually convert one type to another. It would help if I typed answer string correctly. And then I can print that out. Print line, answer. Alright, so we've defined a string, we've defined a regular expression to extract information from that string, we define a statement. Actually apply that regular expression to the string and store the result somewhere. We're calling that answer string. We've converted that from a string to an integer, and then we're going to print that injure out Control Alt W. And it worked. So you can see that we extracted the string 42. We converted that to an integer 42 and printed it out when we were done. All right, moving on. Let's talk about Booleans. So really easy thing they work exactly as you would expect. So let's talk about Booleans. So for example, we could say val is greater and set that equal to one greater than two. What do you think that will come out to? Is one greater than two? No, the answer is false. So that works pretty much the way you would expect. We'd say Val is lesser equals, one, less than two. That's true. We could say val impossible equals is greater and lesser. And you can see that we can use a single ampersand there. That's fine. But we could also say a double ampersand. Let's see what those do. Control Alt W. So these actually aren't the same thing here. So like in C or C plus, plus a double ampersand is actually a logical end where it's a single ampersand is a bitwise. And so the only reason this works at all is because it's greater and lesser wings can evaluate down to zeros and ones and it still ends up working out. And we just implicitly convert that to true or false. But if you're trying to do a logical operation, which is what we're really trying to do here. You should be using the double ampersand. That's the logical or Boolean operator. So they do give you the same result in this case, but it's really better form to use the double ampersand there works the same way with or if you want to say is greater or lesser, that works. That would be true, presumably, yep. So you get the idea. Booleans work pretty much as you would expect from other languages. Let's play with it some more. And we could say val, the card, which is a string, and set that equal to Pickard. And we can say val best captain. Another value just with a different name. Also a string equals Pickard. And we can say Val is best. Declare that as a boolean. And we'll set that to Picard. Equals equals best captain. So let's go ahead and execute that and Control Alt W does what you might think. So it equals, equals here is actually going inside those strings and comparing the values of the strings. So that means we want to actually compare the values of those two things and see if they're identical. So it's not actually comparing the objects themselves or the address of the objects themselves. It's actually going into that string and comparing the strings to each other. So if you want to compare two strings, just use the equals equals operator. That can be something that's kinda weird in some languages. So it's important to point that out here. And if you do want to debate in the Q and a weather, Picard is the best captain. I welcome that discussion. Alright. If you wanted to get your hands dirty, play around with this stuff, do some more stuff, keep going, guys. So for example, you could write some code that takes the value of pi and doubles it and then prints it within a string with three decimal places of precision to the right. Okay, So actually I'm going to paste that in here as a little challenge for you. And it's really easy. I'm not even going to give you the answer. You guys can talk in the Q and a if you want to talk about the actual solution. But there's my challenge to you. Go and apply what you just learned and do that. Yeah, just write a little snippet of code that takes the value of Pi. We defined, multiply it by two, and printed out within a string with three decimal places of precision to the right. Everything you need to do to accomplish that should be above you here. So little simple way to get some hands-on exposure. So get your hands dirty, play around a little bit more. And then we'll move on to the next chapter of learning Scala. 5. [Exercise] Flow Control in Scala: So moving on, if you want to save what you've done so far, you can just say Control-S, close that out is to keep it around for future reference if you want. And let's make another Scala notebook or Scala worksheet rather. And we'll call this one the creative name of learning Scala. And this time we're gonna talk about flow control. So let's see how if else statements work. They work exactly the same way as they do in other languages. It's nothing too weird here. For example, we could say if one is greater than three, print line, impossible. Else print line. The world makes sense. So exactly like every other language there, if some expression is true, you do this expression, else do some other expression. And if you want to do it all on one line, that's what that looks like. So Control Alt W, the world makes sense as it should. Now if you want to split that up into multiple lines, the syntax again is pretty familiar to you. Those of you who have programmed before, we could just say if one is greater than three curly bracket. And this allows us to put multiple expressions in that positive case. So we could say print line impossible. And if we wanted to, we could print something else again. Really else, do some other thing. We can say print line, the world makes sense. And still, I don't know, I'm making this up. Control Alt W. The world makes sense, still cold. So nothing too surprising there. It's very similar to other languages there. Well, so you know how we can have like switch statements in some languages where you sort of have like matching between different cases. Well, what does that look like in Scala? Let's look at an example of that. So we could say val number equals 3 SAT the number 32, a value named number. And we can say number match curly bracket, case one. Little weird arrow there equals sign greater than print line 1, case 2. You can see where I'm going with this. Print line 2, case 3, print line three. And then we can say case, underscore, print line, something else. Probably guess what that does Control Alt W. We get the value free printed out. So what's going on here is that we have this case statement here where we can have a list of different cases that we're going to check against the value number. It's equal to 1. We'll execute that expression. If it's equal to 2 plus q, that expression is equal to 3, we execute that expression, which it turns out to be. And this underscore is kinda like a catch-all. It's like the default statement of where anything else, any other case that we didn't match, we're going to hit that instead. So for example, we could set this to 30 and that should hit that final catch-all statement, right? Control Alt W. Sure enough we get something else and change that to, to get to. So that's how a match statement works. You don't see that too often in Scala, but it's there if you need it. Next, let's talk about for-loops. Very common thing to do in most programming languages, right? So how does that work? It's a little bit weird in terms of syntax. So. One way to do it is for x less than dash one to four. And then we can say val squared equals x times x and print line squared. So what this does is it iterates through the values one through four. And at each time through it, if signs that current value of that current iteration to the value x. And then we compute a new value called squared that multiplies x by itself and prints out the result. So what we should see, our four results here, each with the square of the numbers one to four, Control Alt W. And there they are, 14916 because that's one squared, two squared, three squared, and four squared. So pretty much works the way you would expect, right? We also have while loops, just like you would in other languages. So we could also do something like var x equals 10. And note that I'm using a mutable variable here. This is generally not good practice in Scala. And I'll say while x greater than or equal to 0, curly bracket, print line x, and then x minus equals one. So we're going to start off setting x to 10. While x is greater than or equal to 0, we're going to print out the value of x and then subtract one from it. So you can see we had to make x a variable there in order to keep modifying the value of x. So not a structure. You're going to see too often in Scala, but you can't do it if you need to. Let's just make sure it works. Control Alt W. And sure enough it counts down from 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0. Cool. Interesting how the worksheet formatted that by the way, it's going out of its way to try to keep things compact. That's kinda cool. Another structure for flow control is the do while loop. Again, similar to other languages. So we could also set x equal to 0. We're still using that same variable that we defined earlier. And we can say Do print line x, semi-colon x plus equals one, while x less than or equal to 10. So this time we're going to start at 0 and print each value counting up by one, while x is less than or equal to 10. So we're doing that comparison at the end of the block instead of the beginning. That's just how do-while works. Control Alt W and yet it works 0123, all the way to 10. Okay, that's about it for flow control, let's talk more about expressions. So moving on, expressions. So one thing that's kinda weird about Scala is that when ever you have an expression, it's sort of implicitly returns the value of that expression automatically within that block. So for example, let's say curly bracket Val X equals 10 semicolon, x plus 20 semicolon. What's, what's going to happen here? What does that actually do? Control Alt W. So what it did was actually returned the value 30, even though we didn't actually say, I want to print this out, I didn't say explicitly, I want to return it, just the act of having that expression is the last thing in that block, means that that's what that block outputs as a result. So this is kind of an important thing to wrap your head around in terms of functional programming and expression, any kinda block of code like this can itself be passed around as a function has its own little entity. And the thing this function returns is whatever the last thing and that expression is. Now you can explicitly return things to like a cannon other languages. But implicitly the last thing that happens within a block of code is going to be the return value of that expression. So this little chunk of code here is a function in and of itself implicitly. And it returns in this case, the value 30. Okay, that's kind of important. I could actually just print that out. And I can say print line. And print out that expression, val x equals ten, semicolon x plus 20, just like that. And that will actually print out hopefully the value 30. Okay, See how that works. Important point there, because that's probably one of the more confusing points of Scala. If you learn programming in a different language, which for most people is the case. That last little bit in within the block is the return value of that expression. And you can treat that expression as its own entity. Okay, that's going to be important when we talk about functions next. All right, that's enough for this little block here. So again, I'm going to give you a challenge here to practice what you've learned. Let me copy and a little exercise I wrote off to the side here. So you may have heard of the Fibonacci sequence. And that's what it is. Basically it's a sequence where every number is the sum of the two numbers before it. So your results should be 011235 and so on. See what's going on here. So we have, so we start off with a value 01, and then we iterate adding up the values of the two numbers before that. So 0 plus 1 is 1, 1 plus 1 is 2, 1 plus 2 is 3. 2 plus 3 is 5. 3 plus 5 is 8, and so on and so forth. So your challenge is to write a little snippet of code in Scala that does that. So you're gonna have to apply what you learned about flow control here and values and variables to pull that off, this is a little bit more of a challenging exercise. Again, I'm not going to give you the answer. You can look it up pretty easily enough if you do get stuck. But this is a pretty common interview question for showing people that you know how to code and can wrap your head around algorithms. So I do encourage you to take the time to get through that exercise because my come in handy on a job interview someday. And it's also good practice for flow control within the Scala language. So go have fun with that. And when we come back, we'll learn some more. 6. [Exercise] Functions in Scala: Moving on, this next lecture is a shorter one, but it's very important. We're going to talk about functions in Scala. And given that it's a functional programming language, obviously that's of great import. Let's close out learning skeleton and create yet another worksheet, New Scala worksheet, and we'll call this one, guess what? Learning Scala three. And this time we're going to talk about functions. So the format of a function in Scala is going to be death. Function name, parameter, name, colon type for as many parameters as you have. And then colon return type equals, and then some expression that defines what that function does. This is a really, really weird syntax if you're used to other non-functional programming languages. So let's look at some examples. We can say, for example, def square it. X colon int, colon int equals curly bracket x times x. So this is how you would define a function that takes in a value and squares it returns the square of that value. Let's break this down. So again, we start with def to define a function, the name of the function, in this case, square it. A parameter. In this case we only have one parameter called x, and that is of type integer. So x colon int means that the first parameter is an integer named x. And then the return value of this function is going to be an integer itself. So that colon int is the return value, again, totally backwards from many other languages. Then we have to say equals to assign that function to some expression. So we're saying this function definition is equal to this expression in the curly brackets, x times x. Don't forget that equals, that's a very common mistake. So again, we're not explicitly returning a value. Just that last thing that gets evaluated within the function is the implicit return value of that function. So that's all there is to it. Another example, let's say def cubit x int, colon int equals x times x times x. So we could do all that in one line to this is simply for a quick one-liner. Let's see these in action. So we can say print line, square it, passing in the parameter two. And we can say print line cubit, passing in the parameter three. Let's Control Alt W and execute all that. And you can see that we assigned the function square root and cube it. And we actually passed in the value 2 to square it and got the value for back. We pass in the value three into Cuba and got 27 back. So just as it should be working. So again, it's focus for a moment on that syntax. It takes some getting used to def function name, parameter colon type, colon return value type equals, and then the expression that defines the function itself. Now, here's where it gets weird. If you didn't think that was weird enough. Functions can actually take other functions as parameters. It's like inception. If you don't get that pop culture reference, I apologize, but let's say for example, death transform ints. And we'll take it in X, which is an integer parameter. And also an f, which is going to define a function that takes an integer and returns an integer as its return value. And this transform function itself will return an integer and we'll set that equal to the following expression, f of x. All right, this takes some thinking, right, so what's going on here? We're defining a new function called transform int. It takes in both a value, an integer value named X. And it also takes a function that takes in an integer and returns another integer. So that can be any function that takes an integer and transforms it to some other integer. We know that this in turn will return an integer and it's set to the expression f of x. So we're going to take the function we passed in called f and pass into that function, the parameter x that we passed into transform it. If you need to sit there and think a little bit about how that works, don't be afraid to press pause because that's a very important concept for functional programming. Let's see what it actually does when we tried to use it. We'll say Val result equals transform int, and we'll pass in two and the function cube it. So what do you think's going to happen here? We're passing in the value two into x, the function cubit into F. So we're going to end up calling cubit on the value two here within transform it. So we then transform int. We're going to call the cubic function that we passed in with the value X. So we should get back to cubed, right? Control Alt W. Sure enough we got the value eight. Alright, again, noodle on that. And if you want to actually print that result out, instead of just relying on what that evaluated to, we could say print line, result, Control Alt W and get back evaluate explicitly there. Okay, So even more important is the concept of Lambda functions, or sometimes they're called anonymous functions are a function literals, you know, it's a lot different terminology for this. But basically, you can declare a function inline without even giving it a name. And you'll see this happening a lot in Spark code. For example, I could say transform int three x equal sign greater than x times x times x. So what I've done is bypass the need to actually define a cubic function there. And I've just sort of like to find that function in line here with whether call a lambda function or an anonymous function or a function literal. Again, different terminology for the same thing. So this does the same thing as 3 cubit. But instead of actually defining a cubic function, we're just defining the guts of that cubic function here in line as part of this line. So we're saying transform int the value 3, that's gonna get passed into x. And now passing into the f parameter transform it. We're passing in this expression, Ross say we're going to take an input parameter x and return x times x times x. So it's just kind of a shorthand of passing in really short expressions as functions. And let's just make sure it does what we think it does. So if we hit Control Alt W, that does give us back the value 27 because it's passing in three, passing three into x and then cubing that. Okay. Let's look at another example. Same idea, transform int 10 x x divided by two. So this is defining a new thing that we haven't seen before. And we're going to use this transform and function pass in ten into x and into f. We're passing in this sort of inline lambda function that takes x and divides it by two. So you would expect that to return 1, 5, right? Sure enough. Another example, we could say transform int to x. And that will go to, and we can put, we get more complicated here. We could say val y equals x times 2 semicolon y times y. Alright? And close that out. We think's going to happen with this one. So what we're doing here is passing in a multi-line expression here, instead of just a single line thing, you can do that, that's cool. So here we're setting up a value named y and setting that to x times 2. So again, two is being passed into this function as x. So we're going to start by saying 2 times 2, which is 4. And then we're gonna say y times y, so 4 times 4. So we should get 16, right? Control Alt W. Sure enough we get 16. So you can see those lambda functions can contain multiline expressions if you just put them in curly brackets, that's fine. So it's just sort of again, a shorthand of defining a short little function without actually defining the function itself. So comes in handy sometimes, but it can be a little bit confusing. All right. That's a lot to wrap your head around. Like I said, there's not a lot of material here, but it's hard to sink in to help you have it sink in. Again, I have a challenge for you. So here's your exercise for this lesson. So strings, it turns out have the built-in dot toUpperCase method. So for example, you could say foo dot to uppercase and it will give you back FU in all caps. So rage tweet mode, if you will, your, your exercise, your challenge is to write a function that converts a string to uppercase and then use that function for a few test strings. And then I want you to do the same thing using a function literal. That's these little inline functions here, instead of using a separate named function. So practice doing to uppercase using a traditional function and then using a function literal using these little inline lambda functions here instead. Not too hard, but it's good practice. So go off and take care of that. And I'll see you in the next lesson. 7. [Exercise] Data Structures in Scala: All right, moving on, Let's make yet another Scala worksheets. And we'll call this one, you'll never guess learning Scala four. And this time we're going to talk about data structures. Very important concept in any language. And this is going to be kind of a long one guys, so bear with me. So there are many different kinds of data structures in Scala. We have tuples. That's a very common one when we're dealing with the Spark code. We have immutable lists. And you can think of those as like database fields or columns of data. And those can be useful for passing around entire rows of data together, right, so let's see what that looks like. We can say val captain stuff. And if you're not familiar with Star Trek II, I apologize for the pop culture references here. Picard, enterprise D, NCC 17, 0, 1, D, doesn't really matter what these mean. There's a list of strings that are put together into what we call a tuple here. Now, despite the name of tuple, does not necessarily have to have three things in it. It just happens to in this case. So we have this trio of different objects here, these three strings that are grouped together in these parentheses. And we can treat that as a single entity that we're calling captain stuff. Let's print that out. Control Alt W. And you can see that that prints out as that object itself, the three strings, Picard enterprise D, and NCC 17, 100 one-d. Within that parentheses, it's actually a single object that we've called cap and stuff, and that object contains those three strings. Now we could rip that apart and refer to those individual components individually. And we do that by referring to them based on their one based index that's important. Refer to the individual fields with a one based index. This can be confusing because in a lot of programming languages you start counting from 0, right? It's not the case in this particular example. So for example, a print line, captain stuff, dot underscore. One is the syntax here. Let's say Control Alt W, that prints up a card, the first element of that tuple. Likewise, we can say print line captain stuff underscore two, and print line captain stuff underscore three. And we get each of those elements stripped out of their original tuple. So that's how you refer to individual elements of a tuple using that underscore format starting with one. Okay, What else can we do? We can create key value pairs, just like we do in many other languages using the arrow operator. So for example, we could say val cards ship equals the string Picard dash greater than little arrow symbol there, enterprise dash d. And I could then refer to the value part of that with a similar syntax. I could say print line, Picard's ship, dot underscore two, that'll give me back that value. So Control Alt W. So you can see we've created that mapping of Pixar to Enterprise D. And if we want to extract that particular value, we can just use underscore two to get it. And you can actually mix different types within a tuple. By the way, we can say val, a bunch of stuff equals in parentheses Kirk, which is a string, 1964, which is a number. The year Star Trek came out, of course. And a Boolean, Let's run a Boolean, true. Is that a valid thing to do? Is that legal? Yes, it is. So you can have a tuple that contains different datatypes as well. In this case a string and int and a boolean. That's perfectly okay in Scala. Let's talk about lists next. So lists are kinda like a tuple, but they're an actual collection object that has more functionality. And lists cannot hold items of different types. So like a tuple, but more functionality must be of same type. Alright, little note for yourself there, for example. It's actually a singly linked list under the hood if you're into data structures. So we can say val, ship list equals a list, capital L, enterprise, defiant, Voyager, Deep Space nine. I forgot index 01, the best one of all anyway. So what does that do? Control Alt W. And yet we've got a list object that has a list of strings. So again, the list must contain the same object type. In this case it's a list of string objects. And what can we do with it? So like with a tuple, we can extract individual elements from that list, but the syntax is going to be completely different. So let's say for example, print line, ship, list, parentheses, one, just like that. What do you think that'll do? Let's giving back enterprise or will get me back to find, let's find out Control Alt W gives me define it. So unlike the tuple syntax, we're actually zero-based here. All right, So ship was 0 would give us back enterprise. Let me just prove that to you. But ship list 1 is going to give us define it. Because we're starting counting from 0 in this case, which is a little bit more what you would expect. We also have head and tail operators available. So we could say print line, ship, list, dot head. And we can say print line, ship, list dot tail, head probably does what you expect. That's going to give you the first item in the list, but tail probably isn't. What do you expect? It actually gives you back all the remaining items. Beyond the first one, Let's see, Control Alt W. And we'll scroll down to the bottom. So head gave us the first thing in the list enterprise which is expected, but tail gives us everything else. So that gives us the list of defined voyager and DSpace nine, sort of a sublist excluding the first item. Little weird SQL alchemy, weird sometimes. Let's say you want to iterate through every item in a list. That's a pretty common thing to need to do, right? The syntax for that is simple enough. We can say for parenthesis, ship less than, so throw the left arrow operator their ship list. And then some expression that will operate on ship. Let's just print it out. Okay. So this is going to iterate through every item in the ship list list, extract each one of those strings and assign it to ship, and then print that out. Let's watch. Yeah, it worked. So that printed out each individual ship name. And you know, if we weren't in the worksheet setting, that would actually be on separate lines, of course, but it's trying to be smart in conserving space for us, which is nice. Okay, let's get weird. Let's apply a function literal to a list. Okay, this is functional programming. We can do things like that. So we can use map, the map function to apply any function that we want to every item in a collection. Let's see how that works. We can say, for example, val backwards ships equals ship list dot map. So this means we're going to map every element of that list to some function that we're going to provide. And here comes that function. We're just going to do that as a function literal will say we start with the ship that is a string. And that's going to be transformed into in curly brackets ship dot reverse because there is a built-in reverse operator on the string operator. Okay, so let's expand that a little bit. It's worth seeing all that in one display there. Okay, so wrap your head around that. Now this is kind of tying together a lot of what we talked about already. So we're going to take a list. We're going to use the map function, which is very important in the world of distributed computing, to map some function to every element of that list. And that function is going to be a function literal where we're just saying some ships string is going to be mapped into the reverse and return that. So after executing this, we should have a new list called backward ships that contains all the ships in reverse. So let's go ahead and go ahead and print those out. With four ship. Backwards ships again, we're just iterating through everything in that list like we saw before. Print line ship. Okay, so let's see what it does. There you have it. So backward ships got assigned to this list where we apply the reverse function to every single element of that ship list. And it just made every word backwards. And then we went through an iterated through each one of those elements in that backward ship list and printed them out. So it worked. Cool. Well it's going to do. So. We saw map. You've maybe heard of MapReduce. So let's do reduce. Same concept. So like we had a map, we can use reduce to combine together all the items in a collection using some function that we will provide. Okay? So for example, let's say val number list is equal to the list containing 1, 2, 3, 4, 5. Okay? Simple enough. Now we're going to say val sum equals number list dot reduce. So map would apply a function to every single element of the list. But reduce is going to go through every element in the list and run the same function on it, let you accumulate all the results of that. It'll make more sense with an example. So we could say, for example. Parentheses and then print the c x colon int y colon int. Map that to X plus Y. Okay? So let's think about what's going on here. So reduce basically keeps a running total of some sort. So it's going to take the previous output of the reduce function from the previous iteration, pass that into x. It's then got taken the current iteration, the number that's looking at right now call that Y and apply some operation to that. And the output of that operation will get fed into the next stage when we go to the next element. So let me kinda walk through what's going on here. We're going to start off with one. And we're going to say 1 plus 2, which is 3, will then pass that into the next iteration. And we'll have 3 plus 3. The result of that will be six. Then we have six plus four. The result of that will be 10, and then 10 plus 5 should be 15. Alright, so that's how Reduce works. When we're done, we'll print out that value, print line some. Let's execute it. Sure enough, I got 15. So reduce. It's just a way to sort of like somehow collapse all the results into one final answer. And often that's used to compute like a grand total or something like that. You could think of more straightforward ways of doing this. But the nice thing about map and reduce functions is that they are very parallelizable. So it's very easy to apply map functions in parallel and then combine the results of those map functions in some sort of reduction stage to get the final answer that you want. We'll see that more in action with examples later on in the course. What else can we do this a filter operation as well. If you want to remove stuff that you don't want. So filter removes stuff. For example, we could make a new list called I hate fives and will pass, assign that to number list dot filter, passing in it another inline function here. Function literal, if you will. X-naught equal five. Alright? So what this is going to do is apply this filter function. Does x naught equal five and only return that back within our resulting list if that's true. So let's see what that does. That gives us back a list 1234 because it excluded the value five. So that's how a filter works. It can filter out invalid information, outliers, whatever you wanna do. It's a good way for cleaning data. Again, in a way that is parallelizable. We do the same thing. We could say in a slightly different syntax. Val I hate 3s equals number list dot filter. And this is kind of a shorthand underscore, not equal three. So instead of like actually writing out that x into giving that x value a name, we don't really need a name. We just want to take whatever value there is and check that against three. So that's kind of a shorthand expression of doing that using the underscore character. So let's see what that does. 12, 45. So equivalent syntax, you noticed using three instead of five and a little bit easier to type. So if you see that syntax, that's what's going on there. The underscore is basically a place holder for every element of the list. Okay? Okay. Now it's worth noting that Spark has its own MapReduce and filter functions that distribute these operations. They work exactly the same way. So Scala has its own built-in MapReduce and filter. Spark also has its own MapReduce and filter. The difference being that Spark is going to be sure to parallelize that across all the machines in your cluster if it can. Okay, hey, you actually understand MapReduce now by the way, that's the fundamental concept behind MapReduce. It's pretty simple, so little Bonus Army for you. Some more list operations you can do. How do I concatenate lists? Syntax for that? Let's say Val, let's create another list. So we have something to play with. More numbers equals list 678. And we can say val, lots of numbers equals number list plus, plus more numbers. So to concatenate two lists together, use that double plus operator. Let's see what that does. Control Alt W. So there you have it. We have a new list called lots of numbers. I can concatenates the number list which contain 1, 2, 3, 4, 5 and the more numbers listed 678 into a single list, 1, 2, 3, 4, 5, 6, 7, 8. Note that we did not get back a tuple of two lists. It's a single list that contains all the values from those two lists. Some more random things we can do with lists. So we get reverse. What's in a list with reversed equals number list dot reverse, Five 4321. And they actually used to be a interview question I would give people at Amazon reverse the words in a string or reverse the elements of a list used to be, that was a hard thing to do and most programming languages, but if you come into their using Scala, you can do it in one line of code. So little way to cheat. They're sorting. You can say val sorted equals reversed dot assorted. Oops, I hope I spelled reversed, right? So let's go ahead and sort that. And that's back in sorted order. 1, 2, 3, 4, 5. Easy-peasy. What else would you vowel? Lots of duplicates. Equals number list plus, plus numberless. So let's concatenate the two lists together and see what happens. We're concatenating a list to itself and you see we can do that. 1234512345. So the elements of a list do not need to be unique. You can't have duplicates in there. That's what I'm showing you with that little example. But if you did want nothing but distinct values, you can do that too. We could say val distinct values equals lots of duplicates, dot distinct. And that will return back only this distinct values within that list. And you can see we're back to 12345. There are no duplicates in that distinct. I just want to get the max, the maximum value in a list That's easy. Vowel max value equals number list dot x. And that is five. If you wanted to get the total, we could say val total equals number list dot. Some little bit more straightforward way of doing it than using a reduced function. That too works. We got the value 15. There are much more simply. And we could say, if you want to check if a element exists within a list, that's also easy to do. Val has three equals, I hate threes. Dot contains three, so we're checking the I hate threes list that we made earlier To see if it contains the value three, it should not. Sure enough that came back as false. All right, We're getting there guys. Last thing I want to talk about is maps. So you might know these as dictionaries and other languages are a key value lookups on distinct keys. Very useful data structure in a lot of different situations. So let's talk about how that works. Let's set up a map called ship map. Little comment in here. So we know we're talking about apps that they're important. So eval shit map is going to be equal to a map capital M that contains the following key value pairs. So for example, Kirk, arrow, enterprise, comma Picard, arrow enterprise D, Cisco, Deep Space Nine, Janeway, Voyager. I could go on, Archer and so on. But I'll spare you the Typing. Alright, so what we've done here is set up a map object. Make sure I close that off properly. And you can see we got back a map object there that maps strings to strings. Again, the data types need to be the same. So we're mapping captain named strings to their ship named strings. And you can have strings mapping to ints or something like that, but you have to have a consistent type within each key value pair, at least. That makes sense. So now if we wanted to use that map, we could say something like print line, shipped map, Janeway within parentheses, and Control Alt W, That gives us back Voyager. So you can see we have a quick little lookup table there that we've defined using a map that will give us back the value Voyager given the key name Janeway. Very useful thing in a lot of situations. What about missing keys? So actually I intentionally omitted one of the ships captains there. Let's say print line. Ship, map dot contains archer. Archer is not in my list in my map there. That comes back as false. How do I deal with that? What happens if I try to like look up archer, where he doesn't exist in the map. Well, let's try it out. Val archers ship equals util dot try. So we're gonna do a try-block here, ship map, Archer, get or else Unknown. All right, so we're doing some exception handling here. And then we'll print line or tourist ship. Oops, print line, archers, ship, Control, Alt W. Alright. So basically we're doing like a little exception handling here. So we're doing a try and get our else block here. So by wrapping shit map Archer in this try-block, that means that it's not gonna like go ballistic because our true doesn't exist. We are getting back sort of an undefined value in that case. So the try exception there, it gets triggered. And it's this getter else clause instead and returns a string unknown in that case. So that's one way of handling missing values in a map or dealing with the possibility of missing values in a map. Okay, That is definitely enough. And we're done with the basics of Scala. So, you know, just enough to be dangerous here. Now, we're going to get a lot more practice as we go through the course by looking at a lot of different examples. And that's really how you're going to learn it. But I hope this demystify some of the syntax for you, at least. One last exercise for you. Let me copy this in little one last challenge. So here it is for you. Your challenge, if you choose to accept it, create a list of the numbers one through 20. And I want you to print out the numbers that are evenly divisible by three. And there's a few hints here in the text here that might guide you. Again, this is a very common interview question I used to give people during phone screens and Amazon, you'd be amazed how many people could not do this successfully in any language that they wanted to try and do it in Scala. There is a modulo operator, like in other languages I talk about here that gives you the remainder after division. So anything that's divisible by three should have a modulo 3 that comes back as 0. So that's your first hint. So just iterate through all the items in the list tests. That modulo test is you go and you can then do it again using a filter function on the list instead. So a couple of ways you can go about it. Go ahead and practice there. And after that, I think you'll know enough skeleton be dangerous. And we'll move on and actually get into the guts of Apache Spark itself. 8. The Resilient Distributed Dataset: So let's dive into Spark itself and we'll start with an overview of Spark and also its original API, the RDD. We'll talk about what that means shortly. That was the first API that spark came out within Spark one. And even though it's been largely superseded in subsequent versions by newer APIs, it's still out there. You're still going to need to use it from time to time. And there are some problems as we'll see, where RDDs are still the most efficient solution. You're also still got to run into some third-party libraries that you might want to use it use RDDs and you might have to look at some legacy code as well that uses a two, so it's still worth learning. After this section though, we will shift our focus to more modern APIs like dataframes and datasets. But for now, let's start with the basics and learn the core Spark itself, the RDD. So let's start by talking about sparks roots and that is the RDD interface. Rdd stands for Resilient Distributed Dataset, not to be confused with datasets in Spark, that's a different thing. Datasets are rather confusingly built on top of RDDs. It's a higher-level API. But at its core, Spark is all about RDDs. And what is this? Well, basically an RDD is a bunch of rows of data of some sort. And because it's divided into rows, those rows can be distributed out to different computers and be processed in parallel on those different computers. So that's why it's distributed. So it's a dataset because it's a bunch of rows of information is distributed because we can potentially divide those rows up amongst multiple computers. And it's resilient because, well, spark makes sure that the processing that you're doing on the RDD gets done one way or another, right? So it's sparks problem to figure out what happens if a node goes down in the middle of your operation into spin up a new one to take its place. So the first thing you need to do before you create an RDD and Spark is create a Spark context. And your driver program will do this as one of the first things it does. So the Spark context object is what's responsible for making those RDDs resilient and distributed. You don't have to write the code to make sure that you can handle no failures or hardware failures. You don't have to write the code for figuring out how to distribute that data across your entire cluster. The RDD itself within the Spark context is figuring out how to do that for you. All you need to worry about is how to transform your data. So your Spark context will create the RDDs that you request. And if you're actually using the Spark shell interactively, it will create an object for you automatically, that will be your Spark context. You have to create it explicitly, it's already there. This will make more sense when we look at some code. So how do I create an RDD? It's usually pretty simple. One way is to just give it an explicit list of stuff. So the first example there were saying val nums equals parallelize list 1, 2, 3, 4. That's going to create an RDD called nums that just contains on each row 1, 2, 3, and 4. Obviously that's not terribly useful in the real-world because if you are able to write down and hardcode what's in your RDD. It's probably not big data and it's probably not real data either. But You know, for testing things out, test cases sometimes that's helpful. More often, you'll be loading data from some text file somewhere. And to do that, you can just say SC, assuming that's your Spark context object dot txt file and pass it the, the path to your giant text file and stuff. And that will create an RDD where every line of that text files its own row in the RDD. And it doesn't have to come from the local file system. Now kinda defeat the purpose of big data, right? So it can also read it from distributed file systems including HDFS, which is the Hadoop Distributed File System, or even Amazon S3, S3 and n would be the prefix for that. So there's always extensions for loading data into a Spark RDD from various distributed data source because sparks all about big data, right? So its ability to load data from distributed data stores is kind of fundamental to what you're doing. It doesn't have to be a text file though. You could even create a, an RDD from a, from a hive context. For example, if you're using Hive, and you can then turn around and just run SQL commands on that high of context if you wanted to as well. So that's one way of doing SQL in the Hadoop environment. You can also create RDDs directly from JDBC. So if you do have a database out there somewhere, you can pretty easily transform that into an RDD and Spark. You can talk to no SQL databases like Cassandra or Hbase. You can also integrate it with Elastic Search if you want to. And it doesn't have to be a text file either you can use structured data like JSON or CSV files or a sequence files or object files. And it also supports various compressed formats as well. Again, with big data, sometimes data is compressed. And as long as it's some lossless format that can be parallelized, then you can create an RDD from it as well. Once you have an RDD, what do you do with it? Well, you probably want to transform it in some way and there are various operations for doing so. Map, for example, works just like MapReduce really. It allows you to apply some function to every row of your RDD in parallel. And we'll see an example of that shortly. Flatmap is similar. The only difference between map and flatMap is that map has a one-to-one relationship. So map will take one row of your RDD and transform that into one row of another RDD. Flatmap, whoever can split things up. So flatMap could take one row of an RDD and create multiple rows in another RDD. So for example, if you had like a list of stuff in each row, flatMap could break that out into individual rows for each thing. And we'll see that in an example shortly. You can also do filters. So if you want to remove data, clean out data, you can do that too. You can define a filter function that every row will be tested against. And if that function returns false, it will throw it away in the resulting RDD. So that's a good way of cleaning your data and getting rid of missing data, things like that. You can also run distinct on it. If you want to eliminate duplicate values in your RDDs, you can sample it. And if you just want to take a small sample of an RDD for experimentation purposes or whatever. And you can also do fancy Boolean operations between RDDs, like the union of two RDDs, the intersection of two RDDs, you can subtract one RDD from another or do a Cartesian product between two RDDs as well. Those are more advanced use cases. Let's see it in action. So let's say I want to just square everything in my RDD. Let's say you have an RDD that I'm calling RDD real creative name, right? And we've made that by doing that parallelized trick again, we just have an RDD of four rows where each row contains respectively 1234. On the next line, we're creating a new RDD called squares by just calling RDD dot map x to x times x. So what this is doing is passing the function, take X and transform that into x times x and applying that to every row in the RDD, RDD, the resulting RDD will be called squares. So when we're, after we're done running that map operation, squares will contain the following rows, 1, 4, 9, and 16, right? So we follow, follow me there. And the beauty of it is that, that can be distributed. So if already D was really massive, it could actually split that processing up and handle that squaring on different chunks of that RDD on different, different nodes within your cluster, and then suck all those results back to your driver script to get the final answers that you want. So that's a little bit weird syntax there. You might not have seen that before. That x to x times x thing of what's going on there? Well, this is what we're talking about when we say functional programming. So a lot of RDD methods are going to accept a function as a parameter and not just a value. And that's what's going on here. So syntactically x and then the funky arrow to x times x is the same exact thing as defining a function to square something. We're calling it square it. And this slide that takes some integer called x and returns x times x. Actually the return keyword would be optional and Scala, you can just say x times x and it would just work. And then by saying RDD dot map passing in the square root function, it would apply that square root function to everything in that RDD. But syntactically it's the same exact thing as saying RDD dot map x, funky arrow x times x. So you tend to see that shorthand. If you have a very simple transformation that you're doing, that you can get into one line there. But that's all it is. Like we're passing that function around. So in this example, we're defining a function on how to square numbers. And we're passing that function into the RDD through the map operator there. And that's it that you understand the functional programming. Now, that's the key concept. We're just passing functions around. And the bigger insight here is that, you know, the language of scale itself is sort of forcing you into writing these functions that can be distributed. So the features of scale and make it easier to do this somewhat, but that's, that's it. That's all we mean by functional programming. You know, we're sort of thinking in parallel and applying functions, two big sets of data by passing those functions around to things like RDDs about that, That's it. It's not that hard guys. All right, once you've actually transformed everything in your RDD the way you want it. What do you do with it? How do you get your results back? Well, that's what an RDD action is all about. So an action is basically something that causes your RDD to collapse and give you back a result back to your driver script. One thing you could do is call collect on the RDD that will just give you everything in that RDD back as one giant data structure in Scala. You know, if it's a big dataset that's not gonna make a lot of sense more often you're going to be trying to get more of a reduced operation out of it. For example, we could say count if we just want to count up how much stuff is in the RDD, we can say count by value. If you want to get a count of how many rows exist for each given unique value in a key-value pair. You could take to just, you know, sort of sample of a certain number of rows from the RDD top to take the first few rows or reduce to get some sort of grand total from all the rows in your RDD. And there's more actions as well. But the basic idea is that an RDD action gives you some answer back to your drivers script. So it's got a force Spark to go out and say, Okay, I'll you executor processes finished what you're doing and give me an answer. And we're going to give that answer back to my driver script. So that's what an action does. And the way that works is kind of interesting. So you have to remember that in Spark. Nothing really happens in your driver program until you call an action. It's sort of a lazy evaluation strategy here. And there's a very good reason for that. So until Spark knows what you're trying to achieve through the action that you're trying to get at the end, it doesn't know how to optimize all the operations that you've done. So remember the key or one key to Spark's performance is that it can construct this directed acyclic graph that's very optimized for the end result that you want to achieve. So this can be a little bit confusing as you're debugging your Spark driver scripts because, you know, you can call some map operation or some sort of transformation on your RDD. And it will look like nothing's happened in your code. It's not until you actually call an action that anything is actually going to get executed and sent out to the various nodes in your cluster and you'll actually get a result. So this can make debugging a little bit tricky, right? Sometimes you need to stick a little temporary action in there to get back and intermediate result in. Make sure it's what do you expect as you're debugging the Spark driver scripts. So something to be aware of, especially while you're debugging these things. And with that, let's dive into some more specifics next. 9. Ratings Histogram Example: So it's generally best to understand things through an example, right? So let's do that. Let's learn about RDDs by looking at a really simple example of using them. So we're going to dive into a simple example in our course materials that just counts the number of a given rating within our MovieLens dataset. So as you recall, the MovieLens dataset is a dataset of 100 thousand movie ratings, where people rated a bunch of movies from one to five stars. And let's say we just want to figure out how many one-star ratings there are, how many two-star ratings, how many three-star ratings, et cetera? Well, that's something we can do with Apache Spark. So before I dive into the code here and what it's doing, Let's just run it and familiarize yourself with what it's doing and sort of get comfortable with the code itself. So let's go do that. So if you open up intelligence and go to the ratings counter file here, that's the one we're going to be playing with here and diving into very simple example of a apache Spark driver script here. And we're going to talk about each line and what it does in more detail. But at a high level, what we're doing is loading up the MovieLens dataset across every CPU core of our machine here. And we're going to count up how many times each unique rating value occurs within that dataset. So we're going to load up every line that represents every rating in that dataset. And we're going to split that out into its own RDD. We're going to count up how many times each rating actually appears, sort the results and print them out. And let's just run it and convince ourselves that it works and does something interesting, right? So let's right-click on ratings counter and say Run. And if we click on, there we go. And that should go off in spin up Spark. And there's our answers. So what this is telling us is that we had 61101 star ratings, 11,372 star ratings, 27,145, three-star ratings, and so on and so forth. And this data itself is kind of insightful, right? Like it tells us that the most common rating for a movie is four-stars. And there's a fair amount of three stars and five-stars aren't there too. But people tend to be that seemed to have a reasonable scale for how they're reading these movies. And they do seem to be reserving one star is for the worst of the worst. So out of a 100 thousand ratings, only 6000 and some change, we're actually rated one stars. So that gives us some confidence that this data is meaningful just in this very simple script here, even. So, let's go back and talk about more about what this code is doing. So let's walk through what this code is doing at each step here. So we start off by importing what we need. And we just say we're declaring this within the package that we've named com Dotson dogs software, dot spark. You know, that's sort of a convention in the Java world of how you give your packages a unique name. In this case, I'm using some domain name that I own. And then we import the packages that we need. We're just importing everything from Spark and everything from log for J because we're going to use a log for J to adjust our error level in our logging. That's all that is. Next we set up a Spark context, like we talked about earlier. So we set SC to a new Spark context. The local brackets star means that we're going to be running just on our local machine. So we don't actually have a cluster running in our living room here, right? So. For the purpose of experimentation and learning, we're just going to be running on our own single machine. But that star means that I'm going to let it parallelize itself, at least throughout the multiple CPU cores that might exist on my machine. So even though it's on one system, it might actually spin up more than one process and take advantage of the various CPU cores that I have. And ratings counter is just what we're naming this application. Next we load up the data and we just create a lines RDD by calling SC dot txt file with a relative path to where that data is. In our code, that path is more explicit as to where it actually resides in our course materials, but you get the idea. So it has a path to the u dot data file, which is the data file that contains all the readings information and the format of that. An example of that is in the upper left corner here. So every row looks like this. It starts off with a user ID and then a movie ID, and then a rating and then a timestamp. So everything is encoded into some sort of numerical ID here. I mean, I don't know what user-user 196 is or what movie movie ID 242 is. There's a separate file for looking those up. But user ID 106 rated movie 2423 stars. And they did it at this epoch time, which translates into some, you know, some date and that we could humanly greed but computers like Epoch seconds. So that's just the format of this data. So the first thing we're gonna do is load up all 100 thousand rows of that file into a line's RDD, where every row represents one row of that data file. Okay? And again, we're starting with RDDs here. We're going to in the next section talk about how to do all this stuff using datasets instead, which is the more modern API. But again, they're both useful tool. Sometimes RDDs are going to be faster than datasets. Sometimes RDDs let you do more. So we're, we're starting with the basics, the lower-level API here of RDDs. And you'll find that using datasets isn't really that much different. But let's start with RDDs here. So right now we have a line's RDD. Next, we needed to do something to that data. So the first thing we wanna do is extract the information that we actually care about from that RDD. So we're going to call map on the lines RDD, passing it in a function of how we want to transform every row of that RDD. And in this case the function is taking every line and calling it x. And then on that line, that string, we're going to convert it explicitly to a string and call split on it, splitting on the tab character because this is tab-delimited data and extracting adjust the field number 2. Now remember in computer science we often start counting from 0. So two is actually the third column of data. So what this is doing is extracting that third column, which is the rating itself, and nothing else. So it's extracting that rating from every line and inserting that into a new RDD called ratings. Follow me. So we start off with a lines RDD, that's those in the upper left-hand corner on there. We call a map function on it to apply this function to every single row of the lines RDD to extract that third column of data, the rating itself. And that goes into a new RDD called ratings, which in this example would just contain 33 one to one on each line of the ratings RDD respectively. So that's what mapping is being used for there it's just to extract the data we care about. And you know, in some cases you might reformat things or put it in a different format that some downstream process wants. But generally that's what mapping is all about. So now we need to do something with all that information. So in this case we're going to perform an action and our action in this case is count by value. This is what's going to cause our data to sort of collapse and give us back an answer. So that mapping that we did in the previous slide could have been distributed amongst many executor nodes, right? But by saying we want to count up the values, That's going to give us back an explicit answer and pass back some result back to our script as a Scala object. So in this case, we're saying ratings dot count by a value. That will do what it says. Give you a count of how many times each unique value appears. So in this example, we have 23 ratings. So we get back the value three comma two, meaning that for the rating 3, there are two of them. For the rating one, there are two of them as well. And for the rating two, there's only one of them. So that's what count by value returns. And in this case it's not giving us back a new RDD. It's passing that back to our drivers scripts that's going into a, a new value called results at the Scala level. So by calling this action, we've taken those results back from the cluster and put them back into our local drive or script. And at this point we're no longer in Spark land were back in Scala line. We're back within our driver script. That's where the state is living at this point. And at that point we just want to sort it and display it. So we have a little simple Scala code here to take that set of results, convert them to a sequence and sort them by the second field. And then we just print them out by applying the print line function to each row of that final result, to each row of that sequence that we defined, right? So again, this is just Scala syntax here. We're taking the results, converting it to a sequence data structure, sorting it by that second column. And then for every row of that sequence within Scala, we're applying the print line function to print out the contents of that. And that's how we see our final result here. Where we had see that we have to 1 star ratings, 12 star rating, and two three-star ratings. So it's just that easy. Again. Let's go run it and just convince ourselves again that it works now that we have more of an appreciation of what the code is doing. So now that we've talked about that code in more depth, if we look at this again, it should make a little bit more sense. So some things we didn't talk about. So first of all, we've wrapped all this code and integrating is counter object. So that's just sort of, you know, how things need to be structured for a driver script and in Scala. So within this radius counter object, we define a main function. This is basically the structure that every driver script needs to conform to for Apache Spark. And the first thing we did, which we didn't really talk about with setting the error level on our logs to error. And unfortunately, as you saw here, a few warnings slipped through before that code actually got executed by Spark. But once the driver script actually starts to get executed, that's going to prevent any log messages that are anything less than an error from getting displayed. And that's just to reduce our span and let us see the results. So we want to see more so than a bunch of warnings that we can't do anything about. After that, we're getting into the code that we talked about in the slides. So we create our new Spark context. We call it as C. It's set up to use every core on our local machine and we call the app name or ratings counter. We load up from the data ML dash 100 K directory ru dot data file, which contains every row of each individual rating in that dataset. This is the meat of it all here. This is where we call the map function on the lines RDD and apply this function that splits out and every line into its individual fields based on the tab delimiter and extracts the third field. Again, we start counting from 0. That happens to be the actual rating value itself. That gets thrown into a new RDD called ratings, which contains each individual rating. We then call the action count by a value to for Spark to go and give us an answer and figure out the optimal way of getting it. Once that executes, we have a results structure here that's just living within our driver script now. So we're no longer having our data on the cluster. We've collapsed that back into some results within the script. And now we can take that resulting scallop data structure converted to a sequence sorted by its actual second field there, which is the actual rating value. And then apply the print line to each sorted results to print out the final answer. And let's run it one more time just for fun. So again, we can just click on 3D scanner and right-click and run. But since we already did it once, it's going to be in a shortcut up here as well, we can just hit the play button and do it again. And there it is again. So cool. All right, so we've talked about the code and how it works, and you have a rough idea of how to use RDDs and practice. Again, we're going to talk about datasets later, very soon, don't worry, we're getting there. But first let's talk about more about what happened under the hood when we ran all of this within our cluster. 10. Spark Internals: So what actually happened under the hood when this all ran? Let's talk about that in a little bit more depth. So when our driver script got to that action, the count by value command and execution plan was created from those RDDs that we defined. So it's kinda figure out what do I need to do to actually get that final answer. In this case, it looks like this, right? So we start off with that text file, the raw data from our RDD. And we are going to call a mapping operation that can be applied in parallel across multiple processes or multiple computers even to extract the ratings that we want there. And then finally, we're going to do a count by value action to add them all up. So the execution plan, it's gonna say, hey, I can do all that first mapping stuff totally in parallel. But that action is going to require you these notes that talk to each other in some way and sort of communicating and Coli, their results. So what happens is the job gets broken down into stages based on when data needs to be reorganized. So that parallel mapping can all be one stage that doesn't require any sort of reorganization of the data itself. But the count by value shuffle things around a little bit, right? So that needs to be its own separate stage in the process. And then every stage it gets broken into tasks which can be distributed across the cluster. So maybe in stage one, you know, those purple arrows are processing parts of the data and that will go off to one executors. Executor, maybe the green ones will go off to another executor and the blue one will go off to a third executor, right? So it thinks about how do I distribute this data effectively and that count by value operation, well, that's not easily parallelizable. So that becomes sort of a different strategy there. And finally, the tasks are scheduled across your entire cluster and executed and you get your answer. So under the hood, That's what went on to actually get our final answer of how many of each rating type existed in our dataset. It's not that hard guys, but under the hood, that's kinda what's happening. 11. Key / Value RDD's, and the Average Friends by Age Example: So we saw some very simple RED these in use with our ratings counterexample previously. But there's also a special kind of RDD called the key value RDD. And with these RDDs, there are some additional operations you can perform that can come in handy. And we'll look at this with an example with our friends by age script here. What we're going to try to do at the end of the day here, or at the end of this lesson anyway, is figuring out how to compute the average number of friends by age. So we're starting off with a dataset that has a bunch of individual people, their names, their age, and how many friends they have imagined that we got this from some social network somewhere, right? So in this case we're going to structure our data into key value pairs. Rdds can hold these things. In our case. In this case the key will be the age and the value will be the number of friends. Because we want to look things up by the age, writing, consolidate things and reduce them by that age. So instead of just a list of ages or list of number of friends and an RDD. We can store this in a more structured manner where every rover RDD consists of a tuple of a given age, a number of friends for an individual, the age and number of friends for another individual, et cetera, et cetera. So we now have these key value pairs on each row of our RDD. How do you create these syntactically? It's really nothing too special in Scala. All you need to do is map pairs of your data into the RDD using tuples. So for example, if I just wanted to take an RDD that had one thing in there and create a key value pair of that thing as the key and the number 1 is the value. I could just call a map operation that says take the row, all of its text is x and transform that into the tuple x comma one. And that becomes a key value pair where the key is x and the value is one. That's all there is to it. You now have a key value, RDD, and you're not limited to having one thing and the key are one thing and the value either you could have tuples or other sorts of objects as values as well. So instead of one, I could have another tuple embedded within that. So I can have x colon, print the sea, you know, something comma something comma something else. Close parentheses if I wanted to as well. So you can't have more complex data types embedded as part of your value to sometimes that comes in handy. So once you'd have these key value pairs in your RDDs, Spark can do some special stuff with them. There are some new functions that become available to you. One is reduced by key, and that's useful for combining all the values for the same key together somehow. So basically you provided a function that defines how do I combine all the values together for a given key? In this case, we're passing in this little function that says x, y, funky arrow x plus y. So that means that to combine some sort of values in those values, I'm going to use the addition operator to combine them together. And this is a little bit confusing what's happening under the hood here. So it might be keeping a running total as it goes through every value for a unique key under the hood there. So x might represent our current running total. And why am I represent the new value that it's seeing from a new row and adding it into that. So what's important here is that operator on the right-hand side here. In this case, we're saying explicitly, we want to add up everything in all the values for this RDD and give us back an answer for each unique key. What's the sum? All the values associated with that key makes sense. If it doesn't, we'll have an example that will make more sense when you see it in action. We can also say group by key. So that will just give you back a group of all the values for a given any key. Not necessarily reducing things down, but it just lets you organize your data better and maybe pass that onto some other operation. Later on. We can also sort by key. So if you have key value pairs, you can do that sorting within your cluster. Remember back in our previous example for the ratings counter, we actually got our results back at the Scala level within our driver script and sorted them there. That's really not the most scalable way of doing things, right? It, it would be preferable if we could sort on the cluster. And if we had a key-value pair instead, we could have done that. Also, we have keys and values, so we can extract all the keys or extract all the values from a key-value pair. If you need to go back in the other direction and create an RDD of just the keys or an RDD of just the values that allows you to split them back out from a key-value pair. Also, it's possible to do SQL style joins if you have two key-value RDDs and we will have an example of that later on. I mean, this does get into the area of where you might ask yourself why you're using RDDs if you're doing SQL operation. So I'll just note that you can do it. But in practice, obviously you'll probably be using datasets are DataFrames are the Spark SQL API to do these sorts of operations in modern spark. Also, if you've gotta do some sort of a mapping operation on a key value RDD and you're only going to be transforming the value part. A little trick is to use mapValues or flatMap values. That way you can just apply your transformation just to the value part that's going to be more efficient than trying to figure out how to keep the key off to the side and only affect the value part. So it's a, both a more efficient in an easier way of applying a mapping operation to just the values of a key value RDD. So remember that. So let's dive into our example a little bit more. So again, what we're trying to do is figure out the average number of friends for a given age, given some little fake dataset here. So again, imagine we have a social network of some sort, where we have a bunch of people that are in our social network. And for each person, we know their age in years, and we know how many friends they have within our social network. So for example, each row might look like this. Some sort of user ID followed by their name, followed by their age, followed by their number of friends. So will is 33 years old and has 385 friends. John Luke is 33 years old and has two friends. He was 55 and has 221 friends and so on and so forth. So the first thing we need to do is map that input data and split it out into some sort of a structure. And here we're defining a parse line function that takes in each individual line of that RDD and splits it out based on the comma. Delimiter, extracts the H field and calls at age, extracts the num friend's field and calls it num friends. And then we return a tuple, a key-value pair of age num friends. So the age in this case is the key and the value is numb friends. And we see here in the next line that we create the lines RDD by loading up the fake friends dot-dot-dot CSVs raw data. We then call map with that parse line function to transform that raw data into a new RDD called RDD that contains key-value pairs. So the output here of RDD will look like this at the bottom. A key-value pairs. For example, 33 is the key, the year, the age, and 385 is the number of France. So again, age number of France. The key is age, the values number of France and other 33 year-old had two friends. A 55-year-old had 221 friends, so on and so forth. And you can see we can have duplicate values here for keys, right? So keys do not have to be unique in this context, not till we reduce things. Next, we do want to start reducing this data down though. So in order to compute the average, we need to have two things, right? We need to have the grand total sum of how many friends existed for a given age and how many people that represent it. So to compute the average, We're gonna eat that total number of friends who are given age and the number of people that existed for that age. And by dividing those two will get the average that we wanted at the end of the day. So the next step is to compute those totals. And that's what's going on here. Now you'll often see in Spark driver scripts, people kinda munging things together into one big long line. So you may as well get used to it. Now, let's talk about what, let's break down what's happening in here, right? So let's break this down into its two components. We're starting off with RDD dot map values X to X comma one. So what this is going to do is take each value and transform that into its own tuple of that value and then one. Now the method to this madness is but that by associating a one with every single value there, I can add up all those individual entries and get the grand total of how many existed for that age. Okay. So that one is just saying I'm going to count each individual row as one, and I'm going to sum that up later on to get a grand total, little bit of a trick there. So after this point, when we run that mapValues operation, this is what our RDD will look like. So for example, 33, 35, which was telling us that a 33 year-old had future 95. France is going to be transformed into the same thing. The key is still 33, but the new value is a tuple of 385 comma one. And we do this for every single row in that RDD. And next we say dot reduceByKey. So we didn't explicitly assign the result of that to a new named RDD. We were just going to pass that new RDD directly into another operation here, in this case the reduceByKey action. So this operation here will add up both elements of that tuple in the value, right? So what's happening here is we're saying we're gonna take 22 inputs here from our RDD that we're reducing, call them x and y. We're going to add together the first element of both tuples and the second element of both tuples, and add them all up for each unique key. Okay? So see where we're going with this. So at this point, what we have here is for all of the 33 year-olds out there, we've reduced that data down into the grand total of all the number of friends that the 33 year-olds had and the total number of 33 year-olds had existed. So think about this for a minute. It's a little bit tricky. Our strategy here was to go back here, okay, we mapped everything to these tuples of 433 year-old for each individual, we had a tuple of the number of friends that person had in the number of one. And this trick here is that by doing this and adding them all together. If we add together all the first element of that tuple of how many friends each, each 33 year-old had, we get the grand total of how many friends 33 year-olds had. And then by adding it together, all those ones, we get a count of how many 33 year-olds there were. And that gives us the two things that we need to compute the average that we wanted. And that's what we do in the next step. We say map values and all we do is go through every value, every tuple that we ended up with, and divide them by each other to get the final average number that we want in the final result. Okay? At that point we can just collect and display the results. So we're going to say collect to get back that reduced averaged values. And we're going to sort them and call print line and each one to display the final result. So with that, let's go ahead and actually run it in our next lecture. 12. [Activity] Running the Average Friends by Age Example: All right, Let's dive into the code and see how it all works in action here and we'll run it and see if it works. So this is all the stuff we talked about in the previous slides. All right, So let's walk through the code here one line at a time. So we start off declaring what package we live within that sort of a standard thing for any Scala script. And then we import the packages we depend upon. In our case, we're being a little bit lazy and just importing everything from Apache Spark and also from log for J. We then define the objects that our script lives within. We're going to call it friends by age, which should match the name of the file. I'm going to skip the parcel line function for now and go down to the main function because that's where execution will start for saying we're doing again is setting the log level to error so that we don't get a bunch of logs spam about warnings that we can't do anything about. There will be a few that slipped through before this gets executed, but we'll catch most of them, at least. Next we create our Spark context, again set up to run on our local machine using every CPU core. And we'll give our application name friends by age. Next thing we do is load up our raw data and there's a fake friends dash, no heterodox CSV file. That's a version of the data that has no header row that can, that can mess you up. So be cognizant of that. And we're going to load that up into a line's RDD where every line represents one row of that data. So what does that data look like? Let's take a quick look at it just to remind ourselves. So it looks like this. Remember the first column is the user ID, the second is the name and the age and the number of friends and all we care about for our problem is those last two columns, right? I don't really care if the user ID is, I don't care if their name was. I'm just trying to find out the average number of friends for every given a so I can throw away those first two columns. And that's what we do in the parse line function here. So we're going to take the parse line function, map that to every one of those input rows and get a new RDD called RDD. Let's see what parcel line does. So what that's doing is splitting up each line by a comma because it is a comma separated value file. And we extract fields 2 and 3. Those are the last two columns in our data. It will call them age and NAM friends and return a tuple of age and NAM friends. And again, that tuple will become our key value pair in our RDD, RDD. So now age is going to be our key and NAM friends will be our value for every individual person that we have in our dataset. So that's where we're at. And now we have that really complicated line that we spent a lot of time talking about. 13. Filtering RDD's, and the Minimum Temperature Example: So next let's talk about filtering. Your RDDs. Filtering is a very simple operation that just lets you clean data out of your RDDs and you'll find the same concept carries over to the other APIs at spark offers as well. We'll illustrate this with a couple of examples using weather data. This is real-world data from two different weather stations in Europe. And what we're going to do is use filtering to sort of clean that data before we mine it for the minimum and maximum temperature is recorded within a given year at those weather stations. Using a filter is pretty simple. All you do is provide a function that returns a Boolean. And if that boolean returns true, then we keep that line, otherwise, we remove it from the RDD. So for example, if we wanted to filter out entries that don't have T min as the second item in the tuple of data, we could say parsed lines dot filter, assuming that parse lines is that RDD that contains those tuples. And passing to the function where every row of data is called x. And we check if the second element of x is equal to team in. If it is, then we keep it. If it doesn't, then we remove it in the resulting Min tempts RDD. So Min terms contains the filtered results of parsed lines. Here's an example of that raw data that's in our source data for this example. So what we're going to try to do here is find the minimum temperature within a year for a given weather station. And that first identifier, that first column of information in our input data is the station identifier. So that just represents where that temperature reading was taken. After that is the actual year and date that that measurement was taken. So you can see this is a pretty old dataset going back to the year 1800. And you can see that we have many different types of data being recorded in this dataset. Tmax indicates the maximum temperature recorded on that day at that weather station. T min is the minimum temperature recorded and precision is the amount of precipitation recorded. And the format of that temperature data is a little bit weird. It's degrees Celsius multiplied by 10. So negative 75 is actually negative 7.5 degrees Celsius will need to deal with that as we parses data and import it. Here's what that code looks like that does parse out that data. So for every line that we receive that we read in for that 8800 dot CSV text file, we're going to use that parse line function to actually make sense of it. You can see that at first splits it up based on the comma delimiter. It extracts the station ID, which is the first field, and the entry type, which is a second field. And finally the temperature number itself. And you can see here that we start off by multiplying that by 0.1 to get two degrees centigrade. And then we convert that to degrees Fahrenheit because I live in the United States, which I think is one of the last countries that still uses the Fahrenheit scale. But anyway, if you want to keep that part off and keep it in centigrade, fine by me, I won't tell anyone what we return is a tuple that contains the station ID, the entry type, which in this case will be team in T max or precipitation and the temperature associated with that entry. So next, we need to filter out the information that we don't want. So we're only interested in T min entries because the question I'm answering is what was the minimum temperature throughout the year for each weather station. So I don't care about the maximum temperatures that were recorded. I don't care about the precipitation. I'm going to throw all that information out. So by just saying parse lines dot filter, checking for whether that second field is equal to T min. I can throw out all the data where that isn't true, right? So if x dot underscore two is equal to team and I will preserve that row and the resulting Min temps RDD. Otherwise I'm going to filter it out. So this will get rid of all the other stuff besides team in entries. So that's filter in action. After that, we can create the station ID temperature key value pairs because we want to do some tricks with key value RDDs, drawing on our previous lesson. So we're going to map that minimum temperature to a new tuple that just consists of the station ID and the temperature. We no longer need that team in entry type field there because we know that everything is a team in, so that's just wasted space, right? So we're going to convert that to just a station ID and the temperature of the minimum temperature for every single row that qualified. And we'll call this new resulting RDD station temps. And then define the minimum. All we can do is say reduceByKey. So again, we're using sort of a key-value trick here and doing a reduce operation on it. So basically we want to take every running total that we have for the current minimum temperature, compare that to a new row. So that's going to be x and y. And we're going to run the minimum operation men to only preserve the minimum value seen. So basically it's going to keep going through row by row all the minimum temperatures for a given key, where every key represents a weather station. And it's gonna keep, trying to keep track of the minimum temperature seen. So it'll look at the first row. Look at the second row is a secondary or less than the first row. Okay, great, That's the new minimum. Is the third regrow great, less than the second row? No. Okay. The second one remains the minimum still, so it just keeps going through keeping track of that minimum temperature that is encountered. And what we'll end up with is a much smaller RDD called a minute attempts by station that just contains all the unique keys. In this case, just two different weather station IDs and the minimum temperature scene for each station. At that point, all we have to do is collect the results. So we'll call the collect action to go and force are RDDs to go and actually do something on our cluster. And then we'll iterate through each result after sorting them and extract the station ID and the temperature, we will format the temperature using a printout format like we talked about back in the introduction to Scala lessons. And print it out again using the substitution format that we talked about as well. So kinda putting some of that string formatting stuff we learned earlier in the course into action here. So we're just going to print out the station ID, the words minimum temperature colon, and then the formatted temperature with two decimal points of precision after the decimal point. So let's go and do it and see if it works. 14. [Activity] Running the Minimum Temperature Example, and Modifying it for Maximum: So if you open up the Min temperature script in intelligence here, you should see this script here. Let's walk through it line by line. And given that you might be new to scale, I'm gonna spend a little bit more time going through the syntax here. It's a pretty simple example, so shouldn't take too long. So let's open up this import block here just to make sure we see everything with that we're using. So as usual, we start off declaring the package that we're in. Calm Dotson dogs software dot Spark, and we import all the stuff we depend on. What we're importing everything from Apache Spark, from log for J. And also we need Scala dot Math.min in order to use that Min function to keep track of the minimum temperatures that we encounter. So we declare him in temperatures. Object to match the name of our file here. And again, I'll skip past the parse line function for now and go straight to the main function because that's where execution starts. First thing we do is set the log level to error to try and get rid of all those warning messages that will clutter up our results. We then create a Spark context. And here we're saying the master will be local star, saying that we'll run it or local machine using all the CPU cores available and the app name is Min temperatures. Now let's talk a little bit about the syntax here. So master equals and app name equals or just little notations that intelligent inserted for us for display purposes. Generally speaking, you don't have to actually say Master equals your app name equals you can just pass in the parameters and it will line it up to the expected parameters for that function. So we know that a SparkContext constructor starts with the master and then the app name for the first two parameters. But if you do want to be more explicit in naming them, you can do it this way as well. That's syntax is also a valid and that can lead to more readable code. So if I was looking at this outside of Intel J, I might not really know what Min temperatures means or local star means by explicitly putting in those parameter names, it makes the code a little bit more readable and easier to understand. We saw the same thing up here on logger, get lager name equals org. We didn't have to say name equals 0, we can just say 4D, but that makes it more clear that the name of that parameter that we're setting is name, and that tells us what it is. After we get the logger for the org name, we are setting the level two level dot error to say that we only want error messages and no warnings or informational messages. Moving on, first thing we do is load up our input data. So we say val lines. So we're making an immutable value here called lines that's set to the text file of data slash 1800 dot csv. So that will take every row of that CSV format and just put the raw text of each row into the line's RDD. Let's take a quick look at what that data looks like. So this is a visualization of that CSV file in Microsoft Excel. You can see the first column is the station ID, and there's only a few of them in here, followed by the date, which is in year, month date format. So 8800, January 1st. And this entire file contains all the data for the year 1800s. So we don't have to do anything special to say. We only want data from the year 1800. That's all that we have. This third column is what type of measurement is recorded in this line of data. So maximum temperature, minimum temperature or precipitation. And the next one is the value associated with that measurement. And again, it's kind of a weird format. So I think the technical word is like deci centi grade or something. I don't know. Basically it's set to degrees centigrade times 10. So again, negative 75 just means negative 7.5 degrees Celsius. The remaining columns are unused for our purposes, so we can just pretend they don't exist. So that's what we're reading into our lines RDD. We then use our parse lines function and apply that to map every row of the lines RDD to produce a new RDD called parsed lines. So let's take a look at what parse line is doing. So again, we'll slow down a little bit here on the format. Remember, the way that we declare functions in Scala is we start with def, the name of the function, and any parameters. And remember we declare parameters a little bit backwards from other languages. So we start with the name colon and then the type. A lot of other languages or type the name in Scala, its name then type. So the parameter is called line and it's expected to be of a string. And then we have a colon, and this is what it's going to be output by the function. So again, that's backwards from any other languages. Lot of times you put the return type before the name of the function in Scala, it's after it. So go figure. So what we're returning is a tuple that contains a string, a string, and a floating point number. And we set this function equal to the following block of code. First, we start off with a field's value that's going to split up that line using the comma delimiter. And again the regex name of that parameter is being inserted here for our convenience. We then extract the first field. We start counting from 0 here. So Fields 0 is extracting the first member of that and calling it station ID. Extract the third field and call that entry type and the fourth field, and we call that temperature after converting it to Fahrenheit. So what's going on here? We're taking that fourth field, which is a string to start off with because we split that out of a string data. So we have to explicitly convert that to a floating point number before we can do numerical operations on it. So we're saying to float, to make that into a floating-point number, we're going to multiply that by 0.1 because as you recall, everything was multiplied by ten and the source data. So at this point we have actual degrees centigrade. We then do the conversion to degrees Fahrenheit by multiplying by nine-fifths and adding 32. If you wanted to stick in centigrade, that's cool. Just take out that part and you'll be in centigrade. What we return is the final thing that's in this function, so we don't have to explicitly return it. And just the last thing that's in there is implicitly returned by this function. And that's going to be a new tuple that consists of the station ID, which again is a string, the entry type, which again is a string. And that can be team and T, max or priciple. And the temperature that we calculated as a floating point value. So that's what the, each individual row of our resulting parsed lines RDD is going to contain at this point. Next we want to filter out everything but the team in entries because all we're trying to compute is the minimum temperature for the whole year for each station, right? So we can throw out all those TMax entries. We can throw out all this precipitation entries. And we do that with this line here. So we're saying filter, which is the whole point of this exercise, passing in this little inline function here. So we're going to call each incoming row of the parcel lines already dx. We are then going to take the second field out of x. And recall that will be the entry type. So looking at up here, the second field is entry type. And we're going to check if that entry type equals t Min. If that returns true, then that row gets passed on into the Min temps RDD otherwise have filters that entry out and does not go into MiniTab. So the resulting mintues RDD only contains the rows that match t min and that second field. Next we're going to throw away that team in that entry type field because we don't need it anymore. We know that every rows and minimum temperature at this point. So we can map this again to just extract the first field, which is going to be the station ID and the third field. And just to be safe, we'll explicitly say that's a floating point number. So our new station, temps RDD is just going to contain station IDs, which is a string and a temperature, which is a floating point number in Fahrenheit. Now we can say reduceByKey, just keeping track of the minimum value that we encounter as we go through every row station attempts. So as we go through Station tempts, we're going to keep comparing the current minimum to each incoming row and return the minimum between those two that will have the effect of keeping track of the lowest recorded temperature for each key because we're reducing by key. So remember each key is a station ID. So what men tents by Station ends up with is a much smaller RDD that just contains one row for each unique key that contains that key name and the minimum temperature associated with that weather station ID. We then collect the results back to our script and put that in the results object. We then sort it with results dot sorted and iterate through it with this for loop. So for result, results sorted, that goes through every line of results sorted every entry, and extracts each one as a result. For each one, we extract the first field of that tuple called station. The second field is temperature. We format that temperature. The F means we're going to use a print f format. And the percent 0.2 F means that we want two decimal points of precision on the right of the decimal point. And we stick an F at the end to indicate that it's Fahrenheit. Then we just have to print it out with print line. And we're going to substitute using dollar sign station for the station value, minimum temperature, and then format a temp, the dollar sign again indicating that we're substituting in that immutable value called formatted temp. So let's run it and see what happens. Right-click on Min temperatures and run Min temperatures. Off it goes. And there we have it. So we can see that for the two weather stations that had minimum temperatures associated with them, the first one was called Easy Ease something or other. And the minimum temperature recorded there was 7.7 degrees Fahrenheit. And for ITE, whatever it is, the minimum temperature was 5.36 degrees Fahrenheit. So it appears to have worked very cool. Now if you want get your hands dirty, I always encourage you to go hands-on here. Here's a really quick challenge for you. What if you wanted to keep track of the maximum temperature for each weather station instead of the minimum temperature. Well, the, the changes associated with that should be pretty straightforward. If you want a little bit of a challenge, hit pause right now and go and try that yourself. See if you can modify this script to print out the maximum temperatures instead of the minimum temperatures. And it pause if you want to do that. If not, I'm gonna show you how I did it. So it's pretty simple, but you need to change, right? So first of all, you want to change the name of the object and the file to max temperatures for consistency, the parsing of the data will be the same, right? That's not going to change. The data format hasn't changed on us. But what we extract is going to be different, right? Min temperatures want change. The happening of the max temperature is probably what we're going to do this though, filter out everything with TMax instead of t min. So that'll give us everything but the T max entries. And then down here when we do our reduction or reduceByKey operation, we just want to change them into a max. And that should pretty much be it changed how we're displaying the results to say it's a maximum temperature. I guess we have to import skeletal math.max as well, right? So pretty straightforward stuff, but it's good to get your hands on and sort of get some confidence in making changes to Scala code and Spark code and seeing that you can do it, It's more about your confidence in anything. If you want to see how I did it, just double-click on the max temperature script here. And it does everything I just said. So we're importing scala dot math.max. I changed the name of it, change the filename. We are now calling it max temperatures. We're filtering for TMax and reducing based on the max function and printing out the max temperature. That's pretty much it. Let's go ahead and run that and see what it does. And there you have, it. Turns out it was they both have the same one. That's kind of curious, isn't it? So for both rather stations, the max temperature was 90 point 104 degrees Fahrenheit. And I'm guessing that corresponds to a nice round number and incentive grade that just happen to be the same for both of those stations. So not too far apart from each other. So it's not too huge of a surprise. And there you have it filter in action. And also another example of writing a simple Apache Spark script using RDDs. You will find that this isn't a whole lot different. We'll be talking about different APIs, but again, they're real soon. 15. [Activity] Counting Word Occurrences using Flatmap(): So we've seen several examples of using the map operation in Spark so far, Let's talk about flatMap, which is a little bit different. And we have a simple example to illustrate it, where we count up the number of each word in a book to show how it works. So let's review what map does. So for example, if we had an input text file that contained the quick read on one line and Fox jumped on another line and over the lazy on another line and brown dogs on the last line. We could read that into an RDD called lines where every line of text corresponds to a row of that RDD. And then in this case, we're calling map on the lines RDD, passing it a function to convert each line to uppercase. So what we get is a new rage caps RDD that just capitalizes every single line or every single row rather of that input RDD. So the thing to note here is that there is a one-to-one relationship between the input RDD and the output RDD. And when you call map, map takes every input row and converts it in some way to one and only one output row. So we started off with four rows and the line's RDD, and we end up with four rows and the output rage caps RDD, map always has that one-to-one relationship. One row comes in, one row goes out. Flatmap, however, removes that restriction. So with flatMap, you can return any number of rows for an input row, it could be 0, it could be 20, it could be 200, anything you want. So in this different example here, we start with the same input. Rdd are called lines, but now we're calling flatMap. And we're passing in as a function there to take in each row, call it x, and then split it up based on the space character. So this has the effect of splitting up each row into its individual words. And because we're calling flatMap, every individual word ends up on its own new row in the resulting words RDD. So using flatMap here, we started off with four rows, but we ended up with many more rows that in the output RDD words, where we have one row for every word as opposed to a one row for every line. And again, flatMap can output any number of rows you want, including 0. To see this in action, let's do a little simple code sample where we count how many times each word occurs within an entire book. And I'm not trying to sell you this book, even though I wrote it, it has nothing to do with Apache Spark, but it is a book where I own the rights to it so I can use it as an example without paying royalties. So let's go ahead and dive in and see how often each individual word appears in my book. So we have a very simple script here to illustrate the use of flatMap. It's called word counts. So go ahead and open up Word Count and intelligent if you'd like to follow along really simple script. And this is often used as sort of the HelloWorld example for Spark programs. We start off as always declaring our package, importing what we need for dependencies, and declaring the object that we live within. Our main function starts off by saying the log level is always and declaring a new SparkContext running on our local PC. Now we get into the meat of it. So we call it SparkContext text file with a path to the entire contents of the text of that entire book that lives in the data slash Book dot txt file. And we call that resulting RDD our input, where every row of that input RDD corresponds to a line of text from the book. And now just like we saw in the slides, we call flatMap on that splitting every row by the space character, which more or less gives us the individual words within that book. So now flatMap has generated a new words RDD, where every row of the words RDD is a word in the book and there can be repeat words in there. It's every individual word, even if it occurs more than once, is going to be one row in that RDD and sequence. And now we're going to use the count by value action to count up how many times each unique word appears in the words RDD. So that's a one-line way of doing what would otherwise be a pretty complicated operation, right? And the beauty of Spark is that it can actually parallelize that across an entire cluster if you want to, or in our case, every core of our CPU. All we then do is iterate through each result in that word count object that we got back from Spark. And we call print line on each of those results to print out each result on its own line. So let's go ahead and run it and see what happens. We'll right-click on word count, say run WordCount, off it goes. And even though this is an entire book, it should make short work of it. There we have it. Interesting. So if you scroll through this, you can see how many times each word appears in my book. The word cash appears 18 times, only 77 times. Self-employed, only 10 times. Kind of surprising there because it's a book about self-employment. But there are some sort of unique words here too, like idea only shows up once. Drop shipping only shows up once. So cool stuff. So at our it worked. That's an example of flatmap in action. But as you look through these results, you can see that there are some issues here that we might want to work on. So let's explore this in subsequent lectures here, for example, depending on capitalisation, I might treat these as different words. Was the word products really only included once? We'll all that really means is that products only appeared once with a capital P, maybe at the beginning of a sentence or something, right? So there are ways of improving these results to be more interesting. And it's also like weird things like is open dash ended one word or two. What about punctuation? Should I be stripping off those question marks and commas and periods, right? So owners comma appear two times, but that's not really talking about the word owners. So there's some room for improvement and how we parse out this data. So let's explore that. 16. [Activity] Improving the Word Count Script with Regular Expressions: Okay, so to review in our previous lesson, we did a simple example of using flatMap to count up how many times each word in my book appears. And we very naively just parse that data by splitting it up based on space characters to try to get individual words, which at first blush you'd think would work. But as you can see in the results here, it doesn't. We end up getting punctuation in there. So for example, quote, y counts as an individual unique word. Capitalisation is also mattering here. So capital Y would be counted as a different word than lowercase y. We have things like commas, punctuation marks and capitalisation affecting our results. So let's fix that. So let's close out of WordCount dot Scala and instead open Word Count better. And here we're going to use a regular expression instead to get better results. Clv, nothing really different here is that instead of having that function that splits up based on a space character, when we split up the input data, instead we're going to use a regular expression. So in this case we're going to say x dot split and set the regular expression to this regular expression format, backslash, backslash capital W plus. And in regular expression land, that just means that I want to break things up based on entire words. The capital W means word. So regular expressions know what a word is. It's going to filter out all the punctuation and all those special characters and just leave us with words that contain the actual characters. So that's gotta get rid of all the issues that punctuation messing up our results. However, we still have the problem of capitalisation, right? So capital R And remember is going to be a different word than lowercase r. And remember, even after stripping out any punctuation that might have existed there. So to take care of that problem, we're going to furthermore a map our results to lowercase. So we see here we start off flat mapping our input data into individual words. But in this case we're splitting on regular expression words as opposed to just on space characters, which is more robust. And then we do a regular map to take each individual word and map that on a one-to-one basis to it's lowercase version. So by normalizing everything to lowercase characters, we can be sure that we're counting all the words the same regardless of capitalisation. And then we go off and do count by value on the resulting lowercase words RDD and print those results. So let's see if that gives us something a little bit better. Let's right-click on word count better and run that. And okay, that looks a little bit better, right? So that's cool. So now we don't have any weird punctuation going on here. Everything is lowercase, so we're getting results that make a little bit more sense hopefully. So for example, the word advertising appears 41 times in the book that seems believable. Religious only one cache 19 times. Let's look for a really popular word here, though, that appears 1292 times. So not too surprising there, that's a very common word. Cool. So it seems to have worked like I think we're getting valid words here. Everything seems normalized. We're not getting messed up. I punctuation, so mission accomplished, but this isn't terribly useful still in how to use it, right? I have to go fishing for common words here. I shouldn't have to do that. Wouldn't it be better if I could sort these results by how often each word appears. So there'll be more useful. I can very easily find the most common words and the least common words in the book. And that might even give me some insight about what the book is about. Or maybe words that I'm overusing or something like that, that's good. I actually have a practical application for writers. So let's move on and make this script even better in our next lecture and actually sort those results so we can get more meaning out of it. 17. [Activity] Sorting the Word Count Results: Okay, so in our previous lesson, we refined our word count example to be a little bit more intelligent about identifying words. So we have a really good output here of the counts of each unique word within our entire book. But it's not terribly useful because it's not sorted by how often each word occurs. So I have to go fishing to find out what the most common and least common words in my book are. So that's not very helpful, right? So let's fix that. We want to sort the results based on how often each word occurs. We can quickly see the most used words in my book. Now in previous lectures in this course, we would have done that by just sorting the final results that we got back from Spark that were returned to our driver script and just sorting them in Scala. But that's not really the big data way of doing things, right? It would be better if we could sort these on the cluster if we had so many words that we couldn't actually process it all on one machine. You'd need to do that, right? I mean, that's sort of a convoluted example. You're unlikely to have that many words in a book, but you get the idea. It's going to be preferable to do that sorting distributed on the cluster if we can. So let's do it that way and see how that might work. So let's open up the word count better sorted example here. Just a small refinement on the previous word count better script here, it's pretty much the same except for the name of the script all the way down to this point here. So we've loaded up our book into an input RDD. We split it out into words using flatMap. So now we have a words RDD containing every word in the book, and then we map that to lowercase to normalize that data. So now, previously we would have used count by values, right? If we go back to our earlier example here we use count by value to quickly get the number of times each word occurs in that RDD. But we want to sort it at the same time. So we need to kinda do it the hard way. So let's break down what's going on in this kind of complicated line here. So first we're going to map that result to x r1. Take each row, each individual word, and map that to a tuple of the word and the number 1. Now we've seen this trick before, right? So what we're saying is we're going to have every individual tuple have that word and represent that it occurred one time in that row. And that's a little trick we can use because that will allow us to sum everything up, right, and get the number of times each unique word occurs when we do a reduce operation. And that's what we're doing here. We're saying reduce by key, adding up all those totals. So let's go through this again. We're transforming every word into a tuple of the word and the number one. And because this is a tuple with two things in it that could be considered a key value, RDD, where the key is the word and the value is the number one. So now we can say reduceByKey because we basically have a key-value RDD summing up all the values, okay? So for every unique word, we're going to count up all the values, which is just the number 1. So when we add up all the number ones for each word occurrence, we end up with a count of how many times that word occurred. Okay, next thing we wanna do is sort it. So at this point the reduceByKey operation has left us with a word counts RDD that contains key-value pairs, again, where the key is the word and the value is the number of times that word occurs. Now spark offers a sort by key function here that we can use, but it's going to sort by the key and the value we want to sort by the, how often each word occurs. So we need to flip that first. So that's all we're doing in this map operation. We're taking the word count RDD and flipping it so that the key value RDD is being flipped around to value key if you will. And so we're taking the input RDD, which contains word count, and flipping that around to count word by transforming it to x dot t2 and x-dot one. The second field in the first field. Once we have that, we can use sort by key to sort the results by the number of times each word occurs. So with me here, let's walk through it quickly one more time. All this is the same from before. But in this case we're gonna take our lowercase words RDD, which it has each unique word on its own row. We're going to map that to a key-value pair where the key is the word and the value is the number one. We do a reduceByKey operation to add up how many times each unique key, each unique word occurs. And then we need to flip that resulting key-value pair on its head so that we can actually use sort by key to sort by the number of times each word appears. We then go through and print out the results, kinda like we did before. And we should get something more useful. So let's go ahead and try it. Right-click on word count, better sorted and say run. And there we have it. And it is sorted in ascending order. So the most popular words will be at the end. And that's interesting. So my most common words were too and you not in a well, i'm I'm glad that I used the words you in your more than I and me. That's a better way to write. She want to connect with your reader and not talk about yourself a lot. So I'm going to call that a win. But yeah, a in the shallow show up pretty frequently. And of all the common words that you'd expect to see there, there's ie, a business shows up a lot. So when you drill down into the light, the more interesting words you can sort of tell what the book's about just by what the most common words are, right? So the first word that kinda has some meaning there is business, and it is a look about business. Scrolling up time. You know, we do talk about time management a lot due to do product. Okay. Yeah, we talked a lot about how to select what product to sell. So yeah, you can actually get some insights about what the book is all about just from looking at this data now. And if we scroll up to the more least common words, we shouldn't get some pretty obscure stuff here. Foam only did I use were phone. I wonder where where that came from. Swot quoted. So, yeah, interesting stuff. So you can really tell what the book is about just by looking at this data. So we've kinda like seeing the evolution of the script. And along the way of making this script more useful, we've learned not only about how to use flatMap, but also how to use regular expressions and how to sort of make creative use of reduceByKey and sort by key to get the results you want. Now again, we'll find that this is a little bit more straightforward when using datasets in SQL. But we're getting there. We just want to like, start off with the basics here and the fundamentals of Spark, which are built on RDDs. And in cases where you need to use RDDs, it's good to know these tricks. 18. [Exercise] Find the Total Amount Spent by Customer: Okay, I think you've learned enough about RDDs and seen enough Scala code at this point to try doing something on your own. So it's a little challenge, a little hands-on activity. I'm going to have you create a very simple Spark program in Scala that will just count the total amount of money ordered by each individual customer ID and a little fake database of e-commerce data. So I'm going to provide you with a CSV file that contains stuff in this format. And this will represent a customer ID and also an item ID and the amount spent on that item. So each row of this fake data represents some transactions, some purchase at a fictitious person made. You're going to write some code and output the grand total of how much was spent by each unique customer ID. So in this particular example, you can see that user ID 44 bought something for 37, 19, and something else for 4064. So the grand total for our user 44 is 77, 83. And we want to do that for every individual customer in our database. So I'll spell this out for you a little bit if you're a little bit anxious about it, no, don't, don't peek if you do want to try this on your own, but some general guidance on the strategy for doing this. I want to start off by splitting every comma delimited line into its unique fields. And as I said, that's going to be user ID, item ID and amount spent. You will then map each line to a key-value pair of customer ID and dollar amount. And use the reduceByKey operation to add up the total amount spent by customer ID. When you're done, collect the results and print them out. Pretty straightforward. And you should be able to figure this out by studying the previous examples in this course. They're quite similar. Couple of useful code snippets if you are new to Scala, that might be useful if you want to split a comma delimited field into its individual fields, this line would do it for you, Val fields equals line dot split comma. And you will also need at some point to make sure that you treat field 0 is an integer and feel to as a floating point number. So that little snippet of code is important to remember as well. You do need to explicitly cast these into their actual numeric representation and not just strings like they come out of the file rock. So good luck. And before I set you loose, I am going to show you real briefly where to find the data for this and how to create a new object in our projects so you can actually play around with it and get started. So let me show you that real quick and then I'll set you loose. So first let me direct your attention to where the data for this exercise lives. So within your course materials folder, go into Spark Scala course and then the data folder. And it's the customer dash Orders file that we're going to be looking at here. Let's open that up, see what it looks like. Now remember this is comma separated values. So even though Excel is displaying it as a spreadsheet, it's really individual columns separated by commas that we're reading in here. And this data is completely randomly generated. You can see there's quite a bit of it, 10 thousand rows of data here, but it's all fake. The first column represents the user ID, followed by an item ID and the amount spent on that item. So each line represents a single purchase. And yeah, you might note that that second column isn't really needed for what we're doing, right? We just need the amount spent by customer. The item ID doesn't really come into play, so a lot of time, so cleaning your data is going to be a big part of the job and that's the case and this little simple exercise as well. Now to get you started on this, let me show you how to actually go about creating a new object in a tele j. So I click on Calm dot sundown software dot Spark and say New Scala class and give it some unique name. Stick your name in there just to be safe to make sure that it's unique. So something like total amount spent F cane for me, whatever your name is for you. And double-click object because we want an object and not a class. All right, so now you have the boilerplate here of a new object here it has the correct package name because we put it in the right place. And your object is all set up. And now all you have to do is write some code and test it out. So when you're ready to try that out, you can just right-click on that class and you can build it. And then after that you can actually try to run it and see what happens. And yeah, that's it. So go have at it. Not a very difficult example, not a hard exercise by any means, but I want you to get your hands dirty and get some comfort in writing Scala code. We'll have some more challenging ones later in the course, of course, So we're starting off easy. So go give that a try. And when you're ready, come back to the next lecture and I'll show you my solution, which is here in the course materials. You might have spotted it already, so resist the temptation to look at it. Try this yourself first. 19. [Exercise] Check your Results, and Sort Them by Total Amount Spent: So did you do your homework? I hope so. Well, if you want to compare your results to mine and compare your code to mine, I included my solution in your course materials as well. So hopefully you haven't picked at them yet. Just open up the total spent by customer class here. And here's how I did it. So let's walk through this code here. Start off with the package declaration that you should already have and importing what we need. We need everything from Spark and log for J. Hope you didn't get tripped up on including those. And we called our object here total spent by customer. This is my solution file here. And let's jump down to the main function and start there. So we start off by setting our log level to error as we've done before. And we set up a new SparkContext. And the only thing unique about this is that we said an app name that's different from the other examples. In this case, total spent by customer. Now we get into the meat of it all. We load up our text file, which is the customer dash orders dot CSV file that I directed you to earlier. And we're just going to load that up into an RDD named input. Now we need to parse that data out. So that's why we've written this extract customer price pairs function up here. Of course, you could have named it anything you want. It will take in a whole line of Comma Separated Values and we'll just call that string line and return a tuple of an integer and a float. So that'll be our key value pairs that we have later on, where the key is going to be the customer ID, which is the first field, and a floating point value. We're represent the amount spent on some item for that customer in that transaction. We start off by splitting up the fields in that line based on the comma character using line dot split. And we assign that into the fields value. And then we just return the tuple that contains the first field converted to an integer, and the third field converted to a floating point number. And remember again, we start counting from 0 in this particular format. So at this point, we've mapped our input into a new mapped input RDD that contains tuples of customer IDs and amount spent. And that is actually a key value pair because it's a tuple of two things, right? So our key is now the customer ID and our value is now the amount spent in that transaction. Now we just have to reduce that down using the reduceByKey operation. And we use this syntax here to keep a running total of the grand total spent by each customer ID. So we're reducing by key by each customer and adding up those values as they come in. So this takes in a pair of two values associated with the key. And we're going to say we want to add those up because we want the grand total. And what we get back is the total by customer RDD, which will contain the keys, each unique key, each unique customer ID, and the total spent by each customer ID. We can then collect the results back. So that's going to be our action that gets back the results and returns it back to our script, to our driver script running locally here. And then we can just call forEach with the print line function to print each line of those results. So let's run it and see if it works. I'm going to right-click on total spent by customer and say run total spent by customer. Off it goes. And there we have it. Very cool. All right, well it seems to have worked. So for example, user ID 91 spent $4,642.26, and so on and so forth. User 53 spent $4,495. They're all roughly in the same neighborhood because this was all randomly generated data with uniform random distribution. But that's getting into my data science course, I suppose. Anyway, it worked. So hopefully you have a similar output to hear. The order may be different, that's okay. But you know, you should be able to see that for a particular user ID, you can search for that user ID and get the same number hopefully. So if you want to search for user 89 and make sure you got 4851947, 95. That might be a good idea. All right, so I will challenge you further now, you can make this even better kind of like we do in our word count example. It'll be a lot more useful if this was sorted, right? So if I was more interested in seeing who my big spenders are, I might really want to see this Datastore sorted by amount spent. So that way I could easily see who was spending the least amount of money on my site and who was spending the most amount of money on my little fake e-commerce site here. So that is your challenge. Go off and go ahead and see if you can sort the results by amount spent for an extra added bonus team, I want to see if you can format that actually have two decimal points of precision there after the dollar sign there. So make that look more like real currency. But they are, your only real challenge right now is to sort those results by amount spent. And again, referring back to the word count better sorted example that we had earlier. Same exact trick there. So I'm not trying to make you think too hard here. I just want you to get some practice here and apply what you've learned in this section. As I said, we'll get more challenging soon. So go off and give that a try. Now, see if you can sort those results by amount spent and give it a shot. And we'll come back in the next lecture and you can compare your results in your code to my own. 20. Check Your Results and Implementation Against Mine: Okay, so I hope you had a chance to try sorting those results by amount spent so we can get a more useful output there. Let's compare your solution to mine again. So if you want to look at mine, you can click on total spent by customers sorted. Double-click on that and see how I did it. Pretty much the same stuff as the original code that we went through in the previous lecture. So I'm not going to reiterate all the stuff that didn't change. All we did change was this bit here. So at this point here we have the total amount by customer. So we have reduced our data down to have key value pairs where we have individual customer IDs associated with the amount spent by that customer across the entire database. So just like we did in the word count sorted example, we're going to flip that on its head and try to take advantage of the sort by key operation that spark offers. That way we can actually distribute the sorting of this on the cluster if we want to. So to do that, we need to make our values, the keys and the keys the values because we want to sort by the values, the amount spent. So we're creating this flipped RDD. That's just mapping total by customer to take the input key-value pair x and take the second element and make it first. And we take the first element and make it second. So it's just flipping the key and the value to make the new key, the amount spent and the new value, the user ID. We then sort by key to actually sort that by the amount spent that first value there. Then we can collect the results back and print out each line. So let's see if it works. Right-click on total spent by customers sorted and run it. And we should see who are big spenders are. All right, Looks like it worked. So we have in the first column now the amount spent. So we can see that the most anyone's spent with $6,375.45 and that was by user numbers 68. So we can see that user 68 was our biggest spender. And if we were to scroll up, we can see our cheapest spender, if you will. $3,309 went to user 45. And again, if you do want to push yourself further, you could try to format this output a little bit more nicely. I'm not going to go through that with you, but if you do feel like you want even more of a challenge, see if you can print out each line just to say more verbosely, user ID, whatever spent this much money and format the money in real monetary format, dollar sign, some integer dot two decimal points, right? And that's not really going to be a spark problem. It's more of a Scala problem. But if you want to go further, That's one way to do it. But yeah, that's enough for now. So we've learned about RDDs. Rdds are the original API for Apache Spark, going back to Spark one, and they still have their place for many types of problems. They can offer the best performance and the most flexibility and what you can do. But it's time to move on. So let's go talk about dataframes and datasets next, because that's really the more modern way of doing Spark code. And we'll revisit these examples, doing them again using datasets in the next section. 21. Introduction to SparkSQL: So in our next section we are introducing Spark SQL. And with it comes the more modern APIs of Spark DataFrames and datasets in the world of Scala, we're going to be focusing more on datasets because that's even more efficient. So this is going to be the more modern interface for Spark. And we're gonna focus on this going forward in the course. It is a layer on top of Spark Core. But if you can think of your problems in terms of a SQL command, which most data analytics problems can be Spark SQL and dataframes and datasets offer a very efficient and easy to use solution for getting the answers you want across an entire cluster. So let's dive in and see how it works. So we talked a lot about the RDD interface in Spark so far that was the original interface and Spark in everything is built on top of it. Sometimes it is still useful to go back to that and really get down to a low level for the best optimization for simpler problems. But starting with Spark to, they really started emphasizing the Spark SQL interface and Apache Spark instead. And Spark SQL brings to the table things like dataframes and datasets. Let's talk about those. So DataFrames came first. They extended RDDs to a dataframe object. And a DataFrame is a lot like a relational database. It contains row objects that contain some sort of structured information. It has a schema. And that way we can store things more efficiently. And because we have a schema and these rows, we can just treat it like a SQL database and run actual SQL queries on our DataFrames. And because it looks a lot like an actual database, you can have a lot of interoperability with databases and database related file formats. So you can read and write to JSON or HIV or parquet formats. And you can even communicate directly through a JDBC or ODBC interface or using Tableau. So we can actually look just like a relational database, even though it's Apache Spark running on a cluster. Pretty cool, right? And because it can think of things in terms of SQL queries, that also brings to the table the entire world of query optimization technologies, right? So beyond the usual directed acyclic graph optimizations that happen with Spark, we can also look at your actual SQL queries that you're trying to issue on a DataFrame and do the same things that are relational database would do to optimize that query. So as such, if you're doing things in sort of a SQL way with a DataFrame, you can sometimes get even more optimization than you would with an RDD. Then we came out with datasets and Datasets and DataFrames are kinda the same thing. Technically, a DataFrame is just a dataset of row objects. So a row is just a row of stuff, right? So the important point here is that you can also have a dataset that wraps aid and explicit type and explicit or inexplicit structure where we know the actual schema upfront at compile time. For example, I could create a dataset of a person case class that defines exactly what the fields are that define a person and what those types are. Or I could create a dataset that says explicitly I want a string and a double on each row. And by doing that, it will know what those columns are from the get-go. So unlike a DataFrame where all I know is that I have a dataset of rows in that row. Row could contain anything until I define it, a dataset nose up front, what types are inside of it? And because of that, the dataset knows its schema at compile time. This means that type related errors will be detected when you build your script, as opposed to when you actually run the script. And that's huge, right? Because running a script is often a very expensive operation when you're doing it on a big cluster. So a big advantage of datasets is that you're going to detect those type related errors at compile time. And that makes your development a little bit easier. It also leads to better optimization. That means we can actually do some optimization while we're compiling instead of at runtime. And that's also a big benefit as well. Now the catch with datasets is that because it's doing all this stuff at compile time, you can only really use them in languages that are compiled. That means Java or Scala in the world of Spark. So this is yet another reason to use Scala instead of Python. If you can't, even though Python seems to be way more popular. With Python, we're limited to DataFrames because it can't do that compile time optimization. So with Scala, however, we can be the cool kids. We can use datasets and we can know what our types are upfront. Python, however, is going to be limited to DataFrames. Now, you don't have to go all in with datasets if you don't want to, you can actually convert an RDD to a dataset with two ds and you can go the other way around as well. So it's possible to have the best of both worlds and do some processing using RDDs and do some further processing with datasets and do whatever makes sense. As we'll see, RDDs are still more useful for some operations. But for SQL like queries, obviously you're going to want to use Datasets. Datasets really are the new hotness. The trend overall in Spark development is to use RDDs less and datasets more. And I actually get some students that get kind of angry that I even cover RDDs at all now, but they really do have their place. Datasets, whoever are often more efficient, they can be serialized very efficiently. And because of those execution plans that were determining at compile time, if you're doing a SQL query, that can often be a lot more efficient, like I said, then just using an RDD datasets also allow for better interoperability, not just with outside file formats and databases, but also within the various components of Spark itself. So Sparks Machine Learning library and Spark Streaming Engine use datasets as their primary API instead of RDDs now, so if you want to move your data between these different components of Spark, you're going to be using datasets as that sort of way of doing it within Scala. And also sometimes using a dataset can really simplify your development. If you're just doing SQL operations and you can think about what you're trying to do with your analysis on Spark in terms of a SQL command, you're going to find it's really, really, really easy to do that with a dataset. Whereas it might have been much more challenging to do that with an RDD. Now when you're using Spark SQL, including datasets within Scala, there is some difference in structure here to how you approach your scripts. And we'll see that with some examples shortly. The main thing is that you're going to create a SparkSession object instead of a Spark context when you're starting up your script. If you're going to be using Spark SQL or datasets, you need to spark session. Think of that as like a database session. And you do need to explicitly stop that session when you're done. Now once you have that session, you can actually get a SparkContext from it and use that to issue SQL queries on your datasets. Or you can just use the straight up Dataset API, kinda like this. So once you do have a dataset object loaded up, you can do things like show or select or filter or groupBy. And you will notice that this looks a lot like SQL commands, right? So the format should look very familiar. If you know SQL, for example, select quote some field name that does exactly what you think it would do. It actually selects the some field name row from your dataset and returns that into a new dataset. Filtering. That's kind of an interesting syntax there, right? So we can actually say filter with some field name on my dataset greater than 200. We can put that expression as a parameter within our filter command and get back eight. It is dataset where we're just filtering out things that are greater than 200. Group by, that works just like it doesn't SQL as well. So we could actually group everything on some field name. And that's going to be the same thing as like a reduce operation, right? And then we can say.me to automatically compute the average of each one of those results. So in one line there, I think we just did the entire exercise, one of our earlier examples using RDDs, right? So that's a good example of how things can be simpler. And if you do need to do something more low-level, you still can. You could convert that back to an RDD and passing your own custom map function if you want to as well. So like I said, you can mix and match here and get the best of both worlds with datasets and RDDs. Datasets open up some other possibilities too. Like I said, you can't talk to us, JDBC or ODBC. So it is possible to actually open up a shell to spark SQL and deal with it just like it's a database, which is kind of awesome, right? I mean, that gives you sort of this horizontally scalable database, if you will, across your entire Spark cluster. How cool is that? And some of the details here are listed here on the slide if you want more detail, but I don't want to muddle your brain with this too much because we're not going to be doing this in the exercises, but just know that it's possible. You can also have user-defined functions if you want to. Sometimes this will be easier to do outside of datasets, but you there, here if you need them. So for example, we could create a simple user-defined function by just importing org dot apache Spark dot SQL dot, dot UDF. We could construct a UDF called square that just contains the expression take x and make it x times x. And then we could, for example, say width column and add a new column called square, that just calls that square user-defined functions. So you can do that too, just like you can have UDFs in databases. You can have UDFs in datasets as well. So this will make a lot more sense with some examples. So let's just dive in and spend the rest of this section looking at examples hands-on. So we're going to go back to our fake social network data that we used earlier in the course with RDDs. And I'll show you a couple of ways of using datasets to explore that data. So first we will actually query it with an actual SQL command, and then we'll go back and use functions from the dataset API to query it without using actual SQL commands. But the code will still look a lot like SQL. And after that, let's go back to our RDD examples and redo them with datasets and we'll see which ones are simpler now and which ones are more complex. And there'll be a good way to understand the strengths and weaknesses of both RDDs and datasets. And you can learn about how to actually choose what API to use for a given problem. So let's dive in with the following lecture and go back to our fake social network and play around with it. 22. [Activity] Using SparkSQL: So let's play around with spark SQL and Datasets and DataFrames actually with this little simple example here, open up the Spark SQL dataset example in your materials here. And let's walk through what's going on in this one. So we're going back to our fake friends database or dataset that was used for our friends by age example back in the first section. And we're going to query this data using SQL SAL interfaces instead using Spark SQL. So we declare a package is always n. Now note that we're just including org dot apache, Spark dot SQL. We're limiting ourselves to the Spark SQL APIs. For this example, we have log for J as well. We're creating our object and we start off creating a case class. What's that all about? So this is a Scala feature that we haven't talked about before. So a case class is just a very compact way of defining a class and object definition, if you will. So by saying case class person, we're saying that we want to define a person object. And it will contain the following fields with the following types. So if you're familiar with like C plus plus, for example, it's kinda like declaring a structure. But the syntax of here's a little bit more compact. So we're saying a person consists of four different fields. It contains an ID field and this name is meaningful, that does get kept right. And that ID is an integer. It contains a name field of type string and H field of type integer, and a friend's field of type integer as well. Okay? And this will actually be used for constructing the virtual database tables that will query using Spark SQL. So when we're done, this will correspond to the column names and the column types in our table will have an ID column that contains integers, will have a name column that contains strings and so on and so forth. So with that one line, we've defined the schema for our data. Okay, That's very important. With data sets. We need to have that schema defined at compile time. We can't just infer it from the data itself. So with that in hand, we'll go into our main function. We set our log level. And now instead of creating a SparkContext, we're creating a Spark session. Again, you can think of that as similar to a database session. So to create that spark session, we say SparkSession dot builder. To create our, to actually build our session. We give it an app name was Spark SQL. We say our master is local star, same exact syntax as before, saying that we're going to run on your local machine and I'll CPU cores and get or create. That means that we can actually create a new Spark session or reuse an existing one. Now remember I told you you have to stop these things when you're done. If we skip ahead a little bit here, we see that we do say spark dot stop at the end to close out that session. Possible for these sessions to keep on running beyond the lifetime of this driver script. So if you didn't explicitly stop it, or it stops unexpectedly, that session could still be running on your cluster and get our crate would just reuse that existing session instead of creating a new one in that case. So pretty interesting stuff. Now let's load up our data. So remember our data exists in fake friends dot CSV. And we're going to just load that up using Spark dot read. So Spark dot read on the SparkSession object will read that CSV file. We're going to tell it that it has a header row. And we're going to ask you to initially infer the schema from what's in that header row. So the header row information matches the same names that we gave in our case class up here, the header is id, name, age, and friends. And it's important that, that matches up in this case. So at this point, after saying spark dot read.csv, we have a DataFrame. At this point the schema has been inferred at runtime, right? That's the main difference between a dataframe and a dataset. But we still want to use datasets. We want all the benefits of that compile time type checking. So I'm gonna take that DataFrame that just consists of these row objects with inferred schemas and convert that to a dataset with an explicit schema by saying dot as person. So that dot as person takes that DataFrame that we read from the CSV and converts it to a dataset with a schema that we know at compile time using the person case class with me so far. Okay, it'll make more sense when we run it. First thing we're gonna do is print that schema out to make sure that it's what we expected. So at this point we have a schema people dataset. And now we can create a database view on it by saying create or replace temp view. Again, that will either create a new one or replace an existing one with the same name. And we're going to call that few people. This will have the effect of basically creating a database table called people for all intents and purposes. And we can query that table now by just saying Spark dot SQL. We're calling the SparkSession dot SQL with an actual SQL command here. We're going to say select star from people. That's the view that we created from our dataset. Where age is greater than or equal to 13 and age is less than or equal to 19. So we're creating an actual SQL statement here and returning the results in a new structure called teenagers, a new dataset. We will then collect the contents of those results and iterate through them, printing them all out, and stop the session when we're done. So let's go ahead and run that and see what happens. Cool, it worked. So we can see that inferred schema here, actually the explicit schema that we gave it for the dataset, right? So we can confirm that we have an id, name, age, and friends field of the types we expect integer, string integer, and an integer. So that's the schema of our schema people dataset that we set up there and read in from that CSV file. And then we have the results from our query here. And we see that we have all the teenagers in our dataset here. These are 19-year-olds and 18-year-olds, and I guess only generated data from 18 years old on and on up. So that's the result of our query there. It looks like it worked. Pretty awesome. So we actually executed a actual SQL command on a dataset on Apache Spark across an entire cluster potentially. So how cool is that you have the entire power of the SQL language at your disposal with Apache Spark using datasets. And I do want to make one more point here before we move on. So I didn't have to use a dataset here. I could have used a DataFrame. Let me show you how that would work. So if I just comment out this line, at this point, my schema people is going to be a DataFrame. It's going to have that schema inferred at runtime. I'm not telling it explicitly with that case class what the schema is. So by commenting that out, I'm losing all the compile-time error checking and compile-time optimizations that I will get with datasets. But it will still work. It will just work at runtime. So let's go ahead and run that and make sure that I'm not telling you a lie. And you can see here that we got the same answer. So the same query results here of 18, 19 year-olds that we wanted, and the schemas still the same as well. So it all still works as a DataFrame. It's just that by using a dataset and giving it that explicit schema, we do get better performance and better compile-time error checking. Let's dive into another example next. 23. [Activity] Using DataSets: So in our previous lecture, we talked about using SQL commands directly on our fake friends and data from our fake social network. Let's look at another way of doing it without actually using SQL commands, but instead using the SQL like functions available on datasets. So take a look at this alternate way of doing things. Open up the DataFrames dataset, driver script. And let's take a look at this. So the actual setup and loading of our data is going to be identical from last time, we're going to set up a person case class again that defines the schema of the data that we intend to load up. And again, we have a ID that's an integer, a name that's a string, and age that, that's an integer. The number of friends that is also an integer. Within our main function. Just like before we set our log level, we create a SparkSession object here with a given name and master and we either get or create it. And then again, we read that up from the CSV file using Spark dot read. And we actually force that to be a dataset by saying dot as person at the end to convert that implicitly defined schema at runtime to a hard-coded schema by forcing it to the person case class here. So I kind of glossed over this initially, the import spark dot implicits dot underscore line is important anytime that you have smart implicitly inferring the schema, you need to make sure that you are importing that package right before you do it. Syntax here is a little bit weird. Normally your import statements would be at the top of your file. But in this case we want it right before we're going to actually use it because we only want to apply to this specific scope here. So anytime you have a command that we're the schemas being inferred. And here it's pretty explicit that we're doing that whereas saying spark dot read to use the header in the CSV file. And we're inferring the schema initially from that CSV header, we need sparked out implicits to actually be able to pull that off. Now when we're done, we're, we're converting that to a dataset with a explicit schema. But that intermediate step requires Spark dot implicits. There are other ways of loading datasets that we'll see later that don't require that intermediate step of an inferred schema, but we'll get there. Anyway, we print out that schema like we did before. And now we have a different way of playing with our data here that we're looking at. So this is the new stuff now. So let's start off by selecting just the name column. One thing we did in our RDD examples, so that was pretty common, was discarding columns of information that we didn't need. And this is one way of doing that using datasets. We can say people dot select, name dot show, and that would only select the name column and nothing more. In this case, we're not going to assign that to a new dataset. We're just going to call dot show to actually force Spark to go out and get that result and actually display it to us from our driver script. So just like you could say select name, we're saying here, select name in quotes here instead of an a SQL command, Similar syntax, same idea, just a different way of doing it that doesn't require an actual database view to do it. We can also do something similar to our friends by age example, where we're just going to filter out anyone over 21 years old. So we could say People dot filter, people age less than 21 dot show. So that's the same thing as saying select name where age is less than 21 and SQL, but we're doing this more explicitly through functions on the dataset itself. So this is a little bit weird, right? You know, if you're used to, other languages are non-functional languages. The idea of having an expression like this, people age less than 21 and passing that as a parameter to a function might seem odd. But in Scala, you can do that. It actually works. So it's pretty cool. So we just say people and we pass in the name of the column that we want to filter on less than 21. And that expression when passing through the filter function of a dataset, just works. It does the right thing. It's almost magical. Again, we will show the results, display them, as opposed to storing that into a new dataset in our example here. Also in SQL, we have the group by function. And this is basically the same thing as the groupByKey operation that we had in RDD. So it's going to clump together all of the distinct values of ages and glom them all together. And then we will furthermore called count on that to just get a grand total of how many people exist in each age, year and show the results on one line here we've done something that we had to do a little bit more convoluted using RDDs, right? If you remember, we would've had to transform things with a map operation to like insert a number one for every individual person and then do a reduceByKey operation to add them all up to do this before. But in this case, we can use more of a SQL kind of syntax here and do it all on one line of code. So you can see the power of datasets here. And furthermore, since we are using datasets, it can be more efficient as well. So, yeah, just think about this a little bit more. We're going to group by age. So that basically combines together all of the entries in people by unique AGE numbers. We count up each one within each age and show the results. And here's another interesting example here. In this case, we're going to select all of the people. And we're just gonna select the name column. And then we're also going to select people age plus 10. So this is going to show a display of two columns. One is the name column from the original people dataset. And the second column it will display that we're selecting here is going to be the age column with 10 attitude it. So again, we can have these expressions within functions when in Scala, and it actually works. We can say people age plus 10 and pass that as a parameter to the select function. It's a little bit weird if you're used to other languages again, but this is how you do it in Scala and spark, it actually works. And remember always to stop your session when you're done. Otherwise, it will keep on running and hanging around and using resources that you don't want to be using. And it could confuse you when you're running things in on subsequent runs as well. Never forget to stop your session when you're done. So let's go ahead and see this in action and convince ourselves that it actually works. We'll right-click on DataFrames, Datasets and say run off, it goes. All right, and let's walk through what happened here. Let's go up to the top here I'm going to scroll up. First thing we did was print the schema outright, and there it is. So we can convince ourselves that it did actually get the correct dataset structure there. Looks right to me. So we are inferring a ID name, age and friends column of types integer, string, integer, and integer respectively. Our first step here with selecting the name column. And there was this command here, people dot select col equals name dot show. And it did exactly what you expect. And it looks a lot like you would expect to see from an actual relational database output, doesn't it? You know, they're, they're really trying to close the gap between Spark and using a relational database and making it just the same way to use it with Apache Spark and it's newer versions. The difference, of course, is that instead of a single monolithic database server, you have an entire cluster doing this potentially. So you can see that we successfully selected the name column and showed it. So we have this little preview of the first entries. We only show the top 20 rows by default, or otherwise, it would just go on forever. And our next experiment here was filtering out anyone over the age of 21 by passing in this filter argument of people age less than 21. And sure enough, the resulting data we got back is only showing people that have ages of 19 or 20 or 18, because those are all the people less than 21 years old in our dataset. So that work to very cool. And we also tried that group by age command here again, a little bit more fancy here. So again to recap, we grouped everyone by unique AGE numbers and we counted them all up and we showed them and it looks like it worked. So we have 831 year-olds, 565 year-old, 753 year-olds, et cetera, within our dataset. So again, one line of code, very simple, very easy to use. They're highly optimized. And it's a lot simpler than the way we had to do this using RDDs. Finally, we did this, make everyone 10 years older trick here where we selected the name column and also invented a new column that's called people age plus 10. So let's see if that worked. And yeah, it looked so we'll presumably was originally 33, but in the age plus 10 column that we created here, he's now 43. And it actually works. You can pass in mathematical expressions like that and they will actually get evaluated and display to you in this manner. When we were done, we stop the session, and that was it. So there you have it. Datasets using functions instead of actual SQL commands. You can do it either way, but, you know, depends on your preference. So there you have it. Two different ways of using datasets. 24. [Exercise] Implement the "Friends by Age" example using DataSets: All right, you've listened to me talk about datasets long enough. I think it's time to get your hands dirty and try to use them yourselves. So let me give you your first challenge using datasets. Remember the friends by age example that we did with RDDs? Well, I wanted to go back there and do it again, but this time using datasets instead of RDDs. So just to review what that example is all about, we've provided a fake friends comma separated value file that contains data in this format. It's basically a user ID, their name, their age, and the number of friends they have. And your task is to give me the average number of friends by age. So I want to know on average, how many friends as a 33 year-old have, many friends is a 55-year-old have and so on and so forth for every age represented in my fake friends dot CSV file. And it's called fake friends because it's all fake data. It was all generated randomly. So they're not like, I'm not making a value judgment on your friends here. Anyway, some hints on how to get through this. So there is a fake friends dot csv file in your course materials that contains a header row. And as you've seen in our previous examples, using datasets that's going to be important for allowing scallop and spark to for the correct schema for that data. And a few commands here that might come in handy. Remember, we can use the select command to extract given columns from a dataset and get a new one that just has the data we need. You might recall that when we did this with RDDs, we had a map function that actually did that for us. So this is going to be a lot simpler using a select statement. And you will need some other dataset function. So groupBy we've looked at before that will basically aggregate things by unique ages, kind of like a reduce operation. Show will show the final results and get them back. And when we didn't talk about was AVG for average. So after you've grouped things by ages, you will then want to get the average by age. And so I always see the column names I put in here for illustrative purposes, it's up to you to put the right ones in there and put them in the right order and what not. But those are the pieces you should need to do this. And you might want to refer back to the DataFrames dataset dot Scala file for inspiration. And you don't have to do it this way either. You could just use a straight up SQL command if you want to as well. So it's kind of up to you what direction you want to go in. So go and give that a try. And when we come back to the next lecture, I'll show you my solution to it and also some ways of expanding on it to do some more interesting things with that data as well. So go give it a shot and see if you can get that working and then come back and I'll show you my solution. Don't peek at the solution. It is in there and of course materials, so make sure you create a new file that has some unique name and resist the temptation to peak guys and come back and we'll go over the answer. 25. Exercise Solution: Friends by Age, with Datasets.: Okay, So hope you were able to get through that exercise and implement the friends by H activity using datasets without too much trouble. If you'd like to compare your approach to mine. Here is my solution here. If you open up the friends by age dataset script here in the course materials. And again, there are many ways of doing this. So if your code does not match mine exactly, That's okay. All that really matters is that you got the same end results. So spot check a few of the average results that you got for different ages and if they match up with mine, You're good. So let's walk through what I did here. So a lot of this will look familiar. We created a case class called Fake Friends in this case instead of person just to mix things up a little bit. This defines the explicit schema of our data coming in from the fake friends dot CSV file, which consists of an ID and name, age and friends. We've seen this before, just a different name. First thing we do is set our log level, create a SparkSession, same as we did before. And as we did before, we load up the data slash fake friends dot CSV file using inferred schema from the header. And then we force that DataFrame into a dataset by saying dot S and passing, in that case class fake friends to define explicitly what its schema is. Now we can at compile-time check all the types and what not and do some optimizations at compile time as well. Alright, so the first thing we do is go and select just the columns we need. So as you might recall, the only things that we really care about in this example are the age and the number of friends, the user IDs and the names are irrelevant for this problem. So we're just going to select from that dataset, the age and friends columns, and pass that into a new dataset called Friends by age. And now all we have to do is say friends by age dot groupby on age. So that's going to group everything by unique age values. And then once we have that, we will furthermore say dot AVG to get the average the friends columns grouped by age and then dot show to display the results. And that's it. We would actually be done at this point. There are some more tricks I want to show you here, so there's more to talk about. But for the challenge as stated, that's all you need. So that's pretty cool, right? In two lines of code, we basically did what we did in the entire example using RDDs before. Let's go back and look at that just for comparison purposes. So here's the friends by age dot Scala script where we did this using RDDs. And you'll see it's a lot more complicated here. So back here we had to write this parse line function to go and parse everything out and split things out by commas and extract the fields we want and return a new tuple. That's what the fields that we want. Whereas with datasets, we can just use the CSV loader that's built in to our Spark session. And use the select statement to select the fields that we want. So much simpler using datasets. So using RDDs, we had to do all sorts of convoluted things to compute that average, right? We had to go and map things to x comma one here. So we can just count everything up using reduceByKey. And it's a pretty twisted way of doing what we did just using over here in the dataset example, group by an average. And after that reduceByKey, we had to compute the average by hand as well by using mapValues. So this code, I think makes a lot more sense. It's a lot easier to read. It's a lot simpler. As opposed to what we had to do here using RDDs to do the same thing. So that's a good example of where datasets can be a lot easier to use and it might even be faster. But under the hood, it's RDDs at a lower level. And so sometimes it's good to go to that lower level if you really need to tweak things at that level, let's move on and take this to the next step though. So what if I wanted to sort that data? It turns out that's easy as well. If I wanted to sort these things by age, I could just stick a dot sort age command there before the show. And that will take the results of my averages by age and just sort them by the age column before displaying it. And furthermore, I could format that more nicely. Our averages will sometimes have a lot of decimal places to them and it doesn't make sense to have that much precision. I could say AGG and then round average. This requires a little bit of getting your head around. So we have the average of the number of friends here that we did before. And then we're saying we're going to round that to two decimal places of precision. So that's how the round command works there. And we have to use this AGG function because we're aggregating all of the data for that given age that we grouped by, right? So agonies, we're going to compute that average across the aggregate of all the values for that age that we selected there. After that rounding is done, we again sort the results by age and show them. And also we can customize our column names as well. And that's what this example here is doing. So here we're adding in an alias after that aggregate of rounding everything. And we're going to rename that to friends underscore average. And then again, we sort it and show it. So let's walk through this again. After seeing the output, it might make a little bit more sense. Let's right-click on friends by age dataset and run it. And here we go. Alright, so let's walk through each set of results here from the top, scroll up to the beginning here. Alright, so our first example was the simple one where we just did a group by age and then average of friends and showed it. And that works. So the group by age grouped everything together by unique ages, and then average a friend's computer, the average of the friends column for each age. It's just that easy and indeed this does work. However, you'll notice that the edges are not sorted, so it's a little bit hard to look things up and get a specific answer from it. Also, the display Here's a little bit weird, right? We have all these crazy amounts of precision on the averages that are probably not necessary. So on our next example, we solve the sorting problem. All we did here was to add a dot sort with an age parameter there to the, to the mix there at the end. And that did work. So we see that the ages are now sorted. We start with 18, 19, 20, 21, etc. And that seems to have worked as well. And furthermore, we can do it even more nicely. So if we go down to this third example here, in this one, we're rounding the results to two decimal places of precision by applying a round function to the average and using Ag to apply that to the entire age results for that group. So here we have the age and the expression round average friends comma two, which is rounding that average to two decimal places of precision. And you can see that actually work as intended. Very cool. And finally, we gave that a custom column name by using the alias command within that aggregate as well. So by saying dot alias, we renamed that column from round average friends comma two to something more useful, friends underscore average. And if you were to say this into a resulting dataset, that would make it more easy to refer to that new column that we computed by its name without having to type out that huge convoluted results. So when you're chaining datasets together that way that's often very useful for having an easier way to refer to those computed results in that column. And there you have it in datasets with our friends by age example, done a few different ways and extended in a few different ways to do so, I hope you learned a few things there. Let's move on to some more examples and get even more familiar with datasets. 26. [Activity] Word Count example, using Datasets: So let's continue to learn about datasets through examples. And again, we'll go back to our RDD examples and adapt our word count exercise to use datasets instead of RDDs. And we'll go straight to the more complicated version that actually uses regular expressions and sorts the results. Now in this example, we'll be using SQL functions to achieve the results that we want. So instead of using flatMap, we're going to use a function called explode that will explode columns into rows. So if every column represents a word exploitable, put each word into its own row. We'll use the split function to split things up by word. We'll use the lower function in order to convert everything to lowercase. And a few notes here about syntax to be aware of. So when you're passing a column name as a parameter into these SQL functions, the syntax is dollar sign and then the column name in quotes. So for example, split dollar sign quote value comma, and then the regular expression would split up whatever text is in the value column based by the word delimiters there. Similarly, filter would work the same way if we wanted to filter to eliminate any words that have an empty string, we would do it this way. Filter dollar sign, and then word in quotes. And note the weird equality operators here when you're using filters statements here, it's not going to be a bank sign equal, it's going to be equal, bang sine equal. So little bit of a weird thing. Similarly inequality operator here would be equals, equals, equals. So just something to remember, you will get an error in the compiler if you don't get that right. So that will remind you if there's an issue, but kind of a weird, quirky thing there. So again, that's an example of just testing where it against an empty string, just two quotes with nothing in between them. Now, it is worth pointing out that dataset's really work best with structured data. And when we're just dealing with lines of texts, that's not really structured data. So you're kind of fitting a square peg into a round hole here by using datasets. To some extent. Sometimes RDDs are simpler and this might be one of those, but you don't just have to use data sets or RDDs. You can use them both together and we'll talk about that more in a second. One other thing to note though, is that something else It's weird about using datasets in this sense is that because we don't have structured data coming in, we don't have a schema coming in either. So the schema that ends up getting inferred is kind of arbitrary. We just end up with a DataFrame full of row objects with a column named value by default for each line of text because there is no schema. Every, every row is just a line of text. So there's a default name for that column called value that you just kinda have to know about. And that's how we'll refer to that data going forward in the script. But again, you can use both and we'll see a way of doing this as well. So remember, you can convert RDDs two datasets. And so sometimes it makes sense to disclose your load, your data using the RDD interfaces and then convert it to a dataset to make further processing easier later on. So we'll try that too. Let's dive in and see what it looks like. So let's open up the word count, better assorted dataset script here and walk through what's going on here. So we're gonna do this a couple of different ways here. Let's start off like we always do just importing all the stuff we need, declaring our package, declaring an object. This time we're going to have a very simple case class for loading up our initial dataset of data. We'll just call it book. And the only data it has is a string that we're going to have to call value. Remember, we'll get back to that later as to why that has to be named value. Let's go to our main function, set the log level, create a Spark session just like before. And again, we're going to initially import this as a DataFrame by using readme.txt with data book dot text. And in order to implicitly figure out that schema, we need to import spark dot implicits to do that at the DataFrame level. We then call dot as book to force that into a dataset that has a defined schema that we can know about at compile time. What's weird here is that when we inferred that schema implicitly with the DataFrame, there was no schema provided. So all we know is that we have lines of text coming in through Spark dot-dot-dot text. And by default, every row of text is going to be called value. That's the name of that one column that we have. And we just have to know that. So in order for that dataset to match up with the inferred DataFrame schema, we need to name that line value. All right, So we didn't just pick that name arbitrarily. If it were anything other than value, this line would not work. So again, we're getting into how using datasets can be a little bit clunky in some situations when you're not dealing with actual structured data sources coming in from a database or some database format. But once we've done this, we will just move on. So now we're going to split that up using a regular expression like we kinda did before with the split operator, we're using RDDs. And then instead of using flatMap to blow that out into individual rows for each word, we're going to use the explode function. So look at this more carefully. We're going to say input, which is our input dataset that contains a row of text on each line. And we'll say select, explode, split value with a regular expression pattern to break things up into individual words. Okay, so let's start at the middle here. So we start off saying split the value column. And again, that's just the name that we assigned automatically to that one column of text that we have based on the word delimiter. So at this point we're going to have an array of words. We're going to call explode on that to then split that up into a series of individual rows for each word. And then furthermore, we'll give that an alias called word because otherwise it would be this big convoluted description of that column that we generated. So at this point we should have a new dataset that has a word column where every word is on its own row. Now furthermore, it turns out that we end up with a bunch of empty strings when we do it this way, and we want to filter those out so we don't get this big entry in our final results about empty strings being really common. So to do that, we're just going to say dot filter. And we're going to filter for where the word column is not equal to an empty string, just two quotes with nothing in between them. So again, note the syntax here, dollar sign and then the column name in quotes. And at this point we had already given an alias to that new word column called word, so we can refer to it by the name word in the filter operation that comes after it. Alright, so now we have a new words dataset that has individual words in each row. We've filtered out all the empty words. Next, we're going to convert that to lowercase. And again we're going to say words.py select passing in our lower function. And we're going to apply that lower function to the word column. And we will again give this a new alias and keep the name word. We're just going to basically convert that to lowercase in place. Next, we'll count up the occurrences of each word by using the group by operator, and then count. So groupby will collect together all the unique words together and then count, will this count up all the occurrences of each? So now we have a new word counts dataset that contains a count of each individual word. We're almost done. Now we just need to sort the results by the count column and display them. And note the syntax here. So what's going on here exactly when we say show word counts, sorted dot count, dot t2 int, we're just making sure that the show command shows the complete set of results here. By default, it will just show the first 20 rows. We want to show all of the results. So the way this show works is that you pass him the number of rows that you want to show. And we want the whole thing. So we're going to call word counts sorted dot count to get the count of how many rows are in word counts sorted. And then we have to explicitly convert that to an integer to make sure that show is getting the data type that it expects. So whenever you want to show the entire contents of a dataset, that's how you would do it. You would just say the dataset name.com dot t2 int, and pass that as a parameter to the show command to make sure that we get every row displayed. So let's go ahead and run this before we move on and just see how that works. So right-click run WordCount, better sorted data set. Off it goes. And there's our results. So let's scroll up. And you can see it worked. So looks like I have a typo in there. Use three. These are all the words and only appeared once in my entire book. And as we scroll down, we'll see more and more common words as we go. So there's the end of it. Yes. So the word you is actually the most common word followed by the word, pretty unsurprising results and the same results that we got back with RDDs as well. Now let me show you another way to do it because like I said, the process of loading that data into a dataset was a little bit clunky. Maybe an RDD is better suited to looting that raw data. So like I said in the slides, we can mix and match. You can get the best of both worlds. So in this alternate approach here, we're loading up the data using the RDD interface. So I'm going to get these SparkContext out of the SparkSession and call the old textFile function on that. So that's just going to load up all the book dot text data into an RDD where every line is a line of text. So now we can just call flatMap on book RDD and split that out using a regular expression like we did before. And that's arguably a little bit simpler than the way we did it using a dataset. We had to do this weird convoluted thing with split and explode an alias and filter to get that to work using a dataset. So here's an example where an RDD is actually simpler. But when you want to do analysis, datasets are usually the better choice, right? Because you have the entire power of SQL behind you and the better optimization that, that offers. So at this point we can convert the RDD2 a dataset. And that's what's going on here with two ds. And from that point on, the code is basically the same. The only difference is that we're referring to that value name instead of word, because at this point, we're still dealing with the default column name of value here that we got when we converted the RDD to ds. And that too works. So we already ran this. If we scroll down, we can see that second run where we got the same exact results just using a different way, starting off using an RDD to load the data and a dataset to actually analyze it. So sometimes that can be a useful trick, especially if you're not dealing with a structured data source. And you can see we've got the same answer there in the end. So that's two ways of doing it using datasets and also using a hybrid RDD and dataset approach. So having these tools in your back pocket can sometimes be useful. 27. [Activity] Revisiting the Minimum Temperature example, with Datasets: So using datasets, it's actually pretty simple as you've seen, but we're going to quickly go through a few more examples here just to go back and re-implement those RDD examples with datasets for the sake of example. So let's go back to the minimum and maximum temperature exercises that we did and taken a look at those using datasets instead of RDDs. Not a whole lot new in this example. One though, is that we're going to be providing an explicit schema to SparkSession dot read. So instead of inferring it from a CSV file, we're going to tell it explicitly what the schema is as it's reading in the DataFrame. We're still going to have to convert that to a dataset with a case class. But this is another way of importing data from a text file that we'll look at. And we'll also look at using the width column function on a dataset to create a new column using our own custom function. So let's dive in and see how that all works. So let's take a look at the original RDD version of this. That's the Min temperatures script here. And just to refresh your memory, we loaded up a 8800 dot CSV file that contained weather information from the year 8800. We then parse that out to extract the fields we wanted and converted it to Fahrenheit as we went. And then we trimmed out anything that wasn't a T min entry for the actual type of weather entry there with the filter command. We map that to station ID and temperature tuples and reduced it by key to keep track of the minimum temperature found for each unique weather station. So let's see how we do this with a dataset instead of an RDD. Click on Min temperatures dataset. And here's our new implementation here. Interestingly, it's more code, not less. So again, for simple things like this that aren't really dealing with database sorts of problems. Sometimes an RDD might actually be easier. But let's take a look. So we're going to start off with our package declaration and we're importing specific SQL types that we're going to use within our script, the SparkSession and SQL function. So instead of importing everything, It's usually a better practice to just import what you need, which is what we're doing here. We start off by declaring a temperature case class. This is going to be the format that we read in from our actual data file. It has the format of a string based station ID, a date, which we can interpret as an integer, a measurement type, which is a string, and a temperature, which is going to be a floating point value. We set our log level, create our Spark session as we've done before. And now we're going to do something a little bit new here. We're going to declare a specific schema for that input data. So instead of inferring it from a header, because we don't have a header on this CSV file. We're going to tell it explicitly what those columns mean that it's reading in. Now before we cheated by saying import the header and infer the schema from the header, but we don't have that this time. So here's an alternative way of doing it. So we're setting up a temperature schema value that will be a StructType. And we're going to call the add function on this StructType to add different fields to that structure. So we just say add station ID, which is a string type, and it can be mobile. Date as an integer type measure type is a string type, temperature is a float type C. And this all matches up with the names that we have here in our case class as well. And the same types, same names. So now that we have that, we can import spark dot implicits and read our data from the CSV file. This time instead of saying 3D from the header, we're going to say dot schema with temperature schema. So that's going to allow us to construct our DataFrame correctly from that CSV data. And then as before, we convert that explicitly to a dataset with a hard-coded compile-time structure that's living in the temperature case class. So again, you see the difference here. We're actually inferring the schema at runtime still, because we're defining that schema as a structure at runtime. Whereas by saying dot as temperature at compile time, we know what that temperature object is and Spark can make more optimizations, give us compile-time errors instead of runtime errors. So it might seem like an extra step that might not be necessary, but we are getting a benefit from it. So now we're going to go back and do the same stuff that we did before with RDD's but using a dataset. So the first thing we're going to do is filter out anything that's not a team in entry. So instead of using filter on an RDD, we just use filter on the dataset. And we use this syntax here, dollar sign again preceding the column name. So we want to test the measure type column and see if it's equal to T min. And again, in these filter statements, the syntax is a little bit unusual for equality. It's three equals signs. So that will create a new MR1 tents dataset only contains team inlines. Next, we want to just select out the columns that we need. There's a bunch of stuff that we don't need like the specific dates because we're just looking for the minimum across the entire year. We don't really care when that happened. So we're just going to do a select with a list of the columns that we want. Station ID, comma temperature. That's all that we need to preserve. Now we get into the meat of things. We will group them and find the minimum. So group by station ID will group all of the entries together for unique station IDs. And there should only be two of them left in this dataset. And then once we have those groups, we will furthermore call Min to find the minimum entry within each group, looking at the temperature column and finding the minimum value of temperature for each group of station IDs. Got it. So now we have a Min term Space Station dataset that is just station ID and minimum temperature. We're almost done. The last thing to do is to convert those temperatures two degrees Fahrenheit. And we may as well sort the results while we're at it. So that's what's going on here. We're going to make a new dataset called Min attempts by station F because we are converting this to Fahrenheit. And that will be taking our Min times by station dataset and applying this width column function on it. So this is going to create a new column name temperature. This is actually different from the one that we started with here. So with column can either replace an existing column or create a new one. In this case, we're actually creating a new one because with many attempts by station, all that we get from that group BY command is the station ID and a new column that's called Min at temperature with temperature in parentheses. So our temperature column that we had originally, it's actually gone to the state and we're going to recreate it now we're going to make a new temperature column here with the width column command. So with column, we're going to call our new column temperature. And it will be constructed using this function. So we saw with column before when we talked about user-defined functions, same concept here. We're just not using a actual function that we defined. We're passing in an expression here explicitly. So let's start in the middle. We start with the min temperature column, and that's what this column was automatically named when we created it with the group BY command. So group by grouping on the Min temperature will create a Min temperature column. And if you were to do a show on that dataset while you're debugging things, you would have seen that column name. Alternately, we could have called alias on that to give it a more explicit name that we could refer to later on, but we didn't. So we're just going to use this default name here. So we're gonna take that, multiply it by 0.1. That's going to convert the raw data format into actual degrees Celsius. And then we'll convert that to Fahrenheit by multiplying by nine-fifths and adding 32. Furthermore, this is going to live inside the round function here. Parameter of two, meaning we want to round that to two decimal places of precision. Okay, so that's going to create a new temperature column again, with our temperature converted to Fahrenheit rounded to two decimal places. Furthermore, we will select the data that we need, so we no longer need that min temperature column. We just want the converted final Fahrenheit column instead. So we're going to select out the station ID and temperature columns because that's all we want to display in the end. And finally, we'll sort it by temperature so that we get these results sorted by temperature, the minimum temperature that was observed at each one. Finally, we'll collect those results just like you would have done with an RDD. And at this point all of our data is back within our drivers script and we're living within the world of Scala again. So we just go through the Scala code here, iterate through all of the results in our results set. Extract the first, second elements of the tuple, call it station in temp. And we will furthermore explicitly cast the temperature as a floating point value. We can then format that using the print F operator here. And finally print line the actual results using the substitution operator, like we talked about way back in our Scala crash course. So let's run it and convince ourselves that it works. Right-click, run Min temperatures dataset, and off it goes. And it worked. Great. So we have our two weather stations there with their minimum temperatures in Fahrenheit rounded to two decimal places of precision. Wu it worked. All right, so yeah, if you want a little bit of an extra challenge, you can go ahead and try to modify this to do maximum temperatures and set a minimum temperatures just to get your hands on a little bit more should be trivial. So I'm not actually even gone to provide you with a solution for that. But if you want to play around with this some more, I certainly encourage it. 28. [Exercise] Implement the "Total Spent by Customer" problem with Datasets: So let's wrap up this section introducing datasets with another exercise and other hands-on challenge. What we're gonna do is go back to the exercise from the previous section, where we calculated the total spent by customer in our fake e-commerce dataset. And we're going to revisit that using datasets instead of RDDs, probably saw that one coming home. So to recap, what we're gonna do is add up the amount spent by customer in our dataset here, the raw data coming in, it's CSV format. And it contains three columns, a user ID and item ID and a price paid for that item. What I wanna do is add up all the total amount paid by user ID. So in this simple example again, user ID 44 spent a total of $77.83. Userid 35 spent 7897, and so on and so forth. So our strategy is going to be somewhat similar to how we did with RDDs. It's just the technique will be a little bit different. So we will be loading up the data customer orders dot CSV file as a dataframe with eight explicit schema will be the easiest way to do it. And you can choose to convert that explicitly to a dataset if you want to. Obviously that will sometimes lead to better performance, but you don't have to do it. After that, we're going to group that dataset by customer ID. You sum it up by the amount spent by customer ID. And if you want bonus points, not that anyone's really keeping score. You can round that to two decimal places while you're at it, sorted by the total amount spent. So we can see our top spenders and cheapest customers as well, and then show the results. And a few useful snippets if you want some more hints here, reviewing the previous examples in this section, we'll certainly be helpful. And there is a SQL function called sum that you can use to add things up after you've grouped them together. And specifically the syntax for a rounding that after summing it will look like AG, round some, whatever the column name as it, you're summing comma tubal round to two decimal places. So little, little cheat there. And yeah, this solution is in the course materials. Please resist the temptation to peek and look ahead. Do try this yourself and see if you can work it out. It's good practice. And I'll come back and walk through my solution in the next lecture. 29. Exercise Solution: Total Spent by Customer with Datasets: I hope you've had some success in doing that exercise and recreating our total amount spent by customer activity using datasets. Let's take a look at my solution and you can compare my approach to yours. Open up the total spent by customers sorted dataset script here. And again, you know your, your code doesn't have to be exactly like mine. What counts is the end result really there, there are many ways of doing it. So we start off importing the stuff we need and we create an object called total spent by customers sorted dataset and start off with a case class defining the schema that our dataset will be forced into. So we are saying explicitly at compile time that our customer orders dataset is going to contain an integer customer ID and integer item ID and a double-precision amount spent. Then go into my main function here and we set up our Spark session like we always do. And we will create a schema for use with reading in the CSV file as a DataFrame initially. So we're going to create a customer order schema, which is, you know, looks a lot like some previous code that we looked at. So hopefully you were able to lift that, will create a new StructType and add in a customer id of type integer item ID of integer amount spent of type double type. And then we actually read that in front of the CSV file here using the schema that we provided. And we will force that into a dataset using the customer orders case class as we did before. Now we get into the meat of the exercise. Here, we create a new total by customer dataset that pretty much does everything. We start by grouping by the customer ID field. And remember we define that name up here in our case class, that name has to match the one up here. So by grouping by customer ID, we group together all of the various purchases for each customer together. And then this next line is what actually sums those all up. So let's start from the middle and work our way out. We start by summing the amount underscore spent column. So for every unique customer ID, we will sum up all the amounts Spence associated with that customer ID. And that's further wrapped by a round function to round that to two decimal places of precision. We're going to use dot egg to aggregate that result into a single column in the final result. And we will give that new column name by saying dot alias, total underscore spent to make it easier to refer to. Because we need to give it a name to be able to sort it in, right? So I realize this agg function can be a little bit confusing. Basically, what it's doing is saying that we're going to aggregate these functions together into a single column, okay? And we're also going to give that column a name using dot alias as well. So at this point, the structure of our dataset is going to be a customer ID column and a total underscore spent column. Now we can sort it by just saying dot sort with the total underscore spent column name that we specified in the previous line and show the results again, reasons little trick to get the grand total of how many rows are in the total by customer dataset, converting that to an integer and passing that in as a parameter to the show function to make sure that we explicitly show the entire dataset and not just the first 20 rows. So let's go ahead and kick that off. And there we have. It looks like it worked. It looks like the results are sorted by amount spent. So our biggest spender was UserID 68, spending $6,375.45. Very cool. All right. I hope you enjoyed that and you might have gone even further. I mean, I could have taken this a step further by forcing that to be a dot 24 example, I'm using a print f format to actually iterate through every individual row that was returned back and format the output more to my liking. But yeah, if you did that, bonus points for you. But hey, hope you got through this. So we've CAD introduction to datasets and how they work at fairly high level, there is more to explore, I should add. There's way more to Apache Spark than I can cover in this course. So if you ever want to refer to the documentation for the complete set of reference on what you can do. You can go to spark dot apache.org slash docs slash latest to get the latest documentation. If you want to drill into everything you can do with the dataset, for example, you can just navigate to org. Apache Spark. Sql dataset is going to be the path for that class name. And it will tell you all about it in every intricate detail. If you scroll down, you'll see some examples that are even more useful. And you can see a list of all the methods that are available on the dataset. Most of them will look familiar. We did cover a lot of them. For example, collect and counting for each applying a function to each row. Head is a way to get the first n rows of the dataset show we've used that a lot, right? Take can also be used to take a certain number of rows from the dataset. There are some other things to explore here as well. You can actually do explain, to explain the actual way that it's going to be executed, which is kind of interesting for debugging purposes. There's ways of persisting it to a disk. There's a way to convert it to an RDD. We looked at that a little bit. We can print out the schema. We've done that before. We can actually convert it to a DataFrame as well if we want to, as well as also a right function for storing it to disk. And we'll talk about streaming later, we'll get to that later on. And there's other things here like various operations like distinct to get just distinct values in the dataset. We've looked at filter before, we've talked about flatMap. You can apply that to a dataset as well as an RDD groupByKey we've talked about join intersections. You can do some Boolean operations there to mapping. You can do that on a dataset just like you could on an RDD, although it's going to be a little bit more cumbersome with a dataset. And yeah, just go ahead and fish through this if you ever want to get a more complete example of what you can do and what's available. And if you have any questions about the syntax and parameters that are supported on these dataset functions. This is the place to turn to you to find out. As you can see, there's a long list of what you can do the dataset. We've touched on some of the more common things you can do. But as you can see, there's quite a menu here. So we'll continue using datasets throughout the rest of this course. And also note that in your course materials for every exercise there's both an RDD version and a dataset version. So if you ever want to compare the two, they're all there. We're not gonna go through every single version of every single script. Like I said, we're gonna focus on datasets from here on out. But if you ever curious, you can look at both versions. And there are some that we haven't touched on too. For example, our initial rating counter RDD example. It's also a dataset version of that in there if you want to play with it as well. So it's all there for your reference and joy. And let's move on to some more complicated examples in our next section. 30. [Activity] Find the Most Popular Movie: So up to this point in the course, we've been introducing you to Spark and using some pretty simple examples. But I want you to see that Spark has a lot of power beyond it as well. So if you even have a more complicated problem, sometimes if you think creatively, spark unsolved problems that you wouldn't have thought it could solve. And we're going to dive into some examples like that in the following section using some pretty fun examples using movie data and superheroes data. So let's dive in and see what you can do with Spark that you might not have thought you could do if you just think a little bit creatively about your problems. Okay, So I know the name of this section is more advanced examples of Apache Spark, but we need to start with something simple and work our way up from it. So to set the base groundwork of this next example, we're gonna do something easy. We're going to find the most popular movie in the MovieLens dataset. And to recap, actually do, We talked about MovieLens already. We did briefly back with the, the ratings counterexample. So we'll be ingesting the MovieLens 100 K dataset here. And to recap the format, it contains these four columns. They correspond to a user ID, movie ID, a rating, and a timestamp. And to find the most popular movie, all we need to do is find the movie IDs that have the most ratings, right? So if we assume that somebody that rated the movie watched it, and we define popular as the most watched movie than by just finding the movie ID that shows up the most often. In this dataset, we'll find the most popular movie that way. So let's go off to the code and see how we can do this using datasets. Pretty straightforward, but again, we're going to build on this to do something a little bit more complicated in the next lecture. So let's explore the popular movies dataset script in your course materials. And we'll walk through this. It's not too fancy here, but again, we're going to build on top of this to do something a little bit more deep and complex in subsequent lectures. So we start off important itself. We need, as usual, we'll call that our object, popular movies dataset. We are unneeded case class called movie. The insight here is that all we're doing is counting up how many times each movie ID appears in our dataset. And to do that, all we need is the movie ID column, right? We don't care about the ratings. We don't care when they were rated. We don't care who rated it. All we care about is that that movie ID appears on each individual row. So that's the only data that we actually have to extract, which makes things a little bit simpler. Diving into our main function here, we set our log level, greater SparkSession, nothing special there. We create our movies schema for loading the initial DataFrame from the CSV file. Although it's actually a tab separated value data file here. But that's okay. We can still use the CSV loader forward as we'll see. And we're going to tell it the structure of that data is user ID, movie ID, rating, and timestamp with three integers and a long to represent that data. Will import sparked out implicits to load up that CSV by using that implicit schema. And we will load up, note the option set backslash t. That's telling the CSV loader that are a separator is actually not a comma, but a tab character, which is what this data is actually formatted as we pass in the path to the data file, that is the data file that contains the actual ratings information. And then we will subsequently force it into a movie dataset. And by doing that, we're casting it to that single first column there that only consists of the movie ID and nothing else. So little interesting trick there. Your dataset does not have to include all of the columns, so the DataFrame that you're building it from. And now the meat of it happens in this one single line. And it's very SQL like and how it approaches the problem. So we start off by grouping by movie IDs. So we just grouped together all of the ratings for each movie ID. We then count them all up. And then order by is basically how you do a sort in SQL so that it does the same thing. We're basically saying order BY the count column, which is going to be created by that function there. And in descending order so that we get our most popular movies on the top of the list and our least popular on the bottom of the list. After that, we'll show the top ten results and that should give us our top 10 most popular movies. So let's kick that off and see what we get. Okay, there we have our results and it seems like it worked. So apparently the most popular movie is movie ID 550, with 583 distinct ratings associated with that movie. But that's not terribly helpful, is it? How do we know what movie ID 50 is? So an obvious next step to extend this script would be to actually import the movie name associated with each movie ID so that we can actually get some insight as to what these movies are. I mean, we could do it the hard way this data lives in the item file in our dataset. So if we open that up to explore what's in there, you can see that this contains all sorts of extra information about each movie ID. So for example, movie ID one is the movie Toy Story and it tells you the year was released, linked to IMDB data from it. And these ones and zeros actually correspond to what genres that movie is in. And if you go down to movie ID 50, we'll see that our winner is actually Star Wars, the original one from 1977. Not too surprising. But yeah, we don't know. We don't want to have to look those up by hand, right. So let's come back and talk about how to actually join that with our movie names data. And how to do that in an optimal way so that we can actually distribute that data to each node on our cluster. 31. [Activity] Use Broadcast Variables to Display Movie Names: So we need to make those results human-readable. Those movie IDs aren't really useful to us as human beings because we don't know what movie say represents. So there's no intuition to be gained. So let's take care of that problem now. So we want to display the movie names and not just the IDs. And to do that, we need to merge in information from that u dot item file from the MovieLens Dataset. Now there are many ways of doing this. The most straightforward would be to just set up another dataset that maps IDs to names after parsing in that u dot item file. And then we can just join that dataset with our ratings dataset by the movie IDs and join in those movie names that way. That would be kind of the most straightforward way of doing it. And from a SQL standpoint, that's how you would approach this problem as well. But that's not the only way to do it. You can't do that. In fact, I encourage you to give that a try if you want a little hands-on exercise to practice with. But we're gonna do it in a different way. Because doing a join in like distributing a dataset across the cluster, that's a pretty heavyweight operation. And it's not totally necessary for this data, right? Because there aren't that many movies in the world. The table of movie IDs to movie names is not actually that large. We can fit all that in memory quite easily on a single computer, So we don't really need to distribute that throughout a dataset. So we could just keep a table loaded up in the driver program and use that to map the IDs too. Movie names as we're printing them out. That'll be fine. But there's another way to do a two. So we could also let Spark automatically for that to each executor when needed. So imagine, if you will, that we needed to have those movie names within the driver script itself and not just for the final output stage. We can actually, there is a way using what we call broadcast variables to forward an object which can include a map, that might be mapping movie IDs to move the names, for example, and fording that automatically to every executor within our cluster to be used as needed. And you know, if the table is massive, this might be an important thing to use these broadcasts variables because that will ensure that we only transfer it wants to each executor and keep it there. Contrast that to a dataset where it might be getting tossed all over the place. It's a little bit more of a complex operation. So by using a broadcast variable, we can make sure that that table, that map of information, that maps IDs to names is only sent across wants to each executor, and it stays there until we explicitly get rid of it. So with broadcast variables, we can broadcast any object to the all the executors such that they are always there whenever needed. We just have to call broadcast on the Spark context object in our driver script to ship that data off whatever you wanted to ship off to each executor node. We just have to make sure that it will fit within their memory of obviously. And then after that's been broadcast, we can use dot value on that broadcast variable to retrieve it and get that object back and refer to it within the script that's being distributed throughout the entire cluster. And we can use those broadcasted objects however you want to. You can use them within map functions in RDDs or in our case, we want to apply them to datasets, which means we're going to set up a user-defined function so we can use it in sort of a more SQL ESC manner. So with that, let's dive off to the code and see how broadcast variables work. So let's open up the popular movies nicer dataset script here and see what that looks like. So again, the more straightforward way of handling this problem would be to load up a dataset of movie IDs to movie names and joining that in sort of a SQL operation. But we're gonna do it in a little bit of a different way because we know that this is a relatively small dataset and we can get away with just shipping that off to each individual executor as our script starts up and just refer to it locally within each executor node. So it's just another way of doing it. Sometimes broadcast variables can be useful that way you don't really see them used that often in practice to be honest, but it is a tool in your tool chest that I think you should know about. And this also gives us an opportunity to illustrate the use of user-defined functions in a dataset, which is also something good to know. So even though this is a bit of a contrived example, it illustrates a couple of important points. So let's dive into the code here. We import all the stuff we need. And we're going to set up a full case class from the movies datatype here that includes the user ID, movie ID rating, and timestamp that will import from the u dot data file. We will then create a load movie names function. And the purpose of this function is just to create a Scala map that maps movie IDs to movie names. So this is just straight up Scala code. It's not Spark code. All it's doing is loading up that u dot item folder using the correct character encoding, by the way, which is a little bit obscure. And we return a map of integers to strings that maps movie IDs to movie names. We just load up the file straight from disk, iterate through each line in that file, split it up based on that pipe character that delimits it. And we do a little sanity check to make sure that it's a valid line. And if so, we add to our movie names map and entry that maps that first field, which is the movie ID to the second field, which is the movie name. Close the file when we're done and return the resulting movie names map. So again, this is not Spark code, this is just straight up Scala code, reading that data from the local disk from wherever we're, we're executing our driver script from. So let's get into the actual script itself. As usual, we start with our log level and set up a SparkSession object. And now we're going to call that load movie names function to load up that map of movie IDs to movie names. And we're going to call broadcast from our Spark context interface. And we're going to call that resulting broadcast variable named it short for name dictionary, makes sense, right? So at this point that map has been loaded up and it has been sent out to all of the executors that are going to be running our Spark application here. So it will be available to every executor locally to do fast look-ups of movie IDs to movie names. All right, Now we get into the Spark code itself. We're going to be loading up our movies data with the following schema. So we're going to load up the data file specifying the tab separator. Again, this is the same exact code from the previous example and using the schema that we're providing. And then we will force it to the movies case class to get the full contents of that movie's data defined thusly. All right, moving on, we will then get the number of reviews per, per movie ID, just like we did before with the simple group BY statements. So we're going to group by the movie ID column and count up each individual movie ID how many times it occurs. So at this point we have what we had before. We have a movie counts dataset that contains movie ID and count for each individual movie ID. All right, so now we need a way to actually transform those movie IDs to movie names within our dataset. So again, this is a contrived example. You could have just as easily done this at the output stage, right? And we could have just iterated through the resulting dataset that we got back and look up those movie names back on the driver script locally when we're actually printing the results. But we're just trying to illustrate the use of UDFs here and broadcast variables. So we're going to actually do this distributed within the cluster itself. So for every executor that's handling some subset of the movies, some subset of movie IDs for this problem, they will individually be looking at the movie titles for the movie IDs. They're responsible for so little bit more scalable, right? To do that, we need to first setup a user-defined functions so that we can actually set up a data set command to create a new column using this function to generate that column's data. Now the syntax for this is a little weird, so we haven't actually seen this before. We're declaring what's called an anonymous function here. And in Scala, the syntax for that looks like this. So we're basically defining an inline function here. We're going to call it lookup name. That's the name of our, of our function. We're saying that it will take an integer and return a string. And we're gonna set that equal to this following anonymous function. So it takes in an integer named movie ID and passes out into this block of code, which just returns the lookup of the name dictionary given that movie ID, note the dot value, so named it again as a broadcast variable. It's not the actual map. To get back the map object we have to call dot value wanted to retrieve the name Dick's object. So again, named dict is the broadcast variable. To get its contents, we need to call value on it. So at this point we have an actual Scala map to use. And then we can just pass movie ID to it to look up what the movie name is associated with that movie ID. Next, we need to wrap that function with a UDF. So you may note up at the top here we imported explicitly. Here it is, or dot Apache dot Spark dot SQL dot functions.php UDF. And that allows us to actually wrap that function as a user-defined function that we can then use in a SQL setting. So now that we have lookup name UDF, we can use it. And here's where the magic happens. So we'll say movies with names. This is going to be our new dataset that has an additional column called movie title. And to add that column, we will just say movie counts dot width column. To add that new column to that dataset, we'll call that new column movie title. And the contents of that column will be generated using the lookup name UDF, that user-defined function that we just defined. And the parameters for that will come from the movie id column. So the syntax here is to use the call function that comes from SQL dot functions to actually pass in the contents of a given column name, in this case movie ID. And because we wrap that with a UDF, it will accept that column and transform that to the integer that the actual underlying function expects for you. So this is where it all happens. So in a distributed manner across the entire cluster, it's going to use our user-defined function using our broadcasted map to look up movie IDs to movie names and generate a new dataset called movies with names. At that point, we can then sort it again across the entire cluster based on the count field. And at that point we can just retrieve the results and show them. In this case, we're just going to show the entire thing because we might want to dig into all the details of it. And so we will just pass in as a parameter, the entire list, the count D2 and to get the every single row of that dataset. And furthermore, we'll say truncate equals false to prevent it from actually truncating the length of each individual row. Otherwise, it would chop off some of those movie names to fit within a given row width. So let's kick this off and see what happens. And there we have it. So we didn't do a descending sort in this case, we actually just s endings. So the most common, the most popular movies will be at the end in this case. And we got the same result. Star Wars is, are taught movie, followed by contact, followed by Fargo, followed by Return of the Jedi. You can tell this is definitely a dated dataset. There are more current ones you can get from MovieLens if you want to experiment with them. So just keep in mind this is old data, so don't be surprised that near favorite blockbusters and showing up here. But yeah, it works. So again, a bit of a contrived example, but the point was to teach you about broadcast variables and user-defined functions. Again, you can often accomplish what we need to do with a broadcast variable by using a dataset instead. But sometimes a broadcast variable will be more efficient. So it's another tool in your tool chest and now you've got it. Let's move on. 32. [Activity] Find the Most Popular Superhero in a Social Graph: All right, so in our next few lectures we're going to be looking at a pretty fun dataset. It's actually based on the Marvel comic book universe is a Marvell or Marvell and everybody knew how to pronounce that. But anyway, we're talking about superheroes like Spiderman and Thor and everybody in that whole world there. And it turns out you can actually think of the world of superheroes as a bit of a social network. The dataset we're gonna work with is kind of interesting. It models superheroes as a network of superheroes by looking at what other superheroes each hero appeared with in a single comic book. So if you had Spiderman appear in the same comic book as Thor, then we would say there's a connection between Spider-Man and thorough, for example. And it turns out there's a lot of these sort of co-occurrences of heroes within the same comic book. And it leads to this really interesting network structure between all these heroes in the Marvel universe. So we have actually a pretty cool public dataset out there. And it contains the two files that we're interested in. One is called Marvell dashed graph dot text. And what we've done is we've assigned an ID to every individual character that appears in these comic books. So the first number that appears in each line is the character that we're talking about, followed by a list of all the character ids that, that character has appeared with in other comic books. So I have no idea what character ID for 395 is, but let's just pretend that it's Spider-Man. So this would mean that Spider-Man appeared with characters 22, 37, 17, 6, 7, 4, 7, 2, and so on and so forth. And then we can look up what names are associated with individual character ids. We'll call them hero IDs if you wish. In the Marvel dash names dot txt file, that's very straightforward. It's just an ID followed by a quotation mark enclosed name of that, that corresponds to that hero ID. Now one thing that can trip you up here is that a hero may span multiple lines in the Marvel dashed graph dot txt file. So some superheroes appears so often that they have so many co-occurrences with other heroes that they can't fit them all on one line. So in that case, we might have two or more lines that would start with the same hero ID. So we can see how this might be interesting from a data processing standpoint. We need to combine those together as we're processing it somehow. And well, let's talk about how to do that. So our challenge is to find out who is the most popular superhero in the Marvel comic book world. And we're going to define popular by who appears with the most other superheroes in other comic books? Who is, who gets around the most in the comic book world basically. So here's our strategy for actually answering this problem. So we'll load up our dataset from the data file there. And the first thing we'll do is split off that first number on each line because that's the hero ID that we're talking about. With that line. We will then just simply count up how many space separated numbers altogether are in that line. We don't really care who the connections are. We're just counting them up, right? And so all we care about is how many connections that Hero has. So by taking that grand total of how many things are in that line. If we just subtract one to get rid of that initial hero ID that identifies who it's about, we get the number of connections to that hero ID that are represented by that line of data with me so far. So we're just gonna count up. How many other IDEs are in that line to corresponding to that hero ID to get a count of how many connections are represented by that line of data. And as we said, we can have multiple lines for a given hero ID, which means we need to combine them together somehow. So in SQL, we can do a group BY command or on our dataset that we can do a group BY command to actually combine those altogether. And then we can add up all the connections together for different hero IDs that were split up. At that point, we can just sort by the total connections and pluck off the top result to get the most popular superhero. And at the same time we can have another dataset loaded up from the Marvel names text file and just do a filter on that by the year ID that we're interested in to get the name of our winner. So with that, let's go off to the code and see how it works. So back to our project here, let's open up the most popular superhero dataset script and walk through what it's doing. So we're going to do that same strategy that we talked about in the slides here. We'll start off by importing the stuff we need as always and declaring our object. We have two case classes here, because we have two datasets to load up. One is going to be the database of superhero names that maps superhero IDs, which are integers, two names which are strings. And we also just have a superhero dataset. And this is just going to be a string because we're not actually going to worry about structuring that data. We're just going to worry about, well, what's the actual raw line of input data and how many space separated things are in it. So we don't need to get fancy here. We're just going to import in the actual string of the line itself in its raw form. So not really making full use of what datasets can do here, but we don't need to. We're not like doing fancy analytics here. We're just counting up how many fields are in each row. So with that, we'll kick things off. So our log level defined our Spark session. And we will start out by creating our superhero names schema. This will allow us to load up that Marvel dash names dot txt file because it does not have a header file that we can infer the schema from. We have to tell it what it is explicitly. So we're going to say that we have two columns in that file. One is the ID and one is the name. And they will be separated by a space character. So even though we're using the CSV loader, we can use any separator we want. And in this case it's going to be the space character. And you might say at this point, but Frank, What if I have a space in my name? Well, the good thing about the CSV loader is that it's smart about quotation marks. So it will actually handle that because our names are in quotation marks in that file, which is cool. All right, then we need to load up the Marvel dash graph data file as well. And we're just going to load that up directly as a superhero datatype. So no need to infer a schema there because we only have one line, we're just taking it in, say, raw line of data. So there's no transformation going on there at all. We're just going to load that right up into the superhero dataset, which we're defining as just a plain old string. So here's where the action happens. We take that lines dataset there that contains the raw data, and we add a new column called id. And all that's gonna do is contain that very first entry on the line. So we're using split here. The split function in SQL on the value column. That's the default name of the column that was imported by the DataFrame using the space separator. And we're just going to extract the very first one element number 0 from that split operation. So all this is doing is saying split up that line of raw data and pluck off the first result and call that the ID. And we're going to put that into a new column called id. And then we do something similar for everything else. So we have another column called connections, which is just taking the size of how many things exist in that row split by spaces. So let's work from the inside here. We're starting with the value column, which is the default name of that column that the DataFrame loaded. That's all of the raw data. We split that up based on the space character, and then we'd call size to count up how many of those things exist. So after we split everything up by space, how many fields do we have? We then subtract one to subtract off the hero ID at the beginning of that line of data. And that is our count of how many connections are associated with this line of data. And we'll put that number in our new connections column. And then we will group by ID. This will allow us to combine together multiple lines for the same hero ID. And we use are assembled aggregation Trick here, we will sum up all the connections and any multiple entries for a given hero ID that adds them all together. And we'll give that new resulting in for Tidal, the alias of connections. So we're introducing a new sum total for each group together, hero ID. And we're going to call that connections. So that's basically it. We now have a dataset that contains each unique hero ID because we did a group BY and the total number of connections for that here are IID under a connections column. All we need to do now to find the most popular superheroes to sort them. And that's what we're doing here, sorting in descending order. And we will furthermore pluck off the very first result, you get the most popular superhero. So now our most popular dataset contains one row, which is the most popular superhero, with one column for the ID and for the number of connections that Hero has. Next, we need to look up the name of the winner there. So it actually means something to us. So we can say most popular name. We're going to take that names dataset that we loaded up earlier, filter it based on the most popular ID. So most popular 0 is going to be that first field which is the hero ID. And we use that to filter our names dataset for that ID that we're looking for. So that's going to give us back that one row for the ID that we care about. We will then select off the name column of that result and again pluck off the first result in the unlikely situation of this more than one row for the same hero ID in that dataset. And at that point, we just can print out the results. So here we pluck off the first and only column of our most popular name dataset there and print out name out using the substitution format. And we say that is the most popular hero with most popular one that's going to pluck off the second field of the most popular result there that we got, which will be the number of connections that hero had. So let's kick it off and see what happens. Right-click. Run. Who do you think the most popular superhero is in the Marvel universe? The answer might surprise you. It turns out it is Captain America. I didn't see that coming. But it turns out that of all the superheroes in the Marvel Universe, Captain America appears most often in all of the comic books with other characters. So he gets around, go figure, who knew. So there it is, That is using datasets to solve this problem. I do want to direct your attention, however, to the RDD version of the script as well. So you'll notice that we kind of had to do some convoluted stuff here to parse that data out, right? So we had to like add these new columns that did some fairly heavy-weight information there. We had a carry around that original line of data all along the way. So although you can do this just using dataset commands, that doesn't always mean it's the right thing to do. In fact, when you look at a lot of online examples of parsing data like this, where you need to do some real clean up and rearranging of the data coming in. Often you'll see that they start off by loading things using the RDD interfaces and then convert that RDD to a dataset to do further analysis on it. And if you look at the RDD version of this, that's the most popular superhero script here. You'll see it's actually a little bit simpler. So, well, I wouldn't say simpler, but it's more straightforward in some ways, at least the parsing of the data. So for example, let's look at how we parse that Marvel dot graph file here in the RDD example. Here we're just calling, reloading that up into an RDD and then using a map function for count co-occurrences. Let's look at what that does. Pretty straightforward stuff, right? So here we're just splitting up the line based on spaces and we just return back the very first element, the hero ID and the elements dot length minus 1. In many ways that's simpler and more straightforward than how we did it with datasets, where we had to do this kind of convoluted with column actually two different with calling commands, with all these nested SQL functions to get the same result. So arguably, this might actually be more efficient. And once we have that, we can just do a reduceByKey to get the grand total of friends by character. So again, you know, sometimes RDDs can be simpler and even faster sometimes, but datasets are there for you too. If you didn't want to do more complicated data analysis on true structured data, That's where dataset's really shine and their power becomes attractive. But remember, although datasets in DataFrames are kind of the modern way of using Spark. It's not always the best way. Sometimes RDDs are going to be simpler. So, so you should still feel free to mix and match the two, because sometimes RDD is the right answer, sometimes datasets are right answer. It's not always a dataset. But there you have it. The most popular superhero, Captain America, did not see that coming up next, Let's do some more complicated stuff with this dataset and extract some more interesting information out of it. 33. [Exercise] Find the Most Obscure Superheroes: Well, this seems like a good time to have you go off and write your own code for a bit here and get some hands-on practice. So your challenge is going to be to find the most obscure superheroes as opposed to the most popular superhero. And I like finding cheesy stock photo images. Sometimes this superhero looks pretty obscure. Maybe it's an office worker man. I don't know. It's a little bit disturbing anyway. So the problem is I want you to list the names of all the superheroes in our dataset that have only one connection. And this is going to be a little bit harder than it sounds like you might think you could just take the existing script you have for the most popular superheroes and change a maximum to a minimum somewhere and be done with it or sort by a different order. But it's not that simple because the code that I just gave you assumes that there's only one answer, a single most popular superhero. However, in this case, there are actually many superheroes in our dataset that have only a single connection. And you're going to list them all out along with their names. So that makes things a little bit more complicated. You're going to have to join a dataset at some point with the names dataset in order to print out all of those names together at once. I mean, you could do it after the fact and, you know, kinda do with the way that we did originally and look up individual names as you print them out. But that's not going to be as efficient as just joining in the names at some point to get them all in, in a distributed manner across the cluster. And for extra credit, See if you can compute the actual smallest number of connections instead of just assuming that it's one. If you can figure out how to do the min command to actually figure out the smallest number of connections in the entire dataset, instead of just hard-coding it to one. That would be the more principled thing to do, right? But it turns out the answer is one that is the smallest number of connections. There aren't any that have 0 in the dataset. They were filtered out already apparently. And one is the magic number, but it would be better to compute that somehow, right? So here's the overall strategy that you'll want to use here. Start with a copy of the most popular superhero dataset script that we just looked at. And if you just do a copy of that script and Intel J and then paste it into the package. That will give you an opportunity to rename that script as you copy it in. So just call it something like most obscure superheroes or something like that. And that script can be largely unchanged up to the point where we build the connections dataset. That's going to give you this dataset that has every superhero and how many connections they have. And that's still going to be useful for this exercise. But from there then on forward, it's going to be completely different. So at that point you're going to want to filter that down to find only the rows that have one connection. And then when you filter that down, you want to join those results with our names dataset to join in the names for each superhero in that filter down at dataset. Now note here we're thinking a little bit about performance here, right? I could have joined in that names dataset before I filtered out all those connections that just have one connection. But that would force my cluster to do a lot more work than it has to, right? What's the point in joining in the names for all these superheroes that have more than one connection, I'm not going to display those. So sometimes even though Spark is optimized and has a lot of automatic optimization, you still have to think a little bit about the results you want and the right order to do things in. All right, so once we have that, we just need to select the names column and show it. And that's pretty much it. And if you do want some extra hints here, Here's a few snippets of code that might help you out. Again, if you want to filter out a dataset for some column name equaling some value, it would look something like that. Just the dataset name dot filter. Remember the syntax is the dollar sign and then quotation column name. And three equals signs to whatever value you want to filter against. Also to join one dataset with another. This is just like a SQL join operation. You could just say dataset name, whatever it is, dot join, and then specify the name of another dataset that shares a common column name that you can use to join those datasets together. So in our case, that will be the ID column. And then you say using column equals whatever that common columns name is. And that will join in the fields of the second dataset with the first dataset where that common column name matches up. And finally, if you do want the extra credit, I'm not actually keeping track of your score here. This is totally on the honor system. But if you want to actually compute the minimum number of connections found in the dataset, it would look something like this. On the dataset, you would use the add command and usage to contain the minimum command on the column name that you're interested in finding the minimum of that will return back a dataset. But if you want to actually get that actual value back into your driver script as a Scala value. To convert that back, you need to first call dot first to extract that first row of that result. And there will just be one row. And that will give you back a row datatype. And that's not going to be useful in and of itself. So we're going to call get long as 0 to extract the first number from that row, which will be that minimum value. So syntax there is a little bit messed up, but that's what you have to do to actually get that number back from the dataset and into an actual value that you can then use in your filter operation. So good luck. Wow, if there was a Broccoli Man, I'm pretty sure he would be on that final list in your results. That is a super disturbing picture. Let's, let's make this go away. Go write some code and we'll come back in the next lecture and I'll show you my solution. 34. Exercise Solution: Find the Most Obscure Superheroes: All right, I hope you had a go at that exercise and here's my solution to the problem. Again, there are many ways of doing this. So as long as you get the same results I did, that's what counts. There are many ways of going about this. Do you think about how efficient your solution was though? And we'll talk a little bit about that. So like I suggested, I just copied the most popular superhero script. And by doing that, I just right-clicked on that and said copy and copy or just Control C. And then if you paste it into com dot sun dot software dot spark, that gave you an opportunity to duplicate that script with a new name. And in my case, I duplicated it with the name of most obscure superheroes, dot Scala. So let's go ahead and take a look at that and walk through my changes. Most of the script is unchanged. All I did was change the name, like I said, and I also changed the app name because why not? And like I said, it's going to be the same all the way down to constructing that connections dataset. So here's where we construct the dataset of all the superhero IDs and the number of connections they have. And that number of connections column is called connections. So at this point we have a dataset that has an ID and connections. So that's unchanged. We use that same information for finding the most popular superhero. But once we have that, we need to filter that down to the results that had the least number of connections. And I happen to know that the number is one, that the lowest number of connections in my dataset is one. But if I wanted to define that out algorithmically, I could doing something like this. I could say val, Min connection count. And that's going to be a scalar value that does hold some number equals connections dot agg. We just need that to wrap that min SQL function there. Men column name equals connections. So at this point, we're saying connections AG Min, column name connections, so will give us back a dataset that has a single row in it, again contains one column which contains the minimum value found in the connections column. Now at this point we have a DataFrame still. And we actually wanted to return that back to our driver script as a value, right? So we first need to call dot first to convert that dataset to a single row. And then once we have that row object, we call getline non-zero to extract the first and only field in that row and convert that back to a long integer. So this little complicated expression here computes that minimum value and converts it back to a dataset, brings it back to the driver script with the first command, and then calls get long to extract the number from that dataset from that row of the dataset more specifically. So now main connection count will be equal to one. All right, so now that we know what our minimum number of connections is, we can filter our connections dataset on it by saying connections dot filter, dollar sign connections equals, equals, equals, remember three equals. Kind of a weird quirk of the syntax here, equals min connection count, which will be one. And if you did take the most straightforward approach here and assume that it was one, you could have just said equals, equals equals 1. And that would work too. So we're taking that filtered dataset and calling it Min connections. So the men connections dataset now just contains the hero IDs. That contain one connection. So we still have the connections column hanging around here. Now we need to join in the names of those superhero so we can actually display them and have human-readable results. And that's what's going on in this line here, we're saying Min connections with names equals min connections dot join with the names dataset using column ID. So you'll note that both the connections DataFrame and the names dataframe have a field called ID. And we're going to use that to join these two datasets together. So for every row in our Min connections dataset, we're going to basically look up the ID on that row and join in the name from the name dataset that has that same ID. Just like a SQL join command if you're familiar with that. So at this point we're going to have a dataset that contains not only an ID and number of connections, but also a name column that we joined in from that names DataFrame. So now we can finally print out our results. So first of all, let's print out a little bit of a header just because we want the results to look neat. You didn't have to do this. Of course, we're going to say the following characters have only Min connection count connections. In that case I'll be 1. And then we just say Min connections with names, dot select name to just select that name column. I don't care about anything else For the final output. And dots show to display that as the output of my drivers script. Makes sense. Pretty straightforward. And again, there's more than one way to do it. You do one thing about performance though. This is a pretty good way of doing it. There might be a more efficient way of actually getting that minimum value there without actually stopping my script here and basically stopping the world here to figure out what that minimum value is at this point. Because it has come back to my driver script and I'm using that value in a subsequent section of my driver scripts. So this whole section above here has to run is one step. It's going to stop on the cluster, wait for that result to come back for the minimum value, and then kick off the subsequent data set operations after that, that might not be entirely optimal. But there's a better way of doing that. I haven't figured it out. I think that's unavoidable in this case. But one thing I did do, like I said in the slides, was make sure that I only did that join operation once I filtered down the connections to the final set of results that I wanted, joins are expensive operations on Spark. So you definitely want to keep those as small and as limited as you can. So that was the method of my madness to do in that join after the filter operation instead of before it. And also noise. Note that I'm not iterating through the final results in doing a dataset lookup on the names dataset for every single name and the driver script. That will be even worse. So this is pretty optimal. So let's go ahead and run it and see what we get. Most obscure superheroes run off it goes. And here come are obscure heroes. The following characters have only one connection. And it's a pretty short list. But worker to Blair, Marvel boy. Clumsy follow-up. While these are, you know, there's been a lot of Marvel superhero movies somehow. I don't think we're gonna see one about clumsy follow-up. That's a pretty obscure card. I haven't heard of any of these guys, have you if you have, say so in the comments, might be interested to hear about it. So yeah, when, when Marvel starts making movies about these characters, you know, they've really hit the bottom of the barrel. All right, well that was an interesting example there. So I hope you had some success with that exercise and got the same results I did. The order obviously doesn't have to be the same, but you should have the same list of results here in whatever you did. If not, go back to bug what you did and keep iterating on it until you get the right answer. And it's good practice. And with that, let's move on. 35. Superhero Degrees of Separation: Introducing Breadth-First Search: So the purpose of this next exercise is to show you that spark can often be used to parallelize problems that you might not have thought Spark could do. So a lot of times people use Spark for Data Analytics and specifically for things that can be expressed as a SQL command. But it can do so much more if you just think creatively about it. So as an example, let's figure out the degrees of separation between the superheroes in our superhero social network here, to recap with the idea of degrees of separation is if you're not familiar with it, well, back in Hollywood, there used to be a concept called your bacon Number. And that was how many degrees of separation and actor was from Kevin Bacon. There was this legend that Kevin Bacon was so well-connected in, appeared in so many different movies that almost every actor could trace themselves back to Kevin Bacon somehow. And it's true actually. So how does that work? Let's put that in terms of our superhero data, right? So let's say that for the sake of argument, Spider-Man appeared in the same comic book as the Incredible Hulk. And the Incredible Hulk in turn appeared in the same comic book as Thor. In that case, we would say that Spider-Man and Thor are two degrees of separation, right? Because if you look at spider Man's connections, the Hulk isn't in that immediate set of connections. Somebody that Spider-Man connected to is connected to the Hulk. So that's two steps away from Spider-Man or two degrees of separation. So that's an interesting thing to look at. It turns out that people are closer connected than you might think. And it turns out that's also true in the world of comic books. So how would we do this? Well, there's a computer science algorithm called breadth-first search that solves this kind of problem for us. And this certainly isn't something where you would just write a SQL command and be done with it to find the answer, right? We want to figure out for any given person at any given node, in this case, any given superhero in this graph of connections. How many steps away is a given superhero from any other superhero? So how would we do this using Apache Spark? Well, before we talk about the implementation details, let's talk about the algorithm. So to understand this example, you need to understand breadth-first search or BFS. This is a computer science algorithms problem. And if you are going to be interviewing for software engineer positions, it's a good thing to know, although it's not directly related to spark. So if you don't follow the details here, don't get too hung up on it. It's not really relevant to spark itself. I'm just relevant for understanding this example of how Spark can be used to solve problems that you might not have thought Spark can be used for. So imagine this represents our network. So every circle here represents a superhero, and every line between those circles represents a connection between superheroes. So that means that they appeared in the same comic book together. And some, you know, sometimes you have many connections coming off of a superhero and sometimes there are many routes between two superheroes to look at. So the number inside each circle represents the distance of that superhero from whatever superhero we're measuring the degrees of separation from. And initially we don't know the answer as we are starting this algorithm. So everything is set to infinity. We assume that they're infinitely far away because we haven't computed it yet. So to start, we picked the superhero that we want to start with and who we want to compute the degrees of separation from. So let's say that Spider-Man, and we'll call that S here in this example. So the kickoff, the algorithm, we color that initial node gray, meaning that we need to explore that NOAEL, we want to find out more about the connections from that node. And we put in the number 0 because that's the distance from the person we're starting with 2, that node. Spiderman is 0 degrees of separation from Spider-Man because Spider-Man is Spider-Man. So what do we do in the BFS algorithm is that we go looking for gray nodes in our graph. And for every gray node we find, we explore the connections coming off of that gray node. And when we do if that node that we're exploring his white, meaning that it has been unexplored previously, we will color it gray, meaning that we need to explore that node further iteratively. And we'll take the node that we just explored in color it black, indicating that we've already explored that node. We don't need to come back to it again. We increment the current count of distance. So it was 0 plus one is one. So now we know that we are one degree of separation on R and W Makes sense. So we started exploring that initially gray node S color to black because we're done with it. And for all of its connections, if they were white, we make them gray now and put in the current count of distance from that initial node. So in the next pass, again, we look for the gray notes and explore their connections. So let's start with w. There will break out its connections. So those were both white originally and now we're going to color them gray and increment the distance again. So now we're two degrees of separation from S will do the same thing for the connection from S to R. Explore It's one connection to v and color it gray. And again, put a two in there because we're currently at the iteration of two distance from S. Now that we've explored are in W, we're going to call it goes black, meaning that we should not revisit those again in the future. Next, we'll start with T. Again. We need to go through all the gray nodes and explore their connections again. Now T is connected to u, which was white, so we'll now color that gray and increment the distance count again to three. Now. Now note that t is also connected to x and w, But those are not white. We're going to try to preserve the darkest color here as we go through. So we're not going to bother revisiting x again. It's already marked grey. We're definitely not gonna visit W again because that's marked black. That means we've already done that one. We don't need a process that one at all. So we can mark x is being explored now, we've, we can mark that black now because we've already determined that we're done with it. And moving on. We also need to explore are we haven't finished that one yet. So we'll go and check off V as being black now as well. That was a gray originally. So now the next step is to just call that black, meaning that we're done exploring it. Now we just have two more gray nodes to explore here. And they don't have any newly unexplored nodes associated with them. So we're just going to color them black and be done with it. So the overall algorithm here to recap, we start off by coloring everything white with an unknown distance. Our initial node is gray, and for each iteration, we look for gray nodes, explore their connections. And for every white node coming off that gray node, we will mark it as gray for future exploration unless it's already been colored gray or black, where precisely preserving the darkest color there as we go. And furthermore, we are preserving the lowest distance metric that we see in these nodes as we go. So we're never going to do something like go from you back to T and say that t is now a distance of four because we backtracked, that would make sense, right? So we would always preserve that lower distance metric as we're traversing the graph. So it is kind of a hard thing to wrap your head around. And again, if you didn't quite follow that, don't worry about it too much. This is not a course about algorithms, is a course about Spark. I'm just trying to convey how it's possible to use Spark to solve problems that maybe you wouldn't think Spark with suited for. You can actually parallelize this problem pretty well if we think about it in a creative way. So let's talk about that implementation in Spark next and make this happen on our Spark Cluster. 36. Superhero Degrees of Separation: Accumulators, and Implementing BFS in Spark: So let's talk about how to frame this as a spark problem and actually distributed breadth-first search across our entire cluster potentially. So the first thing we need to do is to represent our data coming in as these nodes that we talked about in the previous lecture. So for example, an input line might be in this format, the first number is some hero ID, followed by a list of the other heroes that Hero has appeared with another comic books. So we want to structure that in a way that represents one of these nodes in breadth-first search. So for example, we could structure this as the following tuple of tuples, if you will. Where the first element is the hero ID that we're talking about. The one that's basically represented by this node, followed by a list of all the other hero IDs that are connected to this hero. So imagine a virtual line between 59, 83 and 11, sixty five, thirty eight, thirty six, et cetera, et cetera. So these are all the connections that are connected to our hero. And that itself is within its own tuple. To keep that all self-contained. The next element in our tuple will represent the distance associated with this node from the character that we're measuring distance from. An initially, we set that to infinity. There's no way to really represent infinity here, so we'll just set it to a really large number. In this case, 99. 99 will be our stand-in for infinity as the initial distance value. And we need to associate a color. Remember, our notes could be white, gray, or black, where white represents an unexplored node. Gray represents a node that needs to be explored, and black is a no that's already been explored. So initially every node is white. So that's an example of how we can represent these lines in a format that is consistent with our concept of nodes in a breath-first search. So now actually going through that breadth-first search algorithm can be thought of as a map and reduce operation. The mapping function will take care of converting those lines into the BFS node format that we just talked about. And that code might look like this. So we just need to split it up by spaces, convert the hero IDs to integers and the connections to integers build up an array of connections that are all the connections associated with that person. Set the initial color to white. Let's set the initial distance to 99, 99. And if the hero ID is one, we're starting with the initial case for the one we're starting from. Color that one gray with a distance of 0, but everything else should be white with a distance of infinity as our initial case. And we just return our tupled structure that we talked about with the hero ID. And then furthermore, the array of connections and the distance and the color. Now, it's maybe a good idea to talk about the history of the distributed computing here a little bit. So note that we're using a MapReduce paradigm to sort of think about this problem. Where does that idea come from? So way back when Google was first dealing with large-scale data analytics and the need to process data that could not fit on a single machine. They invented MapReduce. And MapReduce was pretty much limited to map functions and reduce functions. That's what it did, right? And we've seen examples of mappers and reducers in previous examples in this course, but it needs to be, That's pretty much the only trick that we had up our sleeves. Hadoop was built on top of the concepts that Google came up with. And Hadoop, it also has something called MapReduce. It's actually called MapReduce, one word that's an essential component of Hadoop itself. The problem is that MapReduce wasn't very smart about optimization, and that's where Spark came in. So Spark initially it was just sort of a faster alternative to MapReduce. So that's where the initial RDD interface of spark came from. You'll notice that RDDs have a map function and a reduce function available to them. And that's why, because they were trying to replicate the functionality of Hadoop MapReduce functionality. Now later on over time, when we got to Spark Version 2, That's when Datasets and DataFrames were introduced. Dataframes came first actually. And that's when we started to say, Hey, let's think about these in terms of SQL operations. Sql was kind of becoming the common language between data analytics software packages at that time. So even for things that were distributed and technically weren't suited to relational database operations. Sql was still becoming the common syntax between all these different platforms. So we saw this evolution from MapReduce to spark that was sort of built as an extension to make MapReduce better. To spark becoming more of a SQL SAL interface. But not everything is a SQL problem, right? So we still have that more lower level functionality of MapReduce available to us that we can use to solve more general purpose problems such as this one. I can't write a SQL command that says compute the distance in a graph between any two nodes. But I can frame that as a MapReduce problems. So that's what we're going to do here. Again, it's about choosing the right tool for the job, even though we can do this with a dataset. And I'll show you that in this case, using RDDs and their lower level mapping and reducing functionality is going to be the more straightforward approach. Okay, off of my platform, let's talk about the implementation some more. So just like we walked through in the slides for breadth-first search, we're going to go through things iteratively. And for every step, every distance, we're going to look for all the gray nodes and expand those gray nodes out. Color the nose. We're done with his black and just keep incrementing the distances as we go. So you're going to see a big a loop that just iteratively goes through that graph over and over and over again, looking for gray nodes and processing those gray nodes accordingly based on the rules that we defined in the BFS algorithm. All right, so again, we're just framing this as a map and a reduce job. So every time we increment through our algorithm, the mapper will increment the distance by one. And it's going to look for gray notes. If it finds a grey node, gonna look at all the connections of that gray node, color them gray for future exploration. And initially they will have no connections because they need to be explored to build up those connections. On the next iteration, the grain oat it just process will then be colored black. And it will copy the node itself back into the results to make sure that we don't lose it. The reducers job will be to deal with all those cases where we have multiple paths going into the same node, many connections. So we need to combine together all the nodes that exist for the same hero ID. And if there is a case where there's multiple connections, we want to preserve the connection with the shortest distance and the darkest color. And as it's reducing things that will preserve that list of connections from the original node to make sure we don't lose those on the next iteration. So That's how it all works from an algorithmic standpoint. But how do we know when we're done? So here's where we are going to introduce a new concept called the accumulator in Spark. So an accumulator is basically a shared counter across all the executors in your cluster. So if you need to have sort of a commonly held counter, if you will, or a flag of some sort. And accumulator can do that. It allows many executors to increment some shared variable across the entire cluster. So here's an example of this syntax. I could say hit counter is a long accumulator named hit counter. And by default it's initial value will be 0. So as we're going through our algorithm, our iterations are going to be distributed throughout our entire platform. And any individual executor might hit upon the character that we're interested in. So when that happens, that executor needs to somehow signal back to the driver script as a whole, Hey, we're done. We actually got a hit here. So we can increment the hit counter accumulator in that case to indicate that it actually found the character that we're trying to find the distance from to our original character. So when we're done with each iteration, we just check if the hit counter is greater than one. And if it is, we know that we're done and sometimes we'll get a hit from multiple directions and that hinder hit counter might be more than one. Uh, but that's how we know that we're done. Those executors can signal back to the script, Hey, I got a hit. And it can keep track of how many hits by having this shared accumulator across all the executors. So that's a lot to wrap your head around. Let's go off to the code and see what it looks like. 37. [Activity] Superhero Degrees of Separation: Review the code, and run it!: So let's dive into the code, open up degrees of separation, not the dataset one yet, just degrees of separation. And we'll walk through this code. Now again. The point of this exercise is just to show you that you can do complex things with Spark who'd not just limited to SQL style operations. So if you do want to do like traditional data analytics, great, use SQL commands. But for things like this, you might need to get down to the lower level of Spark and go back to the RDD interface. And that's okay, that's, that's available to you as well. And you know, that's sort of creative thinking is really what companies like Amazon and Google and Apple are going to be paying people the big bucks for, right? It doesn't take someone really special to write a SQL command and figure out how many website hits you had in the past hour. But it does take someone special to figure out how I can take a problem that has never been solved before using Apache Spark and solve it. And here's an example of doing that. So if you don't follow the code here, if you don't follow the algorithm, don't get too stressed out about that. That's not really the point of this activity. The point is just to show you that it is possible to do complicated things in Spark and parallelism across a cluster that you may not have thought possible. In this case, we're sort of thinking more in a MapReduce paradigm then in a SQL paradigm. And that's okay. So diving in, we import all the stuff we prayed or object grade. Now the first thing we do is define what character we're starting from and what character we're looking for. So the purpose of this script is going to be to find the degrees of separation between Spider-Man, which is character ID 5306, it turns out, and the character atom 3,031, who is frankly the most obscure character I could find in the dataset, at least I never heard of them. I have no idea who that is. And his character ID is 14. So what we're trying to do is find the degrees of separation between Spider-Man and Adam, 3,031 here. And we talked about using an accumulator called hit counter. And that's going to be how our executives signal back to say, Hey, I actually found atom 3,031 while traversing the graph from Spider-Man. So that's going to be sort of our shared counter that says, hey, I got a connection here. Next we set up our datatypes here. So we have a BFS data type that is a array of integers and an integer and a string that will represent a list of connections, the distance and the color associated with the node. And the node itself will have the hero ID associated with that node and the data associated with it. So a BFS node is going to represent one of those circles in our slides that represents a certain hero ID. And the connections at that Hero has the distance from Spider-Man. And finally, the color of that node, which is going to be white, gray, or black. Let's skip down to the main function and work backwards from there. So down here somewhere, we have a main function. Start off by saying the log level and we set up a new Spark context. We're not using a Spark session because we're not using the dataset interface in this example, we're just going to be using RDDs. And for that we just need to Spark context. We initialize our accumulator and give it a name. And the first thing we do is create starting RDD. So this is creating those initial conditions of our graph. So let's see what that does. Where is create starting RDD? Very simple, right? So all we're doing here is loading up the Marvel dot graph dot text into a raw input file RDD. And then we call map with converge to BFS to convert that data to BFS node format. Let's look at Convert to BFS. Splits the line up into individual fields, plucks off the first one makes that the hero ID plucks off all the subsequent fields here and populates an array of integers and which is going to represent all the connections from that hero to other heroes. We will set its default color to white and it's default distance to 99, 99, which is our representation of infinity, unless you're Spider-Man. In which case we color you gray with a distance of 0. Meaning that we want to start by exploring the Spiderman node and the degrees of separation from Spider-Man 2 himself is 0. We then return back that BFS notes structure that we talked about, which is a hero ID and then a tuple containing the array of connections, the current distance and the current color associated with every node. So initially everything is going to have a white color, a distance of 99, 99. And all the connections associated with each hero ID read in from that text file, except for Spider-Man, who's going to be gray with a distance of 0. All right, So we've done that initial case, they're set up. What happens next? So now we go through this iteration, like we said, and we picked ten is an arbitrary upper-bound. Turns out you never have any one. It's actually ten degrees of separation in this particular dataset that's more than enough. We print out how many iterations we've gone. So this represents the distance that we're looking at from Spider-Man at this step. And the first thing we do is do a map operation. So we call flatMap on BF using the BFS map function using that iteration RDD. So we started off I sitting iteration RDD to that initial state. We then call flatMap on iteration RDD. And flatMap, as you can recall, has the potential to create new nodes, right? So let's see what's happening in BFS map. To do. There's BFS map. All right, So basically we extract the character ID and the associated data, and we blow out that data as well to extract the array of connections, the distance and the color from the node that we're looking at that we're mapping. Now since this is a flatMap, we can return an array of BFS nodes more than one if we want to add to our new RDD. And as we're going, we're looking for gray nodes every time, right? So the gray nodes are the ones that need to be explored. So if we encounter a grey node, we process it. We explore all of its connections and we create a new node for that connection. So we might have a new character ID, a new distance, which is the distance plus 1 from them. Know that we started from, the new color will be gray. And if that is actually the character we're looking for, if this turns out to be Adam 3000, 31 or whatever it was, then we're going to increment our hit counter and say, Hey, we found the guy, we're done. We actually found a node where this person is connected to Spiderman. Then we create a new node based on the new distance in new color that we computed and add that to our list of results. So we've created all these new gray nodes that we need to explore from the connections of the gray know that we just found. And we take the gray know that we just explored and color it black, indicating that we're done with it. Okay. We just return the results and we're done. So again, if you don't understand the BFS algorithm, don't get too hung up on it. That's not really the point of this course, but I just wanted to show you what you can do. All right, so we've done the map operation. We then print out our process as we go and we say we've processed however many values so far. And if we actually did get a hit, we're going to say, Hey, we found the character we're looking for. Here is the actual total of how many hits we got from that iteration. So it's possible that more than one executor was actually processing that character at the same time. And they might have given you multiple hits coming from different directions and that's okay. We've got the grand total printed out here. If so. Now before we iterate further, we need to combine data together. So we might have generated new nodes for the same character ID as we went. So as we explored all those connections, we might have branched out in multiple different directions and come back and knows that we already processed, for example. So to handle that case, we needed to do a reduceByKey operation. The reduceByKey will pull together all of the new notes that they might have created for the same character ID. And the BFS reduced function's job in life is just to preserve the minimum distance and the darkest color in that case. So that's all that's going on here. I'm not going to go through the code here in much detail because it's not important. But that's the purpose of this function just to handle those cases where we have more than one connection coming into the same node. And in that case, preserve the minimum distance and the darkest color associated with that particular node. So yeah, let's see if it actually works. Let's right-click on degrees of separation and run it. Pretty complicated job, right? But Spark can make short work of it. Here we go. Already done so well, we actually processed 218,067 connections there. But we found that even this obscure character is only two degrees of separation from Spider-Man. While it turns out that even fictitious superheroes in comic books are very well connected. And this actually holds true in the real world. It turns out if you look at, you know, real social network like LinkedIn or IMDB is database of connections of actors who act it together in the same movie, you get similar results. People are way more connected than you might think. It's kind of interesting. So look at that. It worked. Pretty cool again, this is just an example of how by using the lower level functionalities of Spark, you can sometimes solve problems that you wouldn't think Spark was necessarily suited for. It's not just about solving SQL problems. You can do more stuff with it than that. So remember you have that power at your disposal. That's not to say that I couldn't Have frame this in terms of datasets and SQL operations, you can, and we've done that. So let's briefly look at the dataset version of this script. And it does some pretty weird stuff. So I mean, trying to like do MapReduce without using MapReduce, It's possible. It's not gonna be efficient. So if we look at this, I'm not gonna go into too much detail the script here, but you can see that we're doing some pretty convoluted stuff here to do the same stuff that we did in the RDD scripts. So even the creating that initial conditions here, just sort of constructing that node structure ends up being this complicated operation of width columns in SQL functions to try to replicate what we did with just using, you know, straight up Scala code in our map and reduce functions. And we have all these special cases for our starting character. We basically have to like do these filters and search through the entire dataset looking for that special case, instead of just handling it in line. As we explore each node, looking for gray nodes, we have to do a filter operation and everything to extract those gray nodes. Select the information we need. Filter out looking for the target character ID that we're looking for and handle that, especially. So it does the same thing. Well, that's complicated code that I preserve the darkest color too, to be fair, it was pretty complicated in our code as well, but it didn't involve doing a join operation. Good Lord. And all this really fancy SQL that it's probably going to be pretty heavyweight operations. So even though we can, if we think about this hard enough, frame it as a SQL problem, it's not the most efficient way to think about it. And just to prove my point, let's run this. And we'll sort of count how many seconds it takes for this version of the script to run once Spark spins up. So here we go. 1, 2, 3, 4, 5, 6, 7. Took about seven seconds to get that answer there. We've got the same answer. So that's good. It worked, but it took seven seconds to get that answer. In contrast, let's run the RDD version of that. One to not even three seconds, right? So that was actually more than twice as fast as the dataset version. So again, it's about choosing the right tool for the job. Datasets are awesome at handling sort of traditional data analytics problems. If you can frame your problem in terms of a SQL command easily, a dataset probably will give you the best performance. But if you find yourselves having to really stand on your head to try to frame what you're trying to do in terms of SQL commands. It's not going to end well, right? And sometimes RDDs are the better tool. So in this case, by going to a lower level using Spark's lower-level RDD interface, we've gotten much better performance than by using the dataset API. But if we were really doing something that was well-suited to a SQL style command implementing that using datasets. And there are functions that look a lot like SQL commands, allows more optimizations and would be actually be faster for problems like that. And in the real-world, you know, that's probably the more common problem you'll be given with Apache Spark. Usually you are trying to do things like answer questions about logged data coming in. How many errors did I get in the past month or whatever it is? Things like that. Datasets are awesome. But if you're trying to do something at a lower level, remember RDDs are there for you too, and you have that functionality available to you. 38. Item-Based Collaborative Filtering in Spark, cache(), and persist(): So as another interesting example of complicated things you can do with Spark that you might not have thought you could do with Spark. Let's talk about item-based collaborative filtering. This is basically a recommender engine algorithm. It's the way that Amazon used to do things like people who bought this also bought or recommend things to you based on your past purchases. The high-level idea is that we can take a look at the stuff that you've bought in the past, look for similar items to the stuff you bought and then recommend those items to us, things you might be interested in. And we're going to do this in the context of movies using the MovieLens dataset. So the fundamental thing we need is some measurement of movies that are similar to each other based on user ratings. So that's what we're gonna do here. And while we're at it, we're also going to introduce the concept of caching your data sets to improve performance. So similar movies, this is a little screenshot and older one from the GroupLens websites. So they actually have a UI where you can rate movies, tell them explicitly the moves that UV light, and it will recommend new movies to you based on those ratings. So one way to do it is to take a look at all the movies that you've liked. And for all of those movies, look at the similar movies to those movies. And those similarities will be based on other users who like the same movies that you like. Okay? So the basic algorithm here is that we're going to start by finding every pair of movies that were watched by the same person. Our objective here is to find all the movies that are similar to every other movie. Okay? To do that, we're going to first break down every pair of movies that were washed by the same persons. So if user a and user B both watched the movie and liked it, that would be a pair of movies. Now we can then take together all the users that like those pairs of movies and measure the similarity of their ratings across all of them who watched both of those movies. So based on every user who watched that pair of movies and liked it, how similar are those movies in terms of ratings? Then we can group those all together by the movie ID, sort them based on their similarity strength. That's something that we compute based on those ratings. And then we have a sorted list of movies that are similar to every other movie just by that simple algorithm. So this is just one way to do it. Mind you. There are many other techniques out there as well, but this one's simple and sometimes simple as better. It produces good results if you have enough data. Let me walk you through the algorithm and more of a graphical manner here. So imagine that N here watched both Star Wars and The Empire Strikes Back, which is the SQL to Star Wars, meant you like them both. And then Bob here, the guy with a mohawk, also watched Star Wars and The Empire Strikes Back and like them both as well. So based on this information, we can say that we have a pair of movies here, Star Wars and The Empire Strikes Back. And they both had these two users in common who both liked them. So for that pair, we know that user and user Bob both like them and let's imagine they both gave them five-stars. So based on that, we can compute that these are very similar movies based on that user rating behavior. Of the three people in our dataset. Two of them both loved this pair of movies, right? So that would yield a very good similarity score. However, we might choose to measure that. Now let's imagine Ted here at the bottom comes along and he only watched The Empire Strikes Back. Maybe he just moved here and came from a country where Star Wars isn't a big thing. That's okay. But somehow he doesn't know that the first movie, Star Wars, a New Hope actually existed. So how do we recommend to him what else he might be interested in watching based on the information that we have here. Well, we can say, okay, we know Ted, you liked Empire Strikes Back. And we know from our previous computation that the most similar movie, The Empire Strikes Back is Star Wars a New Hope, the first movie? And again, that's similarity was based on the rating behavior of the other users in our dataset. So that's it at a high level. Basically, we can recommend movies that are similar to the movies that you've loved based on the rating behavior of the other people in our dataset. And that's item-based collaborative filtering in a nutshell. So ultimately that recommends a good movie to Ted that he maybe didn't know about before. Now we're going to focus here in this activity on just the movie similarity portion of it. So we're not going to actually be doing full-blown recommendations for people, but we are going to be computing that dataset of movie similarities. What movies are similar to other movies based on user behavior. That's kind of the hard part, right? Because once you have that data, you can make recommendations to people very quickly based on the stuff that they've liked before. So you might say to yourself, how the heck can we do this in Spark? Spark is kinda built around a SQL paradigm these days, or maybe a MapReduce at best. How do you frame this as a problem that fits into that framework and something you can parallelize. Well, you can, you just have to think about it creatively. And again, you know, being able to think about these problems in a creative manner is what they pay you the big bucks for. So it's an example of how you can use this technology to solve a problem you might not have thought it was suited for. Here's the general algorithm that we're going to use in our Spark example. So we'll start by selecting the user IDs, movie IDs, and ratings. In this example, we don't really care about the times, although you could imagine a world where you use that information as well. We then find every movie pair that was rated by the same user. And that might sound like a tricky thing to do, but it turns out with a self-adjoint operation, which is very much a database kind of thing to do. That's exactly what it will give you. Basically you take the set of user movie and rating columns and join that on itself based on the user ID. And that will give you every possible pair of movies that were rated by the same person. So at that point we can reframe our self joined dataset to have movie one, movie to rating one and reading two columns that will tell us the movie pairs and the ratings associated with each. And again, at this point, this will be done for every unique user who rated those two movies together. We can then go through and filter out while we're at it any duplicate pairs, because a rating of movie one to move the two is the same thing as movie to movie one. So we'll have some heuristic in there to make sure that we only capture one of those two variations. Once we have that, we can go through every user that rated that pair of movies and compute the cosine similarity score based on that information. I don't want to get into the math of that because it's not really relevant to Apache Spark development. But if you're curious about how cosine similarity works, I, there is another course that I have on recommender systems that gets into more depth on it. But at a high level, if you take a look at all the user ratings for a movie, you can think of that as sort of a multi-dimensional vector. And a cosine similarity score would basically be the angle between two vectors for two different movies based on the user ratings that they have in common. So once we have that, we can group everything by the movie one, movie two pairs and make sure that we aggregate everything together for a given movie pair, computing that's cosine similarity across all the users who rated that pair of movies. So at this point, we have a dataset that contains every possible pairing of movies and the similarity score between those pairs of movies. We can then just filter and sort and display the results. So we can group those together by each individual movie ID, filter out things that are below some sort of similarity score threshold to make sure that we're only capturing strong similarities and sort them by that similarity score and display the final results. And that will ultimately give us the similarity score from any given movie to any other movie. So before we dive into the code, let me introduce the idea of caching datasets because this is going to come up in this example. Oftentimes when you're doing complicated things like this, you end up querying or using a dataset more than once. And if you find yourself doing that, you should explicitly cache it if you can. Otherwise, there's a chance that sparked might end up reevaluating that entire dataset all over again a second time, just to do that second operation on it. And if you want to make sure that that doesn't happen, you can use dot cash on a dataset after you're done computing it to make sure that it sticks around in memory. Any subsequent operations or analysis that you do on that dataset happens as quickly as possible. There's also don't persist. Persist will optionally let you catch that not just a memory but also to disk. So if a node fails or something, you can recover from that point and just pick up where you left off. So cash will be in memory, persists can optionally be persisted to disk as well. And with that, let's jump off to the code and see what it looks like. 39. [Activity] Running the Similar Movies Script using Spark's Cluster Manager: All right, let's jump into the code here for computing movie similarities using Spark. So open up the movie similarities dataset script here. And this is a case where using a dataset will outperform using an RDD. The really most complicated part of this whole algorithm is that self join operation, where we tried to find every unique pair of movies were rated by the same person. And that is a very SQL kind of thing to do, right? A self join operation. So this actually fits well into the ideas of using a dataset in Spark SQL. It is possible to do this with RDDs as well, but in this case, datasets do outperform RDDs, which is usually the case. So we're going to focus on datasets for this one. If you are morbidly curious about how to do this with RDDs, there is a movie similarity script as well here available that just does it using RDDs instead. But let's focus on datasets. So what do we have here? Well, we start off by importing all the stuff we need as usual. And we declare a movie similarities dataset object that contains several case classes. These defined the various file formats that we're going to read in and use for the datasets that we use along the way throughout our algorithm. So we have a movies dataset that is just mirroring the format of the raw data that we're reading in. It consists of a user ID integer, a movie ID integer, or rating integer, and a long timestamp. We're also going to load up a movie names dataset so we can very easily display the human-readable names of the given movie ID. That's just going to map movie IDs to movie titles. We will also have a movie pairs data-set along the way, and this embodies a unique pair of movies. This is sort of the output of that self join operation where we get every set of movie that was rated by the same person. And that will consist of both the movie one and movie to movie IDs and the ratings associated with each of those movies from a given user. So a movie pairs row will consist of a pair of movies and their ratings by a single user who watched them both. Got it. And then we'll also have a movie Paris similarity dataset at some point that contains a pair of movies, movie 1, a movie to a similarity score in double-precision format, and the number of pairs that were associated in computing that's similarity score. So that's going to reflect how many users actually watched this pair of movies and rated them together. Let's skip ahead to the main function and work from there. We start by setting our log level and creating a SparkSession object. As usual. We then create a movie names schema, which will be used later on for structuring our dataset to look up movie IDs to movie titles. It just says our movie ID will be an integer, and it will have a movie title for every movie ID that is a string. We will then declare a schema for the data, the movie ratings data itself as well. That's going to have user ID, movie ID, rating, and timestamp in the structure that we've seen before will then load up those movie names. As a first step, we import sparked on implicits because we're loading up a DataFrame initially and then converting that to a dataset. So to implicitly infer the schema of a DataFrame, even though we're providing you with one, you still need to use Spark dot implicits for that step. So we tell it that particular data file, you dot item is separated by a pipe delimiters. It is in the ISO 8859 dash one character set. And it uses the movie name schema that we're explicitly providing because there is no header row in this file that we can use to infer it from. And when we're done with that, we will explicitly convert it to a dataset using the moving games case class for even better performance under Spark. We then looked at the movie ratings data itself, again is a dataset. Same idea here. This one's actually tab separated using the movie schema that we defined above. And we will convert that as well to a dataset using the movies case class. So at this point we have a movie names dataset and we have a movies dataset that represents all the individual user ratings. We will then select out the information we care about just to prevent us from carrying around information we don't need. So our ratings dataset will now be just the user ID, movie ID and rating. Leaving aside the, the timestamp column, because we don't actually need that information for what we're doing here. Now here's where it gets interesting. This is that self join operation we talked about before. So we're going to have a movie Paris dataset. It's intent here is to contain every pair of movies rated by the same person. So the way we're doing that is first of all, we're gonna take our ratings dataset. We're going to give it an alias of ratings one so we can refer to it easily. And we will join that with another copy of the radius dataset that's referred to as ratings to. So we're going to do a join between ratings one and ratings to which are really the same thing. They're both pointing to the ratings dataset. So that's why we call this a self join. We're joining the ratings dataset on itself, and we will join it based on the user ID column. Okay, So we're saying that we're going to join only if a user ID or a ratings one matches user ID and ratings to. So this will have the effect of pairing together all the movies that were rated by the same person. Got it. You kinda have to understand how join operations work in SQL. And this isn't really a SQL class, but the upshot is, that's what you get out of this expression here. Every pair of movies that were rated by the same person. And furthermore, while we're at it, we're going to enforce that the movie ID for ratings one will be less than the movie ID coming from ratings to. And we're doing this just to prevent duplicates. So again, we don't want to have a separate entry for ratings one paired with readings to, as we would for ratings to with ratings one. So by doing this, we just make sure that we're only capturing one unique pairing there. Once we have that, we can then rename things to make it easier to use. So we will rename ratings one dot movie ID to movie one, readings today on movie ID to movie two, ratings when I'm reading to reading one and ratings to dot reading 2, writing 2, again, just to make it easier to work with. And furthermore, we will explicitly make this a dataset using the movie pairs case class that we defined above. Now, we have at our disposal here that movie pairs dataset that can pains pairs of movies and their ratings for every unique user that rated that pair of movies. Given that, we can now call compute cosine similarity to construct our movie Paris similarities dataset that will contain for every pair of movies how similar they are to each other based on all the users that rated those movies together. So before we go further and talk about that cache operation, let's see what compute cosine similarity does. That's up here, somewhere. There it is. So there's a little bit of fancy math going on here. I don't really want to get into the cosine similarity metric itself too much because that's not really anything to do with Spark programming. Again, if you want to learn more about that, you can check out my course on recommender systems. But at a high level, we're creating three new columns here. Xx and XY. That's going to compute x squared, y squared, and x, y from the algorithm that we use for computing cosine similarity. Again, it's just basically a, an angle between two virtual vectors in this user movie space. We can then calculate that similarity scores. Thusly, basically we use that new dataset called Paris scores. It has those extra terms, it's appended to it. And then we call egg to aggregate together all the entries for every given movie pair, they're using the following expression. So for all of the movie pairs, for all the users at rated those two movies together. We're gonna go across them all with that agg function. And we will sum up the X, Y columns and call that the numerator. That's just going to be the numerator of our expression for computing cosine similarity. And then for the denominator that routing you that the earliest denominator that will end up being the square root of the sum of the XX column and square root of the sum of the y, y column added together. And finally, we will have the non pairs, which is just going to count up how many of the x y column exists. That's just a shortcut for figuring out how many users actually rated this pair of movies together. And that's information that we need to actually compute the actual similarity score, which is what's going to land in this result dataset here, we just add a new score column to that as well. And what we do is we first make sure that we're not going to be dividing by 0. We check explicitly to make sure the denominator is not 0. Otherwise we just have a null result there. If not, then we just divide the numerator by the denominator. And at this point we have our actual similarity score computed based on the cosine similarity metric. We then just whittle this down to the columns that we care about, which is going to be movie one and movie to the similarity score between them and the number of pairs supporting that score. And we will force that into a dataset as well using the movie pairs similarity case class above and return the result dataset, which again at this point just contains movie one movie to score and non pairs who, all right, back to where that was called. So we now have this dataset of all the similarity scores between every unique pair of movies. And we're probably going to use that more than once. So let's go ahead and cache that so that we'll have that in memory in handy to go. No matter what we're gonna do with it later on. Not only are we going to use this to display our results, but you could imagine, we could actually make an actual item-based collaborative filtering system out of this by keeping that movie pair similarities dataset in memory. We can then take the set of everything that any new user has liked or expressed interest in there, or rated highly, whatever you want to use as an indication of interest and then hit that movie Paris similarity state is set to very quickly get back all the similar movies to the movies AP person light. So again, this would be very important to cash if you were building a real recommender system here. But all we're gonna do here is just tried to get the results of the top similarities for a given movie. So we're going to check that we pass in an argument here. So the idea here is that we're going to pass in as an argument to this script, a movie ID that we're interested in seeing all the similarities to. And furthermore, we're going to set some thresholds here. So we're going to say that unless there's a 97% similarity score between two movies, we're not going to consider it similar enough to be interesting. And we will also say that you need to have at least 50 users in common that rated both of these things together. So that's sort of the minimum support that we need to have confidence that this is a reliable result. You don't want to be making a recommendation based on what two people said. Ideally, you want many people that agree with each other to give you a better result. So these thresholds and a being important in getting quality results and they're rather arbitrary, but we'll get back to that. So we'll apply those filters here. We will filter the movie Paris similarities dataset and filter it not just to enforce those score thresholds, making sure that the score was greater than our school score threshold. And none pairs is greater than the co-occurrence start threshold. But we're also going to filter it down for the movie ID that we're interested in. So for the movie that we passed in as an argument, That's the movie that we want to see, similar movies too. So we enforce that movie one is equal to movie ID or movie two is equal to movie ID. We don't really know if it's going to be on the movie one or the movie to side, it could be either depending on the order of the movie IDs, right? So if either of the movies in the movie pair is the one that we're interested in. We're going to lift that through and will also further check that our thresholds for quality are met as well. Once we have that, we will sort that descending based on that score column to get the most similar movies to that movie and take the top 10. So we would call this a top n recommender in recommender system parlance because we're taking the top 10 results in making that our recommendations for this movie. And then we print it out and we just say get the top 10 similar movies for whatever that movie name is to make that human-readable. Again, using that movie names dataset that we loaded up a long time ago for that movie ID. And for every unique result that we get back, we iterate through it and we extract these similar movie based on whichever movie ID is not the one that we passed in as a parameter. Print out that result along with its score and the strength based on the number of pairs that supported that score. Who? So hey, it looks like this should work. You know, it was a lot to talk about, but if you look at the code for all of it, it does, it's really not that much code, right? It's not really that bad. I mean, there's some funky stuff going on for sure. They NED, wrap your head around with these more complicated expressions here like this aggregate here or the self join operation. But once you get past that, it's not that much code. So let's go ahead and run it and see what happens before we run it though, we need to pass that parameter of what movie ID we want to get back, right? So to do that, we showed you that little trickier and intelligence. Right-click on movie similarities and we're going to say Create Movie similarities. This will create a run configuration that we can set up explicitly. And that has a slot here for program arguments. So you can put in whatever movie ID you want here for any movie you're interested in, let's say 56, whatever that happens to be. And we'll say, Okay. And now up here we have the movie similarities run configuration that we just defined. And we can hit the play button to run it. So let's kick that off and see what happens. Off it goes. We loaded our movie names and now it's off actually computing those similarities. And there we have our results. So that's pretty good. That was a reasonably small amount of time for a very complicated operation. And doing a self-join on a big dataset is no small feat, mind you? And yeah, we got back the top results for Pulp Fiction. It turns out that's what movie ID 56 is, and it's kind of a greedy movie. And that came back with more gritty movies. So it actually seemed to work. I've ever seen smoke before. I don't actually like pull fission, so I'm not sure I actually wanna go watch that myself, but Reservoir Dogs, Dani brass go. True Romance. These are all reasonable results for a similar movies to Pulp Fiction based on other user ratings. So there you have it. If you were going to build people who liked this movie also liked on Netflix or something like that. You now know how to do that and you could actually scale that up using Apache Spark to handle a massive amount of ratings or a massive number of movies. Because you can actually throw a whole cluster at this now. So there you have it. An example of doing a movie similarities and item-based collaborative filtering, at least the first half of it using Apache Spark. And as you'll see later on, There's actually a built-in function in the machine learning library for Apache Spark that does something similar, but it doesn't actually generate results that good with the MovieLens dataset. So sometimes just using the off-the-shelf tools isn't good enough and you need to go back and sort of be inventive and implement new algorithms using Spark that maybe haven't been seen before in Spark, again, that's what they're going to pay the big bucks for guys. But that's a good example and we're going to wrap up on that one. Before we do though, I'm going to challenge you to actually make this better. So let's talk about that in our next lecture. 40. [Exercise] Improve the Quality of Similar Movies: So it turns out that movie similarities is kind of near and dear to my heart for a couple of reasons. First of all, most of what I did during my time working at Amazon.com was working in the field of collaborative filtering and their user recommendation systems. I spent a long time trying to improve those systems there. And it was fun stuff. It, it really is interesting work also I ran IMDB for awhile. So that's sort of that intersection of collaborative filtering and movies that pervades whole lot of my courses. So my challenge for you is to make the results better. And there's really no right or wrong answer to this, right. The thing with recommendations is that they're often rather qualitative in nature. I mean, you can measure them based on how people react to them in the real-world. But the end of the day, you kinda have to judge for yourself if they're good or not. So my recommendation see what I did there. Recommendation, my recommendation to you is to go find a movie that you are passionate about and find that in the MovieLens Dataset lookup, it's movie ID. And if there's a movie that you're really familiar with, you'll have a good intuitive feel as to what good similar movies to that movie might be. So start off by running the script that we just had for that movie ID that you know and love so well. And judge for yourself if those are good recommendations and think about how could those be better? So here are some ideas of things you can do to modify that script to improve the quality of those recommendations further, one idea it would be to just discard all the bad ratings. So if somebody writes a movie one or two stars, do we really want that influencing our measure of how similar movie is? One sort of pitfall of this algorithm is that movies that are similar in terms of everyone hated them would still show up as similar movies. So that's not necessarily a good thing if you're trying to recommend good movies to be pulled that they might want to watch. So that's one idea. Just introduced another filter that gets rid of any bad ratings right from the get-go and see what difference that has on the outcome. You could also try different similarity metrics. You'd have to look these up right now, reason cosine similarity, but there are alternate metrics out there as well like Pearson correlation coefficient, the Jaccard coefficient, or just straight up conditional probability. So you might want to look those up and try to implement that instead of the cosine metric and see if that does better or worse. You can also play with those thresholds and it's probably the easiest thing to do. Maybe you should have more minimum curators or a higher minimum score for making it into the final cut. That will have a trade-off or coverage though, you know, if you have more obscure movies, you might not have enough data to actually make a recommendation if you set those thresholds too high. Or you could invent your own new similarity metric that takes a number of cooperators into account. So maybe the number of people that watched the pair of movies is in itself an indication of how good those movies are. Maybe just the fact that they're popular. And a lot of people have seen them both is something you should take into account. So maybe you can normalize that somehow and introduce that into your similarity metric as well. And if you really want to get to ambitious, you could even pull in genre information from the UE dot items data file and the MovieLens dataset that contains an array of zeros and ones that indicates what genres and movie belongs to. And maybe you could filter out movies that are in different genres or boost the similarity scores from movies that have a lot of genres in common. So some general ideas there and how you might make the results better. Again, there is no right answer here. I can't show you like the correct answer to this activity. I just want you to go and play with it and see if you can improve the results. And if you do, I'd be very curious to hear about what you did, either in the comments or the Q and a or whatever mechanism the platform you're watching this course on gives you for providing feedback. So have at it, it's actually a fun activity and a very important one to a lot of e-commerce depends on making recommendations. So if you can learn how to do that, It's a good thing. Go off and play with that. And then we'll come back in the next section and talk about scaling things up. 41. [Activity] Using spark-submit to run Spark driver scripts: So far in this course we've been developing on your local desktop PC because, well, that's cheap and easy, right? Don't want to be spending a bunch of money as you learn this stuff necessarily, but it's time to scale things up. So in this next section we'll talk about running Spark on a real cluster B that a cluster that your accompany owns or a cluster you might be renting on a service like Amazon Web Services. There's a few special considerations when you're writing your drivers scripts for use on a cluster. And certain ways that you're going to be deploying your code and packaging your code and running your code when it's on an actual cluster. So let's get into those details and this should arm you with the knowledge you need for actually using Spark in a real production setting at large scale. So far in this course, we've been running our Spark applications on our desktop within the intelligence environment. And that's all well and good for development. But if you want to actually run a Spark application in the real world, you're probably going to be deploying it to a cluster somewhere and not within intelliJ a, you're going to want to be able to fire this off from some sort of Cron job or some sort of management system that will kick off your Scala application on some sort of a schedule, right? So how do we do that? Well, let's get started by talking about how to package and deploy your application and running it from outside of intelligence and just from a command line somewhere. So in the real-world, on a cluster will be running our scripts with something called Spark dash submit. So when you install Apache Spark, it comes with a application called Spark submit. And its job is to read in a jar file that contains your compiled Spark application and distribute that out to your entire cluster to be run. And it can do all of this outside of intelligence. It's all completely standalone. Now before you do this, there are some things you need to make sure of. First of all, make sure that you didn't leave any paths to your local file system inside of your script, right? So on our examples so far we've been referring to files that exist on our local file system on a relative path from where our project is. Now in the real-world, you want to make sure that your data files are accessible to whatever node on your cluster might be running your application, right? So generally speaking, your data will be deployed as some sort of a shared file system. Maybe it's HDFS, maybe it's Amazon's S3, something, but it's not going to be the local file system. Because when you're distributing this code, that local file system won't necessarily be available to you. So first things first, make sure any past to files are being used to a distributed file system or at least to some file that is accessible wherever your script might be running from. Then we're going to package up our Scala project into a jar file somehow. And there are a couple of ways of doing this. For now, we're going to get started by just adding a jar artifact in intelligence to actually export our actual application code itself to a jar file. This has limitations though, if you do have dependencies in your script beyond the stock Spark libraries, that's going to be an issue we need to think about how to distribute those dependencies. And later on we'll talk about using SBT to do that. We've actually been doing that all along. You just didn't know it. And once we have a JAR file, we can use Spark to ask submit to execute that driver script outside of the IDE. The format of it's pretty simple. You just type in Spark dashed, submit, dash, dash class, whatever your class name is that you want to run that has your main function. If you do have dependencies for other JAR files, you can use dash, dash jar is to specify where those can be found. And you can also use dash, dash files to automatically place files alongside your application. So for small files that might be little lookup files or something like that, you can get away with using dash dash files for that. And finally, the path to the JAR file itself that contains the code you want to execute. So with that, let's give it a try, will actually execute our hello world example going way back to the beginning of the course using Spark dash submit outside of intelligence. So I'll walk you through how to do that. So before we can try running our Spark application outside of intelligence, first, we need a Spark environment to run it within. So we're going to sort of set up a standalone Spark environment on our desktop PC here, which will simulate the same environment that we might have when we're running this on a server in the cloud somewhere. So let's open up our web browser and head on over to spark dot apache.org. And this is where you're going to download the latest version of Apache Spark itself. You're also going to need 7-Zip if you're on Windows or some utility that can decompress tar.gz files. So okay, so if you're on Windows and you don't already have a utility that allows you to decompress a dot tar.gz file. I recommend installing 7-Zip to take care of that first. So back to Apache Spark. Well, go to Download Spark, and we are using Spark 3 in this course here. So I will choose that spark released for now. And we want the pre-built version for Apache Hadoop 2.7 or whatever it is. And let's hit Download Spark 3 will use the suggested mirror site and wait for that to come down. It's only about 200 megabytes, so it will only take a few seconds here. Once that's done, we'll use 7-Zip on Windows to decompress it. Or if you're on Mac or Linux, you can just go to your command line and say, you know, the usual GAN zip and then tar dash X, vf, whatever command you need to decompress that. I'm sure you're familiar with T, G, C files already if you're on Mac or Linux. All right, looks like that downloaded. So let's go to our downloads folder and take a look at it. Since I installed 7-Zip, I can just right-click on it and go to 7 Zip and say extract. And I believe, uh, how to do it again? Yeah. So that extracted the G zip file to a tar file. That tar file and turn needs to be decompressed as well. And inside here we should have Spark itself. I'll pre-compiled for us. It's built using Java, so it is actually a platform independent. So we can get away with fronting this on Windows for the most part. As you'll see, there are some glitches. So let's go ahead and Control a and copy all of that. Control C. And I'll go to my C drive and create a new folder called Spark and open that up and copy it and paste it in there rather. And if you're on Mac or Linux, of course you'd be just doing this using make dir and the cp command to copy these files where you want them. Just make sure you remember where you put them. All right, so we have sparked 3 installed. That wasn't hard, right? So let's actually create a JAR file and see if we can run it using this new version of Spark that we have installed. So back to Intel J. Let's see here. So what we're gonna do is go to File and say project structure. And we're gonna go to artifacts and click the plus sign. And we'll say jar. And we'll start with an empty jar. First, let's give it a name. Let's call it sparked course. And we need to tell it what we want to put in this jar. So let's open up this sparked Scala course directory here. And you can see there's all the dependencies that we imported from Spark itself in log for J and everything that it depends on. But since we already have a Spark environment installed, we don't need to package up all those dependencies. We just need the code for our script itself. And that's going to live in these sparks gala course compile output. So this will be the compiled bytecode of our code, the actual stuff under com dot sun dot software dot Spark. So let's go ahead and double-click that. And we've added that to our JAR file. Let's also click on include and project built to make sure it actually gets created. And we'll click Okay. And we'll click the Build icon again to force it to build that. Alright, so you can see along the way it's built, see sparks, gala course out artefacts, Spark Core, Spark course dot jar. So that's our JAR file that contains our code. Let's go looking for it. If we go to see sparks gala course. There's our OUT directory, artifacts, Spark course, and there's spark horse dot jar. So 399 kilobytes. That seems like about the right size for all of the compiled by code in our entire project here. So that includes every class that we have here. So let's go ahead and kick it off and see if we can use it. Now, as I said, you do need to make sure that any file paths are going to be valid still. So you'll see in Hello world, we do have a relative path here to data slash m one through k slash u dot data. So for this to work, I need to make sure that I'm running this from the right location where that path will be accessible. Again, if this were a real application on a real cluster, I would probably want to make sure that data was on some sort of a shared file system instead. But for the sake of illustration, running on our desktop here, we'll keep it that way. Let's open up a command prompt. And again, on Mac or Linux, you just use a terminal prompt. Let's navigate to our Course Materials folder. So for me that C colon backslash spark Scala course slash Spark Scala course. And from here we have that datapath relative available to us. So that's good. So now let's actually execute these sparked ask submit command from our standalone environment of Spark that we installed. So as you recall for me, that was C colon backslash Spark slash bin slash Spark dashed, submit. What is next thing we need to pass in is the class name that we want to execute. And so we're going to say dash, dash class, com, Dotson dogs software, dot spark dot hello, world. And now we need to give it the path to the actual JAR file that we want to submit to Spark submit. And that will be for me C colon backslash, spark Scala course slash Spark Scala course slash. It was outright artifacts, Spark course, Spark Core start jar. Okay, So this should work. What it's going to do again is take that JAR file of my compiled by code and actually pass that into Spark itself using Spark dash submit. It will look for that class name that I specified and try to execute it. And again, if we were on a real cluster here, sparked estimate would kick everything off. It would distribute that code across the entire cluster. Kick off our Cluster Manager, do everything it needs to distribute that and make sure that it actually runs successfully. So let's hit Enter and see what happens. Again, note that we are outside of intelligent entirely here we're using an entirely different standalone Spark environment. This is quite analogous to what you'd be doing in the real-world. Alright, so we have some scary-looking error messages, but ignore that for the time being. If you look at the top here, we did get our output. So it says Hello World, the data file has a 100 thousand lines. It actually worked. That's awesome. Now, pay no attention to these error messages. I know that sounds really hand-wavy, but this is actually a Windows specific bug in Scala itself that's been around forever. There is an issue with actually file permissions on the temp directories that Spark uses on Windows. And that's what it's complaining about. In the real world. No one really runs production Spark jobs on Windows, so no one's ever bothered to fix this. So we're just going to ignore it for now. If you do this on Linux, you won't see that by enlarge, you will be running Spark jobs on Linux in the real-world, but the process will be the same. You'll still use sparked estimate to kick it off, even if you're on a giant cluster, the only difference is that it will run and you won't get these weird error messages about file permissions when you're done. So there you have it actually using Spark dash submit in a brand Chinese spanking new Spark environment using Spark three, using our compiled bytecode that we generated from Intel J. But we're not actually running within intelligence anymore. So where we've actually gone beyond the bounds of our IDE, which is important for real-world deployments. 42. [Activity] Packaging driver scripts with SBT: So wouldn't it be great if we could package up everything that we need, all of our dependencies and everything into a single JAR file and just distribute that to the master node of our cluster and kick it off. Well, that's what SBT or the simple build tool lets you do. So let's talk about SBT in more depth and how we can package up your script and all of its dependencies together using it. So if you're familiar with Java, you might be familiar with Maven. You can sort of think of SBT is like Maven for Scala. And what it does is manage your libraries dependency tree for you. So if you have a script that depends on some Scala or Java library, some JAR file, it will automatically go out and figure out not only where to get that from that, how to package it into your ultimate JAR file that you are compiling. But also any dependencies at that package has too often you have these complicated trees of dependencies and these JAR files and whatever package you are depending on in turn depends on other packages, which in turn might depend on other packages. So keeping track of all that by hand is it gets out of hand pretty quickly, but SBT will manage that complexity for you. Automatically figure out what packages you need for everything to actually run and gather those together for you and package them up. So that's really what SBT has. Four, makes life a lot easier if you have a lot of dependencies or if you have a library that has a lot of dependencies of its own, it's a lot easier than tracking down those dependencies by hand. And it's a lot easier than passing it a ton of options to the dash, dash jars command line parameter when you're running Spark submit. So instead of passing a bunch of specific dependencies on the command line with Spark submit. We can patch it those all up into the jar file itself so that we don't have to actually remember what those are and actually enter that on the command line. To get SVT It's free, it's open source. We can get it from Scala dash SBT.org, and we'll show you that shortly. Using it's pretty simple. You just have to set up a directory structure that looks like this. So at the top level somewhere, you will have a project directory where it will compile stuff and a source directory which as you might guess, is where your source will go. Under your source, there should be a main directory, and under main there should be a scallop directory. And inside that Scala directory is where you're going to put the actual Scala files that you want to be compiled. Sbt will match compiling that for you and packaging it up into your JAR file together with any dependencies that you specify to use it. Like I said, you just put your source files and the source folder pretty straightforward. Put it in the right place. And then in your project folder, we're going to create a little assembly dot SBT file that contains just one line that looked like this. Now 0.14.10 might change over time. That's still pretty current. There is a 15 0 out there now, but it's not widely released yet. So for now we're going to stick with the version 14 here of the SBT assembly plugin. But that's all you have to do. That just tells SBT that we're going to be using this plug-in called SBT assembly. And its job in life is to create that self-contained JAR file that we want. The real heart of it though is the SBT build file. And that is build a dot SBT that should be placed at the root of your SBT directory tree there alongside the source and project directories. So here's an example of what one might look like. We specify the name of the Father, we're creating a version number, whatever you want to be. The organization that's associated with this package. The Scala version that it depends on, this is important to get right. Remember that different versions of Spark will require different versions of Scala. So in this example, we're specifying a library dependency of org dot apache, Spark, spark dash core. Okay, So this is telling us that this popular movie script that we have in Scala depends on the Spark Core package. From Apache Spark. And we are specifically specifying version 3 with Spark here. So because we know that sparked three requires Scala version 2.12. That's why we have Scala version 2.12.3 up there or whatever, you know, the latest version might be that you're using. Now this is important to get right. For example, we're going to in the next lecture, upload our JAR file to Amazon's Elastic MapReduce service, which as of this recording does not support Spark three yet, they support Spark version 2.4.5, which requires Scala version 2.11. So in that example, we're going to specify Spark Core version 2.4.5 instead of 3 and Scala version 2.11 something. So you've gotta make sure those match up or else it won't work right? Note that library dependencies here is actually a sequence, so we can have more than one thing in there if we just have a comma separated list of lines there. So in addition, we might have sparked SQL or third-party libraries, even whatever it is that you need for your script to run, you would just listed here and then SBT would go out and get it and use African piling your code and also for packaging that into your final JAR file. Now one thing we want to talk about in particular here is that provided clause there. So provided means that we can assume that that package will be pre-installed on wherever we're going to be running this from. So because I'm going to be deploying this to a cluster that has Spark installed already. There's no need for me to include sparked core within the JAR file itself because that's going to be available to the system as a whole, all ready. So by saying provided, that means that I do require this package in order to compile my code, but I don't need to package it into my final JAR file because it will already be present wherever I'm gonna be running this from. If you wanted to actually include smart core in the jar file itself, I would leave off the provided and that would bundle it all into a truly self-contained char file. But since I already have Spark installed where I'm going to be running it from. We don't need that, We can leave that out with the provided flag. So as another example, let's say that I need to depend on Kafka. That's actually not part of Spark. I could have another line there in the library dependencies that says or data apache dot Spark, spark Streaming Kafka version, whatever the version of that packages that you need. Now in that case, I would not say provided because I know that's not pre-installed on my system, then I'm going to be running it from. So that would actually package that Spark Streaming Kafka jar into the ultimate JAR file that I'm going to be deploying and building into my final JAR file. So once you have everything in place, all you have to do is run SBT assembly from the root folder. And it will go off and work its magic. And it will go off and compile your, your scripts. It will patch it all together after gathering all the dependencies that it needs. And you'll find the final JAR file under targets last Scala dash whatever version of Scala you told it to build against. And then you have a jar file that you can do whatever you want. And the beauty again is that it is completely self-contained. As long as you can get that JAR file to the master node of your cluster that you're going to be running it from. All you have to do is say sparked as Summit path to that jar file and it's done. You don't need to specify the class, no JAR dependencies, nothing, that's it. So let's go and try it out. All right, So first let's review that script that we're actually going to be packaging up here. So if we go back to IntelliJ and look for a movie similarities 1M dataset. That's what we're going to package up here and ultimately send off to Amazon's Elastic MapReduce to actually run in a real cluster. So let's walk through what's different here. Not a lot really, but there are a few things to talk about here. First of all, for the movies dot DAT file, note that we're passing this in as a specific path to movies dot dat. That's going to be expected to be alongside our where we're running Spark submit from, its going to assume that that file is present there on the local file system. Now, like we talked about before, that's not usually a great plan. Like if you have a distributed file system where you can get that from, that's going to be better. But in our case, movies dot dat is a pretty small file. So we can put that around ourselves without a whole lot of trouble. However, generally speaking, you would wanted to have that on some sort of a distributed file system instead, instead of relying on that file being present everywhere that you're going to run this script from. We are however, making sure that we have the ratings dot DAT file on a much larger file system here. So that's going to be on Amazon has three, That's what the S3 n prefix there refers to. So, um, so I have an S3 bucket named Sun dog does spark that contains the ratings dot dat data file there. And because that is truly big data per se, well, really We could manage that on a single machine, but for larger datasets like that, you generally want to have that on some sort of a distributed file system for sure. So in that case, we did take the trouble to make sure that is available on Amazon S3. And we're not going to assume that that's on the local file system. Now, this is a somewhat different format than we saw before for the analyte 100 K datasets. So let's talk a little bit about that too. Let's go to GroupLens.org. And if we go to datasets, we can learn more about that specific 1 million dataset or there's a 1 billion dataset now to, well, if you really want big data, but there's one that we're using in this example. So if you go to Read Me dot text, it tells you about the file format in more detail. Let's scroll down. So the ratings file description, for example, it is telling you that the format is User ID, movie ID, rating, timestamp, but in this case it's not tab separated. It's actually separated by these pairs of Coleman's. So that's important to realize. Also the movies file information is important to know too also that is delimited by double colon's little bit unusual. So gotta make sure you understand the format that you're dealing with before you write your code. If we go back to intelligence, we can see that we did specify that double colon as the separator in both cases. And we are specifying the character set for the movie names as well. That is also ISO 8859 dash one. So that's one difference, just the format and the path that we're using for those data files for the MovieLens 1 million dataset, we're switching to the 1 million dataset here just to illustrate actually operating on big data. So this is a median of tasks at doing it on a single machine might be challenging. So we're, we're tackling that now. As for what else is different? Well, not much. Rest of the code is pretty much the same. So we just had to make sure that we're getting our data from the right place, that it's in the right format. And apart from that, it's pretty much the same code that we saw back in the movie similarities dataset example earlier in the course. So let's close this out now that we're done talking about it. First thing we need to do is download SBT itself. So if you head over to Scala dash SBT.org. You should see a download button somewhere. I'm sure this website will change over time, but hopefully you can find it. For Windows. There is a Windows installer, so let's go ahead and grab that. If you're on Linux or Mac OS, you can get a self-contained package here from either zip file or TGC format, whatever you prefer. And you can just decompress that and be done with it. You'll find the SBT executable within there. If you're on Ubuntu and you prefer to use a package manager, you can head over to the documentation for a Scala SBT and it will tell you about it. Just go to this link here and that will walk you through how to make sure you have the right repository in place and how to install it using app.get. But I'm on Windows, so I'm just going to run the Windows installer and be done with it. Yeah, yeah, I know. Run anyway. This walk through it. And the nice thing is that they should put SBT in my path for me so I don't have to worry about where God install two. Alright, so that's out of the way. Next, let's actually get our SBT directories so we can add something to package up. So if you head on over to http media, Dotson dogs, dash, soft.com slash Spark, Scala slash SPT dot zip. Okay. Go ahead and get that and decompress it. However you decompress things on your OS. And you should have an SBT folder that looks like this. So let's just move that someplace where we won't lose it. I'll cut that and let's just put it on my C drive. All right, so now we have an SBT folder and if we look in the contents of it, we see what we expect to see, right there's that project directory. And inside the project directory is assembly not SBT. Let's examine that. And it does contain that one line that we talked about earlier, just specifying that we are going to be using the SBT assembly plug-in with that version of it. And under the source directory, we have a main and a scholar directory which contains the movie similarities. One m dataset dot Scala file that we just looked at. And finally at the top directory here we have our magical build dot SBT file. And if we examine that, we can see that we have a name that matches our className, whatever version number we want to give to it, our organization name, the Scala version that we depend on, and the versions of Spark Core and Spark SQL that we depend on for this script. Now again, in this example on Amazon EMR currently they are on version 2.4.5. So that's what I've specified here because that's where I'm going to be deploying this to in the end. And I know that Spark 2.4 relies on Scala version 2.11. So that's the method to all that madness. And again, note that I'm saying provided because I know that Spark Core in Spark SQL are going to be pre-installed on my Amazon EMR cluster. I do not need to bundle Spark itself into my final JAR file. Now this is sort of an uninteresting example because I don't have any third party dependencies on here. So in the end, all that I'm really going to be packaging up is the JAR file for my script itself. But this comes in great use if you have a more complicated dependency tree, right? So for the sake of example here, we're not doing anything super-complicated, but this is how you would manage your dependencies if you did have a third party package that was not pre-installed on your system. You'll be running it from. You just list those here. In addition, making sure that it's comma separated and you would not say provided on those specific ones where you do want it to be bundled up. So let's see if it works. Let's open up a command prompt, and I'll do this with administrator permissions just to be safe. So if I go down and windows, that would be under Windows system command prompt, I'll right-click on that, say more run as administrator. And let's navigate to that SBT folder. There it is. Now we can just type in SBT assembly. And off it should go. And we'll actually need to go out and retrieve a Scala environment and Apache Spark and everything that it needs to actually compile that. So it's kinda magical how SBT can have this whole self-contained environment that it builds from scratch. And there we have it. So it looks like it's succeeded. Let's take a look and see what we have here. So if we go and here we now have a target directory. Let's see what we have in there. And there's a skeletal 11 directory and inside there is our JAR file. Very cool. So it's like it worked movie similarities when I'm dataset dash assembly, dash 1 dot jar, just sitting there waiting for us to use it. So in our next lecture, Let's use it. 43. [Exercise] Package a Script with SBT and Run it Locally with spark-submit: Okay, So at this point I've taught you how to use Spark submit on a local installation of Apache Spark to run your scripts. And I've also showed you how to use sbt to bundle up your script into a self-contained JAR file. So your challenge is to put these two things together. I want you to use sbt to bundle up a JAR file and then run it locally using Spark das submit on your local desktop. And the strategy here, choose any script you want. I don't really care, but I'm going to choose the Min temperatures dataset script. It really doesn't matter which one you want to play with. And what you're going to have to do is modify the build SBT file in your SPT directory tree and make sure you're using the same version of Spark that you installed locally for using Spark submit. And we also need to make sure that we're using a compatible version of Scala for that version of Spark. So this is going to involve doing a little bit of research online. You're going to have to figure out what version of Spark you're using and what version of Scala is compatible with that version of Spark. And once you've done that, furthermore, you need to figure out what the current release of Scala is for that major revision of Scala. So I don't want to give you too much guidance here because being a developer in the real world is often mostly about doing this sort of research yourself and figuring out new problems yourself. You're not going to be able to go to your boss or your colleagues to figure this stuff out for you. I mean, you can, but they're going to find that pretty annoying and you probably won't last long if you do that all the time. So I want you to get some practice and do a figuring this stuff out yourself. Once you have though, is should just be a matter of making sure your script is in the right place in the build tree for sbt. And also making sure that you have the correct Bill dot SBT put together for the right version of Spark and Spark Submit. At that point, we can just use Spark assembly to build it up and make sure that we run Spark submit from the directory that the script assumes. As you recall, most of our scripts assume that there is a local data directory that will contain the data that we want on the local file system. So we need to make sure to execute that script from the right place where that data sub-directory exists. And with that, go off and give it a shot. And in the next video, I'll walk you through how I did it. 44. Exercise solution: Using SBT and spark-submit: So let me walk you through how I've gone about packaging up the Min temperatures dataset script using SBT and running it locally using Spark submit outside of intelligence. So the first thing you always want to do is take a look at the script before you package it up and make sure that there's nothing in there that you need to change before you run it in a different environment, right? So first of all, note that we're using local star. Now if I was actually running on a real cluster, I probably want to take that out because on a cluster I want to be using every core and every machine in my cluster, not just the local machine that I'm running the driver script on, but since my intent is to still run this just on my local PC that can remain unchanged in this case. Also pay attention to file paths. This is assuming that we are going to data slash 8800 dot CSV for our data file there. And that implies that I'm going to have to have a data folder on my local file system that is alongside the same location where I'm invoking this script from using Spark submit. So I could change that to an absolute path if I wanted to. I could change that to some shared file system if I wanted to. But again, since I'm just running on my local filesystem, I'll just make sure that I remember to run this from alongside that data folder where I have it already installed. So I'm not going to actually change anything here, but you always want to be cognizant of those sorts of issues before you package up a script that you've been playing with locally and shipping it off for you someplace else, it would be really horrible if I forgot to take out that master line there, right, and actually ship it off to a cluster. I'd had this huge cluster and I wouldn't be taking advantage of it at all in that case. So always be careful of that sort of thing. So let's start with the easy part. Let's copy that script itself and put it in SBT. So let's minimize that in my Spark Scala course, let's find that file. It's under source, main scala com, some dogs software, a spark. And what was it? Min temperature dataset, dots, scallops, I'm gonna go ahead and copy that. And go to my SBT directory and go into source main Scala. And I'll paste that in here and remove the movie similarities 1 million dataset file that we used in the previous activity. So now I need to make sure that I'm using the right version of Spark and the right version of Scala for this to be able to run locally on my local Spark environment. So let's go back to the top of SBT here and I'm going to edit that bill did not SBT file, use whatever editor you want. For me, I have something called Notepad Plus, Plus installed. Anything works, right? Any text editor will do the job here. So we need to specify what version of Spark we're using and what version of Scala. And these need to be compatible, not just with each other, but also with what you have installed for Spark on your local system. Note that we're saying that spark itself is provided, so it's not going to package up Spark itself in our resulting jar file. It's going to use whatever version of Spark is present on my file system here. So I need to make sure this is building against the correct version of Spark. Let's start with that. So let's go back to my spark installation and just double-check. So I installed Spark into see Spark. And if I look at the release textFile there, it tells me I'm using Spark 3. Okay, so that's one piece of the puzzle. Let's go back to my bill dot SBT file under SBT. And we will change Spark Core to 3. And as I'm using datasets, I do need the Spark SQL package as well. So we'll leave that in there too. And you also want to think about what other dependencies you have within Spark itself at this point or outside a spark. And in this case there aren't anymore. But if I was packaging up, say, a machine-learning script, I might want to include Spark ML, Spark streaming files doing some screen streaming script. So make sure you have any dependencies that you need there. All right, now we need to figure out what version of Scala is compatible with that version of Spark. So to do that requires a little bit of research. Let's fire up a web browser. And I'll bring over a nucleon window here. So let's go to spark dot apache.org. And if we look at Spark three-point, Oh, go to the Download Spark area here, and we can see that it tells us here what version of Scala's compatible with, what version of Spark. So it says no, that spark to point x is pre-built with 2.11 except for version 2.4.2, which is built with skeleton 12. But Spark three-point Oh, that's us is pre-built with Scala to 0.12. Okay, so I know I need Spark version 2.12. Is that specific enough? Well, let's look. I want a minor version as well. Okay. Well, we'll change the 11 to 12, but what comes after the 12? I don't know. Let's do some more research. So again, you kinda have to be resourceful sometimes. So let's just search for Scala to 0.12 and see if we can figure out what the current release of that is. Looks like that pointing me to scale a dashed line.org, which is the official home of Scala. And while I don't want, it looks like 2.13 is actually out, but again, I need to use 2.12 for that version of Spark. Let's see if the release notes tell me anything. Okay. So apparently the latest release of 2.12 is 2.12.12. All right, let's go with that. So back to my build out SBT file, we will specify 2.12.12 and I will save this. At this point, we should be ready to package this up. Let's go ahead and open up a command prompt. Clean things up a little bit here. So let's see. I'll go to the Start menu. Of course, on Mac or a Linux, you would just open a terminal. No big deal. This is going to be under Windows system command prompt and we want to run that as an administrator just to be safe. All right, so let's navigate to that SBT folder. And we will just type in SBT assembly. And hopefully it will do the right thing. Okay, did something looks like a built against skeleton 0.12. So that sounds right. Let's go ahead and see if we have that resulting jar file. Let's go into targets skeleton 12. Oh, I forgot to change the name of it. Alright, fine. Well, let's go back to sbt and edit that Bill dot SBT file again. See I forgot to change the name. So really this should be what was this thing again, Min temperature dataset. All right, let's do it again. Should be quicker now that it's downloaded all those dependencies. Alright, let's see what we have now. Target Scala to 12. There it is, min temperature dataset assembly dash 1 O. So let's copy that and put it alongside that data directory where I want to run it from. So Control C, obviously you could use the cp command from a command prompt if you wanted to as well. And we'll go into see sparks gala course. And inside there, there's the data directory that it assumes is local relative to the script. Let's go ahead and paste that in here. And now should be able to navigate to there and you run it with Spark submit. So let's change our directory to see sparks gala course slash Spark Scala course. And let's kick it off with Spark submit and see what happens. So sparks submits path was under C spark bin, spark desk submit. Make sure it's there, okay. And we should be able to pass in Min temperature dataset dash assembly dash 1 dot char. I'm just using the Tab key to autocomplete that. And it should work. And we did get a bunch of errors on shutdown. But again, that is normal. That's a Windows thing. We did actually get our output though. So look, it worked cool. We got our minimum temperature for those two weather stations. And we have a successful packaging of that's good with SBT and we run it locally using Spark desk submit taking care with the version of Spark that we had installed and the version of Scala that, that version of Spark depends on. So hopefully you had some success with this as well. If not, I think watching this video probably showed you where things went wrong. So go back and try it again if you need to. And with that, let's carry on. 45. Introducing Amazon Elastic MapReduce: So the moment we've been waiting for, we're going to actually run our 1 million movie rating script on a real cluster using Amazon's Elastic MapReduce surface and Hadoop. Lots of buzz words there. Let's talk about a little bit more before we actually do it. Let's talk about how distributed Spark actually works. So the same scripts you've been using to run these Spark jobs locally on your own PC can be used on a cluster without much modification. So It's kind of up to Spark submit and Spark itself to figure out what cluster manager you're running on top of. And that might be sparks built-in cluster manager. It could be Hadoop's yarn, it could be Mesos. And integrate with that to actually distribute the work of all your mappers and reducers as well as a CAN across the cluster that you have available to you. So basically, the Spark driver script is running on your master node, your driver, okay? And that communicates with your cluster manager to actually distribute out the work that's in that driver's script to different executor nodes workers, okay? And the cluster manager is responsible for dealing with failures of individual nodes and clicking the results back together to get back to your driver script when it's done. Now a few other Spark Summit parameters we should talk about. And first I should note that on a lot of clusters, a lot of these settings are going to be pre-configured for you automatically. So if you don't specify anything in your script explicitly for what the master is going to be or if it's not being specified on the command line. There is also a configuration file within spark that can be set up to set all of these things for you automatically. And for example, if you set up a cluster on Amazon, Elastic MapReduce, a lot of these things will be set up for you and not in an optimal manner already. But sometimes you run into issues where things don't complete. You run out of resources, things start timing out and you need to tweak these things a little bit to get things to run more reliably so you need to know they exist. The first option we don't want to talk about is dash, dash master. And that's not something you can tweak. It's just test to be set to the right thing for what kind of a cluster you have. So if you're running on a Hadoop cluster and you want to take advantage of Hadoop's yarn Cluster Manager that will be set to yarn. If you want to use Spark's standalone cluster, you would set that to the host name and port of your master node on your Spark cluster. Mesos works similarly. And again, if you have a spark conf or anything in your script itself that overrides that. It will ignore what's on the command line. So the hierarchy is whatever is in your script, waters on the command line and winners in the configuration files for sparks. So never forget to double-check your scripts to make sure that you're not hard coding a given master. For example, if you have that local brackets star, that will override the master option here. And if you were to run that script on a cluster, they wouldn't take advantage of the full cluster. As for managing the usage of resources on your cluster, there are options for that as well. Num executors will specify how many executor nodes you're want to use. By default, that's only two. So if you're running on a larger cluster that has more than two nodes, you'll want to increase that setting. Again, usually that will be set for you by somebody else, by the administration, but something to be aware of and make sure that that is in fact set somewhere. The executor memory manages how much memory is available to each executor. And of course you want to make sure that that does not exceed the physical memory available to each individual executor node. If you're running on a cluster in the Cloud, those are often virtual machines that have less memory than you might think. So, make sure you are aware of the memory available to your script on each executor. And you can also look at the total executor cores. If you have multicores on your virtual nodes, then you might want to tweak that to actually put an upper limit on how many, how many cores your script can consume. Okay, Amazon Elastic MapReduce, so that's what we're going to use in this course. It's a quick and easy way to spin up a Hadoop cluster and you can actually tell it to pre-installed Spark on it as for you as well, with everything automatically configured. So very easy way to get started and run your script on a real cluster where you just rent time and pay for what you need. And that's kinda the whole premise of Amazon Web Services. You just write time and pay for the computing resources that you actually need for whatever you're doing. So you're charged basically by the hour instance, how much time you're spending on how many computers of a given type. And you're also charged for any network IO and any storage space and any storage IO as well. So you pay for what you use. And usually it's not a whole lot. I think I spent about 30 bucks and actually putting this course together in terms of AWS charges. But do be careful. I do recommend just watching me do this for now unless you're got some corporate account or something where it's not your money on the line. Because if you mess up, it's very easy to forget to terminate your cluster when you're done. And if you do that, your cluster will just keep on running forever even though you're not using it. And you're going to be built for all that time when you might not even realize it until you see a credit card charge for $1000. I don't want that to happen to you. I don't want to be responsible for that. So, um, you know, if you do want to fiddle around with the MR, Remember to terminate your clusters when you're done. If you don't, your bank accounts not going to like you and you're not going to like me, so just just don't go there. Okay. So with that, let's talk about actually running on a cluster, talking about few of these points already. But again, what EMR sets up for you is a Hadoop cluster and you can run Spark on top of the yarn component of Hadoop. So people kind of conflate Hadoop and Hadoop yarn. Sometimes I hear a lot of people's talk about how Spark is faster than Hadoop, but it's not really one or the other. Okay? What they really mean is Spark is faster than MapReduce, which is a way of running distributed jobs on Hadoop. But Hadoop itself is just a technology for managing a cluster. And one component of Hadoop is yarn, the Cluster Manager, which spark can run on top of just fine. Okay, so remember the different pieces as a Spark driver is a cluster manager and then there's the actual hardware itself. Hadoop is just filling in that little middle piece for you. So Hadoop and Spark are not mutually exclusive, which is a common misconception. Okay? One other thing I want to point out to in terms of best practices, because running on a real cluster is expensive. These are expensive resources that you're dealing with here potentially, you always want to make sure you're doing your development and testing locally on your own PC first, okay. Or some desktop computer or some single computer that you have access to you that doesn't cost a lot of money. And a way to do that often is to use a subset of your data just to develop with. So if you're dealing with a big data set that you can only manage on a cluster, consider using just a piece of that dataset to develop and test width. And that way you're more likely to have a successful run when you're actually renting time on the cluster itself, you really want to minimize the amount of time you're working on the cluster if possible. Okay, So in terms of getting setup, you need to start off by creating an Amazon web services account. And I'll just assume you can figure out how to do that. Since again, I just want you to watch me do this right now. The next step will be to create an EC2 key pair so that you have the ability to actually login to your cluster once you've spun it up in a secure manner. And you'll need a way of logging into that actual virtual machine at some point using something like putty on Windows, you need some sort of a terminal to be able to connect to these machines and actually run your script and download the things you need to them. So let's get started and actually see how it works. 46. Creating Similar Movies from One Million Ratings on EMR: So let's do this. Let's actually generate movie similarities based on 1 million user movie ratings and do it for real. So just follow along here and I'll show you how it works using Amazon's Elastic MapReduce service. Now to get things set up, I've already loaded some stuff into Amazon's S3 service, which is basically a distributed file store that you rent space on. All right, So kinda like HDFS, but it's the Amazon version of it. One way to think about it. So I've already created what's called a bucket in S3 called Sunday August Spark. And I've uploaded a few things I'm going to need. One is the movie similarities, one M Jar, Jar file. And this is the same thing that I generated using SBT earlier in the course. I just gave it a slightly different filename. So it has my self-contained Spark driver script bundled up into a jar file. Okay, so that's going to be on S3. So I can quickly copy it onto my cluster once I've split up my cluster to run it on. The other thing I've done is I've moved the MovieLens 1 million rating dataset in here as well. So I've created a MNL dash 1M folder here on my son died smart S3 bucket. And that's going to make sure that this distributed file system of S3 that contains my 1 million movie ratings is also accessible to my entire cluster. And inside that we have the different files that make up the dataset ratings. Dot dat is the actual movie ratings themselves and movies dot dat is all the metadata associated with the actual movies. So our strategy, if you recall, is going to be to run that JAR file from the master node of our cluster. And it's going to have movies dot dat located alongside it. So it can actually build up that lookup table of movie IDs to movie names when it's outputting the results. But all the cluster needs is the gradings data. So let's go and set things up and make a cluster. So we have our nice shiny MovieLens 1 million JAR file ready to deploy to our cluster. But first we need a cluster to deploy a two and create one. I'm going to use AWS Elastic MapReduce service to do that. Now this does cost money. So if you do not like spending money or you don't have an AWS account, you probably just want to watch here and not actually follow along yourself. Okay? But I already have an AWS account and a little bit of a budget to play with. So I'm gonna go ahead and click on EMR here or just type in Elastic MapReduce and find services. And go ahead and spin up a new cluster here. So let's create a new cluster and let's call it spark Scala fun. I don't know, call whatever you want. And we're going to say here that we're using the latest version of EMR. That's not a beta at least. And this is going to be using Spark. We want to select the Spark application here. And you can see that right now they're offering Spark 2.4.4. They haven't had the guts to create a spark three quite yet. But by the time you're watching this video, maybe that'll be an option as well. But that's why we package our JAR files specifically for Spark 2.4.4 and Scala to 0.11. Because we knew that that's what this cluster is going to have installed on it. For hardware configuration, we can stick with the defaults here. This is actually going to spin up a Spark cluster of three nodes. Typically it's Spark running on top of Hadoop, and that's going to be all m5 next Large Instances. Things are not cheap, they are not Frigyes. So even if you're like a brand new account where you're talking about free tier. This is not free tier hardware. So again, this cost real money to spin up, cost me about 30 bucks to do this when I was putting all these exercises together. So again, if that makes you squeamish, don't do this, just watch. You also need to specify an EC2 key pair. I already have one created that's called Sun dog EC2. And if you need to create a new one, you can just follow that link to learn how to create your own brand new EC2 key pair. So you get a public and a private key you can use to actually connect to this cluster later on, you'll need that to actually sign into the master node and actually kick off the script. Default permissions are fine. Let's go ahead and hit Create Cluster. And off it goes. So that's going to go off and provision all the hardware we need it. We'll take a few minutes to actually get that hardware and get everything set up and bootstrapped on it. So we'll come back when that's done getting setup. Okay, So after about five or ten minutes, we see that we now have our master and core machines in the running state here. And my cluster is just waiting for me to do something. So cool, my cluster is ready. So let's do something with it. To do that first, I need to connect to it somehow. So you can see here under the master public DNS SEC gives me the externally available address of the master node. Then I'm going to run my script from. And if I click on this SSH link, it'll tell you exactly how to connect to it. So for Windows you can use something called like putty for a terminal program. That's what I'm using. And if you need to install it, There's a handy dandy download the link there for you. And it tells you exactly how to connect. Using that. If you're on Mac or Linux, they have instructions for you as well. But I'm on Windows. So what we're gonna do is copy that address so I can quickly get it in there and open up putty. Type that in for the host name. And then I need to specify my APK file, my private key to actually log into that. So remember, I specify while I was setting up a cluster that I was going to use the Sun dog EC2 key. And that's where I save that file there. So now I should be able to just hit Open. And there we are logged into our master node. Now, depending on your security settings, you might actually get a timeout at this point. So if you're trying to figure out why you can't connect in no matter what you try with your firewall settings or whenever you still can't get through. Azar is blocked on the server side. So if you do run into that quick tip, you can click on the security group here for the master node. And once you're in there, you can click on the inbound rules here and make sure you have an SSH port open. So in this case, I had to actually manually add a SSH TCP port to port 22 to the IP address it I'm connecting from. So if you are having trouble connecting, That's probably what you need to do. But let's go back to where we were, back to the EMR console. And anyway, we are now logged in. So now we can start doing some fun stuff. So first things first, let's see where we are. We should have a little home directory here under the Hadoop user. First I'm going to do is copy over our actual driver script in the jar file that we created using SBT earlier. So let's go ahead and copy that over from S3, EMR and AWS EC2 node has a set of utilities built-in called AWS that you can use to actually copy stuff from S3. And stuff. So I can type in AWS, S3, CP, Let's see, s3 colon slash slash, Sun dog. Can't tell some dog park. And what did I call the file again? Movies similarity is one m dot jar. The similarities one m dot char here. Alright, and you can see that worked. The other thing I need is the movies dot DAT file. So I can actually do the construct the lookup table that movie IDs to movie names. And again, I put that in the ML dash one m sub folder. So let's copy that over as well. Aws S3 copy s3 colon slash slash, Spark slash ML dash 1M slash movies got adapt to the local directory. So there we have it. Alright, I've expanded the window a little bit here so we can see our results better. And all I need to do now is type in Spark dash, submit the name of the jar file, movie similarities, one m dot jar. And if you remember right, this scripture requires a command line parameter of the movie ID we're interested in finding similarities for. So I happen to know that Star Wars is 260 in the 1 million dataset. Let's kick it off. All right, so we see a few introductory info messages here. And one of the first things the script does is turn off anything but error messages. So we shouldn't see anything else other than progress messages as it actually breaks this up and spreads it out to the cluster. So we're basically waiting for it to get that first action command and create the DAG. Just loaded the movie names and it is now spreading it out. So right now, as we speak, we are actually computing similar movies for Star Wars using 1 million ratings on a little cluster of three machines. Pretty cool stuff. We're already in stage 2 and I remember the stages are broken up by areas of the DAG that are split up by data shuffle operations. And those stages in turn are broken up into tasks. So you can see right now here in stage two where chugging through 32 tasks, there'll be another stage after this with a 100 for the, a 100 partitions that we set up. And in about five minutes or so this will be done. It doesn't take all that long, kind of the power of a cluster chugging through a million ratings and every possible permutation of every possible pair of movies and then filtering out the results we want student a lot of work, but it will do it fairly quickly. So let's come back in five minutes here and I'll show you the results. All right, We're almost done here. There it is. Awesome. So we have the top similar movies for Star Wars Episode 4 and new hope based on 1 million real movie ratings. And this days that's a little bit newer from 2003. So again, we're not going to see any current movies, but the results look pretty darn reasonable. We have a Star Wars Episode 5 and power strikes backers of the Lost Ark. Return of the Jedi, kind of getting that reason the Lost Ark was rated higher Return of the Jedi. But to be honest, I think I might have enjoyed Raiders of the Lost Ark more than Return of the Jedi. So that maybe isn't as crazy as it sounds. And the original Indiana Jones movie, very good recommendations for someone who enjoys Star Wars. And then we start getting into other big movies that were good. The terminator matrix, there's some, some pretty good ones in here, back to the future Princess Bride all appeals to sort of that geeky demographic. Very cool. Hey honey Python and the Holy Grail, even these are actually pretty darn reasonable recommendations. Cool, and there you have it. So that's movie recommendations, similar movies and Star Wars using 1 million movie ratings run on an actual cluster using Hadoop yarn and ApacheSpark. It's what it's all about. Now, last step, do not forget to terminate the cluster when you're done. So I'm going to exit out of this, but that's not enough. And I need to go back to the EMR dashboard. And in my cluster hit terminate. Yes, I'm sure I'm done with it. Now, if you don't do that, the bill's going to keep on running for you. That cluster is still running and even though you're not doing anything with it, you're still going to be built for the time on that cluster, for that all three of those computers or even more if you set up more. So again, if you're doing this on your own dime and actually following along, please remember, determinate your cluster when you're done. I really don't want to hear about it. If you get a bill for $1000 at the end of the month from Amazon, supposed to 10, which is about what this cost. All right. Make sure that that terminate successfully and then it's safe to close out of it. But with that, woo hoo, congratulations, that's kind of the culmination of what we've been doing here. We actually ran a real big spark, big data job on a real cluster successfully and got some really useful results out of, out of it. So cool stuff. Let's talk in a little bit more depth about troubleshooting Spark jobs and some of the finer points of running in tuning things. 47. Partitioning: So one thing I really glossed over was this line here in these movies similarity is one m dataset dot Scala file here that we did not talk about. This is different from our original movie similarities dataset script. This line here, repartition num partitions equals 100. So we applied that to our self joined a DataFrame before converting that to a dataset. So what's that all about? What's repartitioning about? Why do I have to do that? Well, this is a really, really heavyweight operation here I doing that self-join across a million ratings is going to blow up really fast and that's more than you can really do on a single machine. So in this case we have to tell Spark, hey, you really want to split this operation up. And that repartition parameter tells it how many ways you want to split that operation up by. So let's talk about that in a little bit more depth. So Spark is usually pretty magical. It will automatically do the right thing to distribute your job across an entire cluster. But sometimes you have to give it some hints and sometimes you need to think a little bit about how your data will be partitioned across the different executors on your cluster. Now running that movie similarity script as it was without that repartition call might not work at all on its own. That's self-join is a really, really expensive operation, and Spark isn't always smart enough to distribute that on its own the right way. That is really the most demanding part of this script. And we need to give sparks some hints to make sure that we don't run out of resources by asking a single executor to do more than it can do. So by explicitly calling repartition on that DataFrame. And if you were using RDDs, there's a partition BY function that does the same thing on RDDs. If you use that before running a large operation that benefits from partitioning, it will make sure that things are split up in a way that makes sense. Now some operations that benefit from partitioning include join, which is what we're doing in this example. Also COGROUP group with that join, left outer join, Bisk, any kind of join or groupBy operation or reduced by operation or combined by operation. Also lookup can benefit from it as well. So if you're calling any of these methods on a very large dataset, you might want to think about explicitly partitioning it using the repartition function. And when you do that, the operations will preserve that partitioning and the results as well. So you'll get your results back broken up by those partitions too. So keep that in mind. How do you choose a partition size though? Where did that number 100 come from? Well, if you have too few partitions, that's not going to take full advantage of your cluster, right? So if I have fewer partitions, then I have executors. That's going to leave some executors idle on this task. So I definitely want to at least as many partitions as I have executors on my cluster. Otherwise I'm just wasting resources, right? So to make that real, if I'm taking the self join operation that we're using in this script. And I'm saying repartition two. And I have five nodes in my cluster than three of those nodes are gonna go unused. That's pointless because I'm only going to be distributing that work amongst two partitions. But if you have too many, that results in too much overhead from shuffling all the data around. So moving data around your cluster is also an expensive operation. You don't want to be doing that too much either. So you kinda have to strike this sweet spot where you're taking full advantage of the clusters and the executors that you have on your, on your cluster. But not having a ridiculous amount of partitions where the overhead of managing it all becomes prohibitive. So in general, you want at least as many partitions as you have cores or executors that fit within your available memory for your cluster, 100 is usually a reasonable place to start for large operations. That's what we're using here it works. It's not a crazy high number that's going to have a ton of overhead. And it's probably a number that's more than the number of executors that you have on a typical clusters. So that's usually a good starting point. But if you did have more than a 100 executors, obviously you'd want to use an even higher number there. So that's what partitioning is all about. Again, if you're using a very large join operation or group by or reduced by, that can benefit from explicit partitioning. Sometimes that can make the difference between your job succeeding or running out of memory. So if you do run into weird issues where your script is failing and running out of resources even though you have a huge cluster available. This might be why you might need to go back and think about, do I need to explicitly partition these operations using the repartition command? 48. Best Practices for Running on a Cluster: Let's get into a few more nitty-gritty details of running on a cluster. First of all, make sure you always avoid specifying a configuration for Spark in the driver script itself. That would include specifying your master configuration. Remember, we normally put in masters set to local star to say that you want to run locally on your machine. Obviously, you would not want to do that on a real cluster. You want to use the entire cluster, not just one CPU core, even multiple CPU cores on a single system. What do you want to do though is use the defaults that Elastic MapReduce setup instead if you're running on Elastic MapReduce and this is going to hold true on most clusters that you might be running on where Spark is pre-installed, odds are sparkle already be configured out of the box on your cluster to have the right to fall configuration. And you also want to be careful about any command line options that you pass into Spark desk submit from your master node, you can override those. That way. The way it works is that a thing in your driver script takes priority. After that. Anything that you pass in as a command line argument to spark desk submit would take priority. And finally, the configuration files on the cluster itself would have the last say in the chain of command there, if you will. So generally speaking, again, your cluster will be set up with the correct configuration out of the box. And this is true for Elastic MapReduce as well. So you generally don't want to be specifying configurations in the driver script itself or on the command line. You're usually better off just letting the configuration do its magic for you. Obviously, if you're a system administrator and it's your job to set up that configuration, then you do have to think about that. But as an application developer generally you will not. However, there are situations where you need to tweak things if you do find that your executors are failing and a large job, maybe you need to adjust the memory that each executor has. So if you're seeing error messages that suggests that you're running out of memory on your executor nodes. Well, you have to do something about that. One way is to just increase the amount of memory that each executor has allocated to it. So for example, you could say sparked estimate, dash, dash executor memory one G. And that would explicitly allocate one gigabyte of RAM to each executor. Of course, that assumes that you have enough RAM available on each node in your cluster to pull that off. But, you know, that's sort of a brute force approach, right? And like maybe what you really should be doing is thinking more about partitioning and splitting up your data in a more efficient manner. Also, you can specify a cluster manager on the command line. For example, you can say dash, dash master yarn. If you wanted to explicitly to run using the yarn, a cluster manager on a Hadoop cluster. But again, this is probably going to be set up for you by default. And on Elastic MapReduce it is, it will just automatically infer that the default master is the yarn cluster manager because Elastic MapReduce is running on Hadoop, which has the yarn Cluster Manager. So you don't really need to do anything there, but you can specify a master from the command line if you want to have that flexibility. And some more best practices here, kind of a reminder. Again, it's very important to make sure that your scripts in your data or someplace where EMR or whatever cluster you are running on can access them. If you're on EMR, you're probably going to be using AWS S3 service. That's a simple storage service. And if you do that, you can just use the S3, AN URL prefix to specify the path to your S3 bucket. And you just have to make sure that your file permissions are accessible to your EMR cluster. That can sometimes be a tricky thing with AWS in general, those permissions, I mean, the easiest thing to do is to just make your bucket publicly accessible. But if they're sensitive data in there, then you need to think about more explicit permissions where you tie things specifically to where your EMR cluster is running. And I don't want to get into the specifics of AWS security in this course, that's a whole other topic, but it is possible once you have everything where it needs to be, you can then spin up your cluster using the AWS console or if you want to use the API, that's fine too. And as soon as you spin up that cluster, that's when the clock starts. And remember as soon as you spin up that cluster, the clock starts on billing, it doesn't matter that you're not using it, you're gonna get billed by the hour there. And for a cluster that contains large machines and many of them that can get extremely expensive, extremely quickly. So make sure you know what you do in there. Once you have that cluster spun up, you gotta, you gotta rush because time is money, right? So you gotta get the external DNS name for that master node. Login to it using the Hadoop account and your private key file that you used when setting up the cluster. Once you're in copy over your drivers jar program file and any files that you need on the driver of script itself. You can use the AWS S3 cp command to copy stuff that you upload it into S3 ahead of time, and then run sparked estimate, and hopefully it works. And again, remember to terminate your cluster when you're done. If you forget to shut down your cluster after your job is done, you are going to get a very, very, very nasty surprise on your credit card bill at the end of the month. And it is possible by the way, with EMR to set things up to automatically run a script as soon as it spins up and automatically shut down as soon as it's done. So once you have the bugs worked out of your script and you can run it reliably. You can set things up so that as soon as you spin up that cluster will automatically kick off your job. And it's spin that cluster down when you're finished with it. In some situations that makes sense. In other situations, your company might have enough money to just keep that cluster running all the time and keep it accessible to its developer is 24, 7. But for most of us, we don't have that kind of luxury available to us. So you do need to think about making sure that you're watching how longer running your cluster 4. And the best way to minimize this cost is to debug and work out the kinks in your script locally. First before you try to run it on an actual cluster. That's what we've been doing throughout this course. So if you have a large dataset that you can't run on one machine uses a subset of that data initially to work out the kinks in your application. And again, just be sure that any data files that you need are going to be accessible to every executor note on every machine that might be in your cluster. Anything that might actually be referenced or read from a distributed executor needs to be accessible beyond just where your drivers script is running from. So, so long as you're careful about that stuff and you're again careful about not specifying configuration that's specific to running on a single computer in your final driver script, you should be able to get things pretty close to running reliably before you start experimenting on the cluster itself. 49. Troubleshooting, and Managing Dependencies: So let's go into a little bit more depth about troubleshooting your jobs once they are running on a cluster. Unfortunately, it is a bit of a dark art. It can be very difficult to track down these sorts of errors when they're running within the complexity of Spark itself and furthermore on the complexity of being distributed across multiple nodes. Now your master will be running a console on port 4040 that will let you see some information about what's going on in that can sometimes lead to some insights about why your job is failing. But if you see error messages in your logs, it's gonna be kinda hard to track them down for one thing. First of all, you're going to see this deep call stack in your log files if you do have an error, because remember you're running your Scala code that gets compiled down to Java code. And through all those layers of compilation, it's easy to kinda get lost in where things are running. So typically you'll see things dying like deep within the framework of Spark itself. And if you look hard enough though eventually you should see an error message or some sort of exception that leads you to the actual problem in your script, but you're not always that lucky. However, let's take a look at the master console here to see what information it gives us now. And again, this is, this itself can be difficult because an Elastic MapReduce, it's virtually impossible to actually connect to that from outside of your cluster. So if you want to connect to that console from your desktop that's running outside of your EMR environment. That's a tough thing to set up. A can be done through things like proxy hosts and things like that. But it's tough. If you have your own cluster though, running on your own network, life's going to be a lot easier in that respect to your ear, those security issues should be something you can work through a little bit more easily. So let's take a look at the console running on our own local machine as a way to get started. And we'll sort of explore the information that it gives you. All right, so let's explore that Spark console I talked about. Now unfortunately, it's only going to be available while Spark itself is running. So if I'm not actually running a constant database interface to spark or something like that. I'm only going to be able to get at it while my spark script is actually executing. So for the sake of illustration, let's kick off a really meaty spark script here that will run for a few minutes, so we'll have time to explore it in the course materials. There's a movie recommendations dataset, local script. And this is set up to use the ML 1 million dataset locally on your machine here. So you don't have to run this along with me necessarily, but if you do want to, you'll see that I installed the ML dash 1M dataset here in the data folder of my course materials. And this script is meant to go use that. So once I fire this off, that will run for a couple of minutes trying to figure out the most similar movies to Star Wars. And while it's doing that, we can explore the console. And I've got a web browser here ready to go for that. Set up to 127 dot 0, that's 0 dot one colon 4040, 127, 100, 000 dot one is the IP address of our local host. And since I'm running Spark locally on my desktop here, that's going to be the IP address that I will hit for the Spark console. And 40, 40 is the port that it runs on. So I'm going to kick that off. And as soon as Spark is running, I will reload this page and we should see a console. So let's go back to IntelliJ and we'll kick that off. And I'll give it a little bit of time to spin up. We wanna make sure that it's actually had time to fire up. It's executors. I think we're there. So let's go back to our dashboard here and reload it. And I'm going to go ahead and open up each of these in a new tab so we can explore those. And if you go to jobs, you can see that we can drill into any individual job ID here. And you can do things like visualize the directed acyclic graph. That's pretty cool. You can do the same thing from stages. Again, dig into any stage that you want. This one, Let's take it for the sake of example, that one hasn't actually started yet. So let's try the first one. And you can see this gives us a lot more information as well. Again, there's a DAG Visualization available here too. So this can just keep on running and doing its thing while we explore this at this point. So let's go back to the jobs tab here. You can see that it gives you sort of a graphical representation of the event timeline. So we launched the driver at this point and it's been tried to execute that final take action and figuring out what it needs to do to fulfill that final action that we have in the driver script to get the top ten movies similarities to Star Wars. So that shows you the progress as it's going through all the tasks to accomplish that. And if you go to the stages tab here, you can see it shows you your progress along that lines. And remember, stages are representing each point within the execution where it needs to shuffle data around. So it's good to get some visualization into how many stages are really involved in fulfilling the job that you want. Because more stages is bad. Every time you have a stage, it has to shuffle data throughout the cluster. And that's a very expensive operation. So if you see fewer stages, That's a good thing. And again, you can drill into an individual stage. I think it actually finished, so we can actually look at that. Let's go ahead and kick that script off again. Okay, we've got some results to set a curiosity. Hey, that worked pretty good. So our similarity collaborative filtering algorithm worked pretty nicely. Said the top result for episode for Star Wars, Episode 5, Star Wars, It's like we had in our slides. So that's cool to see. Rays of the Lost Ark, Star Wars Episode 6, Indiana Jones. All great recommendations for Star Wars, but let's kick that off again so we can play with the console some more. Alright, so back to the stages tab. I think that's where we were playing. Again, you can click on an individual stage. I should reload this to find out more about what's going on in it. And it tells you all sorts of stuff. How long it's taking for shuffling those data around. Remember, stages are associated with shuffling data. We can visualize the directed acyclic graph that it came up with. So that's pretty cool to see. So if you're trying to optimize the performance of your job, viewing, this can give you some insight into how it's actually breaking down what you've asked it to do. And sometimes that can lead to a better way of structuring your driver script to be more efficient. What else can we look at? Storage is probably gonna be empty. I'm not storing anything. Environment gives you some information about the runtime environment. Let's refresh that as well. And you can see that we're running under the JDK, running under JDK 14 on my C drive. Using that Java version, that Scala version, just some general information. We can also look at system properties. So if you wanna get more information about the environment that your script is running on, you can do that. They're under executors. This shows you how many executors are running, and this itself is interesting. So only a single executor is actually running right now. So that tells me, for example, that even though I have multiple CPU cores on my machine here, it's not actually using it. So that might be nature's way of telling me that I haven't written things in such a way where it can be parallelized or there might be something wrong with my configuration that's leading to that issue. If I was running this on a real cluster and I saw only one executor running throughout the entire running of this. That would probably give me a pretty strong hint that something is really wrong with the configuration on my cluster that I should go figure out. So that's sort of an overview of the stuff that you can see using the Spark console. With that, let's talk about a little bit more depth about troubleshooting once you have it. Now when things do go wrong, your logs in standalone mode will be available in that web UI as well. But if you're running on top of the yarn cluster manager like we are doing with EMR. Those logs are going to be distributed throughout each individual node in your cluster. Now you can collect them after the fact using yarn logs dash, dash application ID. You can figure out the application ID of your job that's running there. So if you're running in the stand-alone Spark Cluster Manager life can be a little bit easier. There'll be right there in the UI for you. But if you're running on top of Hadoop, things get complicated for actually finding those logs. Again, it just re-emphasizes the importance of debugging these issues locally on your own machine before you try to debug them on a cluster where it's harder to get that information. Now while your driver script is running, it itself will log errors like executors failing to issue heartbeats. So that's a good sign that your executor is failing and crashing for some reason on some remote node. So sometimes your driver scripts output can give you some clues as to what's happening. If you're seeing errors like that, it probably means you're asking too much of each executor. Maybe you need more of them, maybe you just need more machines in your cluster to handle the job that you're doing. Maybe you need to give more memory to each executor. We covered that in the previous lesson. How to do that? That can just be an extra command line parameter on Spark submit. Or maybe you need to use partitionBy or repartition to demand less work from your individual executors by using smaller partitions. So throwing more machines editor or more memory is kind of a brute force solution to problems like this. But sometimes if you just explicitly use repartition to break up the job into smaller pieces, you can get by with less hardware or less memory. It will just take more time to finish the job. So Often that's actually the right thing to do. So don't just see no go throwing more hardware at the problem if you don't need to. If you are doing large joint operations in your script or large GroupBy operations or reduce operations. You can probably solve these problems just by judicious use of the repartition command to break things up into smaller chunks. And also remember again, your executors are not necessarily running on the same box as your drivers script. So you can't assume that data in memory is going to be accessible across the entire cluster. You need to make sure that that's the case. If you do need to share data outside of your RDDs are datasets, you're going to have to use broadcast variables to do that. You can't just store that locally within your driver script and expect it to work. If you need some sort of a Java or Scala package that's not preloaded on your cluster, makes sure that you're bundling them into your jar that you're executing using SBT assembly. Or you can use dash, dash jars with Spark submit to add individual libraries that are on the master node and that will automatically distribute them to the ER executors. But that can get really complicated really fast because often the jars you want to include themselves have JAR dependencies and you have to specify all those dependencies along with it as well. Lot easier to just use sbt to package it all together into the jar itself. And you know, really just trying to avoid using obscure packages that you don't really need in the first place is probably the best advice. Remember, time is money on your cluster and if you're not wasting time fiddling with the issues about distributing those jar packages throughout your executors. Well, all the better. So if you can get by just using the core Spark packages and little more, that's probably going to be the best strategy for simplifying things. Remember, a simple is good when it comes to software engineering in general. And with that, I think we've talked about running on a cluster in depth. So let's move on to our next section. 50. Introducing MLLib: This next section is pretty exciting for me because I like machine learning a lot. That's a big part of what I've done in the past. And we're going to talk about how to use machine learning on Spark. And this is super exciting because the power of machine learning algorithms are generally limited to a single machine. A lot of times people just develop these on a single PC running a notebook somewhere, right? But if you have a massive dataset, what do you do? Well, Spark actually has a solution for many popular machine learning algorithms. So let's dive in and see how Sparks ML library allows you to apply DataFrames across an entire cluster on Spark to solve complicated machine learning problems. Pretty powerful stuff. Let's dive in. Let's talk about Sparks Machine Learning Library, which is amazingly enough called Sparks ML library, creative name. It's really cool because it allows you to distribute the processing of massive datasets and apply sophisticated machine learning algorithms to them at massive scale. So there are easier ways of applying these algorithms and techniques to data on a single machine just using Python code and Jupiter Notebooks. But when it comes time to actually distribute this stuff across an entire cluster because you have a truly massive dataset. Well, those tools that are so commonly used kinda fall apart and we need to turn it something like Spark ML to be able to handle these sorts of tasks at large scale. It's gotta be a little bit hard to talk about what ML does without having some background in machine learning. And if you are curious about these algorithms, I definitely encourage you to go and take a dedicated machine learning class to learn what they're all about. I will give you a very high level overview of them here and as, as best as I can do in a single slide at least. So the basic capabilities of Sparks ML package or feature extraction that includes something called TF-IDF, that's term frequency inverse document frequency. And those are analytics that are mostly used for search. So it allows you to figure out what terms are most relevant to a document. Very easy way of doing that. It also has some basic statistics stuff built-in. If you wanted to, like just compute correlation coefficients or chi-square tests and things like that. It can do that at scale. And then we get into the actual machine learning algorithms themselves, which are more interesting. So it can do linear regression and logistic regression. Linear regression is just basically fitting a line to a set of data. So for example, you could imagine that you have a dataset that maps people's heights to their ages. And by fitting a line to the data that you've observed, you can actually fit new data points to that line to predict their actual value. So if I have a line that defines age as a function of height, I can actually punch in new heights at that model has never seen before and predict their age just by extrapolating it across that line. That's linear regression. Logistic regression is kinda the same idea under the hood, but it's for classification purposes. So, so if I want to predict if somebody's Democrat or Republican, for example, you might use logistic regression to do that. If I want to predict if a car is a sports car or a sedan based on some attributes. That might be another example of logistic regression. We're again, we're applying a very simple mathematical function to predict classifications as opposed to actual values, which you'll be doing with linear regression. It can also do support vector machines. Those are another way of doing classification that's a little bit more sophisticated, if you will. It works in these higher-dimensional spaces and can find more complex divisions between classifications that aren't necessarily linear. It can also do Naive Bayes classification. The poster child of an application for a Naive Bayes is spam classification. So a very common example of naive Bayes is to feed it a corpus of emails and have a training set that's tagged as spam and not spam. And you can train that Naive Bayes classifier to take new emails that it's never seen before and predict whether their spam or not based on that naive Bayes model. It can also do decision trees distributed across an entire cluster, which is really cool. I like decision trees are just fun to look at their, what they sound like. It's basically this tree of decisions. So basically you'll start with some decision like, I don't know, is this is the temperature outside above 72 degrees or below 72 degrees. If it's above 72 degrees, maybe you go down to another decision block. And maybe this next decision. And the decision tree is, is it raining or not? And maybe this could ultimately lead you to a decision as to whether you should go outside and play today if it's a nice day or not, right? That's a very simple example of a decision tree. Obviously, you can get much more complicated and it can work in arbitrary situations where you have many features that are numerical, that can be compared against some set values. And by making decisions in sequence among those different features of your data, you can arrive at a decision with a decision tree. K-means clustering, just a way to cluster data together based on their attributes, based on how similar things are to each other. In a very simple means. This is what we call unsupervised learning, where we're not actually training a model based on known values. We're trying to just cluster things together based on their natural properties and how similar things are to each other. And it can also do principal component analysis, known as PCA for short and singular value decomposition, SVD for short. These are what we call dimensionality reduction techniques. So sometimes you'll have this problem where you have data that has a lot of different features to it. You know, different things that we're measuring for different objects. And it becomes, it causes what we call the cursive dimensionality. This is basically a sparse data problem where you have too many features. These are ways to boil those down to the most salient features. Maybe they're synthetic features that didn't even exist at all in the real world. But it's a way of boiling higher-dimensional feature data down to lower dimensional feature data that makes it a little bit easier to work with. And finally, it can do recommendations using a technique called Alternating Least Squares. And we're going to dig into that in some more depth very soon. Als is cool. There was actually a competition a long time ago that was sponsored by Netflix to develop a better recommendation system than their own. And one of the winning entries they're part of its solution was using ALS. So, so that is also available for you within Spark ML. Now it is sort of a limited list of what you can do. I mean, there are countless more machine learning algorithms out there, but not all of them really lend themselves well to being distributed across a cluster. So still to this day, there are a lot of machine learning algorithms out there where we're just vertically scaling them. We can only run them on a single machine. And we are kinda back to the old days of big monolithic databases where the only way to do these things is to have a really big machine to run them on with lots of GPUs and lots of memory and all that good stuff. But sparse kind of paving the way to doing horizontal scaling of machine learning. And it's really exciting. Now there was a previous API that existed in Spark one and in Spark to called ML Lib. And that was a lower level API that used RDDs and some specialized data structure is to perform these algorithms. Now in Spark three, that's been deprecated and parts of it don't work anymore at all. And frankly, the maintainers of Spark don't care because it's just in maintenance mode at this point. So they really want you to move to the new ML library. The difference is mainly that it uses DataFrames for everything. So instead of passing around RDDs and these specialized data structures, you're going to be using dataframes and datasets, which is really the API that they want to be pushing everybody toward. The benefit of this is that by using DataFrames in all of the different components of Spark, you can pass those DataFrames around from one component to another seamlessly. So that's kinda they're bigger vision. They want to be able to take a DataFrame that comes out of Spark ML and maybe, you know, feed it to Spark SQL and maybe off to GraphX. Who knows right? That interoperability, that becomes exciting by having this general-purpose DataFrame. That's sort of the lingua franca of all these Spark components including machine learning. If you want more depth, There's a book called Advanced Analytics with Spark published by O'Reilly, probably a little bit out of date right now, I don't know if they have an updated version for the new ML library, but it's a pretty good resource. Anything from O'Reilly is usually a good read. The more timely examples or always going to be in the spark SDK itself. There's an examples directory in the SDK. And if you dive into there, you'll find an example of every one of those algorithms actually being used. And again, for more depth on the actual algorithms and how to use them and how they work. I will refer you to a different course, not this one on machine learning in general. With that, Let's put it into practice. In our next lecture, we're going to actually generate movie recommendations using that ALS module that we talked about in Spark ML. And it's really exciting because it's really, really easy to use. So recommender systems are really complicated things and Alternating Least Squares is using this matrix factorization method. It's some very hardcore math under the hood here. But to actually use it on Spark and actually distributed across an entire cluster, this is all the code you need to do. Like Isn't that awesome? So let's walk through this a little bit. All we're doing here is loading up a data file that's going to be our MovieLens ratings dataset from u dot data. We then map that into a ratings RDD using the lower level APIs because we're doing kinda simple transformations here. And then we convert that to a DataFrame to make it compatible with Spark ML. We then just create a new ALS object from the ML library. We set what's called a couple of hyperparameters. Dirty little secret of machine learning is a lot of these algorithms are only as good as the hyperparameters that you set into them. So the maximum number of iterations, the regularization parameter. These are two examples of hyperparameters with the ALS algorithm. And most machine learning algorithms have several of them. And how well they work depends on those parameters. And the dirty little truth is that it's largely still a matter of trial and error to find the parameter values that work best for a given problem. So we like to think that we understand these models very deeply, but in reality, a lot of it's still guesswork. But once we have those values in there to start with, at least we can create a new ALS object. We tell it where the user ID, movie ID and rating information is in our DataFrame. And then we can just call fit on that ALS model. And that will go off and train a model using all of that MovieLens data. And at that point we can just use that model to make predictions for new users that we've never seen before. So it's just that easy. Let's play with it. Let's go run it. 51. [Activity] Using MLLib to Produce Movie Recommendations: So let's have a go at using the Alternating Least Squares recommender algorithm that's built into the ML Lib, that's part of Spark. And we're going to look at the movie recommendations ALS dataset script for this one. Now before we get into this, what we've done to test this out is create sort of a fake persona of a user whose tastes I can at least understand a little bit that will give me some qualitative sense as to whether or not the recommendations are good or not. So to do that, I've actually gone into my MO 100 K dataset and edited the u dot data file. And what I've done is added in three lines for a fake user 0, user 0 did not exist previously in this dataset. And this is a very simple person here. Basically, I've said that this user ID likes items 50, movie ID 50 and movie ID 172 a lot. He gave them both five-stars. And those correspond to the movie Star Wars Episode 4 and Star Wars Episode 5. A New Hope in Empire Strikes Back. So we've established that this person has a really big Star Wars science fiction fan. And in contrast, I've also said that for movie 133, which is Gone With the Wind, That was only one star. So I've created a very simple persona here. Someone who loves Star Wars and science fiction presumably, but really isn't a fan of old romantic dramas, right? And so that's someone that I can relate to myself. So that gives me sort of a personal feel or sense as to whether these results are going to be good or not. Now there are ways of objectively measuring how good a set of recommendations are. But often it is sort of a qualitative thing. You can try to predict whether or not your algorithm can sense whether or not someone would have actually liked a movie that they radiate previously. But at the end of the day, recommendations are about trying to recommend things that a person hasn't seen before. And that's a hard thing to measure as to whether or not that is a good result or not. So for the sake of demonstration, we're just gonna keep this evaluation qualitative. I'm going to see if the recommendations that I get out of this makes sense for somebody who loves Star Wars and hates Gone With the Wind. So let's walk through the code itself and see what it looks like. Pretty easy to use actually. So we're importing ML dot recommendation, everything underneath that and the usual stuff as well. We set up our object and we have two case classes here. One for the movie names lookup that we're just going to be using to look up movie names for movie IDs to get human-readable results at the end. And we also have our little dataset for ratings that includes a user ID is movie IDs and ratings, note that we're emitting the timestamps because it doesn't actually matter for this algorithm. Let's go down to the main function here. Start by setting our log level. So I think with Spark session as we always do, and we start off by loading up that movie names dataset. We provide a schema that says that it has integer movie IDs and string movie titles. And while we're at it, we said the movie schema as well for the actual ratings data that we're gonna read in as well, we import spark dot implicits because we're going to be loading up DataFrames initially and not a dataset. We read in the EU dot item folder that contains those movie names using the movie name schema. And we cast that to the movie names dataset using the movie names case class. Glossing over that because we've done this a million times already earlier in the course, right? We then actually collected that back into Scala in a names list object. So at this point, we've actually use Spark just to load that data up and parse it. Actually copy that back into our driver script itself. Now we can get away with this because there just aren't that many movies in the world, right? We can safely fit that in memory within our driver scripts. So that's not a crazy thing to do actually for so we're just going to dump that back into a Scala map here to look up those when we're displaying these at the end. However, the movie ratings data we do want to keep distributed and keep that on our cluster as a dataset. And that's what we're doing here is loading that up using the schema we provided, the tab delimiter. And we're going to cast that as a rating dataset using the rating case class that we defined above. And again, this is only going to extract those first three columns, the user ID, movie ID, and a rating itself. We're going to leave that timestamp off. After that, take a look at how easy it is to use ML Lib. So we set up our model here. So we just say new ALS for Alternating Least Squares as coming out of the ML Lib recommender algorithm library there. And we said a few hyperparameters. So we set our maximum iterations to five. We set our regularization parameter to 0.01. And you might want to where we got these numbers from. Well, I just kinda pluck them out of the air initially based on examples that I saw online, you'll find that in practice, sort of the dirty little secret of machine learning in general is that it's all about hyperparameter optimization. And that's a very fancy way of saying we're just gonna guess at what the right number is until we get it right. So it's a very common practice to just have these huge jobs that try many, many, many different values of these hyperparameters like max iterations and regularization parameter, and just see which ones yield the best results experimentally. There's often no real guidance on what the best value should be forgiven dataset and it will vary wildly from dataset, the dataset. So that problem of figuring out the right parameters here is really at the core of getting good machine-learning results in practice. The other thing that the core is just cleaning your data, but this is not a machine learning course, so I'm not gonna get too much into that. We tell it what the column names are for the user ID, the item ID, and the rating. That's all that the algorithm needs. So we just need to point it at what those dataset columns are actually named, so it knows where to retrieve that information from. At that point, all we have to do is say dot fit on that model passing in that ratings dataset. And that's it that will actually run that model on our data and construct a model that can predict recommendations for new users of movies they might like. And it's kind of a black box at that point, right? So the ALS algorithm is off doing something. We know it's using Alternating Least Squares that's about it. To create a model that can take in a new user ID and recommend movies that that user might like based on what they've rated. So let's do that. Let's go ahead and pull out from the command line argument. The user ID is someone that we want to get recommendations for. In our case, we'll pass in the user ID from the command line. We then just create a really simple DataFrame that contains nothing but that user ID. And we pass that into the models recommend for users subset method passing in that fake user DataFrame of a single user that we weren't results from. And we could actually pass in multiple users and get back recommendations for many people at once if you wanted to. And we're saying we just want the top 10 recommendations for each user that we're passing into that users DataFrame. At that point, we have our results, we just have to display them. And oddly, that's actually the hardest part of all this, figuring out how to extract that information. The recommendations DataFrame that we got back. So that's what we're doing here and this little loop Here, we go through each recommendation in the recommendations DataFrame that we got back from recommend freezer subset, we extract the user ID and the recommendations themselves. We have to actually tell Scala what this thing is. So we have to explicitly cast that result as a mutable wrapped array of row type, which is what we're getting back when a DataFrame. We can then iterate through each individual recommendation results, extract the movie, extract the rating. And then we can translate that movie ID to a movie name using our get movie name function and print out the results. Finally, we stop the session when we're done and we see what happens. Quick look at the get movie named function. It just runs a filter information on the movie names array that we passed in and tries to find the movie ID that matches the one that we're looking for and returns the movie title associated with it. Again, this isn't actually using a dataset. This is actually just an array that we got back from the dataset through the collective operation. So let's run it and see what happens. Right-click on that. We need to set up a perimeter for it though. So we're going to say Edit movie recommendations and sit in a program argument is 0 because I want user ID zeros recommendations. Now I can hit Okay, and it should be able to select that now and run it. All right, It's off training that model now based on those 100 thousand ratings and there's our top 10 results. So yeah, these results are kinda weird. It's saying that for someone that loves Star Wars and doesn't like Gone With the Wind. My top result with some movie from 1994, I've never heard of, but it's called The Endless Summer 2. I'm thinking that's not a science fiction movie. Pretty sure it to be or not to be from 1942 is not either, nor is ball2. I mean, there are some things in here that kinda makes sense Lost in Space. Transformers. I guess those makes sense for a Star Wars fan, but they might, I just got lucky. These kinda seem like random recommendations to me. So the results here don't really make sense to me. Well, let's explore why that might be. So as we mentioned before, these algorithms tend to be very sensitive to the hyper parameters that you choose. So often it takes a lot of work to find the optimal set of parameters for the dataset that you're using to get good recommendations out of it. So one technique is to use something called train test, where we set aside a portion of our ratings data and use that for testing purposes. So we hold out some users from our dataset and we use the trained model on the remaining users and see if our resulting model can successfully predict what movies the people in the test set actually enjoyed, what they liked, what they rated well. And you can use that to get sort of a quantitative measure as to how good the recommendations are. And once you have that, you can try different combinations of hyperparameters and see which combination yields the best results and try to tune things that way. But this gets dodgy with recommendations. I mean, is a good recommendation really measured by your ability to predict things that someone already watched? Or is it a good recommendation if it's something they haven't seen yet in using historical data, there's no way to know that, right? So recommend, recommender systems is kind of this weird case where it's hard to measure the results really apart from doing like live experiments with real people. But I spent some time doing that. So I did try different parameter values. I could not get better results out of this thing no matter what I did. So I'm kinda thinking this thing isn't working the way it should. So I'm not even convinced that this thing is working properly internally. At least for this dataset, this algorithm just plain does not work well. And the lesson to be learned here is that putting your faith in a black box like this can be a dodgy proposition. You need to understand what's going on in the algorithm and understand whether or not it fits the data that you're getting it. It may be that our dataset is just too small. It may be that there's something about it that's not particularly well suited to this algorithm. I tried some obvious things like normalizing the ratings to 0 to one. That didn't help either. Sometimes that's what you need to do, but we actually got much better results just using our movie similarity example earlier in the course, right? If we just use item-based collaborative filtering using cosine similarity metrics like we did before, we would have gotten much better recommendations with a much simpler approach. And that worked. So ALS is a more complicated algorithm, but sometimes complicated isn't always better. This is a good example of that. And so, you know, always measure the results, make sure you're getting the results you expect. Joe says, don't just blindly trust the results when you're trying to do data analysis on big datasets. Because small problems in the algorithms can become big ones as you throw more data at it. And very often, the real issue is the quality of your input data. Now in our case, that's not the issue. The data that we're putting in has already been cleaned in scrubbed. So it's high-quality data. But often the saying Garbage in, garbage out applies here, right? So if you're passing in unfiltered data that contains a bunch of spurious information from robots and things like that, or people that aren't real people or people trying to game the system. That's going to screw up your results too. But this is getting more into the practice of machine learning than anything else. The lesson here though, is, even though spark contains some exciting algorithms, they're not always going to be the right fit for your data. So don't just blindly put your trust into what spark offers in ML Lib. However, ML Lib is still really useful in general. Kinda started off with a bad example there because ALS is kind of a dodgy thing with it for the MovieLens Dataset anyway. But the rest of the algorithms are more straightforward and they will do what you expect. So let's go and cut a shift to a higher note using Sparks ML Lib and show it something that it's better suited for. 52. Linear Regression with MLLib: So let's shift gears to a more reliable algorithm in Spark ML library, the linear regression algorithm, it's pretty simple stuff. What is linear regression? Well, if you're new to the world of machine learning, that might be a new term for you. In a nutshell, linear regression is just fitting a line to a dataset. And once you have that line, that's sort of a best fit line for a set of observations. You can use that line to predict new values for things you've never seen before. For example, on this little illustration here we have measurements of people's weights and their heights plotted against each other. So you can imagine trying to predict someone's height based on their weight once you actually have this line, that red line that's fitted to those observational data. Basically you take all the observed data points of the thing you're trying to predict. In this case, height based on some attribute, which in this case would be weight. And if you plot all those together and find the line that sort of fits that data the best. You can then use that line's equation to predict the heights of new people just based on their weight. So this ends up with some sort of a point-slope formula here for the line, right? So basically we're saying we have a slope of 0.60 plus a y-intercept of 1.23. And we can use those attributes to make predictions going forward. Why did they call it Regression? It's kinda confusing to be honest. Regression kind of implies that you're going backwards in time, but that's not really what it's all about. There is some history behind that term, but try not to get too hung up on that. Basically, linear regression is just fitting a line to historical data. Maybe you can think of that as the regression part and using that line to predict new data points for new things you haven't seen before. How does it work? Well, usually it uses something called least squares. It's a mathematical technique for a minimizing the squared error between each point and the line. So when we fit a line to those observed data points, for every point, there's going to be some error between where that point really is and where the line says it should be. And our objective in creating that line is to minimize the squared error between every point and that line. It's squared because that way positive or negative error doesn't make a difference. It's all the same thing. So it's kind of interesting mathematically when you're thinking of how this works in two dimensions actually. So if you remember the slope-intercept equation of a line that's y equals mx plus b, where m is the slope and b is the y-intercept. Going way back to high school there for a lot of you, it turns out that the slope can be expressed as the correlation between the two variables times the standard deviation in y divided by the standard deviation in x. And that's a bunch of statistical mumbo jumbo to many of you. But I just think it's kind of interesting how some of these statistical concepts like standard deviation, have a real mathematical meaning that can be used for something as simple as a fitting a line to data. So it's not that standard deviation is some arbitrary cutoff that someone came up with. It actually has a real mathematical meaning, which for me is interesting at least. And you can also compute the intercept as just the mean of y minus the slope times the mean of x. So that's another way to compute that. So given those mathematical insights, it's pretty straightforward to actually fit a line to a set of data in two-dimensions. If you're just dealing with x and y, things get a little bit more complicated when you add more dimensions into the mix though. So it's kind of a fabricated example to just be predicting some attribute based on some other attribute like height based on weight. Real-world, you're usually dealing with multiple attributes. So you might be trying to predict someone's height based on their weight, where they live, how old they are, whatever it is, right? So this usually more than one thing. And that gets into like all this multidimensional weirdness. And one way of dealing with that is stochastic gradient descent. It's another algorithm that you see a lot in the world of machine learning and also in deep learning. And basically I don't want get into the details of how it works, but it kind of, it looks for these contours in higher dimensions, if you will. So it iteratively finds the sort of best fit lines across this multidimensional space for you. So same concept, just a little bit more complicated. But using Spark dot HTML, It's not complicated at all. You don't have to worry about how it works because it all does it for you. All you have to do is say Val and set some value to a new linear regression model, set a bunch of hyperparameters like the regularization parameter or the elastic net parameter, the maximum number of iterations, the convergence tolerance. And as always, machine-learning is dirty little secret is that there's not always a good way to guess at what these value should be for a given dataset. Often you arrive at the right values just through trial and error. But once you have a linear regression model set up and configured, you just train the model and make predictions using tuples that consists of a label and a vector of the features that are using to predict that label. So in this case, the label would be the value you're trying to predict. We usually frame that is our y-axis. A feature is on your x-axis or the other axes. Those are the things you are using to try to make that prediction from. So basically you train the model with a bunch of known points. And then you try to predict new y label values for given Xs or feature values using that line to the model created. And you don't have to worry about what that line is. You can get it if you want to, but you know, it's like any other model, you just create the model and once it's trained, you use that to make predictions. It's all very straightforward. And it can, as we said, work with more than two dimensions if you have multiple features and that's where the power of it really comes in. Now there are some gotchas using SGD linear regression, which is what this is using. One is that it doesn't hinder feature scaling well, so it assumes that your data is similar to a normal distribution and how it's laid out. So if it's not, usually should scale that data down to fit within a range that's more comparable to your standard bell curve here. And you want to make sure that all your features are similarly scaled as well. So if you have one feature that's huge and another feature that's a small value. You need to scale everything down to this similar scale in the similar range for the algorithm to work well. So that means that you're going to need to scale your data down and then back up again when you're done. So you'll train your model based on this scaled feature data. And then you have to remember to scale it back up using whatever the inverses of that relationship when you're actually displaying the final values that are predicted by this model. Also, the algorithm assumes that your y-intercept is 0 unless you call fit intercept true on it. If you're dealing with a situation where you don't think your best fit line will pass through the origin 0, 0, you need to call fit intercept true to get good results. So let's try it out. What we're gonna do is fabricate some fake data for average page speeds and revenue generated from session data on an online store. This is something we really had to do at Amazon. There was a hypothesis many, many, many, many, many years ago, that page load time had a very direct relationship on how much money the customer spent. And we now know that's true and that's sort of common wisdom throughout the industry now. But back then we had to figure that out. So we're going to figure that out recently, fake data. So we're going to know that there is a relationship in there because of how we fabricated it. But given that we're going to try to build a linear regression model using Spark ML and see if we can predict revenue based on Page Speed using this linear model. So let's dive in and give it a go. 53. [Activity] Running a Linear Regression with Spark: So let's walk through our example of using linear regression in Spark and see if it works, open up the linear regression dataframe dataset script. And before we dive into the code, let's look at the data that we're dealing with first, that's always a good first step, right? So if we go into our course materials and go to the Data folder, regression dot txt file here. Let's just familiarize yourselves with what's in it. Now as you recall, what we're trying to do is predict how much people spend based on PageSpeed. So the thing we're trying to predict is announced spent and the amount that we're, our feature that we're trying to predict that on is PageSpeed. So in this first column, these numbers represent the amount spent, but it's normalized and scale down. So this is normalized to fit into a bell curve distribution. That's why you're seeing things like negative values here. Even though you wouldn't actually see negative amount spent unless somebody got a refund or something, right? So like we said in the slides, you need to make sure that you're scaling your data down into that consistent range. And then remembering to scale that back up again when you're done. That second column is going to correspond to our feature data. So in that case that's going to be the page speeds. Again, we've normalized this and scale it down to what would fit into a bell curve distribution. So these numbers by themselves are not raw values. They had been scaled down and that's sort of feature engineering that we call it in the world of machine learning is a very important step and getting good results. A lot of machine learning models do require that your features and your labels are all scaled down into similar ranges, and sometimes even more specifically, centered around 0 or between 01. It kinda have to read the documentation for the algorithm you're using to make sure that your data is in the right format. Otherwise, it won't work. You'll get really weird results and it'll be wondering why. So don't forget that. All right, so let's dive into the code. So our regression schema is going to be correspond to our two columns of double-precision information. There we have the label, which again corresponds to our amount spent and our features underscore RA, which corresponds to our page speeds. And again, these are scaled down to a normal distribution. All right, we spark up our logger settings, we spoke up our SparkSession, nothing new there. And we define a regression schema that matches that case class that we just talked about. So we can import that in from disk using a DataFrame interface. So we just say spark dot read with a comma separator using that schema. And we loaded up from the data slash regression dot txt file that we just looked at and then cast it dataset using the regression schema case class. Now things get interesting. So the format that's expected by this algorithm in the ML library is very specific and oftentimes you need to refer to the sample code that comes with Spark to figure out exactly what it wants. That's how I went about getting this code to be honest. So what it expects is that you will create a VectorAssembler object, which is also part of the machine learning libraries. And you want to set in its input columns as an array of your feature columns. Now in our case, we only have one feature column, it's named features under score raw. So we just pass in an array with that one column name features underscore raw. If you had more than one feature, if you had a multidimensional problem, you would pass in the additional features here as well. And then that vector assembler will output into a new column, something called features. And that's actually what we're going to be passing into our model. So once we have this assembler, That's this vector assembler object, we're going to call transform on it feeding in that dataset of our raw data. And it will transform that into that features output column. We'll select from it the label column and the features column that are produced. So now we have a DataFrame called df that just contains labels and features. And labels will correspond again to our amount spent and columns to the PageSpeed. So again, I'll scale down to a uniform scale. All right, so now that we have that, what we're going to do to measure the performance of this algorithm is to split that data into two sets. We're going to set aside half of that data for training our model. And we'll set aside the other half for testing the model. So the idea here is that we will train this model using just half of the data that we have. And then using that trained model, we'll see how well it can predict the known correct values on the other half of the data. And we will measure that error to get a sense of how good our model really is that can inform us as to whether we need to try different hyperparameters or not, for example, or clean our data better, right? So we extract the first split of that from random splits and call that the training DataFrame. And the second test block there is going to be the test DataFrame. So random split just splits up a DataFrame using an array of percentages that you give it. In this case 5050. We take that first resulting DataFrame and call it training Tf and the second resulting DataFrame and call it test df. And that is just based on randomly assigning every row of d f into one or two of these DataFrames. So now we have everything. We need to actually create our linear regression model. And we'll go ahead and create it. And the linear regression, we'll call it LIR. And we said all the hyper-parameters. Again, I think I just plucked these out of the documentation for places of where you might want to start from the samples. Again, in the real-world, you'd want to iterate on these and find what combination of parameters yields the best results. Once we have our object, we call fit on it passing in our training DataFrame that path that we set aside for training the model. And then we take that trained model and make predictions on it. So we take our model and we call transform on it, passing in our test DataFrame. And that will create a new full predictions DataFrame that will contain predicted amount spent based on the PageSpeed seen in the test DataFrame set. Now what's interesting here is that we can then compare our predicted amount spent to the actual amount spent that's living in our test DataFrame values in the labels that we have there. Note that I'm catching it here. Not strictly necessary for this simple example, but if you were going to do things using that full predictions dataset repeatedly, you'd want to cache that to speed that up, right? So once you have a trained model and a set of predictions based on it, generally speaking, you'll probably want to cache those results so you can use them repeatedly, right? In this case, we're just going to pluck out the predictions and labels after clicking that back from the full predictions DataFrame. And we will call that prediction and label. And then we just connect that back to the script and print them all out. So we can just take a look at what the actual predicted and actual values where for each point, you can probably guess what a good exercise for the reader here would be in that would be to actually do the work of comparing those predicted values to the actual values and actually measuring that error from the test DataFrame. But for now we're just going to look at the results because this isn't really a correspond machine learning. It's of course about using Spark. So let's right-click on linear regression dataframe dataset and rerun it. And it worked. Again, we don't have a or at least we didn't do the work to know if these are reasonable predictions or not. But these are predictions. So you can see the output here. Again, these are scaled down to whatever scaling factor we used when we preprocess the data. So in order to make good use of this, we would have to scale it back up again. But it looks like it worked. These look like real values and within a reasonable range of values. So there you have it. Linear regression in a dataset using Apache Spark. So again, the power of that is that you can take a truly massive dataset and perform linear regression on it and create a model using the full power of the whole cluster. So sky's the limit here, pretty much literally. All right, so that's machine-learning in Spark. In a nutshell, you'll find that most of the models work in a similar manner. You just create a model object with a bunch of hyperparameters. You fit that model to some training data. And then you can transform that model with data that you want to make predictions for. And it's pretty easy to use. So that's it. Let's move on. 54. [Exercise] Predict Real Estate Values with Decision Trees in Spark: So for your next challenge, we're going to use Spark with some real-world data here we're going to try to predict a real estate values using decision trees, which is another machine learning technique similar to regression. It can also be used for regression purposes. And Apache Spark to be able to scale it up if we had to. So we're getting our dataset on a set of data that has the price per unit area. They call it pings in where this is coming from, based on a bunch of attributes of houses. And this comes from Taiwan, from New Taipei City. It's real data and here's the necessary credit for who came up with that dataset. And if you want to explore more about the dataset itself and read more about its format. Follow that second link there that's coming from the UCI data repository. That's a very useful repository of machine learning datasets so you can use for experimentation and messing around with and learning with. So that's a good resource there, candles and other one too, if you're looking for more sources of datasets with real-world data. So this is what the data looks like. I've actually cleaned it up a little bit for you just to give it more reasonable column names and to convert it to CSV format to make your life a little bit simpler for this exercise, it consists of several columns of data in CSV format. The first column is just called number is the number associated with that house. And then for each house we have the transaction date when it was sold, the age of the house, the distance to MRT, that's actually the distance to the local public transportation system. The number of nearby convenience stores, which I guess is a big deal in Taiwan it seems, and the latitude and longitude of the house itself. And finally, the thing we're trying to predict, the price of unit area. And of course that's a local currency. So don't read too much into what that actually means. Your challenge is to predict the price per unit area based on the house age, distance to the public transportation, and the number of nearby convenience stores. Just to make this a little bit easier, I'm going to throw away some of the data that would be more difficult to analyze like the actual location and things like that. Obviously, the actual number of the transaction, that doesn't really matter, nor does the date for our purposes. So we're going to keep it simple and just deal with numerical data here. Your strategy will be to use a decision tree regressor instead of a linear regression object from Spark dot SML, they worked very much the same way. If you want, you can go look at the documentation in the Spark documentation online, but it's pretty much the same syntax, just a different name. Whereas linear regression works by computing these slopes and intercepts for lines, a decision tree works with a different strategy. Basically, it's constructing this tree of decisions that it makes where at each point in the tree it says, is this attribute less than or greater than some value? And it works its way down making these decisions based on these binary decisions to try to arrive at a final prediction of what your final predicted label will be. So just a different way of getting the same sort of result at the end of the day, the reason that we're going with decision trees is twofold. One, just to expose you to yet another ML algorithm that's available in Spark. And also because decision trees are less sensitive to having data that's in different scales. We don't have to actually worry about scaling all of our data down to be normalized and in the same range to get the best results with a decision tree, it can handle that a lot better. So to get started, start off with a copy of the linear regression dataframe dataset Scala class that we looked at earlier. And that'll be a very useful starting point by just modifying that a little bit, you should be able to pull this off. And also note that we have a header row in that CSV file, so there's no need for you to hard code is schema for reading that in, we can just have the spark reader to do that automatically if we tell it that header exists, some useful snippets we're getting through this. First of all, in our previous example, we only had a sin single feature column and you don't have to have just one. You can actually have multiple input columns in a vector assembler. So remember you can also say set input calls array, and then a list of columns, not just a single column if you have multiple features like we do here. And remember when reading the file in, instead of specifying an actual schema, you can just say header true and infer the schema true. And we'll read it all in as long as those header column names match what you have in your case class, everything will work out just fine. The only thing about decision tree regressor, you're not going to need all those extra hyperparameters that we specified on the linear regression decision trees have a different set of hyperparameters, but it's okay to just not specify them and go with the default values. That's fine. So you can strip off all that extra code on the linear regression example. And also note there is a set label col function. And that will allow you to specify a label column that is not named label. In our case it is not named label. It's named price of unit area or something like that. So that should be about all you need. So go off and give it a go. And in the next lecture, I'll walk you through my solution and see how close you up. 55. Exercise Solution: Predicting Real Estate with Decision Trees in Spark: All right, let's walk through my solution for predicting the value of a house by area based on some rather unusual attributes of those houses in Taiwan. As always with machine learning, you should start by studying the nature of the data that you're given. And for our purposes since the status already been cleaned, we mainly just want this around for having a handy reference to those column names. Because matching those exactly is going to be very important as we write our little script here to load that up and convert it to the format that our algorithm expects. And you know, it's always a good idea to, to just double-check the data, make sure there's nothing unexpected in it. We are assuming that these are all numbers. So a quick scan just to make sure that these numbers all appear to be in more or less the same range and don't have any weird like missing values or valleys in the wrong format, like random string values here and there. That will save you some trouble going on. This dataset, however, has already been cleaned for you. So the hard part of machine learning is already taken care of in this case, with that list, go to the code. So I did just start by copying the linear regression dataframe dataset dots Scala file and I call it a real estate dots gala instead. Main change here. First of all, we are using a different algorithm for regression. We're importing or dot apache dot spark dot HTML dot regression dot decision tree regressor this time, just like we gave you a hint in the slides, if you wanted to use a linear regression still you could, it would actually still work. But again, we just want to expose you to a new one. And decision trees are less sensitive to having different ranges of values in the features and the labels. So we declare a new regression schema here. And as you can see, it's a lot more complicated. All of our features have different names and there's more than one of them now. So we actually have each of the columns in that input data that's coming in. And we've mapped that thusly. So we have the column name NO, which stands for number. We're assuming this integer transaction date. We're going to call that double house, age of double, and et cetera, et cetera. These names just all correlate to those column names in the CSV header itself. All right, moving on. Nothing different there. When we actually load in the file and we're going to say option header true and option in first Schema true, instead of giving it a hard-coded schema that will allow to take advantage of that header row in the CSV file. So that makes life a little bit simpler there. We don't have to code up that schema twice, which is kind of annoying when that happens. Next, we construct our VectorAssembler as we did before. And the difference here is that instead of a single features, or I think we call it raw features or features Ron, I think originally we actually have three different features and this example, house age, distance to MRT and number convenience stores. Again, we're matching that up to that case class names like here. So the problem statement was to predict the price of unit area based on house age, distance to MRT a number convenience stores. So we're throwing away the additional columns, you don't need them. We're just setting the input columns to the ones that we care about that we actually want to use as features to make a prediction from. And we will call the output column of features when we're done, that's fine. And we will furthermore selected the price of unit area column because that's going to be our labels. So at this point, what we're going to end up with is a column called price of unit area. That will be the thing that we're trying to predict, the labels. And a features column which just contains a list of features. So that column will contain in list format, the house age, the distance to MRT, and number of convenience stores. And that's just the format that our model expects. So we have to conform to that. As before, we split this up into a training set and a testing dataset. And now we're going to construct our model which is called decision tree regressor. I suppose I should have changed the name of the variable to something else then LIR, I guess it should have been. But whatever, whatever you wanna do is fine. We did explicitly set the features column and label column names here. Features is the default, so technically I didn't have to do that, but the default for label is labeled not price of unit area. So I did have to explicitly tell my model the name of my label column is price of unit area for that to work. After that it all works the same way. So we just say fit, passing in our training DataFrame. And we make a set of predictions by transforming that with our test data. And as before, we just go through and select it out. We did change the column name of the labels here again and collected that back and iterated through it and printed them all out. And what this will do as a comment says, is print out the predicted value and the actual value for each house in our test dataset. So let's see if it works. Right-click on real estate and run. And we have output. Cool. So just thumbing through it offhand, a lot of these look pretty reasonable. We predicted that had 50 to whatever is per whatever, and it's really 55. And for the most part, the model worked pretty well. I mean, there are obviously ways of quantifying how good the model is using metrics like RMSC. But this is not a data science or machine learning course. So I'm not going to get into that. I did get some of them wildly wrong though. So for this one, it predicted a value of 36, but the actual value was a 117. So clearly there is more to what sets the value of a place besides how close to a convenience store you are in. Things like how close you are to the nearest train station. So not too surprising that it's not an awesome model because we didn't give it a lot of awesome data, but it works surprisingly well actually in, by enlarge, it got pretty close on most of these so interesting results. So there you have it, hope you got through it. Again, this is only one way to do it. If you took a different approach, that's fine too. Just make sure you've got similar output and we'll call it good. 56. The DStream API for Spark Streaming: And increasingly common use cases Spark is dealing with streaming data. So to this point in the course, we've talked about analyzing offline datasets in a batch manner. So we have this pilot data and we want to analyze it or process it somehow. But in the real-world, data never stops coming in, right? So this is where Spark Streaming comes in. It allows you to ingest streams of data in real time. For example, you might have a fleet of servers generating log information, getting funneled into your Spark cluster, being analyzed as it comes in. Maybe you're transforming that data and storing it somewhere else in some other database. Maybe you're aggregating it over time and looking for things you want to be alerted on as they happen. Either way, streaming is a very important use case in the business world these days. So let's dive in and see how Spark Streaming can be used to analyze data as it's created and take action on it as need be. So in this section, we'll introduce you to Spark Streaming. This is a very common use case for Apache Spark these days, where we're dealing with streaming sources of data and not just transforming a bunch of data all at once in a batch process, but continually monitoring incoming data and doing something to it in real time as it's received. So a common example of that would be processing logged data coming in from a website or a server, or maybe a whole fleet of servers that power a website. That data never stops coming in, right? So there's no like really simple set of log files. So you can point to and say, Hey, go analyze this chunk of log data. It's always coming in, there's always new information being gathered all the time. And maybe you want to aggregate and analyze that at some given interval or window of information, right? Maybe you want to set up some sort of an alarm where if you're seeing a bunch of errors on your website, something gets notified automatically, right? So instead of just waiting for a nightly analysis of that data, we're analyzing that data continually as it comes in and doing something with it. That data doesn't have to come from a raw file that can come from some port that's receiving TCP data. It can come from Amazon Kinesis service. It can come from an HDFS or some distributed file system that can come from Kafka, come from Flume and all sorts of other data sources to so Spark Streaming can integrate with a bunch of different systems, not just text files and analyze that data as it comes in. And it might not just be about analyzing and aggregating data. It might be about transforming it and just sending it somewhere else to. Maybe all Spark Streaming is doing in your case is taking incoming log data, extracting the information you care about it, structuring that data, and then storing it in a database somewhere. That's a perfectly valid use case for Spark streaming as well. And the nice thing about Spark streaming, instead, it has a checkpointing features. So if your stream does go down, if your cluster goes down for some reason, it has a easy way to automatically pick up from where it left off automatically. So That's a nice feature for sure. So let's start talking about the old DStream API for Spark Streaming. Basically, Spark Streaming has a storied history, if you will, the original interface where it was called DStreams, and it was based on the old RDD interface. And because it was based on RDDs, it was centered around what we call micro batches. So it just dealt with these chunks of data that were represented as RDDs. And it processed those little chunks altogether. So technically speaking, it wasn't true streaming. It wasn't actually operating on data on a field or row-by-row basis. It was chunking things up into what we call a micro batches. Now in the real-world, that's usually not really a problem if you have a latency of a few seconds as opposed to immediate processing for most applications, that doesn't actually matter. But that was one of the main things that cause them to move away from this interface over time. And well, coverages for historical reasons here. But we will go on to the more modern Structured Streaming API after this. Anyway, here's an example of what de Streaming might look like. So you would set up a streaming context as opposed to a SparkContext. And that's seconds one there means that it's going to process things 1 second at a time. So you'll have these micro batches contain 1 second worth of new data that we're going to process as part of our script. We can then say stream dot socket text stream. And that's going to tell it to listen to port 8888 for text data being sent in on that TCP port. And then all it's going to do is filter those incoming lines for lines that contain the word error. And if it sees a line that contains the word error, it will print it out. Once you've set that up, you just have to call stream dot start to actually kick it off and streamed out a week termination to wait for someone to stop it. So it's a little bit magical, little bit weird to wrap your head around this paradigm, right? So you have to remember that we're not processing a single chunk of data. This directed acyclic graph that we're constructing here to process this data in taken data and filter it out and print it out if it, if it passes that filter, this is going to be applied repeatedly for every 1 second worth of data that's received automatically. We don't have to write a loop somewhere that says keep doing this every 1 second repeatedly. Spark Streaming does that for you. So that syntax here may seem a little bit of strange and magical at first because it kinda just works on the stream. You don't have to think about that bit of it too much, but it does work. So at a high level, that's how it works. Some gotchas. So remember your RDDs are only going to contain one little chunk of incoming data. Now what you can do is windowed operations. With a windowed operation, you can combine results from multiple batches over some sliding time window automatically. So there are functions like window or reduced by window or reduceByKey and window. So those window functions allow you to do a reduction over some time going back in the past. So I could say go back for the past minute, go back for the past hour, go back for the past day, whatever it is, and reduce my information down based on that window of information. So that window does not need to correspond to your batch size that can be much larger and it will automatically keep track of those RDD results, those micro batches, and apply your operation over that window of time that you've defined. So it automatically keeps that information around so you can analyze it as you go. Also, something worth talking about is update state by key. This allows you to maintain a state to that spans across many batches as time goes on. So if you need to keep track of some sort of running state as you process this information across Windows or across batches, you can do that with updates state by key. A good example of that would be running some sort of a counter and ongoing count of some event over time since you started the script. So if you do need to keep track of something as the script runs as a whole as opposed to over some window of time. Update state by key, lets you do that. So let's dive into a simple example here. We're going to run it a Spark streaming script that just monitors live tweets from Twitter. And we will keep track of the most popular hashtags as we receive new tweets. And to do this, we first need to set up a Twitter developer account. Now, some people have trouble doing this depends on what country or in how you fill out the form. What kind of a mood for Twitter is in quite honestly, So I'm not sure you really want to go through this yourself. You can, if you want to just go to apps dot twitter.com and apply for a developer account there if you do want to follow along. Honestly, I think you should just watch me do this next activity because as I said, DStreams are kinda dated. Now, if you're just taking this class for the purpose of getting a certification, I can promise you it's not going to be in the certification exam, so you might just want to watch me do this following activity instead of following along. But if you do, you can just go to apps dot twitter.com sign-in for developer account. And then you can create the API keys and access tokens that you need to actually query Twitter and pipe that into Spark streaming. So to store that information, their credentials, you'll need to create a twitter dot txt file. And I'll show you where that is when we walk through it, it's just going to have a name of your consumer key and access token information on each line and our driver script, we'll read that in and use that to authenticate with Twitter. Our overall strategy here, once we have that Twitter stream established, we're just going to extract the messages themselves. There's a bunch of metadata that Twitter gives you as, as well. But we're going to create a Twitter utils dot create stream that's going to be using a third party library that I've included with the course to actually connect Twitter and make it look like a DStream. And then we'll call tweets dot map to extract the text field on that by calling status dot getText. So at this point we'll have a new RDD, a micro batch, if you will, that contains nothing but the status messages from each tweet that's received. So for example, maybe one tweet says Vote for hashtag bulimic boldface. And someone might say a book for I like big boots and I cannot lie and no vote for hashtag. What iceberg? And someone else might say, what are you crazy? Hashtag, Bodhi MAC, boldfaced all the way. And if you're wondering what the heck I'm talking about here, this was a mean several years ago. Basically a, I forget what navy it was, but they ran an online contest to let the Internet name their new ship. And of course that was a horrible idea. The winner, of course was podium mic boldface, which they did not use. But I think they stuck it on some like little summary in or something that was on the ship or something. Anyway, this example we're trying to keep track of the most popular hashtags, right? So in this case we would have hashtag Boolean with both face showing up as number, the number one result with two appearances and hashtag, what iceberg would be the second result with one appearance. So we've established our stream of status messages. Next, we can use flatmap to split that out into its individual words. So by calling flatMap on that statuses micro batch, we will then blow that out into a bigger RDD that has a row for every single word that appears. So, for example, the message vote for her hashtag buddy macrophase, would just be broken up into three separate rows. Vote for hashtag bottom of buffets. And since all we care about are the hashtags, that's the only thing we're measuring. We can then call filter to filter out any rows that do not begin with the hashtag symbol. So that'll reduce things down to just the hashtags that are showing up. And at that point we can do our old trick from way back in section one or section to whatever it was, where we can convert those to an RDD of key value pairs, where the key is the actual hashtag and the value is the number one. That will allow us to later do a reduce operation to count them all up with just what's going on here. And we're gonna do that over a sliding window to make things interesting. So the reduce operation will add up all of those ones together and give us a total count for how many times that hashtag appears. But we're using a stream here, so we're not gonna say reduce by key. We have to say over what period of time. And to do that we'll say reduceByKey and window to perform that reduce operation over a given window and slide interval. So we're going to say reduceByKey and window, basically adding everything up. And when do you pass this in, you need to explicitly give it a function for adding and removing elements as it goes. That's what's going on here in this syntax here. And we're going to say that we're going to apply this over a 300 second window. That's the past five minutes with a 1 second slide. So we're going to slide our window over one every 1 second and look at the past five minutes worth of data. So what this is going to do is recompute that reduction that COUNTIF hashtags every second, going back for five minutes. And we should get a result that looks like that if that was our actual sample data. So finally, we just need to sort and output the results. We will call a transform to sort that RDD based on the second field, which is the count, and print them out. And that should pretty much be it. So let's go and see it in action. Note that we do have some extra libraries that we're using, some third-party libraries. I'll show you how that's set up. And yeah, let's just dive into the code and see if it works. Could be interesting. 57. [Activity] Real-time Monitoring of the Most Popular Hashtags on Twitter: All right, So let's walk through that example of using the old DStream API and using that to find out the most popular hashtags on Twitter right now. And if you do want to follow along, you can try if you head over to apps dot twitter.com or developer dot twitter.com, you'll land on the developer portal for Twitter. And if you don't already have a developer account, it will walk you through the process of applying for one. And again, I'm not really sure It's worth your time to go through that. First of all, they're making it harder and harder to get a developer account these days because so many people are using them to do evil things with bots and whatnot. But if you do want to go through it, if you just say that you're doing it for educational purposes for an online course and list this course's name. They'll probably give you one pretty quickly. So if you do want to do that, go ahead and apply for an account. Once you have that, you'll be able to create both a set of access key. So let's look at what this looks like. So if you click on keys and tokens, that will bring you to a place where you can get both your consumer access keys, the public and secret one, and also a set of tokens for an application. And once you have those, you're going to open up your twitter dot txt file in your course materials and put them in there. So mine looks like this. Obviously I've blurred out the actual keys because I don't want to using mine. But just paste in your consumer key and consumer secret and your access token and access token secret in here for your own account if you are following along. And then the code, we'll just pick that up and use it to authenticate with Twitter. Do be careful though, make sure that you don't have any extra lines, don't have any extra spaces before or after these things that will mess things up. Number one issue that people have with this one. Once you have that, let's go back to intelligence and open up popular hashtags. And let's walk through what's going on here. So again, this is using some older APIs here, DStreams and RDD. So we have a separate functions here to handle our log level here, to do that in a little bit more of a verbose manner. There'll be deal. We call our main function. First thing we do is call set-up Twitter. All that does is open up that twitter dot txt file and extract out all of those authentication keys. So it just sets the system properties that Twitter's client will need to allow you to connect to Twitter. And under the hood, this is using third-party libraries that I've included with the course materials. This is probably a good time to mention that this is all built using SBT under the hood in intelligence. If you actually go to build on SBT here, you can see the packages that we're actually including for the course materials itself. So here we're saying explicitly, we're using Scala version 2.12, Spark version 3. We're importing Spark Core Spark SQL, Spark, MLlib, Spark Streaming. This is how it's all magically worked so far through the course. This is how we've actually had Spark available to us. Even though we didn't actually install Spark explicitly. Sbt went out and got that for us automatically. And also we are getting the Twitter for J core and Twitter for J stream packages. These are just Twitter JAR files out there that allow you to stream in Twitter data from the Twitter API. Once you've set those environment variables at a bullet, it authenticate for you. So by setting this in our library dependencies, it's automatically retrieving the correct version of the Twitter libraries to allow us to connect and stream that data for the given version that we want here. There's also a separate library that we'll convert that Twitter stream from the Twitter API into a D stream for Spark streaming. And we just included that in the lib. And we just include that in the libraries included in the project here. If you go up here and go to the lib folder, that's what that DStream dash Twitter Jar file is there too. It's been explicitly built against the specific version of Spark and Scala and Twitter that we're specifying in that SBT file. And the way SBT works is that if there's something in the lib folder, it just automatically puts it in as sort of a local dependency. So that's how it all works under the hood. Back to the code though. So instead of Twitter has set up all the authentication that it will need to connect. And once we have that, we can set up our streaming context with a 1 second batch size of data coming in. We'll set up our logging to set the log level. And then we call Twitter utils. That's calling into that local library that we included there. And we're going to call create stream on it to create a DStream from Twitter using that third party library, we just pass in our streaming context and an optional parameter that we don't need. Once we have that, we can just call status to status dot getText and use that as our map function on that DStream. So all that's doing is extracting the text field from each tweet. So we don't care who tweeted it. We don't care what time they tweeted it. All we care about is the tweet itself. And that's what's going into our new statuses, D-string. And here, once we have that, we use flatmap to blow that out into individual words after splitting them based on the space character, we filter out anything that doesn't start with a hashtag. So now we're just left with hashtags that appeared within our tweets that are coming in through that stream. We then map that to these tuples of the actual hashtag itself and the number 1. And then we can use reduceByKey and window to check every 1 second, going back five minutes to add up all those ones for each individual hashtag and reduce it by that hashtag key. Again, this is a format that we haven't seen before with reduceByKey. Sometimes they want to have both in operation for adding something in an operation for taking something out. So in this case, if you add something in, we want to add it. If you wanna take something out, we subtract it. Usually that's going to be the case, but there are weird use cases where you want to do something more complicated. Once we have that, we just sort it and display the results, print out the top 10. All right, to kick it off, we said a checkpoint directory. We didn't talk about that in the slides very much. But this is actually setting a checkpoint folder on my C drive. If you're on some other operating system, you'll probably want to change that to a path that makes sense on your OS that wouldn't work on Linux or Mac of course, but you could change that to slash users slash whatever your username is, your home directory if you wanted to slash Check Point. And that would work as well. That's where it will store a store the data that it needs to pick up where it left off. If something bad happens to this script unexpectedly while it's running, so it can resume from that checkpoint. Once we have that setup, we just start our streaming context and we let it do its thing. It's just going to continually do what we told it to do. Reducing every 1 second for the past five minutes what the most popular hashtag is, and we'll keep on doing that until somebody explicitly terminates the script. So let's see if it works. Click on popular hashtags, right-click and run. And here comes. And every time I run this, I'm reminded that Twitter is a global platform. We see an awful lot of foreign language stuff and cultural references that mean nothing to me. Across the entire world. We're looking at every tweet that's going out publicly at least and adding up the hashtags that appear most often. So let, let's let this run for a little while and see what interesting results we get. Let's expand this a little bit so we can see more of it. There we go. So right now BTS is trending as the top hashtag. I have no idea what that means. I'll have to go look that up later. It stands for something culturally significant at the moment that people are excited about. People are instead about ice cream. Ice cream is taken over. Ice cream just caught up for number one, very exciting. Hey, who doesn't like ice cream right up IP3, whatever the heck, that is. Still together. Very dynamic, right? Ice cream though, hold on its own. Like a war between ice cream and BTS. Bts is polling ahead. Anyway, you could run this all day if you want to, and I guarantee you you'll see something different if you're running this yourself because it changes every day, every moment it says whatever the world is talking about right now. And fortunately, I don't think we see anything offensive in here. Usually you do. Although to be fair, I don't know what BTS is or what that foreign language where it is that keeps popping up either still together. I don't know what that refers to. Man, I gotta get out more. But anyway, there you have it. We can go ahead and just cancel this. Hit the red button here to stop. That. Looks like after that time, the winner was whatever that means in whatever language that is, maybe someone in the comments can tell me. I don't know. Maybe I don't want to know. Should be, it could be something nasty. But it's a pretty cool, fun example, right? So we've actually used the DStream API here to go and connect to real Twitter feeds and keep track of the most popular hashtag as we go over time. And if I were to run this for a full five minutes, we get a full five minutes worth of data. And we would then have a sliding window of the past five minutes going forward as we continue to run this script. So it might be a fun thing to run for a while and just see what it does. So there you have it kind of a real-world example using real-world data. That's the fun part, but like we said, DStreams are kind of the old way of doing it. So let's dive into the new way of doing things with structured streaming next. 58. Structured Streaming: So as I mentioned, DStreams were the original streaming API for Apache Spark. In these days, people use something else. It's called Structured Streaming sometimes. So you still run into libraries that expect DStreams like that Twitter library that I was just using. So it's still useful to know that it's out there and how it works. But you'll find that structured streaming is the more modern API for streaming and Spark. And it's also easier to use its ideas that it uses datasets as its primary API instead of RDDs are DStreams that look like RDDs. And like I said before, much of Spark is going the way of using datasets and in Scala or DataFrames in Python as its primary API. And the beauty of this is it's pretty elegant for streaming because you can imagine a dataset that just keeps getting appended to forever and you just query it whenever you want to, just like any other dataset. So the streaming part is just that we keep adding more and more rows to that dataset in real time as new information is received. And that makes things a lot easier. It means you can use these datasets must much like just any other dataset. So it's actually not a whole lot for us to talk about in this lecture. Once you've set things up, you just use this like any other dataset. All the other stuff that we've learned in the course applies exactly the same way to a streaming dataset as it would to a dataset that's read from a batch process. And the other nice thing is that by doing this, streaming is now truly real-time. This was a real sticking point with a lot of people. And for a while people were saying that Spark streaming was inferior to other streaming platforms because of that. Well, now they can't say that anymore because with Structured Streaming, Streaming is now truly real time. As new data is received, it will be immediately appended to that dataset and you get access to it right away. So we're no longer based on these micro batches of like tiny little RDDs contain 1 second worth of data. We don't have to think about that. That level of thought is no longer required in our code. Setting it up, it's super easy. All we need to do is say spark dot read stream instead of Spark dot read, when we're setting up a DataFrame or a dataset. And you can say what? For example, if you wanted to just read it in JSON files from a logs bucket on Amazon S3, you could just say sparked dot-dot-dot stream dot JSON S3 logs and that would just sit there and monitoring that logs bucket in S3 all day long, 24, 7, looking for new JSON files to read in and parse. And it would just keep on adding every new line in every new JSON record that was found there into that input DataFrame. And then you can do whatever you want to it. You can group it by, you know, some action. You can do a window just like we saw in DStream. So if you do want to specify a window of time, you can just pass that extra window parameter there to specify the, the period of time over which you want to go. And you can count it up, Right, the stream out wherever you want to. In this case, we're going to format it to a JDBC connection and just write it into a MySQL database somewhere turnaround and stick that in a database. So that's really all the streaming specific code you would see just the active actually specifying that window, establishing the stream and where you're writing the stream to. Apart from that, it's just a regular old dataframe or dataset, depending on how you're using it and everything you would normally do with the DataFrame where dataset applies here as well. So we're going to do a little example here of streaming log files. So I included an Apache access log file into your course materials. And what we're gonna do is just set up a little directory in our course materials and will copy that log file in and see if our streaming application picks it up and processes it. So to do that, we're just gonna say spark dot read stream dot txt because these are just plain text files, log files with a directory path to the logs folder within my course materials. At that point, I can just use SQL operations to parse out the data from those log lines using regular expressions. You could use that with a map operation if you wanted to as well. And then just use group BY to group things together over some window if you want to stream out the output, in this case to our console, but it could just as easily go to a database or anything that you want. So let's dive in and stream some logs. Not that kinda log. 59. [Activity] Using Structured Streaming for real-time log analysis: All right, So let's play with structured streaming again, kinda the modern way of doing streaming these days and Spark. Let's go and open up the structured streaming script here and see how it works. First of all, you see this is actually pretty simple. I mean, apart from the actual regular expressions of parsing out that log data, the Spark code itself is not a whole lot. It's pretty concise. So let's walk through it here. All right, so we important stuff. We need a crater object here, set the log level, create a Spark session. So far nothing is different. First thing that's different though is that instead of saying spark dot, dot text, we're going to say spark dot read stream dot txt. That's it. That says we're going to make this a naught ongoing stream of information as opposed to just reading in a block of information from a static text file. And the path that we pass in here is actually a directory path. And in this case it's going to be on our local file system. But you could just as easily monitor an S3 bucket or some distributed file system if you wanted to as well. So it's just gonna sit here watching my data slash log directory for new text files. And as new text files are discovered there, it will append each line of that text file into the axis lines DataFrame. Okay, It's that simple. Now the rest of this complicated code here is really just parsing that data out. So Apache access logs are a fairly weird format and it takes some pretty complicated regular expressions to extract the information out of it and turn that unstructured log file data into structured information. So I'm not going to get into the details of how these regular expressions work, but they are meant to extract the information that we need from each field of that Apache access log. And we apply those regular expressions here within this big old select statement. So we take that axis lines DataFrame and we call select on it. Now by default, each new line of texts going into axis lines will be in a column called value. So for every one of these examples, we're doing a reg ex extract on the value column. That means it's going to take the raw line of data, do apply a regular expression to it. In this case, the host expression, which is meant to extract the host name. And we were going to give that an alias of hosts. So this will basically create a new host column that contains the host extracted from that line. And we do the same thing for the timestamp for the method, for the endpoint for the protocol, for the status code and the content size. At this point, you can do whatever you want, right? So in our simple example here, all we're gonna do is group BY the status and counted up over time. So we're going to keep a running count over time of how many times each status code appears in these log files. So that's it's going to sit there again waiting for new log files to get dropped into the data slash log directory. And it will group by over time, how many times each status code appears. We have to tell it where to put that information. So we construct what's called a query. And we just say status counts DataFrame, That's our final count DataFrame there. We're going to write that stream out with the following output mode. We're going to say we wanted to complete output formatted going to the console. And we'll give that query a name called counts. And we'll just call starch to actually start that output stream. Okay. So we defined our input stream up here with Spark dot read stream dot txt. That's what's going to monitor. We defined a series of operations that led ultimately to the status counts DataFrame that's being computed from that input stream. And then we create an output stream here called query. We then just wait for termination. It will just run that forever until we stop it explicitly. And when somebody does that, we call stop on the session to wrap things up. So let's kick it off and see if it works. First will just execute it. We afford to spin up an initially there's nothing in that directory and so we need to put stuff in there for it to actually do something interesting, right? So let's do that. Let's go to our course materials here and go into data. And you'll see there's an access log file here that contains an actual Apache access log from one of my websites. Let's copy that and open up the Logs directory and paste it in there. And what should happen is our script should pick that up. I can hear my CPU fan running, it's doing something. And it worked. So it picked up that new file and counted up all the air code. So we can see that we actually do seem to have an error of 500 error code here they're showing up an awful lot. So this would be alerting me to a real problem on my website if this were a real application, right? Maybe I could have some sort of a threshold on the number of 500 errors and alert me if that exceeds some threshold over time. Let's see if it works with as more data is thrown into it. So let's go back to data and make a copy of our access log file. I'm just going to copy and paste. So now we have access log dash copy. Let's copy that into our logs as well. And what we should see is all of these numbers, double batch one, we have another batch of information that came in. And sure enough, all these numbers have doubled because I just copied that same log file. And again, so there you have it. Let's hit the X button here to stop the stream. And yeah, that's Spark streaming and action, Pretty cool stuff. So as you can see, Structured Streaming is really easy to use. If you can use a dataset, you can use structured streaming and you're not limited to processing data in a batch fashion anymore. You can process it as it comes in, in real time, which is very exciting. All right, That's Spark streaming in a nutshell. Let's move on. 60. [Exercise] Windowed Operations with Structured Streaming: So this next exercise really interesting. I want you to keep track of the top URLs viewed in my event logs there for my Apache access logs. And keep track of that. So instead of displaying the count of status codes over time, I want you to display the count of the top URLs that were viewed from the log data instead. And to make it even more interesting, I don't want you to measure this from the beginning of time. I just want you to look back in the past 30 seconds and just tell me what the top results were within the past 30 second interval. Now to do this, you need to introduce the concept of windowed operations in Spark streaming. So let me introduce that real quickly. A windowed operation is just looking back over some period of time. So for example, if I want to only consider events that happened in the past 10 minutes, I'd have a 10 minute window. And the slide interval defines how often we evaluate that window. So we have a 10 minute window, say from 12 o'clock to 12 ten. If we had a slide interval of five minutes, that means we would evaluate that window at 1205121012151220. Each time looking at the most recent 10 minute window that we're within. To make that a little bit more concrete, let me show you a diagram that comes straight from the Spark documentation here. And in here we have a 10 minute window with a five-minute slide interval. So we're starting this thing at 12 o'clock in this hypothetical example. So we have windows that run from 12 o'clock to 12 ten, from 1205 to 1215, from 1210 to 12, 20 and so on and so forth. So every five minutes we have a new 10 minute window being defined. And over time we aggregate the results without throughout these windows. And we keep adding the results of those windows to our result table as we go. So in this example, we have a stream of animal names coming in apparently. So a 1202 and 12 03, we have cat dog and dog dog come in. So a 12 05, we have a slight interval that gets hit. So that's going to look at the 12 o'clock to 1210 window because that's just our first window that we happen to have since we started running. And what we have so far is one cat and three dogs that have been received. At 1207, we get an owl and a cat. So when we run our next slide interval at 1210, we're now also going to look at the 12 o'clock to 12 10 and the 1205 to 1215 windows. Now it turns out that our cat actually got added into that 12 o'clock to 12, 10 windows. So we're going to add that to our results table and update that. And we're also going to create a new window in our results table for 1205 to 1215 that we're going to keep track of. And it just keeps going on and on, right? So a 1215, we're now going to get another new window from 1210 to 1220 that we're going to look at. And we'll go back and update any previous windows as needed to account for the new data. So we have another dog and another hour that came in in between there. And if you look closely, you can figure it out. But basically the idea is the window is the period of time over which you're aggregating stuff. And the slide interval is how often you are evaluating that. The coded for doing this is pretty straightforward. You just say group BY if you wanted to do some sort of aggregation on a column, in this case, the column we are grouping by the name of that column. And syntactically he would say window. And then specify a call with the column name that represents the timestamp is defining when this thing happened, right? So the windowed code in Spark streaming is not automatically keeping track of when events were received. It's going to be looking at a specific column in your data that's being ingested. Okay, so keep that in mind. You know, you're going to be windowing defined on data that's in the stream itself, not on when that was actually received by Spark streaming. So subtle but important difference, especially in our example here, because we're using very old Apache logs that have timestamps from years and years ago. So that's going to affect how we actually approach this problem. Then you say comma window duration equals whatever window duration you want. And that's just specified in plain English like 10 minutes, 30 seconds, whatever you want there. I think there's a one month limit on that though. And the slide duration also specified in plain English as well. So syntactically, that's what you need to know on how to do windowed aggregations. So your challenge again, modify these structured streaming script that we just worked on to keep track of the most viewed URLs. And that's actually called endpoints in this particular script. Same thing. And I wanted to compute that with a 30 second window and attend second slide. Now you're on needs some little snippets of code in order to do this. First of all, again, I'll just reiterate that snippet of code on how to actually syntactically specify that window. So that's what that code would look like. That's not the entire line that you'll need in the actual script, but it's a good part of it. I'm grouping by is not enough. You also need to count up how many are in that group, right? So you've seen how to do that in previous examples. Also, like I said, you're gonna have an issue because the Apache sample logs I gave you are very old, they are years old. And that's not going to work with a tense, a 30-second window, right? So if you're looking back 30 seconds on a log that was capturing data from three years ago, that's not gonna work out so well. So what you can do is fabricate a new column in your dataset that is the current time that it was ingested. And the syntax for that is that second block, their logs df dot width column will add a column called event time as equal to the current timestamp. So that will give you a new column you can use for the current time. That will be a little bit more useful for this exercise. And finally, you're going to want to make sure that you're ordering the counts in descending order. So at some point you'll have to say order by call counter, whatever you call that count column dot DESC to specify you want that in descending order so you see the top results first and not vice versa. So go off and have at it, this is a little more challenging, might require you to do a little bit more creative thinking or looking stuff up online. But hey, that's how the real world works. So give it a go. And when we come back, I'll show you how I did it. 61. Exercise Solution: Top URL's in a 30-second Window: All right, so let's do something a little bit more fancy with structured streaming and take a look at my solution for keeping track of the top URLs viewed from our Apache access logs using a 30-second sliding window, instead of just from the beginning of time. So I just started with a copy of these structured streaming dot Scala script like I advised you to do. And not a whole lot changes until we get down to this line here on line 42. Now, as I mentioned, this is a very old Apache access law that I have given you within the course materials. And if we're going to be doing sliding windows of time from the past 30 seconds. Well, we need to fabricate some new event times that are more current, right? So that's all this line here is doing, is taking our logs df dataframe and appending a new column using with column called event time that is sent to the current timestamp whenever that data is actually ingested. And we're going to turn around and call the resulting DataFrame logs df2. All right, now things get interesting. So let's keep a running count of those end points. So in, in our nomenclature here, and end point is basically the same thing as a URL. So we have a GET method followed by the end point, which is going to be the URL on our website that someone has hitting. So to do this, we're going to use that same syntax that we talked about in the slides for establishing that 30-second window with a 10 second slide duration. We just say logs df2, groupby window call. The column name here is event time. That's the, that's the timestamp that we're going to be windowing on. Okay? We specify a window duration of thirty-seconds and we check in, update those windows every 10 seconds. And the second parameter for the group BY command is going to be called endpoint. So we're going to be grouping by the endpoints, grouping all those common endpoints together. And then once you have those grouped within that window and on that slide duration, we're going to count them up. So what this one line of code is doing is counting up how many of each URL was encountered every 10 seconds, looking at the current 30-second window that we're in right? Now, that's not enough. I also want to sort these, so I see the most popular URLs. I just don't want to see an unordered lists. So I'm furthermore going to take that end points, counts DataFrame and sort it by saying dot orderBy call cout name equals count, going to be the name of the count column that gets created by that Cal command dot DSC, to say that I wanted in descending order. So the most popular URLs are at the top, and that is it. The rest of the code is basically the same. We outputted to the console there and kick it off basically. And a little aside here, one last note, if you did want to write this out to a database or a file or something like that, you would just set the output mode is something else and the format is something else. So if you look up the documentation there, you can see how we can write out to a database or to a text file or a parquet file or whatever you want. It doesn't have to be to the console. It's going to be much more practical to actually store this data somewhere, right? And what we're storing is that results table. So we're going to get these, this table of results for every window updated, for every slide duration. So the result we're going to see is a table that keeps track of the results that we see within each window, each 30 second window that we have. So let's see if I'm actually lying to you or not. Let's run this and see what happens. So top URLs run will allow that to spin up and I'm going to make that window big so we can see what's going on. All right, so it's sitting there waiting for data. I'm going to go back to my logs here in my course materials. And you can see I actually have three copies of my axis log here to play with. Let's start with one. We've got to copy that and throw it in my logs folder here. So since we're streaming in real time, it should pick up the existence of that new data and go ahead and update everything. It's chugging away right now. You can hear my CPU fan going. And there we have it. So we can see that those 10 seconds slide intervals, we're computing that window value each time and we're getting a consistent result because we haven't actually been running for more than 30 seconds yet. So let's go ahead and throw some more data in there just to make sure that it works as expected. So note here we're seeing that 0, I should now see Batch 1. And there's batch one. Interestingly, it's the same number as before, right? Because more than 30 seconds have elapsed since that first batch. So that first batch of information actually expired. It fell off the end of my window. So if we weren't windowing, we would've seen all those numbers double, right. But they didn't because that thirty-seconds of data expired. I don't think 30 seconds have passed quite yet, so let's see if I can actually get a larger number there. If I'm quick enough, we'll put it in a third copy there. And here we see a patch to I wasn't quite quick enough, fell off the end there of the 30-second window. But if you do have your own version of this running, feel free to play with it or adjust those Window times to be a little bit more friendly to you and how you're working. And a good way to get a feel of how Windows and slide intervals work is to experiment with it and just play around with it like we're doing right now. So you've learned about Spark Streaming now and how to handle data in real time as it comes in. 62. GraphX, Pregel, and Breadth-First-Search with Pregel.: So in this section we'll introduce Graphx, which is an API for dealing with graphs of information kind of like our network of superheroes earlier in the course. Graphics is kind of a neglected stepchild of smart, to be honest, it hasn't seen a whole lot of development lately and it's still stuck on the old RDD API. There is a newer version of it in development called graph frames, but it's not ready for prime time yet. So for now, just for completeness, I'm going to talk about graph x and its current form, which does use the RDD interface, but you'll see it's still quite powerful. And using it does feel a little bit more like SQL a time. So let's dive in and see how graphics can help you solve massive distributed problems using graphs of information. Let's dive briefly into the world of GraphX, the last core component of Spark itself. And when we talk about graphics, we're not talking about line graphs or charts or anything like that. We're talking about graphs in the computer science sense. So for example, our social network of superheroes that we saw earlier in the course. That's an example of the graph we're talking about, where we have vertices that in that case represent individual superheroes and edges between those vertices that represent relationships between them. Graphics is kinda cool, but it's really only useful for some specific thing. So by itself, it can't actually answer the questions that we were answering in the code that we wrote for analyzing our superhero network. But it can do things like measuring connectedness, degree distribution, average path length, triangle count, sort of these high level measures of the graph as a whole. It can do things like count up all the triangles in the graph and apply the PageRank algorithm to the uneven. So, so I think that's really the driving force behind GraphX itself. It's most useful for implementing something like PageRank, which is obviously an important use case. And that's also a problem that involves massive scale, of course, where the power of Spark comes in handy. So it's kinda made for that more than anything else. You can also do things like joining graphs together and transforming those graphs very quickly in a distributed manner. But for things like our degrees of separation example where we're trying to figure out how many degrees of separation as superhero is from Spider-Man. You're not gonna find built-in support for operations like that. However, it does support the Pregel API for traversing a graph. And that allows you to write your own code and develop your own algorithms that sort of live within graphics. And that gives you the flexibility to do those more complicated things that you might dream up. They just don't come out of the box. You have to think through it and think creatively as we had to do when we do this using dataframes or using RDDs. So graphics introduces a couple of new data types, the vertex RDD and the edge RDD, as well as the edge datatype. And that's how we represent the vertices and the edges between them that make up a graph. Now Graphics is a little bit of a holdout. It's still written on the RDD API, even though Spark itself has been trying to migrate more and more toward dataframes and datasets. Quite honestly, GraphX has kind of fallen by the wayside. It hasn't really seen a lot of active development yet. It's still a core piece of Spark itself, so I'm covering it here. But you're seeing it more and more being replaced by other things are just not being used much at all in the real-world. It turns out, doesn't really have that much use for graphics. And as a result, it's kinda been neglected in all honesty. So GraphX is still built on RDDs. Yet another reason to learn RDDs, there is an alternative package called graph frames that is built on the new DataFrame API. But it's not really released yet. It's set version like 0.8 last time I checked. So at some point we might see graph frames replaced graphics and spark, but for now we have the RDD based graphics to work with. And you'll find that graphics code looks a lot like any other RDD spark code for the most part. And actually once you have a graph built, dealing with it looks an awful lot like Spark SQL anyway, so it's not as bad as it sounds. Creating a vertex RDD is pretty straightforward, really. You just have to return a tuple that includes a vertex ID, some unique numerical identifier as its first field. And whatever data you want to associate with that in this little snippet of code here that we're taking from our example. You're seeing that we're actually wrapping that in an option. And that's how we deal with null values in Scala. We didn't really talk about that before, but you see that we're defining an option of a vertex ID and a string, and that means that we have the option of returning nothing. So you'll see that in the case where we have a valid result, we're returning some with a tuple that consists of a vertex ID and the data associated with that vertex. If we have an invalid entry there, we return none. And that just means there is no results. It's basically the Scala equivalent of null. And this is useful because if you're calling flatMap on an RDD and your function returns none, that just gets discarded and that's okay. So that's a way of dealing with the case of not returning anything out of a flatMap operation. In the case of our data, we had to have some data lines that are invalid. It turns out that any hero ID above 6486 is not a real character, so we need to discard those. And that's the case where we returned none in this case. Creating an edge is also pretty straightforward. All you do is create an edge object containing a list of the nodes that it connects. So in this example here you can see that we're creating an edge between a given hero ID that starts the beginning of a line on our data file for the Marvel superhero dataset. And we construct a new edge between that in every superhero that, that Hero is connected to, define by that line of text. So an edge, very straightforward. It's just an edge object that consists of two vertex IDs and some additional information that you might want to associate with that as well. Pretty straightforward again. And to construct a graph, again, that's straightforward. You just construct a graph object and you construct it with the list of vertices that you want to have in it, those two edges between those vertices. And that's pretty much it. You'll probably want to cache that graph because you're probably gonna wanna do a bunch of operations on it. And by caching it that ensures that it will remain in memory, which can help spark optimized things when you do stuff with that graph later on. So doing stuff with a graph is also pretty straightforward once you have it constructed, although sometimes it's useful to see some sample code to start from. For example, if you want to take the top 10 most connected heroes, we could call graph doctor agrees to get those degrees of connectedness and join those with the vertices themselves. Sort the given results by the field that corresponds to the number of connections that they have in descending order. Take the top 10 and you're done. So that's an easy way to figure out who the most connected superheroes are in just one line of code once you've constructed that graph objects, so little bit easier. The syntax is a little bit harder to follow here, so it's a little bit jury's out, I think, on whether that's actually a more straightforward way of coding it, but it works. So that's one way of using GraphX. But as we said before, it's a lot more flexible when we introduced the Pregel API on top of GraphX. So let's talk about that next and how we can actually extend graph x to duplicate the results that we got in our degrees of separation example earlier in the course. 63. Using the Pregel API with Spark GraphX: So let's explore in more depth how the Pregel API gives us more flexibility. On top of graphics, we can actually recreate what we did with the breadth-first search algorithm in finding the degrees of separation between any two superheroes in our graph of Marvel superheroes that appear together and comic books. And the way Pregel works at a high level is basically every vertex has the ability to send messages to all of its neighboring vertices. And every graph is then processed in a iteration called a superstep. And at each superstep, three things happen. Messages from the previous iteration are going to be received by each vertex. Each vertex will then run a program to transform itself based on those messages. And then each vertex will send messages to other vertices if it wants to be picked up in the next step. And if you remember how we implemented the breadth-first search algorithm, you're probably already thinking, Hey, that maps pretty well to how that algorithm works. And in fact it does. So we can initialize our graph pretty simply here, if you remember right, we just start off by setting all the distances to infinity except for the starting point that we're measuring the distances from. And for that one we're going to set the distance to 0, of course, because the distance between something and itself is 0. We can do that with one line of code and graphics. We can just say graph dot map vertices and check to see if the ID is equal to the idea of the hero that we want to start from the place that we're measuring from. If so, we set the initial distance to 0. Otherwise we set that vertex attributes to positive infinity. And then we take advantage of that messaging capability to fan out from that initial notes. So we start from our initial starting point there. And we have retro add a little bit of code to define what happens as we send out those messages during that messaging step. So in this case, we're looking for nodes that are not positive infinity, one that we've just processed. And for those, we're going to fan out to their destination IDs, their neighbors, passing along the attribute of that initial node plus 1. So that's going to increment the distance count as we go and send that next round and messages for that superstep. So essentially, Pregel will work its way through this graph, fanning its way out. And along the way it looks for nodes that are not infinity. And it goes ahead and fans those out further, incrementing the distance as it finds its way out. So Pregel kinda handles a lot of them mechanics of traversing the graph for us, which is handy for something like this. Now don't get too hung up if you're not following this, you're probably never going to have to do this exact same algorithm and the real-world, right? The bigger point here though is that if you have a graph of information that you need to process in Spark, and the built-in capabilities of graphics don't do what you want. Sometimes the Pregel API will let you do it if you just think a little bit more creatively in deeply about it, it's just another tool to have in your tool chest. The other thing that we have to do in BFS is preserved the minimum distance at each step. So we want to make sure we're not backtracking on ourselves and getting these erroneously high distances between nodes because we found these longer paths to get there. So we can write a vertex program that will check for that. And it will preserve the minimum distance between the one that receives and what it has. So as it receives those messages from its neighboring nodes, if it's getting a distance that's actually larger than what's in there already, it will preserve the smaller of those two distances. And furthermore, we do a reduce operation as well. So if we actually have multiple messages received for the same vertex in the same pass, that will also catch that case and preserve the minimum distance in that case as well. So putting it all together is pretty straightforward. It's not a whole lot of code really. So we can just say define our initial graph and graphics with the distance set to 0 where we're starting with otherwise it's infinity. We then call initial graph dot Pregel. And that one function call allows us to define all the attributes of particle that we need. So we pass in that vertex program that maintains that we always preserve the minimum value received for distance there. We set up that messaging system where we say we're going to look for non-infinite nodes and fan are ways out from them incrementing that distance as we go. And finally, we have that reduce operation that says if you have more than one message coming in at once, again, we're going to preserve the minimum distance that's coming in. So those three components of regular passed in as a single function call to the Pregel API there. Again, we have the vertex program. First, we have the messaging function, and then we have the reduce operation. And that's all you need to do. So let's go out and actually see if it works. 64. [Activity] Superhero Degrees of Separation using GraphX: So let's dive in and use the power of graphics and Pregel to solve our degrees of separation problem using graphics. So open up the graphic script here and we'll take a look. Let's see what's going on here. It is little bit simpler than the RDD based implementation, but there's still a fair amount of code to talk about here, right? So let's go through it. So let's jump down to the main function and start there. We set up our Spark context is time. Remember, GraphX is still built on the RDD API, so we're not using a SparkSession here. We're using a SparkContext instead. Again, there's a new version called graph frames in development, but it's not really officially released yet. So again, graphics little bit neglected, but graph frames hopefully is on the way. However, graphics is still useful and as you can see, it's not that hard to use. So we start off by reading in the Marvel dash names file here, and we call the parse names function here with a flatMap. You can see that all is going on here. So returning a tuple of the vertex id, which is our superhero ID, and a string that corresponds to that ID's name. Nothing fancy. We just split it up based on the quote character there to extract the name. And if we have more than one field, we just do a little sanity check on the data there. We trim that initial field down to strip off any extraneous spaces converted to a long integer. And check that it's a valid ID. If so, we return that integer along with the name of that hero that's associated with that ID. And again, we're using the option format here to say that none is a valid result. So by saying sum, that means that this function returns a valid result, but we can also return none. It could be that this line represents an invalid hero ID or improperly formatted data. And in that case by returning the value none as part of our option there, that tells flatMap that we just want to ignore that line and do not add anything to the resulting RDD. Next, what do we do? We build up our edges. So now we're going to parse the Marvel dash graph dot txt file. And as you recall, that's just a list of lines that have a here ID followed by a list of all the IDs at that Hero has been seen with in the same comic book. So make edges takes care of that. Not a whole lot going on here either. Basically we take that entire input line is a string and it returns a list of edges that consists of integers that correspond to actual vertex IDs for each hero. So we're going to build up a list buffer here and populate that list with edge objects. So this is basically the format that graph x is going to expect from us. We split up that raw input line into individual fields. And for each field that we have, we extract the first one that's going to be our origin, the one that we're talking about. And then for each subsequent field, we create a new edge object that consists of the vertex ID of the superhero that we're starting from, and the vertex ID of each connection that that Hero has. We then return the resulting list of edges back to our main script here. So now we can build up the graph itself, which is pretty straightforward. We just construct a new graph object, giving it the list of vertices and edges that we specified. And we're just passing in those vertices as tuples basically where the first value is the vertex ID. That's a valid thing to do here. That just works. And we cashed the resulting graph because presumably we're going to do a lot of work with that graph and we want to keep that in memory. As we saw in the slides, we can quickly compute the top 10 most connected superheroes with this one line of code we just call graph doctor agrees and joining with the entire list of vertices, meaning that we want to get the connections for every vertex that we know about and sort the resulting results by the field that corresponds to the number of connections. Take the top 10, print them out, that's it. And now we can actually do breadth-first search using Pregel, which is actually not that hard. So we start off by defining the ID of our root vertex here. So we happen to know that vertex ID 5306 corresponds to Spider-Man. And we can iterate through computing the distances if everyone to Spiderman. Now, we start off by initializing the graph. Again, we just do a map vertices operation here, checking each vertex and seeing if the ID is equal to Spiderman. If so, we set the initial distance to 0, otherwise we set it to infinity. And now in this one command, we set up the entire Pregel world, if you will. We call initial graph dot Pregel and we pass in this function that contains, first of all, our vertex program. So again, it's purpose in life is just to maintain the minimum distance as messages are received between the incoming message and the current value of a given node. We also define our sendMessage, like we talked about before. That's going to propagate out to all the neighbors with the distance incremented by one. So for any node that does have a current distance associated with it, we found ourselves out incrementing that distance as we go. And finally, we define a reduce operation that will preserve the minimum value of those messages received by a vertex if multiple messages are received by one vertex in the same pass. So we have two different bits here that may contain that we always maintain these shortest distance that we encounter as we traverse the graph. And this little snippet here defines how we traverse the graph and increment that distance by one with each pass. At that point, we can go out and just print out the top 100 results there. We can say BFS dot vertices, dot join verts again, just saying we want to evaluate that graph on all of our superheroes. Take the top 100 and print them out and we'll see what it looks like. And if we want to specifically recreate the results from our degrees of separation exercise that we use to using RDDs without graphics, we can just do a little filter operation there to pluck out the result for superhero ID 14, which is the character atom 3,031 and print out that line specifically. So let's kick this off and see if it works. Graphics run. And what's cool is that it's pretty quick too. So once it gets going, we shouldn't have to wait too long at all. Off it goes. And we have our top 10 most connected and we have our degrees of separation. So if I remember right, that's a lot quicker than actually our implementation just using straight up RDD. So GraphX is obviously doing some pretty cool optimizations. They are under the hood as well. Let's check our results, make sure they make sense. So I'm gonna scroll up. Top 10 most connected to the top result is still Captain America. That's what we saw before earlier in the course. So that checks out and degrees of separation from Spiderman for everyone. Interestingly, it's one or two for everyone. I think. You know that legend of everybody being connected to Kevin Bacon is probably true, right? Oh, looks like boom is actually pretty far had three. He's he's more obscure. Even more obscure than Adam, 3,031, which again comes back as two degrees of separation from Spiderman. Also the same result we got before. So that's pretty exciting. We actually got the same result. It works. And this is a little bit more of a principled and straightforward way of computing that graph operation for degrees of separation using GraphX and the Pregel API. And with that, we've covered all the core components of Spark. So congratulations, with GraphX, we're wrapping things up. So let's talk about where to go from here.