Apache Spark 3 with Scala: Hands On with Big Data! | Frank Kane | Skillshare

Apache Spark 3 with Scala: Hands On with Big Data!

Frank Kane, Founder of Sundog Education, ex-Amazon

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
53 Lessons (7h 24m)
    • 1. Introduction, and Getting Set Up

      16:19
    • 2. [Activity] Create a Histogram of Real Movie Ratings with Spark!

      14:39
    • 3. Please Follow Me on Skillshare

      0:16
    • 4. [Activity] Scala Basics, Part 1

      12:52
    • 5. [Exercise] Scala Basics, Part 2

      9:41
    • 6. [Exercise] Flow Control in Scala

      7:18
    • 7. [Exercise] Functions in Scala

      8:47
    • 8. [Exercise] Data Structures in Scala

      16:38
    • 9. Introduction to Spark

      8:40
    • 10. The Resilient Distributed Dataset

      11:04
    • 11. Ratings Histogram Walkthrough

      7:33
    • 12. Spark Internals

      4:42
    • 13. Key / Value RDD's, and the Average Friends by Age example

      12:21
    • 14. [Activity] Running the Average Friends by Age Example

      7:58
    • 15. Filtering RDD's, and the Minimum Temperature by Location Example

      6:43
    • 16. [Activity] Running the Minimum Temperature Example, and Modifying it for Maximum

      10:10
    • 17. [Activity] Counting Word Occurrences using Flatmap()

      8:59
    • 18. [Activity] Improving the Word Count Script with Regular Expressions

      6:41
    • 19. [Activity] Sorting the Word Count Results

      8:10
    • 20. [Exercise] Find the Total Amount Spent by Customer

      3:37
    • 21. [Exercise] Check your Results, and Sort Them by Total Amount Spent

      4:26
    • 22. Check Your Results and Implementation Against Mine

      3:26
    • 23. [Activity] Find the Most Popular Movie

      4:29
    • 24. [Activity] Use Broadcast Variables to Display Movie Names

      8:52
    • 25. [Activity] Find the Most Popular Superhero in a Social Graph

      14:10
    • 26. Superhero Degrees of Separation: Introducing Breadth-First Search

      6:52
    • 27. Superhero Degrees of Separation: Accumulators, and Implementing BFS in Spark

      5:53
    • 28. Superhero Degrees of Separation: Review the code, and run it!

      10:41
    • 29. Item-Based Collaborative Filtering in Spark, cache(), and persist()

      8:16
    • 30. [Activity] Running the Similar Movies Script using Spark's Cluster Manager

      14:13
    • 31. [Exercise] Improve the Quality of Similar Movies

      2:41
    • 32. [Activity] Using spark-submit to run Spark driver scripts

      6:58
    • 33. [Activity] Packaging driver scripts with SBT

      13:14
    • 34. Introducing Amazon Elastic MapReduce

      7:11
    • 35. Creating Similar Movies from One Million Ratings on EMR

      11:33
    • 36. Partitioning

      5:07
    • 37. Best Practices for Running on a Cluster

      5:31
    • 38. Troubleshooting, and Managing Dependencies

      9:08
    • 39. Introduction to SparkSQL

      7:08
    • 40. [Activity] Using SparkSQL

      7:00
    • 41. [Activity] Using DataFrames and DataSets

      6:38
    • 42. [Activity] Using DataSets instead of RDD's

      7:23
    • 43. Introducing MLLib

      9:18
    • 44. [Activity] Using MLLib to Produce Movie Recommendations

      14:35
    • 45. [Activity] Linear Regression with MLLib

      5:55
    • 46. [Activity] Using DataFrames with MLLib

      8:30
    • 47. Spark Streaming Overview

      9:53
    • 48. [Activity] Set up a Twitter Developer Account, and Stream Tweets

      12:44
    • 49. Structured Streaming

      4:17
    • 50. GraphX, Pregel, and Breadth-First-Search with Pregel.

      10:38
    • 51. [Activity] Superhero Degrees of Separation using GraphX

      8:59
    • 52. Learning More, and Career Tips

      4:15
    • 53. Let's Stay in Touch

      0:46

About This Class

New! Updated for Spark 3.0!

“Big data" analysis is a hot and highly valuable skill – and this course will teach you the hottest technology in big data: Apache Spark. Employers including AmazonEBayNASA JPL, and Yahoo all use Spark to quickly extract meaning from massive data sets across a fault-tolerant Hadoop cluster. You'll learn those same techniques, using your own Windows system right at home. It's easier than you might think, and you'll be learning from an ex-engineer and senior manager from Amazon and IMDb.

Spark works best when using the Scala programming language, and this course includes a crash-course in Scala to get you up to speed quickly. For those more familiar with Python however, a Python version of this class is also available: "Taming Big Data with Apache Spark and Python - Hands On".

Learn and master the art of framing data analysis problems as Spark problems through over 20 hands-on examples, and then scale them up to run on cloud computing services in this course.

  • Learn the concepts of Spark's Resilient Distributed Datastores

  • Get a crash course in the Scala programming language

  • Develop and run Spark jobs quickly using Scala

  • Translate complex analysis problems into iterative or multi-stage Spark scripts

  • Scale up to larger data sets using Amazon's Elastic MapReduce service

  • Understand how Hadoop YARN distributes Spark across computing clusters

  • Practice using other Spark technologies, like Spark SQL, DataFrames, DataSets, Spark Streaming, and GraphX

By the end of this course, you'll be running code that analyzes gigabytes worth of information – in the cloud – in a matter of minutes. 

We'll have some fun along the way. You'll get warmed up with some simple examples of using Spark to analyze movie ratings data and text in a book. Once you've got the basics under your belt, we'll move to some more complex and interesting tasks. We'll use a million movie ratings to find movies that are similar to each other, and you might even discover some new movies you might like in the process! We'll analyze a social graph of superheroes, and learn who the most “popular" superhero is – and develop a system to find “degrees of separation" between superheroes. Are all Marvel superheroes within a few degrees of being connected to SpiderMan? You'll find the answer.

This course is very hands-on; you'll spend most of your time following along with the instructor as we write, analyze, and run real code together – both on your own system, and in the cloud using Amazon's Elastic MapReduce service. 7.5 hours of video content is included, with over 20 real examples of increasing complexity you can build, run and study yourself. Move through them at your own pace, on your own schedule. The course wraps up with an overview of other Spark-based technologies, including Spark SQL, Spark Streaming, and GraphX.

Enroll now, and enjoy the course!

"I studied Spark for the first time using Frank's course "Apache Spark 2 with Scala - Hands On with Big Data!". It was a great starting point for me,  gaining knowledge in Scala and most importantly practical examples of Spark applications. It gave me an understanding of all the relevant Spark core concepts,  RDDs, Dataframes & Datasets, Spark Streaming, AWS EMR. Within a few months of completion, I used the knowledge gained from the course to propose in my current company to  work primarily on Spark applications. Since then I have continued to work with Spark. I would highly recommend any of Franks courses as he simplifies concepts well and his teaching manner is easy to follow and continue with!  " - Joey Faherty

Transcripts

1. Introduction, and Getting Set Up: Hey, I'm Frank A. And I spent over nine years at amazon dot com and I am db dot com making sense of their massive data sets. And I want to teach you about the most powerful technology I know for wrangling. Big data in the cloud today that's Apache Spark using the scallop programming language spark and run on a Hadoop cluster to spread out massive data analysis and machine learning tasks in the cloud. And knowing how to do that is a very hot skill toe have. Right now, we'll start off with a crash course in the scallop programming language. Don't worry. It's pretty easy to pick up a long as you've done some programming or scripting before. We'll start with some simple examples, but work our way up to more complicated and interesting examples using really massive data sets. By the end of this course, you have gone hands on with over 15 real examples, and you'll be comfortable with writing, debugging and running your own spark applications using scallop. And some of them are pretty fun. Well, look at a social network of superheroes and use that data to figure out who is the Kevin bacon of the superhero universe. We'll also look at a 1,000,000 movie ratings from real people and actually construct a riel movie recommendation engine that runs on a spark cluster in the cloud using Amazon's elastic map. Reduce service will also do some big machine learning tasks using sparks ml Lib library and we'll do some graph analysis using Sparks graphics library. So give it a try with me. I think you'll be surprised at how just a few lines of code can kick off a massive, complex data analysis job on a cluster using spark. So let's get started. First thing we need to do is install the software we need, so let's get that out of the way right now. If you'd like to follow along with the hands on activities in this course, let's get started by installing all the software you need. That will include job, a spark, Scalia I. D and Eclipse. And some people say, Frank, why can't you just give us a instance in the cloud somewhere that has us all set up for us already? Well, the answer is running spark clusters is not free, but it's pretty easy to actually install spark on your own desktop and the advantages. Well, it's free, and it will still work even if you're not on the Internet. So let me walk you through the steps involved here. We'll start by installing a job, a development kit, taking care to install the correct version for the version of Scalia we want to use Well, then install spark itself, which is surprisingly easy. Set up a few environment variables that we need for it to work properly and a little bit of an extra step that we need on Windows will then install the scallop I D, which comes with a copy of Eclipse that will let us edit our code in this course in a graphical manner. And if you'd like to follow along, head on over to Sun Dog Dash education dot com slash sparked. Ask Allah and you'll find written set up steps there. If you want to just follow along any written form instead of trying to keep up with me in this video, you will also find steps there for a Mac and Lennox users because I'm about to go through this on Windows. In the rest of this video, most of the steps are pretty similar, but, ah, things could be a little bit different. So if you are on mackerel Lennox, be sure to refer to that Web page for instructions that are specific to your platform. Let's dive in. Here's that Web page I mentioned at Sundar Dash Education dot coms last sparked. Ask Allah. This has he written, set up steps here for every platform here, so if you want to follow along at your own pace, you can do so here. And while you're here, you'll find links to join our Facebook group for the site. If you'd like to collaborate with your fellow students and you'll also find links for following us on social media on Twitter, Facebook or linked in, so take advantage of that as well. And I encourage you to explore the rest of our site, too. But with that, let's get started. First thing we need is a Java development kit because Scalia actually executed, says Java byte code. So that's what it compiles down, too. So we need a J D. K for on a Run Scalia code, so either type in J D. K and toe Google or just follow this link here, and I will take you to where you can download the latest job. But you probably don't want the latest Java. It seems like they have had a new major release of Java every week lately, and the world of Scala just hasn't kept up with it. So let's make sure that we're installing the right version for the version of Scalia that will be using now. In this course, we're going to be using the scallop i d. E. So if you take a peek over its kala dash, I'd ee dot org's and follow the link to download the I. D. E. We're not going to download it just yet, but I want to check its requirements. You can see that it says very explicitly here that it requires J T K eight, not 13 which is the current version. So be sure to check their and make sure we're getting the right version of the J D. K. Now that I know I need J t k eight, I can head back to the Java download page here and look for J T k eight. So there it is. Java sc eight. Whatever it ISS, and we want to go ahead and download the J. D. K. We will need to agree to the license agreement and then select the installer for whatever operating system you're using. For me, that is Windows 64 bit. Oracle's been getting more stripped of their licenses lately, so we actually need to create an account. So go ahead and take care of that if you need to. And once you've signed it with your Oracle account, the download will commence. We just have to wait for that to come down and run the installer, give it whatever permissions it needs and walked through the installation steps here. Now it's generally a good idea to avoid spaces in past when you're dealing with software like Scalia and Spark that were written for Lennox. So let's go ahead and change that install path from that long path that includes the space and program files to just C colon. Backslash J D K. Should do remember that path because we'll need it later. And we also want to change the path for the J. R E to have one that is not have a space, and it is well Let's change that path to just C J R E. All right, so we have job eight successfully installed. We can close out of this and next listen, stall Apache Spark. So to do that, well, head on over to sparked an Apache dot org's thusly and click on the big friendly download Spark button. This course has been updated to spark version three, so go ahead and choose whatever the current release of Spark three is right now. That's the Previ release, because we are ahead of the curve here. You want a pre built version for Apache Hadoop 2.7 or whatever version they offer and just use that download link to actually download spark itself. That should bring you to a page to select a mirror site to do the actual download. I'll just go with one they suggest here. That's fine, and that should come down pretty quickly. Now you might notice that this is in a dot T G Z format that is a unique style of compression that Windows doesn't necessarily have built in support for, so you might need to go install an application that can actually expand Teague Easy files first if you don't have one already, one that I recommend is Win rar, and you can get that from rara lab dot com. Just go to the downloads page here and get whatever version you need for your operating system that will be able to expand that T gz file once you've installed it. Another alternative is something like seven. Zip would work as well. If you have either of those programs installed, you should be able to decompress this file so we've actually downloaded the spark 3.0 Talk T GZ file. Let's go ahead and take a look at that. And since I've already installed wind roar, I can just right click on that and say, Extract to this directory Now we probably don't want to keep that in our downloads directory. So let's put that some place where we'll actually remember where it is and not accidentally delete it. So I clicked into that extracted directory they're, and we see that there's another subdirectory within it. Inside there is the actual contents of spark itself. So this is Apache Spark. Let's go ahead and select its entire thing. I'm just gonna hit control a and then control, see to copy it and how I'm gonna go back to my PC here to my C drive and create a new folder that's just called Spark. We'll go into that folder and copying the contents of the spark distribution that we just downloaded. All right, so now we have spark installed on our PC. We do need to do a couple of extra things, though for Windows specifically so, spark, as you saw, was built for a specific version of Hadoop. But we are not running Hadoop on our local desktop here, so we need to kind of trick spark into thinking that Hadoop its present when it really isn't to do that, go back to the Senate page here and you'll see a link to when you till stop e x c that I've posted here. So go ahead and click on that to download it. And where we're going to do is move that into a sea when you tills bin folder that we have to create. So, first of all, let's take a look at that in our downloads folder, and I'm going to open up a new file explorer window good or C drive create a new folder called When You Tills and within that win you tills folder, create another new folder called Been and Let's Go Copy That win You tills dot eggs e file from our downloads folder and put it into that when you tills bin folder that we just made All right. That's not quite enough, though We also need to trick it out into thinking it has the necessary permissions to run. So let's open up a Windows Command prompt. To do that, you can just go to your start menu here and go down the Windows Windows system specifically and open up a command prompt. So let's start by ceding to the directory that we just installed when you tells into so that's gonna be see the C colon backslash when you tills. Slash been is to a D. I R. Just make sure you're in the right place. There's are when you tills dot eggs e file there, and we need to make a temp hive directory in order to fake out windows into thinking that we're on Olynyk system with Hadoop installed. So let's say M k d I r C colon backslash TMP slash hive. And remember, this is all Windows specific stuff. If here are Macronix, you don't need to do any of this. And now we can use that win. You tills executable to actually change the permissions on that to what they need to be. So we'll say when you tills Dottie XY c h m o d 777 c colon backslash TMP backslash hive. All right, so that's all out of the way. Some Windows specific stuff we need to do their for spark done. Next thing we need to do is configure a spark to not spam us with a bunch of informational messages all the time when it runs. So let's take care of that. To do that, go back to your C drive and go to our spark directory and go into the conference directory . This is where the configurations are. We're gonna take the log for Jadot properties dot template file, click on that and edit the file name to remove the dot template part that will make it an actual properties file that gets applied to spark itself. And from that change, and I open that up, you might need to right click and say open with If you don't already have that associated with some sort of a text editor, What we're gonna do is look for this line here that says Log for j dot rood category equals info. Gonna change that in photo error. That way I don't see a bunch of warning messages and informational messages when I'm running my spark applications, cause that's annoying. Say that and close out of it. All right. Next, we need to set up some environment variable. So let's go back, Teoh. Just type in search here system is probably the quickest way to get to it and look for the system control panel. Open that up. Go to advanced system settings and from there, click on environment Variables and we need to set up a few new variables here under the user environment. Variables here. So in the top pain here, click on new and we'll call this 1st 1 spark underscore home. This will just tell the system where a spark is installed and as you recall, we put that in c colon backslash spark. We will also set up a job, a home java underscore home. Make sure it's all caps and this will be the path that we installed the J. D. K two. So if you recall, we install that to see colon backslash J D. K. And finally we will set up the Duke home h a d o p underscore home. That's where we installed Win you tills which is C colon. Backslash win you tills. And finally we need to modify our path directory so that the executed ALS confined themselves basically. So if you scroll down to path, you should have one already. If not, you can create one. Go ahead and edit that and we'll add a new path to that environment. Variable. The 1st 1 will be percent sign spark underscore home percent sign slash bin and that will be the path to the spark executed ALS and we'll create another one. This one will be percent signed Java underscore Home percent sign slash Been which will tell it where to find the job executed ALS click Ok, All right, hit okay. And the environment variables and okay on the system properties and we can close out of the system control panel at this point. Next, let's install these Scalia I d itself. As you recall, we went to that page earlier. It's just a scallop dash i d dot org's and then follow the download button. Go ahead and download the I D for your operating system. For me, that is Windows 64 bit, and we'll take a look at that file. You'll see. It's just a zip file. It's not really an installer. It's just a collection of files. That's all it is. So go ahead and right click on that to extract it, and you'll see that we have a Clips folder here because all the scale I D is is a copy of Eclipse with a plug in installed for scallop. So let's move that someplace that will be a little bit more safe in the innards of our download folder. I'm gonna select that Eclipse folder and Control X to cut it and let's move it to R C. Drive based Control V so we can clean up our desktop at this point close out of everything . So let's open up a Windows Command prompt. What we're gonna do again is go down to Windows System and there you'll find command prompt . We do need to run this as an administrator to make sure that has the appropriate permission . So make sure you right click on that command, prompt Gautam or and say, Run as administrator. That's very important. Give it the permissions that it wants. And let's run some spark code. So first, let's CD and to see Colin backs Last spark, which is where we installed Spark. And from here, let's do a D I r c what we have here. So what we're gonna do is write a really simple spark application that just counts up. How many lines air in a file, and we have this handy dandy read me dot MD file here that we can use for experimentation. So let's do that actually get into writing since Spark Cola's type in spark dash shell, and you should be seeing a scallop prompt like that. If not, then that means you mess something up along the way somewhere. All it takes is one little missed character or a mis percent sign in an environment variable somewhere. So if you're not seeing this, go back and check your prior steps. If you are seeing the slow Congratulations, you did everything right now a word of warning. If you do need to go back and adjust your environment variables, be sure to close out of this command, prompt and open a new one again remembering to do it with administrator privileges. Command problems only pick up new changes to environment variables when you open new ones. But if you get to this point, we can actually start writing some spark code. So let's do that. Let's start by saying Val rd d equals s C dot text file with a capital F Open print to see double quote, read me dot MD double coat close Prentice E. And what this will do is load up that text file into what's called a R d D. And we'll talk about that a lot more later on in the course that worked. So now we can say RTD dot count open close Prentice E. And that should just go off and fire off a spark job to count up how many lines. Aaron, That text file and the answer is 109. It's just that simple. So congratulations. We just ran our very first spark application using Scalia on your own local PC. So that's pretty cool. And it's free. Even better, let's do something more exciting in the next lecture 2. [Activity] Create a Histogram of Real Movie Ratings with Spark!: So let's do something a little bit more interesting and actually use that Scalia I D. That we installed earlier. What we're going to do is take a real world, a data set of movie ratings and analyze it using spark. So we'll break it down into how Maney one Star to star three star four star and five star ratings exist in this data set as if we were going to generate a history am for it. So to get started, let's create a directory to keep the project for this course within. I'm on Windows, so I'll just open up a file Explorer here and go to my C drive and create one there. Click on new folder and let's call it, Ah, spark Scalia. You can put this wherever you want. If you're on a Mac or Lennox, you know, wherever you normally create new directories within your profile is fine. Let's open that up so we have it handy. All right, so let's go and get that data first. Do that. Let's go to group plans dot or GTA just like that, and this will take you to the website for the movie lens data set This is real data collected by real people about their ratings for lots and lots of movies, and they keep it updated today to the state. Actually, this group is actually instrumental in research and recommended systems, but that's another topic for another course. Click on the data sets tab here and you'll see what they have to offer. We're going to use the 100 k data set here, So let's scroll down to that older data sets movie Alliance 100 K data set were choosing that because we're gonna be running this locally on your desktop and not on a big cluster, so we don't want to date it to be too big at this point. It's also a nice, stable release where we know what to expect. So go ahead and click on the M L Dash 100 k dot zip link there and let that come down. It's pretty small. Open that up and let's decompress it right click and extract all. So let's take a look at what's inside of here. If we crack open that Anil Dash 100 k folder, you can see this is the data set itself that contains a bunch of different files. The movie ratings data itself is in the u dot data file, and that's just a text file. Were each line represents a rating for a given user movie and the actual rating value and time stamp of that rating. There's other files in here as well, such as you dot item that contains information that maps movie ideas to their actual title . So you can actually read that in a human readable format and understand what these identifiers really mean. But anyway, let's take this folder, go back up a level and copy that ml Dash 100 k folder. It's got Select the folder and hit Control C. Go back to the folder that we made for the course. Undersea sparked Skela and paste it in their control. The all right Next, let's get the course materials and that's got include all the scripts and data that we're going to use within this course. Except for movie lines, we just don't have the license to allow us to redistribute that movie lens data as part of the course itself. But for everything else, head on over to media, not sun dog dash soft dot com slash spark Scalia and pay attention to capitalization there its capital s is there that does matter slash spark Scallon three dot zip Just like that again, No typos. Pay attention to capitalisation. It all matters and I should come down and let's again open that up showing folder and we'll right click and extract that as well. All right, so this is basically where we're going to pluck things out of as we go through the course. So just remember where this directory, full of all the course materials is so you can come back to it later. As we go through the course, we're going to be importing scripts and data from this folder into the project itself. So just remember where it is for now. Next, let's actually open up that Scalia, I Either we installed. So I'm done with that browser for now. Let's go to where we install that. That was undersea eclipse, as I recall, and to make life easier, you probably wanna have a shortcut to this handy. And so you can just right click and drive that to your desktop if you want, and create a shortcut there. So you can get to it more easily going forward. Go ahead and open that up. And for our workspace, we're going to specify that directory for the class project that we just made earlier. So let's hit, browse and find that sparks Kala directory that we made. So that's going to be under my C drive this PCC sparks Kayla and click on Launch. All right, let's maximize this window So we have more space to look at and let's go ahead and create a project for our course here that will work within. So follow along with me carefully. Here, every little step matters every little detail. So let's start by going to file new and then scallop project. We need to get this project The name. How about I don't know Sparks Kala Course and click finish. All right. Next. Let's right. Click on that and create a package within that. So just click on right click on Spark Skela course, and then we'll say new package. And even though it says it's a job, a package, it will really be scallop under the hood. But like we said, Scallop compiles to Java by coats. So it's all the same thing. As far as eclipses concerned, we need to get this package a name. And if you're familiar with Java naming conventions, you usually take your domain name of whatever organization urine and flip it as sort of the start. So I'm going to set the name of this package to be calm Hot Sun Dogs software because that's ah, website that I own dot spark. And I'll just be the name of the package itself. Click finish, all right, and that we have a package within our Sparks Kala course project. All right. Next, we need to actually put some stuff into this packet. So let's right click on that package that we just made and say Import. It's like General, open that up and select file system and click on next. Now we're going to browse to the materials for the course where we downloaded that, too. So that's still sitting in my downloads folder. Just go to wherever you put that. And there it is. Sparks Kalla three. And it's in that subdirectory there. Let's go ahead and open that up and you should see this list of all these scripts in data files for the course itself here that we can now choose from to import into our package. So we're going to be working with the ratings counter dot scallop file here. Let's find that and check it off and click on finish. All right, so now we can open that up and have a look at the code. If we just double click on that readings countered out scallop file that we just imported, Not much to it will walk through what it does later. But for now, just trust that this is Sparks Calico, That counts up how many ratings of each type we have. Now, let me call your attention to these nasty looking red exes. Thes do mean that there is some sort of a problem preventing this code from compiling that we need to resolve. And you can always just go to the problems tab here and open them up and see what they are . It turns out that the immediate issue is that we have not actually included the packages for spark itself into this project. So we have all this spark code, but it has nothing to link against, doesn't know where spark lives yet. So let's go and take care of that. To do that, just right. Click on the Sparks Kala Course project. We're gonna go to properties and from here, click on Java Build Path. Now we need to add all the jar files for spark itself. So let's click on libraries tab here and say, Add external jars. Now we'll navigate to where we installed Spark. And for me that's underseas Spark. Navigate to the Jars folder and just click in there and hit control A. To select everything and go ahead and import all of the jar files for spark into this project. All right, look it apply or we just hit. Apply in close. That's easier, huh? And now, if we give it a second, you'll see that those red Xs went away. That's cool. So that means everything has compiled successfully and we should be able to run this thing now. Very cool. However, there is one more thing that might have gone wrong, depending on when you're running this and what versions of Scalia and Spark come out in the future. If you are seeing error messages about incompatible version of Scalia, let me tell you where to go to check on that. So if you re the release notes on different releases of Spark, it will say that is compatible with the given version of Scalia Would spark three. Currently, it supports Cal a 2.12. But maybe there'll be a spark 3.5 or something down the road that takes Cal a 2.13 or something, and you might see something about an incompatible version. If you do, just take a note of what version it wants in that era message, and we just have to change our project to use that version of Scalia. Instead, all you need to do is go back to the Project properties That's by right clicking on Sparks Kala course and click on properties, and from here go to scallop compiler. This is where you can specify which version of scaly you want to use. So if I want to override that to a specific one, I could say use project settings and change that to a specific version of Scalia. If I needed a different version for the version of Spark that I installed, it turns out that Spark three works just fine with the default setting here of the latest 2.12 bundles callus. So we're good. We can leave that the way it waas moving on. Let's actually run this thing. Yeah, so, uh, to do that, just go to the run menu and we'll say, Run configurations kind of doing this the hard way just to illustrate things. But you know, we'll we'll do it this way for now. And we're gonna select a Scalia application double click that give it a name. Let's call this, I don't know, ratings counter, and we need to specify where the main classes that will just be calm dot Sun Dogs software dot spark Whatever. The main class of this ah project is in this case ratings counter. So type that all in exactly a show and then we can just hit, run and see what happens. Holy Crow. It worked. That's awesome. So how do we interpret this output here? That's pretty quick too. Right? So it chuck through those 100,000 ratings using Apache Spark And if we were on a cluster could have actually paralyzed that computation across different notes in the cluster. If we had a much larger data set that might have been beneficial. But we see here that we had 6110 1 star ratings, 34,134 4 star ratings, for example, in 21,201 5 star ratings. And already we're getting some meaningful information out of this data, right? We can see that people are really reserving one star ratings for the worst of the worst. You know, it has to be pretty bad for people to give it one star rating, and the most common rating is actually a four star rating, with five stars not being as common as you might think. So people weren't quite as generous as they might be in other settings when rating movies on the movie lines website. Pretty cool stuff. So there you have it. You actually successfully ran your own spark application using the spark I D. E. So it's all nice and pretty and yeah, it's kind of like walk through what's going on here at a high level. We'll go into this stuff in more depth later in the course, but just to get your feet wet as it were. Let's take a look at what's going on here. First we declare the actual name of the package that we're living within here, and we import all these spark libraries that we depend on for this job. For us. That's just spark spark context and the log for J Class. Now a little quick note here. We're kind of starting off with doing things at a lower level in this course, and we're gonna work our way up to Maura higher level in modern ways of writing spark code . But I do think it's beneficial to understand things at a lower level. Given that Spark three does still support thes AP eyes, they're there for a reason. Sometimes it makes sense to go at a lower level if you're doing something that's simple like this. So we have our main, our main function here, that lives within our ratings counter object, and that's the code that gets executed. What happens here is we start off by setting our logger to only log error messages that could be redundant with our change that we made to our log for J Properties file. But there are settings where that properties file doesn't get read in time, so we'll explicitly say here that we don't want a bunch of log spam as well. We then create a spark context object. And this means that my spark context, my spark environment, if you will, is going to be running on my local machine here. And that star means go ahead and use every core of the machine that's available. So it actually could be paralyzing this work across the four cores on my CPU, even if I don't have multiple machines and a real cluster and we give it a name called Ratings Counter. Now, with one line of code, we use that spark context object to load up our data set that the u. Dot Datafile remember, is the ratings data itself. And then we call this map function on the resulting what we call our d d. And we'll talk about our DDS and much more depth shortly. But for now, just know that it's a ah, you can think of it as sort of a database or a data frame. We're going to take that set of data that we import it from that raw text file and map it to a specific format. So right now, that line's rdd just contains individual lines of text. We need to split that up into their individual fields. That's what this map function does, using what's called a lambda function. It actually take that converted to a string split that string up based on tab characters, tabbed limiters and extract thes second field of that and take that as the rating value itself for that line. So with one line of code here, we've taken all of the raw input from our ratings data from movie lens, extracted the ratings column from it and created a new RTD called ratings that just has that rating ratings data. At that point, we could just say ratings don't count by value that Will has uses a built in function to actually count up how much time each individual distinct value occurs within that RTD. After that, we just sort the results and print them out, and that's all there is to it. So again we'll talk about these operations and much more death. But I just want you to start to wrap your head around what sparked code looks like and how it works. So again, congratulations. You got your first real spark application running within the scallop I d e. Cool stuff 3. Please Follow Me on Skillshare: the world of big data and machine learning is immense and constantly evolving. If you haven't already be sure to hit the follow button next to my name on the main page of this course. That way you'll be sure to get announcements of my future courses and news as this industry continues to change. 4. [Activity] Scala Basics, Part 1: So here's the plan. First, I'm gonna teach you a little bit about the scallop programming language, and you need that before we can actually start writing spark code in Scala's. So first things first. We already know scallop. That's great. You could skip this whole section, but if not, stick with me toe, learn some of the basics. All right, let's learn scallop just to set expectations. You're not gonna be a Scalia expert at the end of watching four videos with me. So what I'm really trying to do here is just get you familiar with the syntax of the scallop programming language and introduce some of the basic constructs like How do I call a function in scallop? Where? How does flow control work, where some basic data structures I might use with scallops and to show you enough Salako that it's not gonna look scary and intimidating to you as we go through the rest of the course. So with that, let's talk about Scaletta high level first. First of all, why learned Scalia Well, you've probably never heard of it before. Maybe you have, but you certainly probably don't know. It's mostly used for a spark programming, but it is uniquely suited for spark because it's really structured in a way that lends itself to doing distributed processing of data over a cluster. You'll see why a little bit later on. It's also what's Mark itself is built with. So by learning Scallon, you'll get access to all the latest and greatest spark features as they come out. And it can usually take a pretty long time for this features to trickle down to, say, python support within spark. And it's also going to be the most efficient way to run spark code itself. So by using Scallon, you will have the fastest in the most reliable spark jobs that you can possibly create. And I think you'd be pretty surprised at just how much faster and how much more reliable the same spark job. Right? And Scalia is compared to, say, the same spark job written in Python. So even though it might be tempting to go off and stick with the language you already know , learning Scalia is worth the effort, and it's really not that hard. Truth is the same spark code for Scalia and Python look very similar to each other at the end of the day. Now scale itself runs on top of the job of virtual machine, so it just compiles down to Java by coz and gets run by the JV M. So one nice thing about that is that you also have access to all of Java. If there's a job, a library, you wanna pull into your scallop code, you can do that so you're not limited to what's in the scallop language itself. Can actually reach down to the job, will air and pull up. It's a job that you want to use to, and we'll do that later in this course, for example, for a dealing with regular expressions and a little bit more of an intuitive matter than you could otherwise. Another key point about Scala's that it is focused around what is called functional programming, where functions are sort of the crux of what we're dealing with functions get past two other functions and chained together in ways you might not be used to. But this is really how spark works at a fundamental level. You know, we basically take a extraction over a chunk of data, and we assign it a function to do some processing on that data and functional programming. And scallop makes that very intuitive to do from a language standpoint. So with that, let's dive in first, a little bit of an administrative note. So as we go through these introductory scallop courses, I'm gonna attach the actual source code files to each lecture, assuming where you're on a platform that supports that sort of a thing. So, you know, feel free to just download those as we do in the videos. Now, if you did in the previous lecture, download the entire big chunk of all the course materials all at once, same files are going to be in there, too. So if you want to just imported from wherever you save those course materials, that's fine to either way works. So just don't get confused by that. With that, let's take a look at some actual scallop code. All right, let's learn a Cem Scala's shall we? So first thing I want you to do is go to the resource is for this lecture and download the Learning. Scalia one dot sc file That's a scholar, a workbook file worksheet file and just download that wherever you download your stuff, your downloads folder and hit pause while you go take care of that, please. I'll stay here and wait for you. I'm done. All right, let's move on. So I've opened up Thesis Kala I d. Here by opening up money clips. Shortcut on the desktop and what I want you to do next is go to the file menu and saying New scallop Project and let's call this project learning Scala's. Shall we just like that? All right now, we do actually need a separate project for these worksheet files. And while the reasons are complicated, basically there's a conflict with the scallop version that we used in the other project. But just go with it. So go ahead and right click on that learning scallop project and hit import and go to the file system and select whatever folder you download stuff from your browser into ever. You downloaded that learning Scalia one file. So for me, that's the Sea Downloads folder, and we'll select learning Scalia one dot sc and just hit fish. And we should have if we open up learning Scalia learning scalawag, not etc. So go ahead and double click on that and here we go. Now what we're looking at here is what's called a scallop worksheet. It's kind of like a notebook and python and that you can actually interactive Leo enter commands and see them evaluated right in line there so you wouldn't, you know, go on, right, A real script this way. But it's a good way to get familiar with the language and just sort of experiment and try different commands, and you can immediately see the effect they have. So what's going on here is you know, you have a line of code, and after you evaluate it, which happens automatically whenever you save the worksheet, it will tell you over here on the right, the actual output from the Scalia interpreter that resulted from that command. So very good way to see you know what you're typing in for code and the result that code has immediately pretty cool stuff. So let's just dive in and take a look at some of this Kalakota. So you know, every language has variables right and variables could be immutable or mutable on mutable means they could be changed. Immutable means they can't be chained so in other languages that might be a constant variable or a final variable, depending on the language we're talking about now. In Scalia, they encourage the use of immutable variables as much as possible on those air called Val Values with Count keyword Val. So the syntax here for declaring a string value? It's a little bit unusual. Let's look what's going on here. So we have Val, meaning that it's an immutable constant. We're gonna name this immutable, constant Hello. It's going to be of type string. And then I said, its value to Ola so viol. Hello, colon straying equals value. So that syntax here, Val meaning its immutable Hello is the name of the variable colon. And then the type equals whatever it iss. Okay, so that's how we declare an immutable value in Scalia. Now, just show you that it actually works. Let's change that, Teoh. I don't know what's French for a Hello Bones are Now. Watch what happens when I hit control as to save it. This actually all gets interpreted again. Unfortunate goes to the bottom of the file whenever I do that. But if we scroll back up, you can see that it? Now it says hello. String equals and we can print out the value of that string with a print line command. So in Scalia, that's how you print out a line of output Print line parentheses. Hello? That that value name will print out. Out with the value. Bonjour. So there you have it. Very simple. Straighten. Very simple There. Now, if you Although Scallon encourages using immutable values, you can also use mutable variables as well, if you need to. Sometimes you just can't get around it. So if you need to do that, use thieve R keyboard instead of Al. So here we have an example where we have hello there, which is initially set to the value that we had in hello which you remember we sent a boozer. So here we are, assigning a variable called hello there and assigning it the value. Hello. Hello. Referring to this value up here that contains a string Bollinger, with me so far makes sense. If you done any programming before you can probably follow along was going on here. So at this point, the value of the hello there variable is set to the string bone jure, because that's what my hello value was set to now, because it's a variable, I can actually change it. So I can say hello there equals hello. Plus the string there. So I'm changing the value of hello there from bone Juror. Two bones. You're there and you can see that when I print that out. Indeed, it does contain bulls over there. And just for fun, we consent to this back to hello. Up here. Watch everything get reevaluated when I save it. And now it says hello there instead. Okay, so you got it. Basically, there are vows that are immutable vars that are immutable and the syntax and scallops a little bit weird for declaring things you have. The variable name colon type equals whatever you're assigning it to. Okay, so that's a little bit backwards for most languages. Takes a little bit of getting used to, but you'll get the hang of it. All right, So one thing I want to point out is that in functional programming, we're using a lot of functions. And the danger of that is that if you have a function that quietly changes the value of something behind your back that can lead to bucks. It can lead to unexpected behavior. So in functional programming languages like Scalia, they really encourage the use of immutable Constance whenever possible. So you want to be using vowels whenever you can instead of are so instead of using a mutable constant and modifying it as you go, create new, immutable Constance and build them up over time. So here's an example of how that works. So let's say I start off with an immutable hello there that is initially constructed from the concatenation of whatever's in my hello string and the string there. So initially, I'm just putting hello there into my immutable. Hello there. So you see here, I'm actually doing the combination here on the right hand side here and putting it into the final value. Where will be changed now? You can change these together. So you know, let's say I want to add something else here. Let's say I want a new valve that's immutable. Um, more stuff and I can allow. Set that to immutable. Hello there. Plus, Yeah, I'm just making this up as I go. All right, So let's see what that does. So now I have. Hello there. Yeah. And basically, you know, I still have to values here. Once called immutable. Hello, there. Once called more stuff, but I never change the value of immutable. Hello there. And I never change the value of more stuff. Effort was initially created. Okay, so I kind of like you always know what's in this this this symbol, this name here, it doesn't change. That makes quoting a lot more safe. If I need to go back to that value, I could just refer to it by its original name. And that's the best practice for dealing with functional programming. Now, one thing I want to point out to notice that I didn't actually declare the type here. Scallop can actually implicitly figure it out in a lot of cases. So I didn't have to save al immutable. Hello there, colon string. Because it knows that I'm sticking together two strings and that produces a string object in return. So, in cases where there's really no ambiguity over the type, scallop can infer it for you automatically. And you don't actually have to do that, Colin. Type names in tax all the time. Okay. Some other types even dealing with strings so far. But there's all the usual suspects here. If you want to introduce an into type capital, I well, so Boolean values that could be true or false. Lower case or the key words there. If you want a single character, you know. Basically a 256 one of 256 characters, 88 bit value represented and asking character. You can use the carrot type double precision floating point is double single precision is float and notice that we have the usual F Suffolk's here, indicating a floating point constant. Scalia is a little bit pickier than other languages and enforcing that you use that. So if you are using single precision constants to make sure you remember to put that effort , the end double wide ensures a long, and we also have eight bit value called bite. Just like in job of that ranges from negative 1 27 to 1 27 and you'll find a lot of these types of very analogous to the types in Java, reason being that Scalise built on top of the Java virtual machine. So there's a lot of transparency between the two languages. All right, moving on to take a break. Now it's fine. Go get a drink, you know, Hit, pause. It's all right, But let's talk about printing stuff out. So a common thing you need to do in the languages output human readable stuff, very simple to do in scallop. If you just want to attack a bunch of stuff together, you could just use the plus operator to do that. So you can see I have a bunch of different types of information up here, and I could just push them all together with the plus sign here and create a big old giant string. So print line here is a mess and starting with a string and then I'm contaminating on the number one Boolean value character value. Ah, double precision value in a long value altogether. And when we print all that out, we get exactly what you would expect. Here is a mess, followed by all the stuff jammed together in the form of a string, and that all just happens automatically. All right, 5. [Exercise] Scala Basics, Part 2: now, Sometimes you need to format numerical output, and you might be familiar with the print F style of output in other languages. Scallop supports that, too, but the syntax is a little bit strange. So what you do, if you want to do formatting of that nature, you need to prefix your string with the character F. And that expands to a little macro that says, I'm going to use a print F style format within this strength. So here was saying print line F, meaning that I'm going to be formatting the text in here somehow, and I'm gonna say pies about that. I have a dollar sign followed by the variable name that I'm going to be formatting, followed by a percent sign and then the print F format for how I want to display that. So in this case, I have point free F, meaning that I want to display a floating point value with three decimal points of precision on the right hand side of the decimal point. And if you're not familiar with that sin tax, you can go look up, print death on the interwebs and get more details but pretty common tool for formatting your output if you need to make sure that it's human readable and only has a reasonable amount of precision. If we run that, we see sure enough pies about 3.142 So that took the value of pie. Single precision, which is 3.14159265 applied this formatting string to 8.3 F, and that made sure that we only have three decimal points places of precision to the right of the decimal point. Here's another example. You can also use, uh, use the same trick for putting zero padding on the left of numbers. Sometimes that's useful for sorting strings, for example, so in here we have the F prefix again dollar signed by the variable name number one. In this case, it's a an integer, which is the number one, and we're using a percent sign, followed by the formatting string, which is 05 d. The D means that it is a integer value, and the 05 means that I want five places of precision with zeros patting it in if necessary , and the output from that is with five decimal places. 00001 Okay. All right. Another trick we can do is actually substitute in variables as part of our string as they're being constructed. So for that, we use the s prefix. And then anything that's preceded by a dollar sign will just be interpreted as a variable or value name. So we have these s prefix. I can use the s prefixed. Use variables like dollar sign number one that will pull in the integer value one up here, dolla. Sign truth. We should pull in. True. All signed letter A. We should pull in the character A and sure enough, it works. When we evaluate that we yet I can use the s prefix two variables, like one drew a and you can also do fancy or stuff. You can do more than just reference values and values and variables. Here is an example where we actually evaluate an expression. So again, that dollar sign means I'm about to evaluate something, and it might be a variable or value name, or it might be a whole expression. So if you put something within curly brackets that basically says I got a little bit of code here that I want to execute. You can put whatever you want in there. It might be like a little tiny little function even. But in this case, we're just gonna add one plus two together. And sure enough, it works when we actually run that we get the S prefixes and limited variables. I can include any expression like three. So dollar sign brackets one plus plus two just got converted to the value three inside that string automatically. So very handy tools for producing output when you're putting stuff out or writing it to a file or what have you? All right, the next section gets a little bit weird. So again, if you want to rest your brain a little bit, let that previous stuff sink in. That's OK, Let's move on. Now. If you're familiar with regular expressions, they could be a powerful tool for pattern matching strings and extracting information out of structured, formatted data. So let's say you, for example, in Apache access law that has a very specific format. You might use a regular expression to extract fields out of that line of log information, and we do that in this course later on quite a bit, actually. So, for example, let's say I start with a string that has to life the universe and everything is 42 we're going to call that string. The ultimate answer. Read Douglas Adams. You know what I'm talking about. So? So the syntax for handling regular expressions and Scala's really weird. Okay, it's gonna blow your mind. Strange. Nothing I could do about it. Sorry, you just have to live with it. What we have is this strange triple quotes syntax followed by a dot are, and that basically means I want to construct a regular expression out of the stuff in between the triple quotes. So if you're familiar with regular expressions, you'll know what this all means. Dot star means extract a bunch of characters, then look for a space. And then we're going to look for one or more one arm or number characters. That's what the backslash D plus means, followed by a bunch of other characters. So basically, this is saying, go look for the first number in this string, so I'm trying to suck out the 42 out of this string. Basically now, regular expressions is the whole topic in itself. There are multiple O'Reilly books on the subject. You can do some very, very fancy and some very complicated things with, um, So if you want to learn more about regular expressions, go look on the Internet or try out the introduction to regular Expressions Book from O'Reilly. That's probably a good start, but odds are you've seen these before, and if you haven't, you can usually find the one you need on the Internet. If you just look, I'll provide either ones you need in this course for logs, for example. So don't get too hung up on the syntax. But the way or since, Kaleka said, it's pretty weird. So we have this this Reggae X that we created using this dot our operator here and what we're gonna do to apply that. It's really weird how this works. You take that rig X, and you pass in where you want to put the result that you extract information from, so we're going to basically try to pull out that number from the string and put it into a new value called Answer String. And then we used the equal sign to say, Well, I want to run this on the string called the Ultimate Answer. And you know this syntax of this really defies human logic. You just have to kind of go with it. You know, it's it's something about scallop, but I just don't like, but keep this worksheet around for where for reference. So if you do need to do some regular expression stuff later on, you'll have this to look at because it really doesn't make intuitive sense at all. Anyway. What that does is it does apply that regular expression to this string, extracts out that number 42 sticks it in the string answer string. Last thing we need to do is because that is a string, but we really want to treat it as a number we can call to int on that string and extract the integer value 42. So two it will convert any string to an integer, and you know, like you might expect is also to flow to her. You know, whatever else you might need, conversion operators, areas well, notice that I didn't actually have to put a open and close empty princey there that's not actually necessary and we put the answer. We got 42. Okay? So, honestly, this is about as weird to Scalia gets if you made it through that. So, uh, it's all downhill from here. Okay? It's kind of easier as promised, with getting easier. Billions work much as you would expect. So if you just want to compare two things if one is greater than two Well, no. The answer is false. Ones less than two will give you true. One thing that's a little bit and big us and some languages is what does one ampersand versus two after sands or one pipe verses to pipe. Do you know what you're doing and or comparisons? You could do either one in scallop. The only difference is that if you use a single ampersand, both expressions get evaluated and then, you know, and it together course. If use a double ampersand, that means that it can actually bail early. So if is greater, were false. It would actually not even bother. Evaluating is lesser at all. And we go left to right, you know, only as far as it needed to. So for performance, you probably go the double ampersand. Usually but for safety. A single one might be better because you might have some expression in there that you assume gets run and might not if you're using a double espresso double ampersand. Either way, here you get the same answer. You can't be greater and lesser at the same time. So both of these provide a result of false string comparisons also something that works differently in different languages. In Scalia, it makes sense when you compare two strings. Using the equals equals operator. It actually compares the contents of the string and not be string objects themselves. So I set one string equal to Picard, and I have another string called Best Captain. That's also set to Picard. If I test Picard equal, equal, best captain, it compares the contents of those strings and says, True and well, you know I don't want to get all nerdy on you and debate the values of structure camping captains, but believe that foreign exercise to the reader shall way. All right, so that's less than one with Scalia crash course on basic stuff here. All we did was cover, you know, the syntax were declaring mutable and immutable variables. We looked at some different Boolean operators. Some ways of printing out information, applying regular expressions and you have what you need. Teoh do a little exercise here, so make it real. So while you're in here, like I said, you can just add in code here and hit Save, and it will get evaluated right away. So play around with it. My challenge to you is to write some code that takes the value of pie, doubles it and then prints it within a string with three decimal places of precision to the right, so you should have enough examples within this worksheet to do that pretty easily. You know, this isn't meant to be difficult, but I just wanted to get your hands on with some scallop code, actually, get comfortable with the fact that, yes, you can write Salako yourself and actually make it run and operate. So please do that. It's exercise. Even though it does seem trivial, it will be helpful. So do that, and we'll move on to the next lesson. 6. [Exercise] Flow Control in Scala: all right, moving on this next topic is gonna be a little bit easier. A little bit more straightforward. We just not talk about flow control and scallop. So to get started, go ahead to the resource is for this lecture and down. It'll be learning Scalia to dot sc file wherever you download stuff, and I will sit here and wait for you. So hit, pause and take care of that, please. All right, let's move on. So right, click on learning scallop scallop Project here and say imports and goto wherever but the file system Browse to wherever you download your stuff. For me, that is oops. For me, that is, see downloads and select Learning scallop to dot SC and had finished to import that into our learning scallop project and double click on that. And you should see this. So let's just talk about flow control. You know, how do we control the flow of our code and scallops? All pretty intuitive. You know, there are a few syntactical differences from other languages, but by and large, it should look very familiar to whatever language you're already familiar with. So, for example, you can use fl syntax. If one is greater than three. Print line impossible. When it's in greater than three else print line, the world makes sense, and sure enough, that will output. The world makes sense now. It could also put Braff brackets around this. If I wanted to make it more than one line, that's OK. You know, you see that down here, where I'm actually splitting it out to be a little bit more readable, often in spark code. You see people jam things on the one line just because they can. And that's kind of Ah, common attribute of functional programming, I'm afraid, but either way works, so that's fine. There is no else if operator, so you would have to actually do another. You know else after that, if you wanted to. And that's all there is to that if and else worse is the way you would expect now matching . So let's say you have, like a switch statement another and other languages where you have a value you want to like , have different cases where you do something different, depending on what value that that variable has. So in Scalia that's called a match statement. So instead of switch, we use match. Other than that, pretty similar. So let's say we start with a constant value number that's set to the number three weaken set number. Match will be the syntax here. So I want to match on the value of number. And for case one, you have this little, like, equal sign greater than thing. That sort of says, you know, associate this function with that case. So that could actually be a whole expression if I wanted to, but in this example will keep it simple. So in case one, we're gonna map that to print line one Case two will be mapped to print line to case three will be map to print line three and then this underscore is basically your default here, your catch all case here. So case underscore, Meaning it's something that's not 12 or three will print. I print lying something else. But in this case, our numbers three. So it falls through to the case three and prince three, which indeed works. So again, you know, there's nothing really complicated going on. There is just like a switch statement in other languages with the syntax a little bit different syntax a little bit different. We have match followed by the value name here with a little bit backwards from other languages. And we have this little mapping operator here equal greater than toe actually map cases to what you want them to do. So that's the syntactical trickery there. Four loops, nothing special about them. Well, little bits and tactically. So look at what's going on here. We have four x 1 to 4, so basically want to reiterate through the values one through four inclusively. The only weird thing here is this little less than dash syntax here, and that is basically a range operator in Scala's. So that says, I'm going to take the Value X within the range of 1 to 4 that's going to create, you know, a list of 1234 and jittery through that in this four loop. So we will take here the square of each value. So for each value one through four all multiply it by itself and print the results. And we see one times one is 12 times two is 43 times three is 94 times for 16. So the four loops in tax itself is exactly what you'd expect. But we have this new range operator here that's something new. So remember. That's how you specify a range of values in Scalia, while Lips absolutely nothing interesting, their works exactly the same as it would in other languages. So let's say I start with the X variable set to 10 and I want to reiterate through this, while X is greater than or equal to zero while I subtract one for it as I go, putting the value. So you know, there's really nothing fancy there at all does exactly what you'd expect. We start with 10 and as we print the value and then Decker meant it will keep on going until hit zero. So 10 98 all the way down to zero, just like in any other language. And just like we have while loops we can also do do while loops also just like another. Any other language? There's really nothing scallops special here going on. We have basically starting with the Value zero do print the value incremental by one and do that, while X is less than or equal to 10. So instead of counting down from 10 to 0. We're gonna count up from 0 to 10 using a do wild loop again if you have any prior programming experience at all. I'm sure you've seen this all before, and it's really nothing special about it when it comes to scallop works this way. You think now one thing I want talk about those expressions in scallop. And one thing that is a little bit different is that we have a block code like this. Let's see him starting with the value of X equal to 10. And then I'm gonna add 22 that now in other languages that wouldn't actually do anything because we're not actually returning anything from that expression. But in Scalia, this has an implicit value. So whatever the last expression is will be what this expression as a whole is set to. So I can actually take this expression and use that as the value 30. So take a look at this. I could say print line. This expression of Alexi was 10 x plus 20. That expression gets evaluated and the number were left with at the end is 30. So this all gets translated into the value 30 which actually is what gets printed out here . Okay, so we're trying to touch on some aspects of functional programming here where expressions can actually be used as values of their own. Okay, so that takes a little bit of wrapping your head around. So, Sterritt, that was two lines a little bit longer. And make sure that sinks in for you. Okay, Now, just get your hands dirty. A little simple example Here. You probably see this on a lot of sort of simple job interviews to print the Fibonacci sequence and whatever lame she want in this case, I want you to do it in the scallop programming language. Not hard. All have to do is write a little loop that goes through the numbers zero through 10. And the expected output should be something like this. Where every time you go through the loop, you're adding in the sum of the two numbers before it. So basically, you need to write a loop that keeps track of the previous two values, sums them up as you go. So little exercise in some flow control and doing some looping and scallop there for you have at it and have some fun with it. So I just want you to get your hands dirty and again. Just get some confidence that you can actually write scholar code that runs. All I have to do is write your code in here, and every time you hit control us to save it will get evaluated for you and you'll see the result right alongside your code. So give that a try and see how it goes onto the next lecture. 7. [Exercise] Functions in Scala: moving on. Like I said, Scalia is a functional programming language. So now it's time to talk about functions which are obviously very important in scallop. So go to the resource is for this lecture and download the learning scallop three dot sc file hit Pause. Will you take care of that? And just like before, we're gonna import that into our project as well. So again, right, click on Learning Scalia project. Go to import, got a file system and select the folder where you download your stuff too, Whatever that might be and select learning scallop three dot sc finish and you should see it here in your project. What's double Click that and take a look should look like this. All right, So let's talk about functions in Scalia Very important lesson here. Like I said, functional programming. It's kind of the building blocks of it. So syntax were a function in Scalia. It looks like this. You start with death, which basically means I'm declaring a function of some sort, followed by the function name in this case squared. I'm making a function that just square some value, and the syntax again is backwards. From what you might expect from other languages. So we start with a parameter X that is of integer type. We say that is ex colon ent, and then the return value dysfunction is an interview as well. So we have another colon after that function name decoration in the parameters that is followed by the type that is returned by the function. So again, this is totally different from what other languages might do. Usually you would see into squared in X instead. In Scala, it's squared X and int. So just ah, it takes some getting used to. But make sure you understand what's happening there. So again, death means we have a function followed by the function name our parameters, which have a colon that specifies the type of each parameter. And then finally another colon, followed by the type that is returned by the function. And then another thing that's a little bit different is that we have unequal sign between that and the actual brackets for the function, because we're basically taking this expression and assigning it to the square it name Okay , so a little bit different there too. But inside all we're gonna do is say X Times X And again, This is expressions to remember from the previous lecture that we don't have to actually explicitly return anything. Whatever the last value computed in that expression is what gets returned. So this is all we need for a function that actually squares a value. So let's take a look again. Square it takes in an interview, returns an insurer, and it applies this expression to it. X. Times X Actually get the square Now we could do something similar here for cubing. Again, deaf Cubit takes in an integer X, returns the manager and uses the expression X times X Times X to compute What gets returned . So let's see it in action. We do print line square to you. Just call the functional he would in any other language that will return for. And if you print lying Cuba to that will give eight because two cubed is eight and two squared. Is four all right? Nothing too crazy. Yet it's all kind of makes sense, right? Things get a little bit weird now. So remember I said before up here, we're basically assigning this expression to that function name. Square it so basically every function and Scala's kind of its own little object, and you can pass these functions around toe other functions, like they are objects themselves, so that makes a lot more sense with an example. Here, let's make a new function called Transform Int and what this function does this it will take a function to transform an insurer could be any function and apply that to some manager. So our parameters here in this example are going to be an X, which is an integer that we want to transform, followed by a function that transforms an integer to another integer. So we're taking in here as a parameter, a function that expects as input one parameter that is an integer and returns an integer. And that's what this syntax here means takes in an integer and it outputs an indigenous. Its result, and the return value of this function itself is an integer set equal to F of X. So we're going to take the function f passed in and pass in the parameter esos passed in, and what gets evaluated is that function f apply to that value X, so let's see how that works so for example, we can call Transform into with the value too, and the function Cuba. So this will pass in the value to two X and the function Cuba toe F function will then take the function Cubitt and pass in X to that and return the results. That gives us back eight, because that will pass into to the cubit function and return eight. You could put the results and there it is now, for example, we could change that to be square it. Let's just make sure it works. It's save. Sure enough, we get four now because now I'm passing to to the square it function. So you know, you might want to take a little break here, Make sure that sinks in. This is kind of a very fundamental and very important concept of functional programming that you can pass functions around as parameters themselves. Okay, I mean, other languages. This is still possible to do, but you don't see it is often okay. So very important that you understand how that's working. So Steira these little lines of code here, make sure you understand what's happening there before you move on. Okay? Now you don't even need to have a separate function declared for this trip toe work. You can actually declare these functions in line as well. And if you've worked with me on my other spark class and python, this is very similar to how a lambda function and python works. It also goes by the name of anonymous functions or function liberals. It's a popular term in Scalia, but you want to see a lot of this. So make sure you understand the syntax of what's going on here. So here I'm calling Transform Meant. And instead of passing in a name function that already declared, I'm just passing in the expression itself, so I don't even need to have a separate function at all. I can say I'm gonna pass in the parameter three and I'm gonna pass in this expression that takes into Value X and transforms it using the expression X, times, X Times X, And this becomes my entire function definition that I pass into that function. So instead of having a separate cubit function, I could embed what Cubit does just right here within this call. So this is perfectly valid syntax in scallop. I can say I have some value, except it's transformed using equal greater than two x times x times x and that actually works. You can see here that if I take in passing the value three to transform int and I pass in this function, ex gets transformed. X Times X Times X It actually works. It will pass in three into X and execute that function on it to get the number 27 as a result. Okay, let's look at it again. So there's another example I'm using. The same transform into that just takes us parameters, some integer and some function, and in this case, the function. The only requirement is that transforms an interviewer to another integer I'm taking X and transforms to X divided by two. So in this example, 10 gets transformed into five that way. Okay, one more time. We could do a more complicated expression here. So instead of just a simple one liner, we could actually do a multi line expression as well. And in this example, we're going to take the value to and passing into this function that transforms axe into this expression that actually does a couple of different steps it takes the value of why that's equal to X Times two and then squares that resulting value and that becomes the end result of that expression. So we take to a sign why to two times two, which is four. And then we multiply by times, live four times for to get 16 and that is indeed what we get back. So make sure you understand what's going on there. You're going to see this in tax a lot in the world of Scalia and in the world of spark in general. So very important to understand that syntax. It's pretty easy toe eyeball it. But you know, to get a really intuitive understanding of what's going on, that's that's important. OK, so let that sink in, and once it has sunk in, I'll give you a little bit of a challenge here. So a very simple exercise here, every string has a dot to upper case method that you can call on it. Just transform the contents of that string to upper case characters, so play around some functions right, a function that actually converts the string to upper case and cried out, actually execute that function on some strings of your own. Make sure it works, and then actually get rid of these separates this separate function entirely and do it in line, just like we did here with anonymous with function liberals. Okay, so I want you to try both doing this with a named separate function that's defined and also with a function literal that you just do in line. So go to that to get some practice with functions. And, hey, when you're done, you can say, you know, functional programming. Congratulations. That's something for the resonate. All right, let's move on to our final lecture on our scallop boot camp Siri's. 8. [Exercise] Data Structures in Scala: All right, We're down to our last lecture in our scallop crash course here. This one's gonna be about data structures, and it's a very important lecture. We use these a lot, so don't skip this one if you don't know them already as before, I go to the resource is for this lecture and please download the learning Scallop four dot sc file. I'll wait for you Hit, pause while you go take care of that. And when you have that go on as before, to the to right, click on the Learning Scalia project and say imports file system browsed wherever you download your stuff and select the learning scallop Ford out SC file, Open it up and you should see a little something like this. So let's talk about what's going on in here, okay? So a really common data structure you seen Scala's called a to pull and you know, you see this in Python and some other languages as well. Basically, it's a collection of stuff that is treated as its own object so we can make basically this immutable list is how you can think about a two bullets, a list of objects that you cannot change once they have been created very common in Scala. So how do we declare one? You just put him in parentheses and separate the contents of the list by comments. So, for example, we can create a two bowl called Captain Stuff that contains a strange Speckhard enterprise . D and N. C. C. 1701 D. So maybe this is a to pull that represents information about a Star Trek captain and contains individual fields that have different information about that captain. So it's sort of easy way to pass around, you know, semi structured data as its own entity. And if we print that out, we could see that gets converted to the string that is, you know, the same syntax within parentheses, the different contents of that two people separated by commas. So Captain Stuff itself is a single object. But it contains three different values these three different strains here, so that it's a good way to sort of stuff different bits of information into one thing, which comes in a handy a lot with spark. As you'll see later on. How do I get stuff out of it? So let's say I want to extract values out of that to pull all the syntax here again with scallops a little bit weird. What you do is you put a dot after the to pull name and then underscore and the number of the field you want us to extract. And the thing you have to remember is that it is a one based index, so the first value is number one. The second values number two is the third values number three. A lot of other things in computer science, you start counting it. Zero. This is not one of those times were starting with one here. So, for example, if I print captain stuff dot underscore one that extracts the first feel of my to bowl, which contains thestreet the card dot underscore to get the 2nd 1 which is Enterprise D Daughter and Score three gets the third element of that to pull, which is sec 1701 D OK with me so far, nothing to nothing too challenging here. But again, you just have to get used to the syntax. Now a special case is a key value pair, basically a to pull that contains two things, and often you'll use this as sort of a key value mapping. So let's say I want to create a to pull that represents a captain and the ship they command . I could use this in task to do that. That would just create a to pull that maps. The card is the key to the Value Enterprise D, and all that does is give Mac YouTube all that contains pick our enterprise D. I could have declared it the same way they did up here just by having it within parentheses , separated by a comma. But this is sometimes a more readable format. That sort of implies to the reader of your CO that you are, in fact associating a value with a key here. And that's your intent of creating that to bowl you a d reference it the same way you would any other to pull the car ship dot underscore to would extract value from that key value pair because it's a second element and give you back Enterprise D, whereas dot underscore one. Let's try. It would give me back the key, which is the card scroll back up to where we were there we are everything about two pools. They don't have to all be the same type. You can actually make mix and match different types of data within a to pull. So here's an example where you have a string, an integer and a boolean all jammed into one to pull, and that's a perfectly okay thing to do. Comes in handy sometimes, so that's two pools moving on. Take a little break if you need to let that sink in, review it. We're good, all right, let's let's. They're a lot like two pools, but it's an actual collection object. So it has MAWR stuff you can do to it a little bit more. A little bit more heavyweight, but a little bit more flexible. So kind of like a tube. Well, you can declare a list. That's just a list of comma separated values here and under the hood is gonna build a singly linked list to represent that information. So here we have a ship list value that is a list of enterprise defined forger, deep space nine So four different ships from four different Star Trek series in this example, and the syntax of dealing with list is different than a to pull. So, for example, if I want to extract a value from that list, I can just use parentheses, E and then the field number I want to extract. So now, in this case, we're not starting with one. We're starting with zero. Okay, so you got to remember that very easy to make bugs around this stuff. In this case ship list, one will go extract the zero based element of the list at one. So we go. Zero is enterprise. One is defiant. Ship list. One gets me back. Defy it. Okay, I know it's confusing, but it's something you just have to remember with scallops now dot Head will give you the first element of the list dot Tail will give you all the remaining elements of the list after the head. OK, so in other languages, tail might return the last element. That's not how it works here. Tail gives you actually list a sub list of a list that just excludes the first element, and sometimes that's useful. If you're iterating through a listing, need to extract one item and then, you know, keep working with the remainder of that list in the next generation. Okay, now, let's say want to eatery through an entire listing, Do something to every value within it. That's something you need to do pretty often. So remember that that range operator we introduced back when we talked about flow control? Well, same trick applies to lists, so we could say four ship in ship lists and that will actually produce a range and generator that goes through every ship in the ship list. Assigns it to the ship. Variable could print out each ship one of the time that way. So this code here again, the key is this little range operator here that will go through the ship list, extract each individual element into a new value called ship that we can then print out within this expression, so that will go through and print out each individual ship in our ship list. More fun with lists here. We can also used the map function to actually transform a list using some expression. So let's take a look. What's going on there. Let's say I have my ship list of ship names at a price. If I enforce your deep space nine. I can call dot map to actually apply an expression, some function to every element of that list one of the time. So in this x case, I'm actually defining a little function in line that takes a strain called ship and maps it to ship, not reverse, which just reverses the characters in each string. So what that does to say, I want to take this function that reverses a string and map that to the ship list and return a new list called backward ships that contains the backwards names of all the shifts . Okay, so stare at that line for a little bit until it sinks in, and the output is exactly what we want. So we have a list of all the ships with their names backwards. Okay, so that's what the map function doesn't just apply some function that transforms each element of the list into a new list. Okay? And we can go through and print out those results to just using that same trick with the range operator there to print out the results. And sure enough, we get the backwards versions of each ship name printed out one at a time. Okay. With me so far. Okay, so that's that's mapping. That's the map function. Mapping sound familiar Map reduce same concept. Same exact concept. The only difference is that we're not actually doing this in a distributed matter. We're just using Scala's built in operators on the list actually map one list to another. Now I said, map reduce, can't talk about map without reduced. And sure enough, there's a reduce method. On list is well, so just like we can use math to apply a transformation to every individual elements of a list. We can use the reduce function to actually combine the contents of a list together. So here we're gonna start with a number list. That's just a list of the numbers. 1234 and five I can call reduce on that list with a function that tells how to combine these things together. So basically what reduced does is it goes through each list, and as it gets each pair of items, it will use the function we provide to combine them together. So basically, we're saying, as new values come in from the list, add them up. So as I get to values that are processed together. I'm gonna add them up, take that result at in the next value, take that result at in the next value and so on and so forth until we're done. So basically, this is a way of saying I want to take the entire list some of all of its values and give me back the result. And what that does is it read through the list one parent values at a time, so it takes one point. Plus two says that's +33 plus +363 puts +46 was forced. 10 10 plus five is 15 using that reduce operator that reduce function we provided. And sure enough, the answer we get back is 15 and we can print out that results. Sure enough, 15 and hey, that's my produce, you know. The only difference is that in the real map produce application, you would be using functions that actually work on a cluster. Same concept, though filter also can be used. And, you know, if you've looked at spark, a map produced before that might also look familiar. You can actually use that to create a sub list based on some filter function. You define. So in this example, we're going to say I have a new list called I Hate Fives and I'm going to take my number list that contains numbers 12345 And apply a filter that tests to make sure that X does not equal five. So I haven't hear a little in line function. Ah, function literal, if you will. That takes an integer X and tests. If X is not equal to five and that Boolean results of that function will determine whether or not that individual list element survives, it gets passed into the resulting list. So what goes on here? We start with 12345 We apply this filter function to it, which will allow everything but the five to pass through. And you can see that we end up with a list 1234 where five has been excluded. Do the same thing again here a little bit of a more compact syntax here that you often see same idea. But here we're using a little bit of a syntactical shortcut. You can actually say in a function literal underscore to just say whatever it is, is not equal to three. So that is sort of implicitly removing this X transformed to X and just simplifying it down to Hey, I know I only have one parameter coming in. I want to take that parameter, whatever it is and just compared to three. Okay. And you can see that works just as well. If I apply this filter function that make sure that each element represented by that underscores not equal to three is all that passes through that I end up with 1245 where the three has been emitted. So, you know, you do see this a lot in Scallon where we have these people that want to save keystrokes at the expense of readability. But you need to understand that we see it, Okay, Now spark, you know, has its own map reducing filter operations that can actually happen in a distributed manner . The only difference is that will be not operating on list structures, but something we call rtgs. But we'll get to that later, but same exact concept. So if you understand how this nap and reduce function in this filter function worked Hey, you understand that produce now and You also understand a good part about how spark works. It's all the same idea. It's actually not that complicated. A few more list tricks here, Let's say I want to catch any to list. Together we have our 12345 list number list, and I want Cam Cat Nate. A new list called More numbers. 2nd 678 The plus plus operator will combine to us together, and that gives us back the combined list. 12345678 Okay, a few more tricks has a Every list has a dot reverse method that will give you the reversed list. So 54321 has a dot sorted method that will return a sorted version of that list. So here I took my reverse list. That was 54321 col dot sorted on it, and that returns a new sorted list. That's back to 12345 Let's ah, look a distinct. So let's say I take my number list in append itself number list plus plus number lists, and I get one view for +512345 If I just want the distinct values in that list. I can call not distinct and just get back. 12345 Because those are the numbers that only appear once. Okay, so we also have a dot max function that will find the maximum value analyst some, well, some them all up. If they are numbers that could be summed. And I also have a dock contains function that could be used as a quick test to see if a list contains a given value. So I hate threes where we filtered out the number three contains three will return false because it has everything but three. All right, that's lists. I won't take a break here, Let it sink in because we're gonna move on to maps now. All right, Maps, maps, just like maps and other languages. Sometimes they're called dictionaries. Basically, they're a look up table of given keys to value. So the way to declare a map in Scalia something like this. So let's say I want to create a map that maps starship captains to the ship they commanded . So here we have a map of Kirk maps to enterprise. We have this dash greater than science. Syntax, mapping, keys to values Here, Picard, a social enterprise. DCIS Cody 69 Janeway to Voyager. And we can actually do a look up very simply here by using this Prentice E key name here to extract the value associate with that key. So shit map, Janeway returns Voyager. Okay, very easy to use. Now, one complexity is that when you're dealing with maps, you need to deal with missing mythic missing keys. Right. So what if I did like Archer? Ok, that wasn't in my list of original captains in my list of keys for this map. So what happens if I try to extract Archer? Well, that should don't generate exception. So one way to protect against that is to make sure that the map actually contains the key you're looking for first. So if I do that test, I can say shit map dot Contains archer. That will return false, because shit map does not have a key for Archer. And at that point, I could say, Well, I know that he doesn't exist, so I'm not gonna try to extract the value for that key because I know that would lead to an error. Another way you can deal with it is by using this. Try syntax. So you know, just, like try exception blocks or try catch blocks in Java. Similar concept there we're gonna do is use this util dot try function that's built into Scalia to actually extract the value for Archer from ship map on. What's gonna happen is that's going to generate an exception because there is no key for Archer and shit map. But instead of airing out, we're gonna wrap it in this util dot Try so it catches that exception getter. L says I want to actually get the value of this expression here. But if it generates an exception, here is my fallback value of unknown. And if I go ahead and execute that, you can see that when I tried to do shit Matt Archer that does generate an exception. Getter else handles that exception and returns the key, the string unknown in that case. So it's a safe way of accessing a map where you're not sure if the key is actually in it or not. All right, those are all the main data structures that we need to deal with with scallop. There are others, you know when we'll maybe touch on one or two more of them as we go through the course. Obviously, there's a lot more to Scalia than we've talked about here. We haven't even really gotten into object oriented Scallon and some of them or, you know, mundane details of the language that might come up from time to time. But as new concepts come up through this course that you might need, I'll introduce them to us. We go. This is all you need to get started. So these are the basics of scallop. If you do want to learn more, there is a good book called Learning Scallop by O'Reilly that I recommend I don't get any money if you buy it or what not but really good reference. And it does go to all those little details I talked about. If you want to learn more, but that's enough to get you started. I think that's enough for now. So let's just start to dive into some actual spark streaming next shall way and congratulations. You now know Scalia, who 9. Introduction to Spark: So we've learned a little bit of scallop. Let's talk about Spark and then we'll start to put it all together. So at a high level with sparkle about well, if you go to the Spark Website, it describes itself as a fast and general engine for large scale data processing. And, well, that's pretty much what sparked does. It is basically a framework that you can use for programming the distributed processing of large data sets. So it contains functions that let you import data from a distributed data store like an HD FS file system or S three or something like that. And it provides a mechanism for very simply and very efficiently processing that data. So, you know, assigning functions that transformed that data or perform various actions upon it. And it could do this on a cluster using the same code that used to run it on your own desktop. So the same code can be run on your own desktop for development purposes, as well as on a riel cluster of computers that could be scaled out as large as you want. So the whole power of horizontal scaling is available to you, or if you have a massive data processing job, all you need to do is throw more hardware at it until you can actually handle the whole thing. So that's kind of the whole idea behind Spark. It lets you do things you could never do on a single computer, and it is highly scalable. So the high level architecture of smart kind of looks like this. So basically, you develop a driver program. So all the scripts that were going to be writing in this course or driver programs and there built around one object called spark context. That sort of encapsulates the actual underlying hardware that you're running on now. On top of that, you are running some sort of a cluster manager, and there is one provided by spark. Or you can actually use it on top of a Hadoop cluster if you want and use Hadoop Yard as well. Also, Apache May sources an option to, and that cluster manager is then in turn, responsible for distributing the work defined by your driver's script among multiple notes . So every note that you run on every machine might have an executor process, which has its own cash and its own list of tasks, and it can split that data up to multiple executors. So if you have a massive data set, it might take one little chunk of it and put one chunk on each one of those different notes on your cluster for parallel processing. And then your cluster manager figures out how to re combine all that data and pass it on to the next step if necessary. But the beautiful thing is that all this stuff on the right here just happens. Auto magically. For the most part, all you really concern yourself with us. A spark programmer is primarily the logic about how are you going to process this data spark in the cluster Manager are then responsible for figuring out How do I distribute it in an efficient manner Now, that's not to say you're totally off the hook for thinking about how it gets distributed, and we'll talk about tweaking these things later on. But for the most part it just works. Why you spark is supposed to something else. Well, a lot of people compared to Hadoop map, reduce and map produced was kind of the first technology that came out for doing distributed processing of data on a cluster and spark claims to run up to 100 times faster than map produce. If you're running in memory or 10 X faster on disk, it is pretty fast. I mean, your mileage may vary, of course, and these air sort of best case scenarios. But spark is generally preferred over a map reduce. For that reason, it is generally a faster thing, and it's more modern. Has a little bit more capabilities to it. In my produce. You'll find yourself writing a lot of things from scratch and kind of like wrapping your head around how to fit a problem into the map pretties coding Paradigm for isn't SPARK It's generally a lot easier. The reason that so much faster is that Sparky's is something called a directed a cyclic graph a Dag engine toe optimize its work flows. So one thing is kind of need about spark is that nothing actually happens until you actually hit a command that says, I want to collect the results and do something with them, and one spark sees on an action like that. It will go back and figure out the optimal way to combine all of your previous code together and, like come up with a plan that's optimal for actually producing the end result that you want . So it's a very different paradigm that how map reduce coat works, for example. That's kind of the secret to its efficiency. A lot of people are using sparking wild, and I'm sure there's many more than just this list right now. There's a little website here at the bottom here you can look at if you want to see the latest list. And I'm sure a lot of big companies just aren't telling people how they're doing stuff internally. There's a lot of secrecy in the corporate world, but people who have fessed up include Amazon, eBay, NASA trip advisor, Yahoo and many money. Others. I'm sure it's also used very widely in the financial industry as well. And pretty much anyone that has big data, I'm sure, is doing something with sparked these days. It's not really that hard. I mean, I think you'll be surprised at just how small of these Spark drivers scripts are that we're gonna right. It's very powerful language in a very concise language that we're gonna be dealing with, and you can code in Python, Java or Scalia again. I'm recommending scallop because it's going to give you the most efficiency and probably the most, the tightest code for dealing with spark programs. But if you do want to stick with job or a python, that's also an option. And Spark is really just generate. Just based around one main concept, the resilient distributed data set. This is basically an abstraction over a giant set of data on you. Just take these RTGs and you transform them and you perform actions on them, and that's really all there is to programming and spark. It's just a matter of trying to figure out the right strategy of how to get from Point A to point B, where you haven't set of input data and a desire set of results. But the actual code in between is generally surprisingly simple. Now. Spark itself consists of many components, so spark core kind of deals with the basics of dealing with our DDS and transforming them and collecting their results and tallying them and reducing things together. But there's also some libraries built on top of spark to make some more complex operations even simpler. So one thing is spark streaming. That's actually a technology built on top of smart that can handle a little micro bursts of data as they come in in real time. So you can process a stream of data from a fleet of Web servers, for example, or maybe a ton of sensors from an Internet of things application as they come in one second at a time. And just keep updating your results as you go in real time. It just runs forever. Pretty cool stuff. Spark Sequel allows a sequel like Interface to Spark, and you can even open up a little thing that was like a database connection to spark so you can actually be using sparks equal to run sequel like queries on massively distributed data sets that you've processed using spark, which is a pretty powerful thing. Sometimes ML lib lets you do machine learning operations on massive data sets. It's a little bit limited in what it can do today, but it's always getting better, so if you need to do is things like a linear regression or even recommend items based on user behavior, Ml Lip has built in routines to do that automatically and distribute that across a cluster so you could deal with truly massive data sets and before machine learning on them in a very efficient manner. Finally, graphics. You know, we're not talking about the types of graphs that you see, you know, in slide presentations where you know, revenues going up into the right or whatever this is about. Graph theory. Graft networks, right? So think about like, say, a social network where we have a bunch of individuals connected to each other in various ways. That's the kind of graft graphics is talking about and provides a framework for getting information about the attributes and properties of that graph, and also a generalized mechanism called Prigel that we look at later in the course that will let you do pretty much anything you want to that graph and very efficient and again distributed manner so you could even throw it. It's a you know, a massive social network data set and wrangle that into whatever information you need. So all more ways and that spark is a very powerful technology. So why are we using Scalia? You know, it's probably not language you know, and asking you to learn a new language is kind of a tall order, I realized. But again, spark itself is written in scallop. So if you want to take advantage of the latest and greatest spark features, Kala is the best choice. It's gonna give you the best performance. It's gonna give you the best reliability, and it's also going to be the simplest code to look at when you're at the end of the day. That's not a whole lot of boilerplate that we need to add on to Scalia code in Spark to get it to work. So it's very simple to work with once you know the basics, but it's really not as hard as you think. Another good thing is that if you have done spark program before in Python, you'll find that Scalia code looks a lot like python code in the context of a spark job. So, you know, look at these two examples here that do the same exact thing in Python and Scalia. The syntax is slightly different. You know, we have Val qualifiers, whereas Python you don't need to declare that and and we're using a list structure here, and the format for Lambda Functions is slightly different. But by and large it's going to look very familiar. So if you have done some spark programming in the past with Python, going to scale is not going to be too much of a leap, actually. 10. The Resilient Distributed Dataset: So now we're gonna go through a little bit more theory here before we go back to some code . So bear with me here. This is a very important concept with spark the resilient distributed data set basically the core object that you're gonna be using throughout all of your spark development. So it's an important concept, Understand? Let's dive in a little bit. RTD run. Call them RTGs a lot in our course here and throughout our work. And it stands for resilient distributed data set. Basically, it abstracts away all the complexity of trying to manage all the fault, tolerance and distributed nature of the processing that happens with these objects you can think of an RTD is basically an encapsulation around a very large data set that you can then apply transformations and actions to and all you as a programmer needs to worry about , is the functions that perform that transformation or what specific actions you want to perform. On this data, you look the RTD abstract away all the complexity that spark does for you of making sure that its fault tolerant, resilient, making sure that if one no goes down in your cluster, it can still recover from that and pick up from where it left off. It makes sure that it's distributed, so it worries about all the details of how it actually splits up your data and farms it out to the notes on a cluster. And fundamentally, it is a data set. So at the end of the day to day and RTD is just a giant set of data basically row after row after row of information, and that could just be lines of raw text or could be key value informations We'll see later on. So where there are deities come from? Well, when you create a Spark drivers script, you can create what's called a spark context object. And if you're running just in the spark shell, you get one automatically. That's called SC. But when you're writing code, you explicitly right a spark context, and you will tell the spark context. Okay, this is the kind of cluster I'm running on. This the name of my job and what not. But once you have this spark context, it is what creates rtgs for you. So, basically, let's take a look at some examples. This first example here we're basically taking a hard coded list of integers 1234 and using the paralyzed method to create an RTD outfit. So that would actually be, uh, as cedaw paralyzed if you were calling your spark context as see, for example, Now that's not a terribly useful thing to do, because if you can actually hard code a list of integers. Well, that's not really big data to begin with now, is it? Where, Using spark in the first place. But it could be useful for, you know, setting up test cases or just developing your scripts with a small set of data. Initially, you can also, and this is more common, load up data from some file system, and it might be your local file system. Or it could be a distributed file system, which is more likely in a real world scenario. You might open this in this example. We're just getting a big text file off for hard drive, but you could also use S three end if you had it stored on Amazon as three or HD fs, if it was stored on a distributed HD FS file system on a Hadoop cluster instead of file. So if you have a big chunk of data somewhere sitting on a distributed file system text file can let you do that. You can also integrate with hive. Hive is basically another bit of the Hadoop ecosystem that lets you store a big data and sort of a more structured way. And if you have a hive database, you can actually construct a high of context, given a spark context, and then use that hive context. Excuse equal queries on your hive data and that will return back in our d D that you can then perform spark operations on. And there are lots of other data sources that you can integrate with spark and create rgds from including jdb. See Cassandra H. Base Last Search J. It's on CCTV files, all sorts of different stuff. If there's a somewhat popular data store format out there, odds are there's a way already to integrate that with spark and create an rdd from it. So what you have in RTD, what do you do with it? Well, they're basically to do the things you can do. You can transform an RTD, basically apply some function to the data in an RTD to create a new RTD, or you can apply an action to it where you say, OK, I want to somehow do something to this RTD that reduces it down to the end results that I want to actually retrieve at the end. Let's talk about transformations first, So the most common operation you could do in RTD is the map function. Look at examples of this later on, so just stick with the concepts for now. But what map does is apply a function to an entire already D. So if you have an RDD full of data and let's say you want to extract one field from it or perform some parsing on it, or somehow, you know, multiply everything that's in it, you'll just pass in a function to the map function that says, This is how I want to transform every row of my RTD So we'll go through your already t one line at a time, and it will apply your function to each line and produce a new RTD that's been transformed and under the hood. It will magically distribute that, if necessary, and deal with fault, tolerance and what not flat map. Very similar concept. The difference there is that you don't have necessarily have a 1 to 1 relationship between the rows of your RTD that you start within the transformed already already deal with flat map. You can look at an individual line. It's the okay, I'm not gonna actually output that at all to the new RTD I'm creating. Or maybe I want. Actually, I'll put multiple lines for each line in my new RTD. So if you don't have a 1 to 1 relationship between the rows between the RTD, you're starting from the one you're transforming it into flat map is the one to use. The rest are pretty self explanatory. Filter just takes a boolean function that tells it. Okay, I want to take out everything out of this RTD that doesn't fit some criteria. So that's a very good way to trim down in our duty initially and just pare it down to the data that you actually need. And the sooner you can do that, the better. Because the less data you have to push around and sparked the faster old performed, the more reliable it will be distinct just returns distinct row. So if you have duplicate rose in your RTD, that could get rid of them automatically. Sample creates a random sample from an RTD. Also useful for testing. A lot of times you want to develop your script on your desktop and maybe take a little random sample of a larger data sets. You can actually run it quickly on your desktop and integrate more quickly on your code. And finally, you can perform set operations on RDS unions, intersection subtractions and Cartesian products between any two RTGs and producer resulting RTD from those operations. Let's look at an example of using the map transformation. So let's say again that we're just taking the hard coded list. 1234 So what we're doing here is we're taking our spark context objects as C and calling paralyzed on it with a list hard coded list. And what this does is produce a new RTD that we're calling RTD real creative, huh? And that already D just contains four rows 123 and four. Now let's see how map works here. So getting into functional programming here a little bit with Scalia, we're taking the RTD object and calling map on it and the parameter we're passing it is actually a function. So this function here is just saying Take is an input the value X and return the value X time, sex. So by passing in this function that squares every row in the RTD, we get back this squares already d that's been transformed and squares will contain the rose 149 and 16 So important concept here to grasp. Basically, you're passing in a function here to map that gets applied to every individual row in your RTD and creates a new RTD as the results. Okay, so we're seeing a couple of different functional programming paradigms here. One is the concept of having a little self contained function being used as a parameter and also notice that we're using vowels here or not bars, which is also good practice for functional programming. Although map creates a new RTD, it doesnt modify the original RTD so it returns a new one that can remain constant and immutable. All right, so just to reiterate that point again, passing in this little function here of X and this, you know, funny little our with an equal sine X Times X sent tactically, That's exactly the same thing. Is doing something like this where if we had a separate function that we named square it, that takes as input a an integer x parameter and returns an integer that is X, Times X, And then we could call RTD dot map and pass in the name of that function. So this is just a more compact way of handling, you know, very simple transforms that could be done is a one liner. Now, if you do have a more complicated transformation and want to pass into map, it might make sense to actually split it out into its own function like we're doing here. But for simpler transformations like just squaring everything in an r d d, this syntax is gonna be a lot simpler to use. Okay, let's talk about actions. You conform on RTD. So here's some of the more popular actions you can perform. There are certainly more that you can look up if you want to, but collect that is what actually takes the results of an rdd and sucks it back down to your driver script. So at that point, you can actually manipulate those results within Python, print them out, save them out somewhere. Whatever you need to do but remember and spark, nothing actually happens until you call an action on an R D. D. When you call collect or one of these other actions, it will go back and try to figure out what's the optimal path for actually producing these results that I want and at that point will construct a directed a cyclic graph executed in the most optimal matter that it can on your cluster or on your desktop, whatever you're doing, and then return the results. So all these actions are what triggers something to actually happen and spark. So again, collect just collects all of the data and the RTD and passes it back to your driver script , where you can do stuff with it. Some other examples count. If you just want to get a count of how many rows or in an already D, that's a quick way to do that. Count by value will look at unique values in your RTD vitro and give you back a count of how many times each value appears. So remember an rdd can contain duplicate rose. That's perfectly fine. And if you do, you can use count by value to get a count of how many of each unique row value you have take will just take the first and results from an rdd. So if you just want to kind of take a little peek at what's in there for debugging purposes , that could be a useful thing to do. Just take, say that 1st 10 rows of the already to make sure the format looks right and printed out. Common thing top. Similar idea and reduce. Probably the more common thing to do with already D, and this actually lets you combine together all the different values associate with given key, and we'll go into that in more depth soon. But imagine you have key value pairs in your RTD and you want to reduce all the values were given key together. If you want to add them all up, maybe you want toe. Multiply them together. Maybe you want to smush them together in some way. Reduce is what will do that for you, and I could be a very powerful thing. So again I want to repeat the point. Nothing actually happens in an rdd until you actually call in action on it. That is, when you're DAG, you're directed a civic craft gets constructed spark figures out the optimal way to produce the results you want. And that's when things actually kick off and get executed on your cluster. So this could be a little bit confusing at times when you're debunking, right? You know that like you might have a bunch of transformations you doing like maps, for example, and you might have a print line in your script to something just sort of like, See what's going on at that point might not actually do it. Do you expect because that map function didn't actually cause anything to get executed yet ? Nothing actually gets executed until you later call an action like collect or reduce or count. And with that, you've got the basics of RTGs. It's gonna sink in a lot more if we actually look at some examples. So up next, let's take a look at that ratings counter example that we made earlier in the course and try to understand it a little bit more in depth. Now that we understand these basic concepts of sparks, Cala and rgds 11. Ratings Histogram Walkthrough: So now that we have some of the basics behind us, let's go back and take another look at that ratings counter program that we wrote back in lecture to way back in the beginning of the course. Break this down and try to understand a little bit more death. What's going on here? So if you open back up the scallop, I d. You should have this. They're still and again. If you want to run, it can go to the run menu here, go to run configurations. Make sure that ratings counter is selected here. Hit, run and off. It will go again. And just to remind you what this script does, it's actually going through 100,000 riel movie ratings and counting up the distribution of different reading scores. So you can see here we have 66,110 1 star readings. 21,000 some odd five star ratings, etcetera, etcetera. So let's dive in and understand more about how this actually worked. Not a whole lot of code. So just keep this up and follow along with me, and we'll go through it one line at a time. So let's dive into this code that counts up ratings by their reading value in a little bit more depth and understand what's going on at a deeper level shall way. So let's take a look at the first few lines of code here. First thing we're doing is declaring what package this code lives under. So we just say Package com dot Sonntag software dots Mark. Remember, that's the package name that we settled on while creating our project back in lecture, too. And that's just a unique way of segregating off our coats. It doesn't collide with other people's libraries. Next, we'll import the specific libraries that we need for this script. So we're going to import all the stuff from the Spark Library that we need specifically the spark context object. And also we're going to import the log for J Package because we're going to do some gains with the log level, and that's actually the next line in the code there. I'm not gonna go into much depth on that, but we're just using a lot for J two se Onley print out air level messages, but the first bit of spark Oh, that we're gonna run is this? So remember the spark context Object is what we used to create our rtgs. So it's kind of the first thing that we need to make. So this is saying that we're going to create a new spark context object we're gonna call it S C, which is an immutable spark context object. And this local bracket star thing just means that we're going to be running it on our local machine. It's not gonna be in a cluster at all, But the star indicates that we can actually use all of the CPI youth all the course on our CPU to actually distribute that locally if we need to. So if you do have a multi core CPU running, it can actually still distribute it across those cores. And this is just a name for the spark context itself. We're gonna name it ratings counter, and that's just so that if we're looking in the spark you I for progress on the job, we will know what toe look it up under. So that will label this context in the U. I. This job runs too quickly for us to actually do that. But later in the course will look at an example of doing that. So what's going on? First thing we do, we take that smart context object we created, and we call text file on it with a path to aware of the movie lens 100 k ratings data set lives. So the u dot data file is what contains the actual ratings data, and it's gonna import that one line at a time into a new RTD called Lines. So Lines is a data set where every row contains one row of the input data that's in you dot data. Was that actually look like? Well, here's a little sample of what's in that file. So the format here is basically user I d movie I D. Rating and time stamp. So the way to interpret each line here is, for example, user i. D 1 96 watched movie I D to 42 rated at three stars at this time and so on and so forth. Okay, so we're importing in our raw data that we're starting with here, and it lives in an rdd called lines at this point. Now, the next thing we want to do and this is generally good practices. Throw away all the information that we don't need a lot of the time, and resource is that a spark job requires. It depends on how much data it has to push around. So the sooner you can get rid of data that you don't need the better, and that's what we're doing here. Basically, we're calling map on that lines RTD to transform it into a new RTD, call the ratings that parses out each line of the lines already D and only extracts the feel that we care about that third field, which is the actual rating value itself. Remember, our end result is just the breakdown of how many times each rating occurs, so we don't really care about the users or the movies themselves or the time stamps. All we care about are those rating values. So let's take a look at what's going on here. We're calling the map function to transform that lines input RTD and passing it this function to transform it with So basically that says, take every individual line. Call it acts converted to a string, split it up into a list based on tap characters because I happen to know this file is tabbed limited and then extract feel number two from that list. And remember, we start counting at zero. So Field zero would be the user i d one would be the movie I D. And two is the rating. That's when we care about. So basically, this function gets applied to every line of the input file and we get back our ratings already that will contain something like this where only the ratings survived. So now we have a ratings rdd where each row is just the actual rating value for each individual rating. So in this example, we start with this. We end up with this just a list of 33121 because those are the actual rating star values for each Rove and put data with me so far, Don't be afraid. Pause and go back and re watch things. This is all very important to understand we're going through the real fundamentals of spark programming here. So now what? Well, the end result we want is the count of each rating and how many times it occurs. And spark has a handy function called count by value that will do that for you. I remember account by value was an action. So But when we call this, it will actually call it Spark to go off Computed is directed a cyclic graph and actually start chugging away in all the data and distributing it if it can. And this is what we end up with. So count by value returns takes an r d d. But it gives you back a scallop map object where basically, it maps identifiers values to how many times it occurred. And in this example, we're starting off with an RDD that just contains values of ratings themselves. Count by value will give us back a scallop map that we can then do whatever you want to where it just maps values to how many times it occurs. So basically, we now have at this point and already d the form in action count by value one and returned Scallon data in just a map format, for example, rating three occur twice rating. One occur to two times and reading to only occurred once an idea okay, makes sense. We have two threes to ones and 12 now, The last thing we have to do is just display our results in a good way. So for way we're gonna do that. We're going to convert our results map to a sequence, which is something we can actually sort, and we're going to sort that by the first field. So this is just saying, Take my results map convert to a sequence and sorted by the actual rating value, and then we can take that resulting sequence. Call for each on it to apply print line to each row. And that's what actually prints out each individual row of the results in sorted order. And our final results are rating I rating. One occurred twice. A rating of two occurred once, and reading a three occurred twice in this very simplified example. Makes sense. It's not really that complicated, but it does illustrate just how simple it could be to perform big distributed operations with just a few lines of code and sparks. That's pretty cool stuff. It's really just that easy. So if you need to watch that again, please do. If you want to go stare at the code, some or please do encourage you to do so, play around it and let's talk about it a little bit more deficit of what's really going on underneath the hood. Do that next. 12. Spark Internals: let's talk a little bit about what's going on under the hood With that example, we just ran of the history ram of ratings from movie ratings data. Good thing to understand what's actually happening under the hood, not just for academic reasons, but also because you want to make sure that your structuring your spark scripts in such a way that they will perform in an optimum manner having some idea what's going on underneath the hood can help you do that. And we'll see if we get into more complicated examples later on how we can apply this knowledge. So let's take a look at that little ratings counter example that we ran in the previous lecture. So what actually happened when we called account by value on that RTD? That was kind of the last step in our script, and it actually was an action on RTD, which called Spark to go back and actually figure out an execution plan for how to actually get the results we asked for. So what happens there? Well, it keeps track of all the things that we've chained together from different RTGs and how they connect to each other and Based on that information, it constructs a directed a cyclic graph. In this case, it might be a very simple one that looks a little bit like this. Basically, we start with a text file command that imports a bunch of raw data into an RTD. We then map that RTD to parse out the information that we care about, which is just the ratings themselves. And then finally, we call the count by value action to total up all the different numbers of each rating type . So that's the execution plan, and you can see here a little representation of how that gets manifested. So we have in this example five different lines of ratings data, and these get pipe down where they are mapped to just extract the actual ratings themselves . And finally, we add them all up together. Now, an important thing to realize here is that with math, there's a 1 to 1 relationship between each input in each output row of the RTD, so we can actually keep everything partitioned in the same manner at that step, right, cause the's we're all just a bunch of parallel lines were just taking much of data and transforming it, so that can all happen very easily in a distributed manner. But what happens when we call account by a value? Well, we might have to shuffle things around here. Okay, so this is where things get a little bit more complicated. You know, in this case, we have a couple of these different roads here getting combined together into that final result in these two ones. Kind of like have to get moved around to this final result too. So this is what we call a shuffle operation in spark. And that can actually cause spark to have to push a bunch of data around on your cluster, and that could be a very expensive operation. So a lot of times, when you're optimizing spark jobs, you want be thinking about how can I minimize shuffle operations? And how can I make sure things remain paralyze herbal when I'm mapping things. So once it has that execution plan, that's basically what sparked does it looks, replaces where you need to shuffle data and uses those this ways to delineate stages. Okay, so what happens is spark will then break this job up into, say, two stages where the first stage is that text file where we read in the data and then we map it just to extract the data we need that can all be run together in parallel. OK, it's one stage. But then, on this count by Value Command, things need to get shuffled, so that needs to be handled as a 2nd 2nd stage. Okay, so these stages air created based on chunks of processing that could be done in a parallel manner without shuffling things around again. Okay, and when you actually run a spark job, it will give you the output if it's actually being distributed. Indications of what stage? It's running, and that's what is talking about. And then once we have our stages, it will then split those stages up into tasks. So that's that's where things actually get distributed. Notice that we have these sort of parallel processing parallel lines of transformations going on here. Maybe it will break up these two on one note of your cluster and these two into another close to the cluster and this one onto yet another one, so the tasks just break up, paralyze herbal tasks, if you will, for lack of a better word into discrete pieces that could be processed individually and in parallel. Okay, makes sense. So we started an execution plan. The execution plan is broken into stages based on things that could be processed together in parallel that don't have a shuffle involved. And then stages get broken up into tasks that are distributed to individual notes on your cluster. OK, that's all there is to it, and then it just goes and does it. So it's up to your cluster manager at that point to actually get the data where it needs to be and collected all back to where it needs to be in the end, in your driver script and the machine starts are running and you get back your results. So at a high level, that is how spark works internally. And it's just good to understand that again. When we're trying to optimize things later on, we'll be talking about things like explicitly partitioning your data and try to structure things in a way that reduces the amount of shuffles. So keep that in mind. All right, let's move on to another real world example Next 13. Key / Value RDD's, and the Average Friends by Age example: Let's talk about key value rgds A very common thing to do. What? So we don't understand how they work. Basically, any time that you have to aggregate information down to some key value, you're gonna be using key value. Rgds. For example, let's consider a case where we need to figure out the average number of friends broken down by age and some fake social network that we have. So let's imagine that we have a data set that's for each person has their age and how many friends they have, along with whatever else information we might have. Okay, now, if I want to actually figure out the average number of friends broken down by age, I need to somehow aggregate all of the friend counts for every given age. So in this case, the key is the thing that we want to aggregate on the age and the values will be the things that were aggregating. The friend counts. We want to aggregate together. All these friend counts broken down by age and end up with what's the average number of friends a 20 year old has for the key 20. What's the average number of friends that a 33 year old has for the key 33 so on for all the other ages. Okay, so how does that work? Well, Sen. Tactically, all we need to do is represent values as not just a single value, but as a to pull of two values. So in our previous examples, we had RTGs, where every row representative line of input texture, every road represented into juror or something. You can also represent a to pull. And if that to bill contains two objects than spark and treat, that automatically is a key value pair. And there are some special things that can do with that sort of information. So let's look with this example here. In this case we're taking in RTD that contains perhaps one bit of information Perot and transforming it into a key value. RTD and all this transformation function is doing is mapping rdd where each line X gets transformed to the two bull X one. So that's creating a key value. Pair of acts, the original value. That's the key and the value one for each row. Okay, and that's it. There's nothing special beyond that. You now have a key value RTD just by the fact that you are storing two pools with two things in them and you can actually store even more complicated structure data as well. You know, if you can still have a to pool where you have some values a key and some more complex to pull as the value so you can you can nest information and structured as complicated as you want. But at the end of the day, as long as you have at the top level two things a key and a value object of some sort that counts is a key value. RTD Now, what can you do when you have a key value rdd a bunch of things that you couldn't normally do? One very common operation is reduced by key, and this allows you to provide any arbitrary function you want for combining values together and grouping them together by key. So is in this example Here, take a look at the syntax. We're actually taking a tupelo x y here, OK? And transforming that x plus y know, this is a little bit confusing. So stay with me here. X y here does not represent a key value pair. It's actually represents two values that need to be combined together, so reduced by KIIS saying Okay, I want a function that will take two values for the same key and define how these things get combined together in my new RTD. So let's say that I just want toe add up all the total friend counts for everyone who's 20 years old. I might hit this for the key. Number 20 The key 20 never changes. You know that's gonna be automatically transferred over to my new RTD. But what does changes the value where I'm going to aggregate together all of the values for key 20. So this function says, given to values for the same key. How do I combine them together? In this case? I want to add them together, and that will take every value found for a given key and add them all up two at a time. Okay, why is it to a time well, remember, this might be distributed, so we can't just take everything all at once and add them all together. And one big function has to do a piecemeal, you know, 11 bit of time. So we need to tell Spark, given a running total and some new value, how do I combine them together? And that's what this function does make more sense. We will get some code we could also do group by key. And if you don't want to actually aggregate all these values together, forgiven key, you can group them all together into a big old list. And that's what group Bikey does. So if I just want to collect together all of the values seen forgiven key group by, he will return in RTD. That is just that it will give you back key values where cookies, each unique key and the values to be a list of all the values associated with that key. OK, we can also do sort by key, which will just sort and RTD by its key values very straightforward. And you can also retrieve individual oddities of just the keys or just the values of the key value. Rdd and the keys and values functions will do that for you. Some are things you can do. You can actually do sequel style joins if you have key value. Paris. I mean, you think about it. A key value. RTD is kind of a lot like a no sequel databases in it, and we can do some database like things. Since we have some more structure data now, you could do joins of various flavours, and we'll look at some examples of that later in the course a little too complicated an issue now. But just know that it is possible to do sequel. Sal joins when you have key value RTGs to work with one of the thing. If you're just gonna be mapping the values in a key value RTD and you're gonna leave the keys alone, it's gonna be more efficient to use the functions, map values and flat map values instead of just map in flat map. Okay, so if you're gonna map just the values of a key value RTD use map values and flat map values. You'll be simpler from a syntactic standpoint and also more efficient. So let's dive into an example to make this all really. I mean, it's nice to talk theory and all, but it doesn't sink in until you actually do it right. So let's imagine if you will that we have some social network data to play with and each row of information and are raw input data looks something like this. So we have a bunch of comma delimited rows of information that consists of a user i D. Number the user's name, the users age and the users number of friends. Okay. And you can probably tell from this fake data that I'm one of those weird Star Trek frank fans that you hear about and probably our best to avoid. Um so what we're trying to do again just to remind you of the purpose of this example, we're trying to find the average number of friends for a given age. So in this case, we have Will is a 33 year old in our fictitious data, and so is John Luke and Will has 385 friends. He's super popular Jean Luc. He's a little bit socially aloof, so he only has to. So we want to somehow figure out how to combine these two guys together because they both have the same age and figure out their average number of friends and so on for the remaining people on a day to set us Well, so let's look at some code. How do we get started? Well, again, the first thing you want to do is parse out your input data and throw away the data you don't need. So that's what we're gonna do with our parse line function here. Given a line of input data split it up based on the comma characters that we find in it. And we will access fields number two and three. And remember, that's actually the third and fourth fields because we start counting with zero here and we will transform that feel number two to an integer and interpret that is the age and feel. Number three will be also be interpreted as an integer and will store. That is the number of friends for this person and we will return is a to pull a key value pair of an age and a number of friends for this person now knows we've completely discarded the user. I d in the user name from that input dated because it's irrelevant to the problem we're trying to solve. All we want to know is what's the average number of friends forgiven age, who these people are doesn't matter, right? So when we're done with this, basically we take sc dot text file and load in our raw data and then we map using that function we just wrote and it gives us out something like this. So from our input data, we end up with a key value RTD where the keys are the ages for each individual and the values are the number of friends for each individual. So we haven't done anything to aggregate things together yet. All we've done is create a key value RTD in preparation for the operation, we need to dio. So now what happens? Well, things get a little bit fancy here. So take a look at that big scary Lina Co there. It's a little bit intimidating, you know, because But it is a common practice in sparks scripts to change things together in this manner. So if you look at it, we're actually taking the RTD. The key value RTD of age is a number of friends were calling map values on it to just transform the values of each one. And then we're chaining together yet another operation on top of that, with a dot reduce by key. So basically saying start with our key value RTD of individuals. Age is a number of friends. We're going to map that using this function here and then take the resulting RTD from that step and apply, reduced by key to it as well. So we're basically doing two steps in one here. Pretty common thing to see. But let's break it down into his components and talk about what's going on. So the first part of this statement is RTD dot map valleys using a transformation function of X two X comma one. So all we're doing here saying transform each number of friends to a two bowl that contains the number of friends and the value one. So what's the method to our madness here? What? Why am I doing that? Well, we need to end up with the average number of friends in the end, remember? And to compute an average, you need to numbers, you need the grand total, and you need the number of instances, the number of things that you added up and divided those two things to get the average. So thinking I had a little bit what I'm gonna do is transform all of these key value pairs of age number of friends, too. Slightly more complicated format here, where the key is still the age. But the value is now to pull itself. That is, the number of friends and the value one, and you see what happens when I start to add these altogether. So if I add together both fields of these two pools for each given key, I'll end up with both the grand total of how many friends in total all the 33 year olds in the world had, and also enough with number of 33 year olds in the world. OK, because I add up all those ones for each individual that I saw. That's what I'm trying to do here. And that's what reduced by key does in this next clause of this expression. So that's going to take two of these values. OK, so remember, reduced by key, the X and Y is not a key value pair. It's two different values and this case, the values air two different two pools where the first value is the number of friends and the second is the number one. I'm going to apply this function that add them together where I just say the result will be the addition of the two number of friends coming in and the two number ones that come in. Okay. So as we keep a running total over the entire data set, what we end up with is a grand total of how many friends in some all the 33 year olds in the world have and how many 33 year olds there are in our data set. And of course, it will do that for every unique he automatically makes sense a little bit complicated. This is, you know, the most complicated thing you've done so far in the course. So if you do need to go back and stared this slide some or don't be afraid to do so, I won't judge. All right. Next thing we need to do is actually use that information to compute. The average is themselves. So remember, at this point we have key value pairs where the keys are the ages and the values are the grand totals of the number of friends and the number of people of that age. All we need to do now is divide those two to get the averages, and that's what this map values function does. It says for a given value, which in this case is one of these two poles, we're going to just take the first field, divided by the second field to get our final results. So in this particular example for 33 year olds, we might end up with 33. The age 33. We're gonna take out the value and map it to just divide these two fields 33 87 divided by two works out to 1 93.5 and we end up with a final result for 33 year olds having an average number of friends of 1 93.5 Not sure what half a friend looks like in reality, but let's not go down that line of thought. It's a little bit disturbing. Finally, we need to collect our results. And remember, nothing happens until we actually calling action. So click the results here into a plain old Scalia object, and then we're going to sort it and apply the print line command to each individual entry on get the final results that way. So with that, let's go take a look at the actual code 14. [Activity] Running the Average Friends by Age Example: So let's make it really. It's actually run that friends by age script and actually running on some data here and actually use spark for it. So I've opened up two different folders here on my hard drive, and if you want to follow along, please do do the same thing here. One is Thesis E Sparks Kala folder that I created earlier in the course, and that's kind of our our directory that we're going to use for the course itself all the data and scripts and projects that we're gonna be running. And over here I have the course materials that I had downloaded all together earlier in the course, and that's in my download Sparks Scallop folder. So first you need to do is get a copy of the data that we generated for this exercise, and that's gonna be the fake friends dot C. S V file. Click on that and preview. You could see that it has the former that I described in the previous lecture where we basically have a user. I d fall by a user name, the user age and the number of friends, and we fabricated a bunch of different people at random using a library of fictitious names of various Star Trek and deep space nine characters because I'm a nerd. But again, all we really care about are the ages. A number of friends. That's the information we're trying to aggregate, right? So let's go ahead and copy the fake friends dot C s V file over into my project directory. All right, now, the script we're gonna use is in the friends by age dot scallop, but we're gonna have to import that into our packets. So let's go ahead and open up the scallop I d e. Make sure your workspace is set to your workspace for this course and up, it should come. All right, open up these sparks. Cal Accor's project Open up the source folder and we're going to select the com dot sundov software, not spark package right click on that select import general file system and then browse to our download directory. In my case, that's downloads sparks kala, and we're going to select the friends by age dot Scala's script and import that into our package. There we have it. So now I can open up that package and double click friends by age, dot scale and take a look at our code, and it's pretty much what we described in the previous lecture. But let's review again. What's going on here just for the sake of letting it sink in? We have our usual boilerplate stuff up here, where we're declaring the package that we're in. We're importing the spark libraries that we need and the long forge a package that we need for this script. We're creating a friends by age object, and within that we have two functions. One is our main function. That is where all the action happens. And then we also have this parse line function, and its job is to just take our input data and extract the information we actually care about. In that case is going to be the age and the number of friends for each individual in our input data. So again, we're going to take our each line of Ryan put split it up based on commas because it is a comma separated values file extract the age from field number two, which is actually the third field. We start counting it. Zero converted to an into the extract the friends number from field number three, which is the fourth field converted to a manager. And we return a two people of the age and the number of friends reach individual. And this becomes our key value pair for are the key value RTD that results from this operation. Okay, so putting it into action, we go to our main function. We set our locking level two error to reduce all the logs span, create our spark context to run locally using every core. And we're gonna name this context friends by age. We call text file on the spot context to load up the fake friends dot c S v file. We just copied in to a lines rdd. At this point, it's not a key value. Rgds isn RTD where every value is a line of raw input. But then we apply our port parts line function to that RTD. We call lines dot map passing in the parts line function as a parameter to map and I will apply the parse line function to every individual input row and give us back a key value. RTD, where the RTD consists of two integers which represent a Neji and a number of friends. So the key is our age that value is the number of friends. And here's that big scary line again that we talked about previously where we apply two steps and one first thing we do is we take those key values of ages of number, friends and that them so that the values air not just a wrong number of friends, but a to pull of the number of friends in the number one. And that just allows us to add up both the total number of friends and the total number of people for a given age. And that's what reduced by key does. It says. Given two of these values, which are two pools of total number of friends and total number of people of a given age, we add them all together, one component of time. So we add together every friend count in every number, one for each individual to end up with our resulting totals by age RTD, which will contain one row for each unique age that we've seen and the grand total in a tube, all of the total number of friends for that age and the total number of people we saw from that age by adding up all those ones that we created okay, and that gives us the numbers that we need to actually compute. The average is so by mapping those resulting values by just dividing the first and second elements of each to bill, the total number of friends and total number of people, we end up with the average. And that goes into our averages by age. Already D we collect the results and we print out we sort them and print out each line. Let's see if it works. So to do that, we need to create a run configuration for this particular class. Remember, we call this friends by age. Was the name of our object to go to run run configurations, and we're going to create a new Scalia application. It's double click on that. We will call this friends by age. The project is correct, sparks al, of course, but the main class is going to be com dot sun dogs. Software doctor spark, not friends by age. Okay? And now we could hit, run and see what happens. Hey, it worked. So there you have it. You can't read anything into the state because it was all just randomly generated. This is not real data, but it could be. You know, maybe you're working at a big social network company, and you have something similar. But to interpret this result, for example, that the average 18 year old in our fake data had 343 friends. The average 40 year old had 250 friends. Their age 57 year old had 258. Friends would be pretty interesting to see what that would look like with real data, but that shows you how to do it. And there you have it. A working key value. RTD example. Feel free to play around with it. Mess around with it, try change the results and get familiar with it. There you have it. So it's one thing to watch me write code and go through it, but it's another thing to do it yourself and get your hands on it so I encourage you to do so little challenge for you much. Go back to that friends by age example and modify it to actually show the average number of friends by first name instead should be pretty straightforward. It's just a matter of extracting a different feel from the source data and using that is your key. And obviously you're not gonna convert names to imagers. But the objective here says to get your hands on this code and actually start modifying it yourself and get some confidence that you can actually go in there and fiddle with some sparks kala code and actually make something work. So I encourage you to do so. See if you can go and do that, find the average number of friends by first name as opposed to the age. And that would actually be an interesting thing to find out in the real world, too. So give that a try and I'll see in the next lecture. 15. Filtering RDD's, and the Minimum Temperature by Location Example: Let's talk about the filter operation on RDS and how you can use that to strip out information that you don't need from an rdd and save sparks from doing unnecessary processing we're on. Apply this to a real example, using some really weather data, and the thing we're gonna try to do is take a bunch of weather data from Europe and find the minimum temperature for a given location for an entire year. The year 1800 to be precise, let's talk about how that works now. Filter functions are pretty straightforward. You just pass in a function that returns a Boolean value and that Boolean, true or false value dictates whether or not a given row in an RTD gets retained or not if it fails that condition than that entry is discarded from the rdd. So let's say that our what weather data has different kinds of information in it, and there's a field that indicates what kind of data a given lion of input represents. It might have a minimum temperature for the day. You might have a maximum temperature for the day, and I might have a precipitation amount for the day. Now, all were tested in finding in this example is the minimum temperature overall for an entire year for a given place. So we don't care about the maximum temperature lines. You don't care about the precipitation lines. We can throw those away, and this filter operation here is a way to do that. So let's imagine that we have an RDD that already has parsed out information about what's in this weather data we could call a filter function that takes a given input line checks whether the second field of it is equal to team in. In the case of the second feels what indicates the data type, if it is, that keeps it. If not, it just gets discarded. And the resulting RTD that only contains team in lines then gets put into the new Min Temps RTD that we've created. So again, you use the filter function on an rdd to provide a boolean function that dictates whether or not a given entry in an RTD gets retained or not. Let's make it really look at some actual data, so we do have some actual real weather data from the year 1800 for a couple of locations and these are this. The format that we have the first is a weather station. Identify air followed by a date. What kind of data this line represents and the associated data with it. So, for example, this weather identifier, I forget if it represents Paris or Prag, it's one of them. But it says that at this weather station on January 1st 1800 the maximum temperature was negative. 7.5 degrees Celsius. Okay, so the data formats here's a little bit strange. It's actually 10th of a degree Celsius and also for that same place. On that same day, the minimum temperature was negative, 14.8 degrees Celsius and likewise we have a precipitation line that we don't care about. That's all we care about our minimum temperature lines. We have a maximum temperature for another weather station and a moment minimum temperature for another station, all on that same day. Now, this example again, we only care about minimum temperatures for the problem we're trying to solve. So we're gonna take in all these lines of input data. We're gonna parse amount, and they were gonna throw away everything that's not a team in entry How does that work? Well, let's start by reading in the raw data and parsing it, making it more structured. So we're gonna do is take our spark context and call text file. The import are comma separated value list for the year 18 hundreds weather data. And then we're going to apply this parse line function as a mapper on that raw data. So what is parts Lines do splits it up in the individual fields based on comic characters and parses it out. So the first field is the Station I. D. Field to, which is actually the third field is going to be the entry type. So that's going to be team in T max or precept. And then, for the temperature will extract that from the fourth field. Feel number three. Convert to a floating point numerical value. Remember, that's 10th of a degree Celsius, so to convert it back to Celsius, we multiply times 0.1 and then to convert from Celsius to Fahrenheit will multiply by 9 15 at 32. So this reads in the temperature value and converts it to Fahrenheit for us. All right, so at this point, we're going to return a to pull that contains three values station I. D. What kind of entry it is and the temperature. So the next thing we want to do is discard all of the data. That's not a minimum temperature. So to find the overall minimum temperature for a location throughout the entire year, we're gonna look at the minimum of all the minimum temperatures reported for that location . So what we're gonna do with this filter operation like we looked at before, look at every input line. And if that second field which integrate indicates the data type is equal to T, men will keep it, Otherwise we'll throw it away. So at this point, we have a new RTD called Man Temps that contains parts information, but Onley team in data types Now we haven't already D that contains three things We have these station identify air team in and a temperature that team in is no longer meaningful, right, because everything in the RTD as a team in in it, So let's just get rid of it again. The less data that we have to deal with, the better. So that's what this next line does, will take that min Temps RTD We have now apply this mapping function to it that takes that to bowl of three data points and converts it to a to pull of two data points. And you know what? A tube full of two data points is a key value. RTD. So at this point, station temps is a key value. RTD, where the key is the station. Identify air, which is just the location Paris or Prague, whatever it is, and the value is the temperature, the minimum temperature scene for that day. Finally, we'll do a reduced by key operation, and so it's a little bit different than where we added things up or made averages. Instead, we're gonna use the men operator to preserve the minimum value. Seen as we go through all of the minimum temperature scene for a given station, I d. So station temps to remind you is a list for every day in our data set for a given station . I D in the minimum temperature. So we're going to aggregate everything together by station. I d the key value, and we're going to look at the minimum men temperature seen across all the minimum captures reported for that station throughout the year. So that's what this reduced function does for two. Given team in values for a given location. Forgiven Key. You'll preserve the minimum of those two, and by doing this, it will find the minimum overall temperature reported for that station for the year. Finally, we just need to collect the results and put him out so we will call collect to transform our RTD back into a scale object that we can iterate through which we do here after sorting it. And there are more compact ways of doing this here, doing this in three different lines just for clarity. But we'll extract the first key value and assign that to the station I d. The temperature will be in the 2nd 1 and then we will format that temperature so that it only has two decimal places to the right and print out the final results for each station that we run into and take a look what we have. So let's go and run it in the next lecture and see what we come up with 16. [Activity] Running the Minimum Temperature Example, and Modifying it for Maximum: So instead of just talking about this code, let's actually run it. So as before, I have to. Windows open up here. One is my See Sparks Kala folder, where I'm going to be keeping all the materials and projects for the courses we go. And then I have my download Sparks Kala folder here, which is where I downloaded all the course materials to. So let's go to the 1800 dot C S V file here. That's my actual weather data. That's really weather data from the year 1800 in a couple of cities in Europe, and you can click on that and kind of preview the contents. It's what described earlier. You have a weather station. Identify are a date what kind of date it is, which might be T Mac's team in or precipitation and the actual temperatures. In the case of T, Max and T, men are represented as 10th of a degrees centigrade, so let's go ahead and copy that and paste it into where I'm working from. All right, now, let's open up the scale I d again Make sure the workspace is set to where you just pasted that file into and where your other projects are. All right. Now let's open up the project and the source folder and click on the package Combat Son of software dot Spark right. Click on it and say Import, go to general file, system it next and browse to where we downloaded the course materials as usual for me, that's under downloads and sparks Kala. We're going to select the men temperatures dot scallop file here and hit finish and we can open up the package now and take a look at it. All right, Let's Ah, close all the others. At this point, I want things get too messy. So this is the same coat we talked about in the previous lecture. Let's take another look at it because it makes more sense when you see it all together. Sometimes as usual, we start with a package name that we're going to be under and we import the things we need . You'll see we're actually importing this. Men function here as well as the usual suspects created men temperatures object that contains are a Spark drivers script. We have our main function where everything happens, and we also created a function called parson line That does exactly what it says that parses a line of raw input data from that 1800 dot C S V file and transforms it into a to pull of station i D entry type and temperature station idea being where the weather measurement was taken. Entry type, being one of team in T max or precept. And in the case of team in her T max, we will extract a temperature field in Fahrenheit. And all this code does is split it up. Based on Comus extracts a station I d from the first field, the entry type from the third field and the temperature from the fourth field. And then it converts that temperature while it's at it from tense of a degrees centigrade to Fahrenheit. All right, so let's dive in here. Said the log level two errors so we don't get a bunch of law expands Creator A spark contests mark context that runs on our local machine. Using every CPU core, we're gonna need a min temperatures. All right, First things first, load up our actual raw data file 1800 dot c s v into a lines already. D then we're going to map it, using our parts line function that will convert that into more structure data that is, a to pull on each line of station I D. Entry type and temperature. Next we're going to do is filter out anything that's not a minimum temperature, because again, our goal here is to find the minimum temperature across an entire year of data for each given station I D. So we're going to pass in this filter function that takes in a given line looks at the second field, which is the entry type again, if it's equal to team in that it lives. If it's not equal to T men, it dies. So we end up with a Min Temps RTD that contains Onley t man entries. Next thing we do is strip out that middle, that middle entry type field because it's no longer meaningful. Everything in here is a team in, So why would we carry that team in data around everywhere again? You want to strip out all the data you can to minimize the amount of data that needs to be pushed around during a shuffle operation. So that's what this map function does it just takes that to pull of three values and converts it to a to pull of two values. And guess what? That makes it a key value. Rdd where the key is now the station i D. And the value is the minimum temperature scene for every day in a given year. Okay, so at this point, we have a station temps key value RTD that has keys of station I DS and string format and values of minimum temperatures seen for every day of the year for that station. Finally, we'll call reduce by key to find the minimum minimum temperature for each station i d. So what this does is it takes that station temps key value. Rdd looks at every minimum temperature reported for every day of the year for each station and preserves the minimum found overall for that station so as to different values to different temperatures or compared to each other. We just take the minimum of the two. That's what we end up with is the answer. We want this min Temps by station rdd, which is just going to be he's of Station ID's, followed by the minimum temperature seen across the entire year for that station. We then collect the results sorted, extract them for math. Um, the way we want in print em out. So let's go see if it actually runs. Go to the right men. You create run configuration DoubleClick's Kala application, and we're going to name this men temperatures and the main classes. Com dot sun dogs software dot spark dot men. Temp ruptures. Double check. That's the correct object name can actually scroll up there, but believe it is lets it run and find out if it works, it's doing something. Hey, there we have it. It works. So for this station, I d. The minimum temperature reported with 7 F 7.7, to be precise. And for this other station idea, the minimum temperature found in the year 1800 was 5.36 degrees Fahrenheit. Now, as I said, this is Paris and practice, I recall. So they would be pretty similar, though not that far apart. So that makes sense. That kind of passes our sanity check, and you should always, you know, do that with your end results to make sure there's not a potential, but that you could have caught. All right, so that's Ah, that's our example of using filter and another example of using key value rgds. Now, at this point in the course, I think you're about ready to actually start writing some code of your own. Maybe start fiddling with us a little bit. So that's my challenge to you right now. What if you were to modify the script to instead find the maximum temperature found for a given station I D. Throughout the year? What changes would you make? What he had. Pause and go try that out. So remember, instead of team, men will be looking for Team Max entries and instead of trying to preserve the minimum temperature found, will be preserving the maximum temperatures found. So hit pause. Go take a crack at that. Make sure you also look for you know things where we're talking about minimum temperatures and change it to say maximum temperatures, things like that. But you know a very simple way to get your get your feet wet with some spark developments. Give that a try hit pots, and when you're ready, we'll come back and take a look at how I did it Ready? Okay, let's move on. So we're going to go back to the package here and right click on its and import the max temperatures starts Kala from where we downloaded everything again. For me, that's under downloads. Look it up, Sparks Kala max temperatures dot Skela and we'll import that in It's double Click on that. Take a look what he did. Nothing. Nothing too crazy here. All I did was change all the object name from in temperature to max temperature. The parsing works exactly the same way, but we're naming the same Max temperatures now. And we now we're filtering out things that are not just team in, but only things that are t. Max the other relevant changes instead of calling men here and are reduced by key operation recalling Max to find the maximum maximum temperature throughout the entire year for each station. And again, I just changed the wording here, too, reflect the results that we want. So let's go ahead and run that and see if it works. Compare that to what you've done, you know, and hopefully you did about the same thing and got the same results. But if you want to check for sure. Let's see if your end results are the same as mine. So let's go to run, run configurations on. We're gonna create anyone called Max Temperatures. Last year is gonna be com dot sun dogs software not spark Dr Max Temperature turds on. Well, run that. And in this case, we end up with an answer of 90.14 degrees for both stations and seems a little bit coincidental by check. That really is the correct results. So if you ended up with a answer of 90.14 for your code of finding the maximum temperature for each weather station in the year 1800 then you did it right? Congratulations. Good job. Well, we'll do more and more hands on stuff as we get through the course, But there you have it. For now, you might ask yourself, why are there only two weather stations in here? Well, there wasn't a whole lot of good record keeping in the year 1800. It turns out, if you get data from later years, there'll be more more information into it. So if you don't want track that data down and play with bigger data sets feel free to do so , but otherwise there you have it. Some practice filtering key and dealing with key value our deeds yet again with real weather data. So if you'd like to challenge yourself, here's an idea. Go back and modify this code. Actually figure out what day had the most precipitation for each location that we have precipitation data for and the general concept will be the same for the code. You just kind of parts out the data a little bit differently and treated a little bit differently, right? So have at it. See if you can figure out what they have the most precipitation in the 1800 not CSP data file that we have. Give that a shot and I'll see you in the next lecture. 17. [Activity] Counting Word Occurrences using Flatmap(): so we touched on this a little bit earlier. But let's go into more depth about the difference between the map and the flat map, Operation and Spark and illustrate this will look at an example of counting up the occurrences of each word in a real book, so we'll do that right now. So to review the map operation transforms one RTD into another, using some function that you provide to it. And there's always this 1 to 1 mapping between the RTD. You start within the RTD. You end up with every row in your RTD that you start with is going to result in one row in your transformed Already D when you're using map. So as an example is to look at this. Let's say we start with the RTD that contains four lines of information, and each row of the RTD contains, you know, little snippet like the quick read extra contains. Fox jumped over the lazy then brown dogs. Let's say we have this little snippet of code here that loads that up into a RTD called lines, and for some reason we want to create a highly distributed scalable job using spark that converts things to rage caps. So begin passing a function like this that will convert every everything in that to upper case and return that into a rage. Caps rdd. So map here, in this example, will take every single line of input data converted to uppercase and return a new RTD where there was a 1 to 1 relationship between the number of input lines and the number of output lines. So we end up with the quick red fox, jumped over the lazy brown dogs, but really, really loudly yelling at you, right? It's all caps, all capitals now. In contrast, flat map can generate many lines from one input line or or zero lines. Even sometimes you just want my wanna throw data ways it comes in. So let's look at this more interesting, interesting example. We start with same input already d call lines, but this time, instead of converting it to rage caps, we're going to split up into individual words. So we want to take an rdd of sentences or, you know, snippets of sentences and convert that into an RTD where each row contains a single word. So we end up with many Mawr rose in our resulting RTD than we started with. So here's an example of how flat map can do that now. Flat map, instead of returning a single value in the function you passed into, it will return a list of values and that was can contain any number of results or maybe even no results at all. So take a look at this. In this example we're passing in this function, X gets transformed into x dot split based on space characters on what split does if it's given a string, is it returns a list of strings that are delimited by the character you pass in. So basically, this is saying I'm going to take whatever string you pass into me and give you back a list of strings that are separated by spaces. Okay. And then flat map will take each element of that list that comes back and produce a new row out of it. So the resulting words already contains one row for every individual word or more specifically, every string that separated by a space in the input data. OK, so now we start off with four lines of data, but we end up with. Well, however, main lines how many words that we have in that data to begin with. So that's the main stream map and flat map. Map is a 1 to 1 relationship between the input data rose and the output data rose, where it's Flat Mac and have a one to many relationship or a 1 to 1 relationship where I wanted none relationship, you know, whatever is appropriate for what you're doing and a quick note for people who are a little bit more advanced in Skela. You can also use thes some, and none constructs in scallop to do a flat mapas. Well, if you return some and none, none will be interpreted as no values at all. Kind of like returning an empty lists, so they'll be examples of that later in the course. So to make it riel, let's write a script that actually puts flat map into practice, and then it will make a lot more sense. Were going to do is cath words in a book, and just for copyright reasons, I'm going to provide the text of a book that I wrote, uh, which is this book, but you know, whatever I'm giving you the text of it for free as part of the course is sort of an exercise in natural language data that we can mess with. So let's go and write a script that uses flat map to actually count the occurrences of each individual word found in this real book. All right, so, as before, I have to Folders open up here on my desktop. One is the directory that we're working out of for the course, which for me is see Sparks Scallon. And then over here I have the folder that I downloaded all the course materials into which for me is in my Downloads folder under another spark scallop folder. First thing to do it's like the book dot text file, and if you previewed over here, you can see that it contains the text of a book I wrote that is completely unrelated to this course, but is just a source of a bunch of text data that we can play with. So let's go ahead and copy that and paste it into the Sea Sparks Kala folder that we're working out of. All right, Next we're gonna start up the scale a I d Make sure we're in the crowd. Workspace should be the same folder that we're working from and open up the Sparks Kala course project going to Source and the Combat Sun Dogs software that sparked package, which contains all the code we've written so far in this course. So let's right click on that package and say, Import from the general file system. And now we'll select the course materials holder, which for me is going to be under downloads. Marks kala. Okay, and you should find in here word count out Scalia. It's like that and imported into the project. And let's take a look at it. So let's double click on word count out Scalia and examine the code. Very simple. We're just trying to illustrate the use of flat map instead of map, and it's going to be pretty much what we talked about the slides here. So, as usual, we have the usual code here for declaring our package name and importing the libraries that we need from Smarck and a lot for J. We're gonna call this object were account and its main function such the air level like we always do for longing. And then we create a spark context much like we've done before again running locally in this example, using all CPU cores named Word Count. Now we're going to input the entire book into an RTD called input on this next line and wait structures that there's actually one paragraph per line. So every row of the input RTD is going to contain one paragraph of text from that book. But what we're trying to do is get a count of how many times each word appears in the book . So that's what we do here with flat map. So just like in the slides, we're calling flat map on the entire text of the book, where every row represents a paragraph and passing in this function that splits those paragraphs up by space characters, giving us more or less a list of words that are in that paragraph. So this returns a list, and every element of that list becomes a new row in the words rdd. So flat map is breaking every paragraph out in the individual words and taking every word in that list and creating a new row in a new RTD that were calling words. So now we can use our friend count by value to count up how many times each distinct word appears in that word's already D. And then we will just print out the results from that by applying print line Teoh. Each element of the final word counts result that we got back from count by value, which is this addiction area of words and how many times they occurred. So let's give it a try. Go to run run configurations, make a new Scalia application, and this is gonna be called word count. And it's made glasses com dot sun dogs software, not spark dot word count. Let's run it and see what happens off ago. So pretty cool. Hey, had actually worked. So you know you can see through here. It's sort of ah, list of words that appeared on and, for example, the word physical showed up 17 times in my book. The word commute only showed up once were some other popular words in here. The course. That's a common word 1107 6 times in this book. But if you look at this, you can see that there's something some issues with our results, right? So, first of all that sort, it's not very useful piece of output here cause I can't really see one of the most common words very easily. Another thing is that there are punctuation included in these words, so we're only splitting on space characters. So, for example, row comma is counted as a different word than just Row without or it might be a different word from row period, right? So that's that's an issue we should address. There's some sort of special Unicode thing going on here. I don't know in and minimal that's not being dealt with properly, either. So and the other thing, too, is we have different capitalization. So for some reason I have the word ideas in all caps. It's probably some section title, but that's gonna be counted a different world than lower case ideas. So there's a lot of room for improvement in this script, and that's what we're gonna do. The next couple of lectures kind of explore ways to take this basic implementation of counting up the words in a book and making you something more more useful over time by making it a little bit more complex. So let's do that in our next lecture 18. [Activity] Improving the Word Count Script with Regular Expressions: So let's make that word count script even better. So if you remember from the last lecture, one of the big problems we had is that the words we were extracting out of the book weren't always words. Sometimes they had punctuation attached to them, and there was uppercase versus lower case stuff going on. So we ended up with things where a word followed by a comma would be treated as a different word than a word, followed by a period or word followed by a space. Because all we were doing before was breaking up strings of paragraphs by space characters and taking anything in between the spaces as a word so we could do better. So there are natural language processing tool kits you can get once called an L T K, for example, that will automatically go out there and apply very complex algorithms for actually normalising words and trying to figure out synonyms of words in different cases and things like that. But let's not go there just yet. Let's just do something a little bit better. That's probably good enough for our purposes, and that's going to be using something called regular expressions and you've probably come across regular expressions in some other programming have done in other languages. It's basically a compact language for defining waves of splitting up strings into sub strings using a set of compact rules. So this gives us some or a robust means of identifying words within a string. Taking that punctuation into account, let's take a look at what that code looks like using Scallon Spark. So back to the scallop i d. Here we have up still our previous script, where we just split out all of our paragraphs based on space characters and treated everything within the spaces as individual words and well, that's produces. Okay, results. It's not great. I mean, just looking at this right here. We have entrepreneurs, followed by a comma being counted. Its own word Not okay. Parking with a question mark, that's gonna be a different word from parking. Without a question mark, let's at least fix that, shall we? So go up to the spark combat sound of software. Not smart package. Right Click on it. Import from the general file system option there and again will browse to where we downloaded the course materials for me. That is see downloads sparks kala and we should see a word count better dot scallop foot pile. So let's select that and imported in and take a look at that. So about the same amount of code. But this time we're using something called a regular expression. How does that work? So it's actually quite simple in scallop we're doing here is the same sort of basic work We're setting up our log level saying about smart context. We're reading in the book one paragraph per line, and then we're calling flat map again to split up our paragraphs into individual words. But instead of just splitting on a space character, we're splitting on this regular expression. So if you call split with two back slashes at Nikki's, I have what's called a regular expression. And there's entire books on regular expressions. If you're not familiar with, um, I encourage you to go take a course on regular expressions or pick up an O Reilly book on it. But short story is it gives you a sin tax form or complicated ways of splitting up sprint strings, and in this case, we're using this syntax uppercase W plus, which is just telling the regular expression split this string up by words and there could be more than one of them. So the uppercase W means I want words. You know, however, you interpreter work to be Mr Regular Expression engine, and the plus means there could be one or more of them. So this returns back a list of words, takes punctuation into account and returns back in new RTD of words that is a little bit more robust than when we started with before. Now the other thing we're gonna do is normalize everything to lower case. So if you see here in my original results, ideas and stick, another case would have encountered is different words than ideas or a stick and lower case , or even where the first letter was capitalized at the beginning of a sentence. So let's fix that, too. We're gonna convert everything to lower case so that these get counted is the same words and case doesn't matter. So that's all this map function is doing. And in this case we do have a 1 to 1 relationship between the words that we extracted from each paragraph and the lower case versions of each word. So we're gonna use map instead of flat map for this one. And from that point, we could do the same thing we did before. Call count by value to get the number of distinct occurrences of each word that we ended up with after it's been stripped out by words and convert the lower case and print out the results. So let's go ahead and run that. See what happens. Run run configurations. We'll create a new scale application. We're going to call this word accounts, caps, lock, word count better and the class is going to be calm dot Sun dogs software not spark dot ordered account Better. Let's run that and see if we do indeed get better results off it goes and we're done and earlier that sure enough, everything's been converted to lower case, which is a good thing. And I'm not seeing any weird punctuation either. I'm not even seeing any weird Unicode characters or anything, so this is looking a lot cleaner, you know, just by using this little regular expression instead of a space character and normalizing everything. The lower case. So you know, in addition to, you know, working on an example and making it better and better by iterating on it. This is also an example of where you really need to look at your output data and ask yourself, Is this really the result that I expect or want a lot of times, your initial attempt to write an algorithm for solving a problem and spark doesn't quite yield the right results in the end. And you really need to look at your results and ask yourself, Does this make sense? Are there edge cases I'm not dealing with? Is my data somehow unclean, right? I mean, data cleansing is often Justus much, if not more, work for a data scientist job as actually processing the data. So we're really doing here is cleaning up our input data before we process it with count by value. And the whole point of this election really is to stress how important that could be. You know, you need to make sure you have good, clean data. That's not a lot of outliers in it that might mess you up. And, you know, we're doing the right thing here. So anyway, you can see things were looking pretty cool, pretty pretty clean here. The word technical appeared 11 times. The word addresses appeared seven times Product 182 times. Wow, I talk about products a lot. In fact, the book is mostly about building products, so that's cool stuff. So you know we're getting closer here for sure, but we're still not quite where we want to be. This would be a whole lot more useful if the results were sorted right. You know, if I could actually see what the most popular words were in the least popular words more easily. And if I were to just sort this as it is, though, it's going to sort it alphabetically by a word. So how do we get to sort by the word count instead? Well, it's a few tricks we can do for that. So let's explore that in our next lecture. 19. [Activity] Sorting the Word Count Results: So let's solve that last problem with our word count script in sorted in a useful way, we want to be able to see very easily what are the most popular words in my book and one of the least popular words in my book. Now granted, we could just do this all within Scala's part of the final map that we get back from count by value. If we assume that there just aren't enough words in the English language for that to be a problem in terms of actually fitting on the memory of one machine, that might be an okay thing to do. But let's pretend just for the sake of education that there are so many different words in this book that I can't be sure it will fit in memory and actually want to do this on a cluster in a distributed matter. So how do we do that? Well, the problem with Count by Value is that it returns a scallop map back where it maps words to their number of occurrences. Now, if you want to instead return an RTD that we can keep on the cluster, we basically have to reinvent how cowed by value works and returning already d instead of a map. So that's what this little snippet of code here does. It says that we're going to take our RTD that where every row contains a lower case normalized word and don t two things to it. First thing round, he was map each word to a key value pair. So take each individual word that we run across and convert it to a to pull where the key is the word itself and the value is the number one. So, for example, if we come across the word product, it will convert that to a key value pair of product and the value one. And now what we can do is use reduced by a key to count up all the occurrences of each unique word. So reduce by key will take a look at all the values associated with a given key. In this case of given word and combined them. How are we say in this? In this case, we're going to say add them all up together. So we're going to add up all the ones associated with each word. And by doing that, we end up with the final count of how many times each word occurs and that ends up not in a scale, a map, but in a new RTD called word counts. Okay, so that's how this little step one works were basically going to count up the occurrences of each word, but return it in an r d d so we could do further distributed processing on it. So now what? Well, if we were just to sort things at this point, it would be sorted based on the words. And I don't really care about looking up the alphabetical looking up alphabetically how many times each word occurs. I want to see the most popular ones on the top least popular ones on the bottom, or vice versa. So to do that, let's flip these pairs around. So instead of having a key of word in the value of the count, let's make it the other way around. Make the first value of the to pull the count in the second value. The word that way. When I sort these things together, it ends up sorted by the count instead of by the word alfa numerically. Okay, so then I could just you sort by key on these key value RTD that I have and that will sort based on the key, which in this case is the numeric count occurrences of each word. That's all this does. So what we're doing here is we're taking that word counts key value RTD, which starts off being word and number of occurrences. This mapping function flips it around so that the second field becomes the first and the first becomes a second. And that just makes it count word instead of word count. And then we can call sort by key to sort things the way we want and display the results. Let's go take a look at the code and actually run it. So back to the scallop I d. Here. You can see we're continuing to reiterate on this problem. We started off with a very naive solution that just split up words based on the space character. We moved on to a better solution where we normalize that data a little bit better. We're more intelligent and how we're splitting things up in the words and we are normalizing them all. The lower case and the last step in our journey here is to actually sort this in a useful manner. So let's go make that happen. Go ahead and click on the combat son of software, not spark package, right click and import from the file system wherever we save the course materials. Oh, hey, it actually remembered it. That's cool. And you should see a word count. Better sorted dot scallop. Select that and import it. And let's open that up and take a look So you can see here we have implemented the code that we had in the slide there, and things were the same up until this point. So we have our RTD hair of lower case words where every row of the RTD is a normalized word from our book, and here again we are going to transform each word into a key value pair where the key is the word in the value is the number one and then we can use reduced by key. On that result toe add up all the occurrences of each word, add up all the ones all the one values associate with each unique word, and that final result ends up in the word counts. RTD, which we can continue to process on the cluster and what we're gonna do to sort that in the way we want is first flip them around so that the first value, the key, is the count, and the second value is the word itself. And then we can use sort by key, to sort the resulting RTD by those key values, which are the Count's. And then we can at that point, sort by key, returns things back to scallop back to the driver script. That's an action, and we just have to reiterate through each one and put him out now to make it a little bit more readable. We actually want to print out word and then count in the end, not counting words, sort of flip it back to its original order as we're iterating through the results and praying them. So we'll extract the first field and say that into something called Count the Second Field . Say that into the word value and print out the results the way we want, which is just going to be substituting in the strings for word and count into our final result in each line. So Let's go ahead and run that and see how it looks. And now the results will be a little bit more enlightening, shall we say, so go up to run run configurations. Creating the scale application we call this again were account better sorted. All right, we're accounts better sorted and our main class on Not Sun Dogs software, not spark dot Word count Better sorted. Let's run it, See what happens. Woo Alright, that's more like it. Interesting, Interesting, Interesting. So the most popular word in my book is the word you. Well, I must I must have you in mind even more popular than two or of our air than that. So all the usual suspects here for popular words in any given book Let's, uh, scroll up a little bit and see when things get a little bit more interesting. Is in that Yeah, whatever. All the most common words in the English language here. Right business. Probably the first, uh, kind of interesting word in the book. And it is indeed a book about business. Let's go a little bit more. What we find product shows up again is a popular word, and people work home customers. So, you know, just by looking at this result of the most popular words in the book, you can get a pretty good feel of what the book is really about. And actually, it's kind of enlightening to me the see whether one of the concepts are the words that are most most frequent and most important to what I'm trying to convey in this book. So you could actually get some insights out of that data now And that was kind of our goal . That's pretty cool stuff. So there you have it. Our journey is complete for now, for a word count example. Here, This is usually a very simple example that people use this sort of the hello world application of spark. But we've kind of taken it to the next level here. And I hope you learned a little bit along the way, mostly about the need to normalize your data in some ways to think creatively about getting the final result that you want and keeping it distributed. So there you have it. So there we have a pretty good example of doing a word count in my book there. If you do want to get your hands on it and let's start fiddling with yourself. Here's an idea of how to do that. How about introducing a stop list of words? Let's say you want to filter out all the common words in the English language, like is and the but you know, whatever you can think of, because that's not really gonna tell you much about the content of the book itself. So see if you can use the filter operation and write a filter function that checks for ah, list of stop lists, words and filters. Those out from the actual processing is early on, so that's a little challenge for you if you're up for it. Either way, I'll see you next lecture. 20. [Exercise] Find the Total Amount Spent by Customer: So at this point, I think you've learned enough, and you have enough examples to actually go off on your own and actually write your own spark script from scratch. So here's a little challenge for you I'm going to do is give you some sample data that represents some fake e commerce data. And what I'm gonna give you is a comma separated value list where each row of the input data represents some customer i d. Some item I d and the amount spent by that customer on that item all I want you to do is create a simple script that, as of the amount spent by each customer, so basically, we're going to reduce down by customer i d. How much each customer spent in total pretty simple and starting off easy here. But, you know, I wanted to get some confidence and actually riding spark scripts from scratch using scallop, because you're probably new spark and you're probably you knew to Scala's well, so as a high level strategy, disappointing in the right direction. Start off by parsing that input data, you know, split it up based on those comic characters and individual fields in a list, then you're gonna want to map each one of those lines to key value pairs where the keys are the customer ideas and the values of the dollar amounts, and then you can just use reduced by key toe. Add up all the amount spent by each customer i d. Call, collect and print out the results when you're done so at a high a logical level. That's the approach you should go after. It's up to you to turn that into code, Be sure to look at the previous examples that we've looked at in this course is gonna be a lot of useful examples here that you could emulate and a couple of useful snippets that you might find handy here. Remember to split a line based on commas. You just use line dot split with the character that you want to split it on and remember that if you need to output a to pull like, say, a key value pair and you want to convert those values to a specific data type you can use to end and to float on the resulting fields to do so. So keep that in your back pocket there and I'll get you started. Set up the project for you. So let's go take a look at how you might get going. And then I'm gonna turn you loose. So let me get you started on your assignment here. If you go to the place where you downloaded the course materials, you see a customer orders dot CSP file. And if you click on that, you can see the format of it again. It is customer I d item I D and amount spent. You only need two of those fields for this assignment. Go ahead and copy that and paste it into where you are working from. For me, that is. See Sparks. Scallon. All right, so your date is in the right place. Remember, It is called customer dash orders dot CSB When you're loading that file up and go back to the scale I d. E. You should have a nice little library here of examples to look at. So I suggest reviewing these examples, understand how they work and think about how it might apply to the problem at hand. Remember, what we're trying to do here is add up the amount spent by each customer. All right, so when you're ready to get started, just right, click on the package and say, instead of import, we're going to say New Scalia object. Give it a name like I don't know, um, purchase by customer, whatever you wanna call it, and that will give you a big, scary, empty object to fill in. So again, look at these other earlier examples to get started. You know, obviously you're gonna want to import the stuff you need is one of the first things you do at a main function. And what you do within that main function is what counts. So, good luck, have a go at it. And next lecture. I'll show you my solution to the problem. And you can compare that to your solution. But do you try this on your own? It's important to get some practice with this stuff, So go to your homework. Seeing the next lecture 21. [Exercise] Check your Results, and Sort Them by Total Amount Spent: all right. Did you do your homework? Hopefully, at this point, you have a working script that you wrote on your very own. That actually counts up the amount spent by customer and our customer orders don't see SV file. So let's go ahead and take a look at how I did it and compared to your approach. And if you do get stuck, that's OK. Take a look at my solution here, and maybe you could go back and give it another try without looking at my solution. Just get you started. Hopefully you saw these similarities between this problem and the ratings counter and the word count problems that we've already solved a lot of inspiration from that could be had. So to take a look at how I did it. Go ahead and right click on the package and import my solution, which is in the file system on if you go to your downloads for the course materials. Come on, work with me here. You should see total spent by customer dots, Cala. And that was my solution. Let's open that up and take a look. So you can see here that at a high level I wrote a little function here that's used to map my input data into two fields the two feels that I care about. So remember what I'm trying to figure out is the total amount spent by customer and our input data contains three fields. The customer I d. The item I d in the purchase amount. I don't care about item ideas at all. That doesn't factor into my final result whatsoever. So all this mapping function is doing is taking it input lines, splitting it up by commas and then extracting the two fields. I do care about into a key value pair where the key is the customer i d. From field zero. And the amount spent is the dollar amount from field to Okay, so that gets used down here. In our main function, we do the usual thing to get rid of log spam and set up a spark context like we always do. And we load in our input data from the Rossi s V file into an input RTD, and we apply that function we just wrote to convert that input data into key value pairs called mapped input. Where again, The key is at this point the customer I D and the Valley was The amount spent at this point is a simple matter of using reduced by key on the resulting rdd, where we tell it to add up all the individual prices together for a given customer. Remember, the key here is the customer. The value is the price reduced by key will add up all of the prices for a given customer, given this edition function and then we have an RDD that contains the results that we want total by customer. About to call it will just collect it and then print out the results one line at a time. So let's go ahead and run that run run configurations Skylar Application Total spent by customer com dots and on software dot spark dot tools. Spanx by customer and right And there we have it. This is all randomly generated fictitious data, so you can't really read much into it, but you can see it worked. We have a list of all the customer I ds and the amount each customer ideas spent in total. And given the even distribution of the random data, they're all in roughly the same ball park. So that looks about right. Very cool. Now you're not done yet. Hopefully, uh, maybe that was a little bit too easy for you. You know? If so, I'm going to challenge you a little bit further. So if you did have trouble of this example, go back and try it again after you've seen the actual code that you can refer to. So I do want you to try and get something running on a piece of code that you actually typed in yourself. Okay, If nothing else, copy this file over. Just make sure that you had the capability of writing code from scratch and running it here . And once you have some comfort with what's going on so far, let's take it to the next level. What I want you to do is actually extend this to sort the results so we can see who the biggest spenders are and the lowest spenders. So a little hint. Remember that word count example? We did where we ended up sorting the results to show the most frequent words in the least frequent words in my book. Sort of that way. Same concept applies here So go back and study that example and figure out how you can apply that same technique to our purchase data. R E commerce data. So I want to get things sorted so that I see my cheapest customers. And as you get toward the end of the list, my biggest spenders, so sort the final list by amount spent using those same techniques. And when you're ready, go off and do that and come back to the next lecture and I'll show you my solution. And you can compare my solution to yours. So go do your homework again. I'll see you in the next lecture. 22. Check Your Results and Implementation Against Mine: Okay, Hopefully, by this time you have successfully extended our little previous exercise to sort the results by amount spent. And we can actually take a look at our cheapest customers. And our biggest spender is pretty easily, if not just disk should go do that before you look at my solution here. Good. Okay, let's go ahead and right click on the package like we usually do. And we will import my solution from the file system, which is going to be called Total spent by customer sorted. Hopefully you haven't peaked at it. All right, let's take a look at that and you can see it is in fact applying the same general idea that we did in the workout example when we tried to sort that by the word frequency. So we've taken what we started with with our total spent by customer Scallon and just change the last few lines of it. So, up to this point, things were the same. You know, we've gone and parsed out our input data and split out the input into key value pairs where the keys, air customers and the values are amount spent. We reduce that by adding up all the amount spent for each individual customer. At this point, we have a total by customer RTD that contains key value pairs of custom, righties and how much they spent. So now we need to sort that the way we want. And we're gonna do that same trick of flipping things around so that the thing we want to sort by is in our key. So this map function does just that. It takes our customer I d to amount spent pairs and flips it around to be amount spent and customer i. D. So the new key is amount spent. Then we can call sort by key on that to sort by amount spent. And it's just a matter of collecting the results in printing them out. So let's give that a try. Give it a run run run configuration. Yet another scale application, this one being called little spent by customer sorted and the class name is com dot sun dogs software dot smart dot tool spent five customers sorted. Let's run it. Make sure it works Well, how about that? It seems to have done so so I didn't actually go to the trouble of flipping the output order when I printed it. If you want to do that and you did, then bonus points for you. Congratulations. But I just kind of took the easiest approach here because I don't make it too hard for you , but you can see that it is in fact, sorting by the amount spent. And our biggest spender in this fictitious data set is customer I d. Number 68 whoever that IHS And I'm not sure if we have enough buffer in here to actually capture the cheapest. So we dio, uh, whoever customer i d 45 is our cheapest customer who only spent about $3300 in total. It's actually a fair amount of money, but anyway, it's an illustrative example. So if you did get similar results and those are your cheapest and biggest spending customers, congratulations, you did it right. And if not, go back, compare your code how I didn't try to see where you went wrong, But hopefully you can take a look at my implementation here and compare that how you did it yourself and find any issues or use mine is a reference. If you have got this far. Congratulations. You have now written Spark Scalia code on your own successfully, so that's a pretty big milestone. Congratulations myself on the back. So with that under your belt in, you should be feeling pretty good about it. Let's move on to some more complicated examples and see what spark is really capable of. We'll do that in the next section. 23. [Activity] Find the Most Popular Movie: So let's start doing some or interesting stuff with spark and scallop. We're going to revisit the movie lens data set that we started off the course with, and we're gonna start with a simple example where we just try to find the most popular movie, the most rated movie in that data set. But we're gonna build on that to doom or more complex things and illustrate more and more features of spark. So let's start our process now. So just to review the data format of the movie lens ratings data, the data file looks like this as a format where every line consists of a user i d a movie i d. A rating at a time stamp. So the way to interpret each row is user. I d redid this movie, this rating at this time. Okay, So, for example, User Idea 1 96 watched movie to 42 gave it three stars and did it at whatever this TIMESTAMP represents. That's actually a F Block seconds format. There's a way to convert that to a real date, but it's not important because we're not using that data. So let's step back and think a bit if I wanted to figure out the most popular movie. What I need to figure out, I need to figure out is how many times the individual movie was rated Overall, I don't really care about the users, right. I can throw that away. I don't care of the times. I can throw that away and I don't even care about the ratings themselves. All they care about is the fact that a movie was actually wash that. It actually appears in a given line, so that makes things a little bit simpler. So let's go look at the code and talk through it. So let's hot back in the scallop i d. E and take a look at some code to find the most popular movie. So it's the right click on our package here and imports from the class materials that you downloaded earlier, and we look for popular movies dot Scalia, open that up and double click that and take a look. So, as I said, this is a very simple example reminiscent of some of the stuff we've done earlier in the course. But we're going to keep building on this to doom or more interesting things. Ultimately, we're gonna build movie recommendations using this data. So we got to start somewhere. So we all of the usual boilerplate stuff import the packages. We need to find our main functions at the logging level. Build a spark context. Nothing into interesting too far. But here we go with the actual meat of the problem. First thing we're gonna do is load up the u dot data file, which contains the movie rating data itself. And remember that contains user i d movie I d rating timestamp. But all I care is that a movie idea actually appears on that line. The only information that's actually relevant to me is the movie I D. So let's extract that with this map function right call map with this little in line function here where we split up the database on tab characters because I know it's tab delimited. I take feel number one, which is the movie I D. And remember, we start counting from zero here. So it goes user I d movie I d rating time stamps. So movie ideas Field one and we're actually making this into a key value pair where the key is the movie I D. In the Valley was the number one. You could probably see where this is going, cause this is similar to the other examples of John where we do counts. So what we have at this point is the movies already that contains key value pairs of movie ideas and the value one. And then we just call, reduce spiky to add up all the ones for each individual unique movie I D. And that gives us a count of how many times each movie appeared in the data set. Finally, we just call, sort by key to sort the results, collect them and put him out. So not too complicated. Let's go ahead and do it. Run, run configurations, school application, popular movies Oh, I got Sun Dogs software not spark, not popular movies and run Bang. It's amazing out faster this once it actually gets going. So there you have it. We have a list of the most popular, most viewed most rated movies in our data set, and it turns out the winner is movie I D. 50. But that's not very helpful now, is it? What does movie I D 50 mean Well, it turns out there's a u dot item file in the movie. Lynn's data set that Mets movie ideas to movie names and all sorts of other information about the movies. So to make this really useful and get some interesting data out of it, we're gonna have to somehow emerge that data in. So let's talk about how that works in the next lecture, and we'll learn what the most popular movie actually is. 24. [Activity] Use Broadcast Variables to Display Movie Names: So let's talk about making the results of that most popular movie script, Mawr human readable. We learned that the most popular, more specifically, the most rated movie in our data set was movie I D. 50. But what the heck is movie I. D. 50? Well, it turns out there's a u dot item file in the movie lines data set that does map I DS two names and much more information as well. And we could just load up that table in memory and keep it in a big scallop map object or something. And you know, he's had to print out the results in a prettier format at the end, within the driver program itself. Or if you wanted to pass that around to, you know, different mapping functions or whatnot, or you could actually pass that in and have spark automatically forward that map to all the executor notes automatically. But let's imagine that this table was truly massive, and you know, it's a little bit sketchy about fitting into memory on just one machine. We also want to make sure that it's not transferred across the network more than once, so this is where broadcast variables come in. This gives us a means of taking a chunk of data and explicitly sending it to all of the nodes in our cluster so that it's there and ready for whenever it needs it. Now the syntax for doing this is pretty straightforward. You just call broadcast on the spark context object, and you can use that to ship off whatever data you want to all the different nose in your cluster and then within your driver's script code. You just use dot value on that object that is returned by the broadcast function to actually retrieve that information back and refer to it within your script. So by using this pattern, you can send arbitrary data across your entire cluster and make sure that it's available whenever it's needed without having to retransmit it all the time. Okay, so let's go take a look at the code for a better version of our most popular movie script and see how that works. So let's hop back to the scale A i D. And a recap. In our previous lecture, we created this script to find the most rated movie and the movie lens data set that we downloaded earlier in the course, and we've learned that the most popular movie was movie I. D 50 with 583 distinct ratings in the movie set. And, of course, there's lots of movies have only got one rating like 14 94 is apparently a very obscure movie I D. But what does it all mean? How do we tie in that movie data, and how do we use broadcast for Ebel's to do it? So let's go ahead and import E Mawr sophisticated version of the script right click on the package import file system. And we're going to look for a popular movie nicer from the course materials and import that into our package that will click that and let's take a look at how it works, All right, so first thing we need as the function that actually reads that you dot item file that maps movie ideas to all the metadata or information about those movies and loads it into a structure that we can use so specifically, we're going to build up a map object in Scalia that maps movie ideas to movie names. Okay, so to do that, we've written this little load movie name, script. We know that there is some funky character encoding in this file because there's a lot of foreign movies with weird accents and stuff like that. So we're going to specifically say that we're going to use utf eight encoding and deal that in a certain way. So if you ever have utf encoding issues, it's might be a nice little snippet of code toe have handy for your Scala ourselves. But with that out of the way, we're going to just create a map object that maps introduced two strings and initialize it as a empty map. Okay, that's what this line is doing here. And we will start off by opening up the you not item file. This is all just using, you know, standards, kala library stuff. We're not really doing spark stuff here at this point, and we're going to get every individual line of the u dot item file and literate through it in one line at a time. Okay, now we know that this particular data is piped limited, so we're going to split each line by a pipe. Character confirmed that there is in fact, data in that road that meets our criteria of not being a black line, basically, and we will add to the movie names map this information here. So we take the first field, which represents the movie I D. And we map that to the second field, which is the movie name. Okay, knows that we're converting that to an integer explicitly to save a little bit of space, because we know that's an integer field where Icefields one has left in its string form. And what we return is just a plain old scallop map object that maps movie ideas to movie names. Okay, with that out of the way, let's see what we're doing here. So check this out usual stuff here. But once we kick things off, we're going to create a broadcast variable so that that map of ideas to movie names could be available to our entire cluster all at once. So we call sc dot broadcast with whatever the load movie names, function returns, which is that map of integer movie ideas to string names, and that gives us back a name Dick object that has been broadcast every note on the cluster or will be when the time comes. Now keep in mind again This data isn't so massive that this is necessarily required for this particular use case. But again, what you're trying to do is avoid retransmitting the state of more than once. Okay, so this is kind of ah illustration of that general technique. There's more than one way to do it. All right, well lit up our movie rating data itself. Everything here is the same as it was before. We just split out the data so that we end up with extract the movie I ds and the number one add up all the ones reach movie I D. And at this point, we have a movie counts RTD that maps movie ideas to how many times that movie idea was raided. Now things get a little bit different while we're at it. Let's sort things and again we're flipping things around so that we have count movie idea Instead of movie idea count, we sort by key and at this point we have our original results of the counts and the movie ideas sorted by count. OK, so the last item on our list will be the most popular movie movie I D 50. And now what we're gonna do is actually do a mapping function to convert all of those movie ideas to their corresponding names. So that's what this little bit of code here is doing. We're calling map on the sordid movies already. D and we're going to map X to name dick dot value. So remember name dicked was our broadcast variable object. We're calling dot value to actually retrieve what's inside of that broadcast variable, which is theme app that we stuffed into it originally. Okay, so we're going to call it named. It got value to retrieve that dictionary. Look up, that movie I D, which is in this case thes second field of the sorted movies RTD. And then we will print alongside that the count and then print out the results. So this you give us back a new set of results that consists of the string based movie name followed by the movie count. Okay, and that should be a little bit more readable. Let's go ahead and run that and see how it works. Run figuration scale application, popular movies. Nicer com dot sundov software dot spark. I thought hop my ability to type decreases as the day goes on. Popular movies Nicer and let's run that drum roll, please. And it turns out the most popular movie was not too surprisingly Star Wars from 1977. Now, remember this status that goes back to 1998? So all the big blockbusters after that, like Titanic and whatnot, didn't exist yet. So as of 1998 Star Wars was the most rated movie of all time on the movie lines website, with a contact Fargo and return of the Jet I and not close behind. And if you go back and look at the more obscure titles, well, they are in fact obscure Il mostro, cold blooded nemesis to never heard of any of these. Probably I only one person rated them. So there you have it, an example of using broadcast variables to make sure that information is available to an entire cluster when needed. Now again, I want to stress that there are more than one way to do this. You know, I could have just loaded up that map and applied it as I was printing out the results within the driver script itself. You know, this is really only necessary when you need to have that data broadcast to every note on your cluster. So if you're performing some sort of distributed map operation or flat map operation that requires that data, you want to think about using a broadcast variable instead. And that's what we did here with this map function. We actually used map to do the conversion of movie ideas to movie names in a distributed manner instead of converting them as we were pointing out the results. But either way would have worked in this example. Okay, I'm just trying to illustrate the concept of broadcast variables here, so I hope that makes sense. Go back and stared code a little bit more. If not, broadcast variable is pretty easy to use, and they could be pretty powerful. So there you have it. Let's move on to another example 25. [Activity] Find the Most Popular Superhero in a Social Graph: so things are gonna get kind of fun right now. We're going to start dealing with a data set or about superheroes and their connections to each other, believe it or not, and as before, with the movieland status that we're going to start off with some simple examples and work our way up to more interesting problems later in the course. But for now, I just want to introduce the data and how to play around with it. So believe it or not, someone actually went out and looked at every single Marvel comic book and kept track of all the superheroes that appear together within the same comic book. And what we're doing is we're treating those co appearances within comic books as social connections. So the fact that, for example, the Hulk and Spider Man appeared in the same comic might imply that their friends, if you will, so we can build up these complicated graphs where we have thes references of who appeared with you in different comic books. So in this example here, maybe the Hulk appeared with Iron Man and the whole compared with or but Thor only appeared with Iron Man and you know we they were different comic books, and maybe Spiderman and The Hulk are connected by appearing in the same comic book. But let's pretend that the Spiderman and Thor never actually appeared in the same comic book totally made up social graph here. But it gives you a very simple example of what a social graph looks like. And just like we can construct something like this in the context of superheroes who appeared with each other in the same college books. You can also apply this to big social networks like Facebook or what have you? Same idea. Networks of friends look like this, and they can have complicated shapes and structures. So playing with this data could be pretty interesting. But for now, let's do something simple and try to figure out who the most popular superhero is. Based on this day that we have, let's take a look at what the data looks like. So there's two files that come with this data set that we're gonna work with when it's called Marvel graph dot text, which contains the actual social graph itself in kind of a strange format. And then there's Marvel names dot text was just maps, superhero ideas to their human readable names. Okay, so looking at marvel graph dot text, every line is just this big stream of numbers, and the way to interpret it is the first number represents a given superhero and all the subsequent numbers represents all the superheroes that appeared with that superhero and other comic books. So again, the first number special. You know, that's the hero that we're talking about. And it's followed by a list of all the heroes that appeared with that hero. Okay? And to map those hero i DS two names, we can do that with the Marvel names dot text file. So, for example, we can see that Spider Man is 53 06 and actually he appears with whoever 43 95 is in this example. Spider Man is pretty popular, but let's find out if he is actually the most popular, so pretty simple, straightforward problem here are high level strategy will just be two parts in that input line, one line at a time. And since we don't really care about the actual individual connections for this problem, we just want to find out who is the most popular. All we care about is the total number of connections for each superhero. So we're going to look at each line of input data extract the superhero i d. That's the first number there they were talking about and then just store the count. The total number of other superheroes that appear with that superhero now remember thes conspire on multiple lines, so the same superhero idea might be broken up onto two or more different lines. So we need to combine them together somehow and reduce by Key will allow us to add up all the individual lines for a given superhero into one final results. From there, we use the same sorting trick that we've used before were just kind of flipped that around so that the key is the count of how many friends you have and the value becomes a superhero i D. And then we can just call Max on the resulting already D to find who has the most friends and friends again is a proxy for co appearances in other comic books in this example. Then we could look up the name of the winner for Marvel text rebel names dot text and display the results. So off to the code. Let's see how that works. Run it and find out who is the most popular Marvel superhero. All right, so let's get our hands dirty with the Marvel superhero social network data. As before, I have to Windows Open appear. One is my working folder that I'm using for the course. That's where of all of my projects and data for the course air going that I work from. And then I have my download folder where I downloaded the course materials for the entire course. So the two files we care about for data are Marvel graph and Marvel names dot text. We click on them and preview them. You'll see that they are, As I advertised, Marvel graf again, just a list of lines that have a bunch of numbers on them or each number represents a super here hero. I d. The first number is the hero we're talking about, followed by a list of that heroes connections. And then we have Marvel names dot text, which is mapping here ideas to their human readable names or the names are enclosed in quotation marks. OK, so Let's go ahead and copy those into our working folder here. Undersea spark, scallop and fire up the scallop I D. E. And we'll import the necessary code to figure out who is the most popular superhero. All right, we'll do our usual dance. Open up the project. Open up. Source opened up the package. Right click on the package and imports from the file system from where we downloaded the course materials. Most popular superhero dot Scalia And let's open that up and take a peek part of this other stuff. All right, little, a little bit more complicated stuff going on here. Surprisingly so. A lot of it's because we're dealing with two different data files have different formats that we have to handle. So let's skip down to the main function where things get interesting and will kind of work back to the other functions here. Usual set up stuff. We're just gonna sit the log level and creator spark context. Now we're gonna read in the Marvel names, not text file. So the first thing we're gonna do is build up in RTD that can map superhero I ds to superhero names, and this is yet another way of doing this Sort of a conversion, You know, another exercises We looked at building up a scallop map object that we just used to do that as we're displaying the end result. And that would work fine here as well. We've also looked at using broadcast variables to load up that map ahead of time and broadcast that each individual note in our cluster and that works too. But you can also build up in RTD and that will be automatically be available to every node near clustered as well. So 1/3 way of dealing with this sort of a problem here just for illustration purposes. So to do this, we're going to call flat map on the parse names with a farce names function that we defined above on this names RTD. So names contains just to refresh what it looks like. Marble names, not text. Looks like this. So every row is a number and a space, and then within quotation marks the name of that character. Okay, now, maybe there's some invalidated in here somewhere. Maybe there are some empty names. Maybe there was some blank lines. In fact, I think there are so we need to deal with all these different edge cases as well, and that's where flat map comes in. So if there's ah possibility of a line that we cannot successfully parse, I want something like flat map. So we have the option of returning nothing for a given line because it's not actually going to end up in our RTD. So that's where we're calling flat map instead of a map. And let's take a look at what parts names does. So remember this is getting called by flat map on the raw input data. The input is a string of the Marvel name start text file, and the output is going to be an option, a scallop option of a to pull of a superhero I d. In the superhero name. Now we haven't talked about options before. Basically, an option is a scallop construct for saying you could have data where you might not have data in other languages, we have the concept of a null value or a nil value or something of that nature. Skela doesn't have that. Instead, it has the concept of an option that wraps basically a value, and you can have either an actual value returned as an option, and that is a some value, which is a subclass of option. Or you can return a nun value, which is also a subclass of option. So by returning an option, you can return a sum which actually contains data or a nun which contains no data. And when you're using flat map, what will happen is that if you return a some object will say, Okay, I'm gonna actually construct a line in my new RTD based on what's inside that some object. But if it returns none, it will say, Okay, there's nothing actually to do here. I'm not going to create a new line it all in my resulting RTD okay with me so far, look at the code makes a little bit more sense to just take a close look here. What's going on now? A little trickier, how I'm gonna parse it again if you look at the actual input data itself I want I want to do is extract that superhero I d in the name and one way I can do that to kind of cheat a little bit is to treat the quotation mark character as the delimit er character. So if I delimit this string and split it up on quotation marks, what we end up with, well, this ends up being one feel. Then the idea followed by a space and then the name itself, which is between the next two quotation marks, becomes our second field. Okay, so even though it doesn't really look like it's quote delimited, date unnecessarily by treating that quotation marks the delimit er I'm getting pretty close to what I want, and that's what I'm gonna do here. So the first thing I'm doing is splitting the string based on a quotation mark, and that gives me back hopefully more than one feel. So if I actually end up with a successfully parsed line that has an I D. In a name, I'm gonna have more than one field and I can continue on, and what I will do is take out that first field value, which is the Here I d, followed by a space called Trim to get rid of that space and then convert that to an integer value. Okay, so that's basically plucking out the superhero I d and converting to injure. And then I will take the second field, which is what appears between the second set of quotation marks and use. That is the name okay. And that gets wrapped in a some object indicating that I did in fact have valid data to return for that line. Now, if I have a blank line or a line that's missing a name or something, or maybe the quotes are all messed up, I'll just return none. In that case, I won't have more than one field, and I will return a Nun Valley, which tells flat map. Ignore this line. Don't do anything with it. There's nothing here to see. Okay, So after this line, we end up with a new names RTD, that it's just a key value. RTD that maps superhero ideas to their names might come in handy later, right? All right, let's look at the social graph itself. So the next thing we're gonna do is load up the marvel graph dot text file into a lion's rdd, and now we're gonna call map on it using this count co occurrences function. So what we want to get as we talked about the slides is a new RTD that maps here. Ideas toe how maney connections they have. And remember, this could be split up on multiple lines, so we're gonna have to reduce that later on. But for now, let's just convert each line of input to here i d and how many connections they have. You can see all we're doing here. We're using another regular expression. Kind of like we did back in the word count example. But instead of W plus freezing s plus that lower case s plus is a regular expression that means split it up based on white space, whenever white spaces and might be multiple spaces, it might be tabs. It might be tabs and spaces. I don't care. Just split it up on white space, please. And that's what it gives me back. So what about you? Back? Here is a list of strings that represent Theseus, the hero, I d. And all of these superheroes connected to that superhero. But all I care about is that first entry, the hero idea we're talking about. So that's elements Euro converted to an integer followed by the number of connections that is going to be the total length of that list minus one to subtract off that first character . I d ok with me so far. I hope so. Don't be. Don't be afraid to go back and re watch things if you're losing me here but it's pretty straightforward. All right, so at this point, we have a pairings already. That is key value pairs of hero I ds two numbers of connections. Since these conspire on multiple lines, we didn't need to do a reduced by he to add them up in that case. So if we have more than one line for the same superhero I d. This reduced by key operation will combine together by adding them. And that gives us our RTD of total friends. My character is what we have at this point is key value pairs of variety to the grand total number of connections that hero has now to sort it properly. We're just gonna flip it and make the key the number of connections in the Valley of the Hero. I d. Now, I could just sort that and print out the results, or I could if this I'm only I'm really trying to find out is the person who has the most friends, right? So rather than sort and print everything out, I'm just gonna find the maximum value. And if you call Max on a key value already D, it will find the maximum value based on the key. So that's why we had to flip it there, so that will give us back the result of the most popular superhero. All we need to do at that point is call look up on our names are d D to figure out the name of that person and we can put it out. And there's where we actually print out the result with string substitution where we put in the name of the results and the end result of how many co appearances that hero had. So wow, long talking lanco to talk about. But let's just run it and get our results. The suspense is killing me. Run run configurations Scallop application Most popular superhero tom dot sundov software spark dot Most popular superhero See what happens and we could get an answer real quickly. It is, uh, and the winner is Captain America. I want to guess that I would get someone like Spider man. But it turns out Captain America gets around. He shows up in a lot of comic books. It's good guy. He deserves it, right? All right, so that's a simple example of using the marble data set were going to some more interesting stuff with it. Next, we're gonna figure out degrees of separation and find out who is the Kevin Bacon of the Marvel superhero universe will do that next. So if you want to get your hands on with this code and fiddle with it, as always, I encourage you to do so. Here's an idea. Instead of just printing out the most popular superhero, put out the top 10 most popular and maybe the top 10 least popular, give that a try. 26. Superhero Degrees of Separation: Introducing Breadth-First Search: So let's do something a lot more complicated with that superhero social network data. Where we're gonna do is find the degrees of separation between any two superheroes. And to do that, we're going to introduce the concept called Breadth First Search, which is a computer science algorithm and illustrate how you can use Apache Spark to implement what might not seem at first to be something that lends itself to distributed processing. But through some creative thinking, you could take even complex algorithms like this and make a spark application out of them. So maybe you've heard the story that the actor Kevin Bacon is six degrees away from any other actor in Hollywood. That is, if you look at the people that Kevin Bacon has appeared within under other films and the people that those people appeared with another films and so on and so forth. Everyone's within six degrees of Kevin Bacon, and I can tell you, having worked at imdb dot com, that's actually not true. The number is much smaller than six. People are really, really well connected, more than so than you might think. The same is true of the superheroes in our superhero social network, you might be surprised at just how closely everyone's connected to, say, Superman and to get a little feeling of what I mean by degrees of separation. Let's take a look at this sample social network here again. So in this example, the Hulk and Spider Man are one degrees of separation apart from each other because they have a direct connection. But for example, Iron Man is two degrees of separation from Spider Man because we have to go through the Hulk to find Iron Man. So, for example, to make it more concrete, even The Hulk and Spider Man may have appeared in the same comic book. But let's say that Spider Man and Iron Man never did. But Iron Man and the whole did therefore there two degrees of separation apart. Take Thor, for example. If we take this path we went might say that he's three degrees of separation, but we always talk about the shortest path. So in the case of Thor, he would be two degrees of separation because they're connected by one person. The Hulk. Okay, two steps to get to four. So how do we do that? Well, we need to use a search algorithm called Breadth First Search. So what we have here is basically a network graph in computer science terms. It's so imagine every one of these circles represents a superhero in our social graph, and these lines represent the connections between them. You know, the people that appear together in the same comic books in this example. So let's pretend this is what a social graph for a piece of the social graph looks like in our social hero Super Network. It illustrates different superheroes at circles. The lines represent connections between the superheroes and the number in the middle. In this case, Infinity represents the distance from any given super here that we start with. So let's say, for example, we're gonna start with superhero s. Maybe that represents Spider Man we want to end up with is a graph that indicates how many degrees separated from Spider Man from notice is every other note in this craft. So to do that, we need some sort of approach, some strategy. So let's start off with a couple of basic ideas here. First of all, we're gonna come up with the idea of maintaining the state of a given node and it could be one of three colors. Now, white means that a note is completely un explored by our algorithms. So we're starting off with everything being white as our initial state because nothing has been explored yet now that we have other colors gray meaning that it needs to be explored and black meaning that it has been fully explored. Okay, so the color represents the state of each superhero. As we search through this graph, White means it hasn't been touched. It all gray means it needs to be touched, and black means we're done with it. And inside each one, we're going to keep track of the degrees of separation from some given character that we started with. Okay, it's pretty much all there is to it. So let's see how that works. So again, our initial state we're gonna start off saying we want to measure everyone's degrees of separation to this note note s maybe it's Spider Man who knows. So to do that, we're going to color this first no gray meaning that it needs to be explored. And the initial degrees of separation will be zero because spider man is zero degrees away from Spider Man because Spider Man is Spider Man makes sense so far. So what that gray indicates is that we need to explore all the connections of this note. So we'll do that next. So by reaching out on the tendrils of that gray node, we're going to color these new connections gray me that I need to connect to explore them as well. And in the process, I'm going to increment the degree count from 0 to 1 and store that in this new exploration here. So we've increased 01 and we've propagated that out to all the connections of the original note. Now we've colored that original, no black meaning that we're done with it now. We've already explored his connections and now these two notes or gray money that they need to be explored further. Now, if we had a situation where there was already a number in one of these connections, we would maintain the lowest connection count, kind of like we showed before with the Hulk in, or Spiderman and Thor, for example being connected through the Hulk. So we always maintain the shortest distance that we find okay. And the darkest color. So we then go off to split out this. No, that was gray. We are now going from 1 to 2 in our depth, and we will go out to its connections, make them a two because we're in committing that and color them gray when you do the same for this connection over here and that gets explored out and a two goes there. And now that we're done with these two notes, they get color black, so we know we're done with them. So that's another iteration of the BFS algorithm. Basically, we keep Iterating through this process of exploring connections, coloring them gray, incriminating the degree of separation count and then marking the node black that we actually processed that we just do this over and over and over again until we're done. So this is another iteration done. Now, Now, these all need to be explored all three of these notes, So we'll see how that works here. These get branched out and we propagate out the next level of three out to there. This guy doesn't have any connection, so he's just gonna turn black. We're done with that branch and finally we'll explore any UN explored connections of these guys. There aren't any, so they end up going black as well. And that is our final graph for degrees of separation from node s. So the way to interpret this Look, if you look at any given note, it already has the answer in there of how many degrees of separation am I from S. And you can see we are one degree from between S and W. But from T two w, for example, it is too. And again we preserved the shortest value there. You know, you could actually do it through three hops if you really wanted to. But to is the shortest path. And that's what breath for a search does. That's how it works. You know, that's actually a pretty fancy pants computer science algorithm that they ask you on job interviews, that big, fancy computer science e companies. Right? So you now understand breath for a search. Congratulations. Not that complicated. So how do we translate that into a spark problem? Well, that's the challenge, right? B