Apache Spark 3 with Scala: Hands On with Big Data! | Frank Kane | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Apache Spark 3 with Scala: Hands On with Big Data!

teacher avatar Frank Kane, Machine Learning & Big Data, ex-Amazon

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

64 Lessons (8h 50m)
    • 1. Introduction

    • 2. Installing the Course Materials

    • 3. Introduction to Apache Spark

    • 4. [Activity] Scala Basics

    • 5. [Exercise] Flow Control in Scala

    • 6. [Exercise] Functions in Scala

    • 7. [Exercise] Data Structures in Scala

    • 8. The Resilient Distributed Dataset

    • 9. Ratings Histogram Example

    • 10. Spark Internals

    • 11. Key / Value RDD's, and the Average Friends by Age Example

    • 12. [Activity] Running the Average Friends by Age Example

    • 13. Filtering RDD's, and the Minimum Temperature Example

    • 14. [Activity] Running the Minimum Temperature Example, and Modifying it for Maximum

    • 15. [Activity] Counting Word Occurrences using Flatmap()

    • 16. [Activity] Improving the Word Count Script with Regular Expressions

    • 17. [Activity] Sorting the Word Count Results

    • 18. [Exercise] Find the Total Amount Spent by Customer

    • 19. [Exercise] Check your Results, and Sort Them by Total Amount Spent

    • 20. Check Your Results and Implementation Against Mine

    • 21. Introduction to SparkSQL

    • 22. [Activity] Using SparkSQL

    • 23. [Activity] Using DataSets

    • 24. [Exercise] Implement the "Friends by Age" example using DataSets

    • 25. Exercise Solution: Friends by Age, with Datasets.

    • 26. [Activity] Word Count example, using Datasets

    • 27. [Activity] Revisiting the Minimum Temperature example, with Datasets

    • 28. [Exercise] Implement the "Total Spent by Customer" problem with Datasets

    • 29. Exercise Solution: Total Spent by Customer with Datasets

    • 30. [Activity] Find the Most Popular Movie

    • 31. [Activity] Use Broadcast Variables to Display Movie Names

    • 32. [Activity] Find the Most Popular Superhero in a Social Graph

    • 33. [Exercise] Find the Most Obscure Superheroes

    • 34. Exercise Solution: Find the Most Obscure Superheroes

    • 35. Superhero Degrees of Separation: Introducing Breadth-First Search

    • 36. Superhero Degrees of Separation: Accumulators, and Implementing BFS in Spark

    • 37. [Activity] Superhero Degrees of Separation: Review the code, and run it!

    • 38. Item-Based Collaborative Filtering in Spark, cache(), and persist()

    • 39. [Activity] Running the Similar Movies Script using Spark's Cluster Manager

    • 40. [Exercise] Improve the Quality of Similar Movies

    • 41. [Activity] Using spark-submit to run Spark driver scripts

    • 42. [Activity] Packaging driver scripts with SBT

    • 43. [Exercise] Package a Script with SBT and Run it Locally with spark-submit

    • 44. Exercise solution: Using SBT and spark-submit

    • 45. Introducing Amazon Elastic MapReduce

    • 46. Creating Similar Movies from One Million Ratings on EMR

    • 47. Partitioning

    • 48. Best Practices for Running on a Cluster

    • 49. Troubleshooting, and Managing Dependencies

    • 50. Introducing MLLib

    • 51. [Activity] Using MLLib to Produce Movie Recommendations

    • 52. Linear Regression with MLLib

    • 53. [Activity] Running a Linear Regression with Spark

    • 54. [Exercise] Predict Real Estate Values with Decision Trees in Spark

    • 55. Exercise Solution: Predicting Real Estate with Decision Trees in Spark

    • 56. The DStream API for Spark Streaming

    • 57. [Activity] Real-time Monitoring of the Most Popular Hashtags on Twitter

    • 58. Structured Streaming

    • 59. [Activity] Using Structured Streaming for real-time log analysis

    • 60. [Exercise] Windowed Operations with Structured Streaming

    • 61. Exercise Solution: Top URL's in a 30-second Window

    • 62. GraphX, Pregel, and Breadth-First-Search with Pregel.

    • 63. Using the Pregel API with Spark GraphX

    • 64. [Activity] Superhero Degrees of Separation using GraphX

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

New! Updated for Spark 3.0!

“Big data" analysis is a hot and highly valuable skill – and this course will teach you the hottest technology in big data: Apache Spark. Employers including AmazonEBayNASA JPL, and Yahoo all use Spark to quickly extract meaning from massive data sets across a fault-tolerant Hadoop cluster. You'll learn those same techniques, using your own Windows system right at home. It's easier than you might think, and you'll be learning from an ex-engineer and senior manager from Amazon and IMDb.

Spark works best when using the Scala programming language, and this course includes a crash-course in Scala to get you up to speed quickly. For those more familiar with Python however, a Python version of this class is also available: "Taming Big Data with Apache Spark and Python - Hands On".

Learn and master the art of framing data analysis problems as Spark problems through over 20 hands-on examples, and then scale them up to run on cloud computing services in this course.

  • Learn the concepts of Spark's Resilient Distributed Datastores

  • Get a crash course in the Scala programming language

  • Develop and run Spark jobs quickly using Scala

  • Translate complex analysis problems into iterative or multi-stage Spark scripts

  • Scale up to larger data sets using Amazon's Elastic MapReduce service

  • Understand how Hadoop YARN distributes Spark across computing clusters

  • Practice using other Spark technologies, like Spark SQL, DataFrames, DataSets, Spark Streaming, and GraphX

By the end of this course, you'll be running code that analyzes gigabytes worth of information – in the cloud – in a matter of minutes. 

We'll have some fun along the way. You'll get warmed up with some simple examples of using Spark to analyze movie ratings data and text in a book. Once you've got the basics under your belt, we'll move to some more complex and interesting tasks. We'll use a million movie ratings to find movies that are similar to each other, and you might even discover some new movies you might like in the process! We'll analyze a social graph of superheroes, and learn who the most “popular" superhero is – and develop a system to find “degrees of separation" between superheroes. Are all Marvel superheroes within a few degrees of being connected to SpiderMan? You'll find the answer.

This course is very hands-on; you'll spend most of your time following along with the instructor as we write, analyze, and run real code together – both on your own system, and in the cloud using Amazon's Elastic MapReduce service. 7.5 hours of video content is included, with over 20 real examples of increasing complexity you can build, run and study yourself. Move through them at your own pace, on your own schedule. The course wraps up with an overview of other Spark-based technologies, including Spark SQL, Spark Streaming, and GraphX.

Enroll now, and enjoy the course!

"I studied Spark for the first time using Frank's course "Apache Spark 2 with Scala - Hands On with Big Data!". It was a great starting point for me,  gaining knowledge in Scala and most importantly practical examples of Spark applications. It gave me an understanding of all the relevant Spark core concepts,  RDDs, Dataframes & Datasets, Spark Streaming, AWS EMR. Within a few months of completion, I used the knowledge gained from the course to propose in my current company to  work primarily on Spark applications. Since then I have continued to work with Spark. I would highly recommend any of Franks courses as he simplifies concepts well and his teaching manner is easy to follow and continue with!  " - Joey Faherty

Meet Your Teacher

Teacher Profile Image

Frank Kane

Machine Learning & Big Data, ex-Amazon


Frank spent 9 years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers, all the time. Frank holds 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, Frank left to start his own successful company, Sundog Software, which focuses on virtual reality environment technology, and teaching others about big data analysis.

See full profile

Class Ratings

Expectations Met?
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Introduction: And I spent over nine years at Amazon.com and IMDB.com making sense of their massive datasets. And I want to teach you about the most powerful technology I know for wrangling big data in the Cloud today. That's Apache Spark, using Scala programming language spark and run into Hadoop cluster to spread out massive data analysis and machine learning tasks in the Cloud. And knowing how to do that is a very hot skill to have right now, we'll start off with a crash course in the Scala programming language. Don't worry, it's pretty easy to pick up as long as you've done some programming or scripting before. We'll start with some simple examples, but work our way up to more complicated and interesting examples using real massive datasets. By the end of this course, you will have gone hands-on with over 15 real examples. And you'll be comfortable with writing, debugging and running your own Spark applications using Scala. And some of them are pretty fun. We'll look at a social network of superheroes and use that data to figure out who is the Kevin Bacon of the superhero universe. We'll also look at a million movie ratings from real people and actually construct a real movie recommendation engine that runs on a Spark cluster in the Cloud using Amazon's Elastic MapReduce service. We'll also do some big machine learning tasks using Sparks ML Lib library. And we'll do some graph analysis using Spark GraphX library. So give it a try with me. I think you'll be surprised at how just a few lines of code can kick off a massive complex data analysis job on a cluster using Spark. So let's get started. First thing we need to do is install the software we need. So let's get that out of the way right now. 2. Installing the Course Materials: So let's get everything set up including Java and intelligence and all the course materials that we need for the entire course. What we're gonna do is start by going to our own website here, which will direct you to the course materials where you can download all the project files and the data that you need for this course. We'll go ahead and get that installed in your system. Then we'll install a Java Development Kit if you don't already have one, we just need to make sure that we have a JDK between versions eight and 14 installed on your system. Odds are if you're a developer, you already do. After that, we'll install the IntelliJ idea Community Edition. It's a free development environment that can, we can use for Scala and for Spark. And the beauty of it is that it also integrates something called sbt. So gotten a handle all the dirty work of actually installing Apache Spark for us on Windows, we do have one extra step. We need to kind of fake out windows to think that it's running Hadoop and I'll show you how to do that. It's not too hard. And finally, we'll set up our project in intelligence. Run a little simple helloworld problem in Apache Spark. Make sure it's all working. Let's dive in and I'll walk you through all of it. So let's start by getting everything set up that you need for this course. Head on over to media Dotson dog, a dash soft.com slash Spark scala dot HTML. Pay attention to capitalisation and every little letter counts. And you should reach this page here that contains everything you need to get going. But most importantly, let's install the course materials, all the scripts that you need to actually get through this course hands-on. Go ahead and click on this link here to immediate Dotson dog dash, soft.com slash Spark Scala slash Spark Scala course dot zip. If you are typing that in by hand for some reason, be sure to pay attention to capitalisation. Once that downloads. We'll go ahead and unzip it. And on Windows I can just go ahead and right-click that and say extract all. On a Mac or Linux, of course, you would just go to a terminal prompt and use the unzip command. And we should get is a spark Scala course folder within a spark Scala course folder, That's correct. That's what we want. And within that second level of folder is all the materials itself, all the project for this course. So first of all, let's move this someplace where we're not going to lose it. So I'm taking that top level sparks gala course folder and I'm going to move it to someplace safe. Let's put it on my C drive. All right, so now in my C drive I have a spark Scala course folder. And within that is another sparks gala course folder. And on Mac or Linux, of course you would not have a C directory. You would put it in your home directory, just someplace where you're not going to lose it. All right, so next we need to get some test data here. And unfortunately the license terms of the data that I like to use doesn't let me include it myself. So you're gonna have to go and download that yourself. That's the MovieLens dataset here. To set up a 100 thousand movie ratings there that we're going to use to play with throughout this course. So you can use this handy dandy link to get it files doc GroupLens.org slash data slash slash MovieLens slash ML dash 100, k dot zip. Go ahead and download that. And if for some reason the GroupLens.org website is down, that happens from time to time. You can usually find the M L dash 100 K file on Kaggle if you need to. Let's go ahead and decompress that as well. Right-click Extract All. Again, just use the unzip command on Mac or Linux. And the resulting MLH1 100 K folder should contain this stuff. We're gonna take that that level here and we're going to copy that. And I'm going to go back to my course materials folder that I just created, which for me was see sparks Scala course. And in the other sparks gala course directory under that there's a data folder. Go into the Data folder and that's where I want to put my AML dash 100 K folder. All right, so this is how things should look at this point, whether you're whatever operating system you're on, you want a spark Scala course folder. Within that, there should be another Spark Scala course folder. Within that should be a data folder. And within that should be an ML dash 100 K folder. And within that should be all of this test data. Okay, so make sure that all looks right or else you're going to run into weird problems and you're not going to know what's going on. Once you're sure that's fine, Let's go back to our instructions here. Next step is to install IntelliJ a, which is going to be our IDE for developing in this course. Now I used to actually tell people to install Eclipse and the Scala IDE, but it seems like intelligent is winning the battle they are against eclipse. So I'm gonna have you install IntelliJ j now instead. Now in order to run Scala code, you first need a JDK and anything between versions aid and 14 we'll do for this course. But if you do need to get a JDK, there's a handy-dandy link here to do it. You can just head on over to Oracle.com slash Java and go ahead and get to the JDK 14 download for your operating system. For me that's going to be Windows 64. You will need to accept their terms. And wait for that to download. Looks like there's a little security warning here. It's fine. I trust it. And let's that comes down. We'll go ahead and install it. Obviously on Linux or Mac that you'll probably use an alternative means of getting Java. In fact, you probably already have Java installed if you're on Linux or Mac. So this is probably someone that specific thing. We'll go ahead and go through the installer here. And one thing on Windows is that when you run Linux code like Apache Spark on windows, sometimes it gets confused when you have spaces in your path. So that space between program and files could actually be a problem. Let's go ahead and change that just to be safe. And I'm going to say this instead into a C colon JDK 14 directory. Well, let that do its thing. It shouldn't take too long. And we're done. All right, we're done with that site, back to our instructions. So now we can install the IntelliJ idea Community Edition. That's going to be our actual ID. Let's go ahead and click on that link. And we want the community edition, the free one, the open source one. We don't need the ultimate one. So go ahead and download that for whatever your operating system is. You see the offer that for Windows, Mac, and Linux. And this is about a half a gig. And so we'll take a few seconds to come down. Come back when that's done. Right, the installer downloaded and let's go ahead and kick it off. And pretty standard installer here. Let's go ahead and just walk through it. And if you do want desktop shortcuts or as file associations, you can do that totally up to you. I'm just going to leave these as is. And that's fine too. Takes about a minute to install, so I'll just come back when that's done. Alright, well good. Let's go ahead and hit the Run button to actually launch it. I don't want to import any existing settings. And personal preference, if you like a dark theme or a light theme, I like a light theme. So I'll select that, do whatever you want though. And now we're going to install plugins. Maybe the only plugin that we really need is the Scala plugin. And unfortunately I'm not seeing it offered here, so I'm just going to move on. I'm not seeing it offered here either. So if you do see the Scala plug-in though, go ahead and take the opportunity to install it. I did not though. So I have to do with the hard way, which isn't that hard. I'm just going to click on the Configure button here from the welcome screen and select Plugins. And from there I can find the Scala plugin. Let's go ahead and install that. Fine. Alright, and we can restart the IDE to pick that up. And now one more thing we need to tell it which JDK to use. So go back to the configure menu here and go to structure for new projects. And if you need to select a JDK here, select the one that we just installed, that's going to be 14 for us and hit Okay. Now there is one more step that we need to do that's only for Windows. So if you're on Mac or Linux, you can, you can ignore this next step, we need to sort of trick windows into thinking that Hadoop is running on it. And to do that, well, it's a little bit clunky. The instructions are in your course materials page here under the Windows only section here, just follow its instructions. Go ahead and create a C Hadoop bin directory. So I'm gonna go to my C drive. I'm going to create a new folder called Hadoop. And within that Hadoop folder, I'm gonna create another folder called bin. Now I'm gonna go back to my course materials, which is under sparks gala course. And you'll see a when utils dot EXE file there. I'm gonna go ahead and copy that and paste it into C. Hadoop been. Next I need to set up a couple of environment variables. So the easiest way to do that is to just go to your Windows search bar down here and type in environment variables. Just ENV is probably enough. And select edit the system environment variables which will take you to the system control panel. From here, you can hit the environment variables button. And we're going to create a new one called Hadoop underscore home all caps. And the value will be C colon backslash, Hadoop. We also need to edit our path environment variable. If you don't have one, you can make one, but you probably have one already. So I'm just going to edit the one that I have an add an additional path to it by double-clicking here and typing in percent sign, Hadoop underscore home percent sign, That's slash bin. Hit OK and OK again and Okay again. All right, so now we're ready, ready to actually try and import the project for the course itself. Let's go back to IntelliJ. Now before we load up that project, it's always a good idea to restart your applications after changing environment variables. So if you are on Windows, go ahead and close out of IntelliJ idea and restarted. It should be in your start menu. And now we're going to click on Open or Import. And we want to navigate to the Course Materials folder, spark Scala course and the spark Scala course folder inside of that, that's our actual project for the course itself. Hit Okay. And let it do its thing. It's going to automatically try to make sense out of what's in that folder. I don't want to tip. And if we're lucky, it will all just work. Okay. See anything too alarming, let's hit the build icon just to make sure it built successfully. That's this little hammer icon up here. You'll also find it in the Build menu if you prefer. And it looks like it worked. So let's give it a shot. Let's go ahead and open up the Spark Scala course folder here for the project, and then open up the source folder. And under that open up main, and then Scala and com dot sun dot software dot spark. These are all the scripts for the course here. So all we have to do is pick one and see if it works. I included a very simple helloworld script. So let's double-click on that. And you can see here it's not doing a lot, but it is actually using Apache Spark. So it will actually verify that you have everything configured properly and set up the right way here. It's just going to set up a SparkContext and load up the data file inside our MovieLens dataset that we installed earlier. So this will also make sure you have that in the right place as well. All it's going to do is spin up an Apache Spark job to count the number of lines in that file. So a very complicated way of doing that, but it will verify that it works. And then when it's done, it will print out Hello World. The data file has hopefully a 100 thousand lines because that is the a 100 thousand dataset. So let's see if it works. Just right-click on HelloWorld and say run hello-world. And at this point there's a really good chance that you're going to get a class not found error if you do, it's just a bug and intelligence. If you quit intelligent and relaunch it, it should clear it up. And that should kick off Spark. You'll see a few warnings are safe to ignore though. And now it's actually running and it worked. So there you have it. Helloworld, the data file has a 100 thousand lines. So if you see that, congratulations, you setup Spark and Scala and intelligence in Java all successfully. And everything's working. And now all we have to do is go through all the rest of these scripts throughout the rest of the course and talk about what they do and learn along the way. If you did not see that output though, go back, you probably missed a little spots somewhere. There's always some little thing and feel free to post in the Q and a or comments of this course to get help if you need it. But hopefully that'll work for you and we can move on and start learning. 3. Introduction to Apache Spark: So let me introduce you to Apache Spark at a high level and just talk about how it works and what it's for real quick. So the official description of Spark is it's a fast and general engine for large-scale data processing. And well, that's a pretty good description. Basically the idea is that you can write a very simple script potentially that describes how you want to transform a huge amount of data or analyze a huge amount of data and sparkle figure out how to distribute that work across an entire cluster of computers for you. So it's an engine for figuring out how to parallelize the processing of your data. You can still tell it what you want. Do you know? Do I want to take in a bunch of log files and extract some information and put it somewhere else. Fine. Spark will go and figure out how to do that across your entire cluster and make that happen as quickly as possible using the resources of tens or even hundreds of individual machines to do it. So the key to this is its scalability. So again, you just write one single driver program, we call it. It's just as simple script written in either Scala or Python or Java. That tells Spark what do you want to do to your data. It's then Sparks problem to figure out how to parallelize that and make it scale out almost an entire fleet of computers. So the key insight here is that you're not limited to the computing power of one machine here with Apache Spark, you can take an, a massive dataset that you couldn't hope to process on a single PC and actually distribute that processing across an entire fleet of computers in parallel at the same time. This sort of divide and conquer approach is how we can process massive datasets and handle what we call big data. From an architecture standpoint, your driver program is just something you write. Like I said, it's a various, potentially a very simple script. And that gets handed off to a cluster manager of some store. You need some sort of system that orchestrates your entire cluster of computers. And that might be a Hadoop cluster, in which case Hadoop's yarn cluster manager would be coming into play there. And that's going to be worrying about how to spin up the resources you need, how to distribute that work and where to put the different jobs in the most optimal place. Thinks about things like how do I run the code in the place where the data is most accessible? So if my data, for example, is split out on a distributed file system, the cluster manager might say, okay, I'm gonna go run the data that processes that chunk of the data on that same machine to make it run even faster. However, you don't need to use Hadoop. Spark has its own built-in Cluster Manager as well. So if you just want to run Spark in a standalone environment, you can do that to just install Spark and every machine in your cluster and configure it properly. And it will just work so Spark and run on its own. It doesn't necessarily need to run on top of Hadoop, although it can. Sometimes, you'll want to run other Hadoop applications on the same cluster and set it more complex pipelines of operations. So there can be advantages to actually running on top of Hadoop, but you don't have to. So an individual machines, there are different nodes we call them. And these will be running different executors. And every executive process that can be distributed throughout your entire cluster has its own cache and it has its own task that is trying to operate on your data. And you can see with all the arrows here that pretty much everything is talking to each other. Your driver program sends out commands to the cluster manager and also directly to the executors when needed. And the executors are talking to each other and synchronizing amongst themselves. And of course, a cluster manager is talking to all of those executor processes as well, trying to orchestrate. What gets run where and then colliding those results back together to get view your final result when it's all done. So that's the sparc architecture at a very high level. Why is Spark so popular? Well, Spark is pretty much replaced. Hadoop MapReduce because it can be up to 100 times faster if it's based on running in memory. So if you have enough memory in your cluster, that's a realistic approximation. If you're actually reading data directly from disk, it will be still be about 10 times faster. Why is it so much faster than MapReduce? Well, it's because of what we call a directed acyclic graph engine or a DAG engine. Basically, it's going to look at the workflow that you've described in your driver script and it will optimize that for you automatically. In contrast, in MapReduce, you're kind of wedged into a single way of thinking of processing data. You have to explicitly map all of your data in parallel and then define some way of reducing that data back into a final answer. With the DAG engine though, can be a little bit more flexible. It can organize that workflow in a more complex and potentially more optimal manner as well. And because Spark is memory-based, that also gives it a very huge advantage as well. So it's fast and it's also very easy to use. It's also hot, it's a very widely used. This is actually a very old list of people using Spark and there are many, many, many, many more people using it now. But the point of this slide is just to show you that It's proven technology. It's being used by very large corporations. It's a very mature technology. It's been out for awhile. You know, the new features in Spark or kind of slowing down a little bit, at least in the open source world. And that's okay because it pretty much does everything you need to do and it does it quite reliably at this point. So for distributed data processing, Spark is a mature technology and it's very widely adopted. It's also not that hard. So you have your choice of writing your code in Python or Java or Scala. Obviously in this course we're going to focus on Scala and we'll talk about why in a moment. But It's easy to use if you know SQL Structured Query Language, that's the same language you use for interfacing with any relational database. You're going to feel right at home because Spark has features called spark datasets and Spark DataFrames that operate very similarly to SQL statements. And you can even give it SQL commands directly through a feature called Spark SQL. So if you know SQL, you can use Spark. It's just that easy. But not everything is a SQL problem. Not every data analysis or transformation can be defined through a SQL command. And if you do want to get to a lower, lower level API that's available to the original API for Spark is called the Resilient Distributed Dataset or the RDD for short. We're going to go into a lot more depth about how that works in a, in a moment here. But with the RDD API, you can get lower level and sometimes I can give you even better performance. And it also gives you more flexibility in what you can do. But for most common data transformation or analysis tasks, you can probably define that as a SQL command. And more often than not, you'll be using datasets, dataframes, or the Spark SQL API. From a software architecture standpoint, this is how Spark is laid out, and this goes back to Sparks original architecture. The lines have kind of blurred on a few of these in recent years. But at its core is, well, Spark Core and that's where RDD's live in the like, right? So that's kinda like the underlying engine of Spark itself. And you can go directly to spark Core. Well, we'll see that in action. But there's these other higher level APIs built on top of Spark Core as well to make your life easier for specific tasks. When his Spark Streaming, that is, it's obviously a very powerful technology for ingesting data in real-time or near real-time. You could imagine, for example, having a fleet of servers out there running your website that are feeding data into Spark through Spark streaming from their log files continuously. And in real-time, spark can monitor that data, look at a window of that data over time, give you analytics over that window of time and take some action based on it. Simple example, let's say you want to have some sort of an alarm on 500 errors on your website. You can have a Spark streaming system setup so that the logs are being streamed into Apache Spark. And in real time it's counting up how many 500 errors there are over in the past, our past minute, whatever you want to monitor and take some action if it exceeds some threshold. And obviously much more complex operations are available as well. Maybe a more common application would be transforming that log data and putting it somewhere else. So I could have a Spark streaming process that ingest data from my logs, transforms it into some format that maybe Elastic Search wants to see or something like that. We also have Spark SQL and that exist to let you integrate with spark SQL commands so you can treat spark just like a giant database that's distributed in nature. So if you can define your data in terms of a table structure, which you usually can, and you can define the problem you want to solve in terms of a SQL command which you probably can. You can just use Spark SQL to define what you want it to do. And sparkle figure out how to parallelize that across an entire fleet of computers. So that's really exciting, right? It gives you all the flexibility of a relational database. But you're not limited to one machine anymore. You can actually horizontally scale that database. Now, it used to be you had to choose between like NoSQL databases if you want to, distributed computing and a big monolithic relational database if you didn't. And there are still some limitations here. Mind you doing big joins is still not going to be very efficient in a horizontally partitioned server environment. But you can do it if you want to write. So it's kinda the best of both worlds. Now, like I mentioned, the lines are blurred in some cases here. So these more modern APIs that we're going to look at later in Spark using dataframes and datasets. They also are very similar to SQL in their structure and how they're used. So, you know, does that, do you consider dataframes and datasets part of Spark SQL or Spark Core. Again, the lines kinda blurry there, but the SQL based interfaces are kinda becoming the predominant way of using Spark. We also have ML Lib, Sparks Machine Learning Library. And if you do want to do distributed machine learning on Apache Spark, you can do that too. It's a somewhat limited set of algorithms, although there most of the ones that you would need in practice. So we'll look at that in a later section of the course as well. And that's really exciting, right? Because if you have machine learning that you want to process on a massive dataset, no longer are you limited to what you can do on a single machine? There are some algorithms that to this day, you know, are hard to scale out, but Spark has figured it out for many of the most popular machine learning algorithms that you might want to be using. And finally, this graph X, don't want to talk about that too much. It's kinda fallen by the wayside. Graphx is not about, you know, charts and graphs are printing, you know, little lines and stuff like that. It's more graphs in the computer science sense. So we're talking about like networks of information. For example, a social network where you have users that are connected to other users is a graph in that sense. And graphics can do things like, you know, analyze those graphs of information, tell you attributes about it, and let you sort of iterate through those in a distributed manner. That graph x is, again kinda fallen by the wayside. It hasn't really been well-maintained lately and there are newer alternative APIs these days that are more popular. We'll talk about that more at the end of the course. In this course, we are using the Scala programming language. Why, why are we using Scala? That's kind of an obscure language, isn't it? Well, there's a few reasons. One is that Spark itself is written in Scala. So by writing your scripts in Scala, you're kind of like getting closer to how Spark itself is written in optimized. So, you know that, that can potentially lead to better performance. The other thing is that Scala is what we call a functional programming language. And as such, it's really a good fit for distributed processing. Scala really enforces that you write your code in such a way that your functions can be distributed across an entire cluster. Whereas other languages like Java and Python don't really try to force you into that. So by writing your Spark driver scripts in Scala, you're more likely to be writing code that can be parallelized safely and easily. It also gives you fast performance. So scale it compiles down to Java byte code. So at the end of the day it's running on the JVM, the Java interpreter, and that's pretty darn fast on most systems. Obviously, Java will also give you fast performance because that will also compile down to Java byte codes. But contrast that to writing your spark scripts in Python, which you can do. But you have to go through another layer there, right? Like that, python code needs to be somehow transformed into Java bytecode at the end of the day. So writing in Scala just cuz you gets you a little bit closer to that ultimate lower level where your code will actually be running. Now to be fair, python is pretty darn fast in Spark these days. So the difference is not as big as it used to be, but there's still a small difference. The other advantage of Scala if you want to put it against Java, is that it's easier to use. So there's going to be a lot less code. You have to write a lot less boilerplate stuff than you would have to write if you're coding in Java. Java has a lot of overhead associated with it in terms of how you can actually compile that code and distributed and stuff like that. It's a lot easier in Scala turns out. And like I said, in comparison, Python slower, it's not as slow as it used to be. You're still gonna get a little bit of an edge with Scala, but that speed comparison has been closing over time. But where are the downsides to Scala? Well, one is that you might not know Scala yet. You know, it's not a very common language. So you're going to have to go learn the basics of house Gallo works, but it's not as hard as you think. For example, let's take a look at this little snippet of code. We're doing the same thing here in Python and in Scala. We're just going to like write some code to square the numbers in a dataset. Pretty simple stuff. So in the Python version and the Scala version, if you look at it, they're not that different, right? So syntactically there's little things like, you know, you have to declare in Scala that it's a immutable constant that you're using by saying val. The syntax for defining a list of stuff is slightly different. The syntax for Lambda functions is a little bit different, but it's the same idea, right? So Scala is kind of a weird syntax. Sometimes things can be a little bit backwards and we're going to talk about that. Don't worry about it. But at the end of the day, it doesn't look that much different from Python code in the context of a spark driver script. And with that, let's actually dive into a crash course in Scala if you need it. In this next section, we're actually going to go into the basics of Scala. What's different about it, What's weird about it? I do expect that you're going to have some prior experience in writing code somewhere, some scripting or programming language. I'm not here to teach you how to program from scratch guys. That would be a different course. But if you do have some Python under your belt, or C or Java or something, I think you can pick up, scale up pretty quickly. So in this next section, if you need it, we have a little bit of an introduction to Scala that will demystify the syntax for you. And as we go through the course, you'll see lots and lots of examples of using Scala. And I think it will just sink in by looking at it enough and seeing enough examples. So let's dive into our Scala crash course if you need it. If you don't feel free to skip the next section and we'll just dive right into how Spark works. 4. [Activity] Scala Basics: Hi, I'm Frank cane and welcome to my office. We're going to start by doing a little crash course on the Scala programming language itself. Now obviously, if you're already familiar with Scala, you can skip this section and that's fine. But if you're new to Scala, but you've had some programming experience before, you'll find a section very helpful for understanding the code that we're going to be looking at throughout this course. It's just enough to be dangerous, right? So don't expect a comprehensive introduction to Scala course here in this section, but it's enough to get you through this course at least and through the examples that we'll go through in this section and the examples later in the course. I think you'll end this course with a pretty good understanding of how Scala works and even how to write your own Scala code. However, if you're new to programming altogether, this isn't gonna be enough for you. I would encourage you to go and find an introductory course on Scala that goes into more depth first and then come back to this one. But for the rest of you, let's plow ahead and learn Scala. All right, let's learn Scala just to set expectations. You're not going to be a Scala expert at the end of watching for videos with me. So what I'm really trying to do here is just get you familiar with the syntax of the Scala programming language and introduce some of the basic constructs like how do I call a function in Scala? What are flow control work where some basic data structures I might use with Scala, show you enough Scala code that it's not going to look scary and intimidating to you as we go through the rest of the course. So with that, let's talk about scale at a high level first. First of all, why learn Scala? Well, you've probably never heard of it before. Maybe you have, but you certainly probably don't know it. It's mostly used for Spark programming. But it is uniquely suited for spar because it's really structured in a way that lends itself to doing distributed processing of data over a cluster. And you'll see why a little bit later on. It's also what Spark itself is built with. So by learning Scala, you'll get access to all the latest and greatest spark features as they come out. And it can usually take a pretty long time for those features to trickle down to say Python support within Spark. And it's also going to be the most efficient way to run Spark code itself. So by using Scala, you will have the fastest and the most reliable Spark jobs that you can possibly create. And I think you'd be pretty surprised at just how much faster and how much more reliable. The same spark job written in Scala is compared to say, the same spark job written in Python. So even though it might be tempting to go off and stick with the language you already know. Learning Scala is worth the effort and it's really not that hard. Truth is the same Spark code for Scala and Python look very similar to each other at the end of the day. Now, skull itself runs on top of the Java virtual machine. So it just compiles down to Java byte code and gets run by the JVM. So one nice thing about that is that you also have access to all of Java. If there's a Java library you want to pull into your Scala code, you can do that. So you're not limited to what's in the Scala language itself. You can actually reach down to the Java layer and pull up. It's a job that you want to use too. And we'll do that later in this course, for example, for dealing with regular expressions in a little bit more of an intuitive matter than you could otherwise. Another key point about Scala is that it is focused around what is called functional programming. Where functions are sort of the crux of what we're dealing with. Functions get passed to other functions and chained together in ways you might not be used to. But this is really how Spark works at a fundamental level. We basically take a abstraction over a chunk of data and we assign it a function to do some processing on that data. And functional programming in Scala makes that very intuitive to do from a language standpoint. All right, so let's just jump right into the deep end of the pool and sink or swim with Scala, We're just gonna write some code and see what happens and get your hands dirty. Now I didn't actually provide you with a copy of this code I'm going to be going through because there's actually a value and typing it yourself to make a kind of sink in. So let's start by creating what's called a New Scala worksheet. This is going to give us a interactive environment where we can just sort of experiment with Scala code and evaluated interactively. So go to your file menu and intelligence and say New Scala worksheet. And we'll call this one learning Scala one. And in this first lecture, we're just going to talk about the syntax and structure of the Scala language because it's a little bit weird compared to other languages out there. So first of all, if you want to a comment, you can just do a double slash like you would in many other languages. And first we're going to talk about values. So values are immutable constants. So that's an example of a comment line there. Now in other languages we have the concept of variables. You know, it's a very universal thing and programming to assign some value to a named variable, right? And like use that throughout your code. Now in Scala, there's two different kinds when it's called values which are immutable and variables which are immutable. And in Scala you want to stick with values as much as possible. So here's an example of how to define one. We could say val for value, Hello, colon, string equals quote Ola. And if you want to actually execute that, we can just hit a little play button here. Or you can see there's also a keyboard shortcut of Control Alt W that I'll use from now on. And it's creating an environment to execute that in right now and there we have it. So you can see that it actually executed that command and assigned the value o law to a string called Hello. So let's spend some time talking about the syntax here because it's kinda backwards from many other languages, right? So we start off by saying this is a value. That means that we're defining an immutable constant. Once we actually define what hello is, we can't change it ever again. So we're going to call this value Hello. And then after the colon, we have to declare what type it is. So we're saying that hello is a string type. So that's backwards from most other languages, right? Usually you'll see like string hello, but in Scala it's hello, colon string. And then we assign it to a value. Nothing too weird. They're just the string hola in quotation marks. Okay, so that makes sense, right? Is a little bit backwards, but you get used to it pretty quickly. Now let's talk about variables. So variables are immutable. That means you can actually change them after you've defined them. So to define a variable, it's the same thing. You just use a var instead of a Val. So we can say var hello there, which is also a string, and we'll set that to Hello. We're going to assign that to the value of the constant string, the immutable constant hello. So I'm going to say Control Alt, W. And you can see there that hello there has been assigned the string value o law because that was stored in our immutable value. Hello makes sense so far. But hello, there is a variable, we can change it now it's not stuck to being Hello. So we could say something like hello there equals hello plus space there and Control Alt W. And you can see that we've actually modified Hello there too now contain the string Hello. There are Ola there rather. So as you can see, variables can be changed values however, cannot. We could also use the print line command to print out that value explicitly. Print line hello there, which does exactly what you would think. You would prints out the value of that variable on a line Control Alt W over there. All right, so you've seen some basic stuff here. First of all, the concept of values in variables and values again are immutable. Once you define them, you cannot change them. Whereas variables are mutable, you cannot change them after you define them. And also note the syntax here of declaring values and variables. It's valor var, the identifier's name, colon, the type, and then equals whatever you want it set it to. Again, that's backwards from a lot of other languages. And just to show you what happens and to prove to you that values cannot be changed. Let's change that var2 eval and see what happens if we try to execute this again, Control Alt W. You can see that we've got some errors here now, reassignment eval, we can't say hello there equals hello plus there because you cannot change a val. A value is immutable. Let's change that back to var and executed again Control Alt W. All right, So far so good, right? Now why do we have this distinction between values and variables in Scala? Well, it's because this is what we call a functional programming language to scale is kind of centered around the idea of passing functions around and potentially running them in parallel. That's why it's such a good match for Apache Spark. And the reason that we want to stick with immutable constants whenever we can is to avoid a bunch of thread safety issues and to sort of head them off at the pass. So imagine you have a function that has a variable in it that you get, that it can change and you pass that variable into many, many threads. What happens if one thread is trying to change that variable at the same time that another thread is trying to change it to something else, the results become undefined, right? So we avoid a lot of these race conditions by trying to use immutable constants whenever possible. If our functions are only acting and processing immutable data, we don't have to worry about all of those threads, safety and race conditions. Now this doesn't limit you as much as you might think. You can still get a lot done just using values immutable constants. So for example, if I did want to do the above operation and construct the string o la there from the value Hello. I can still do that using values. I could say something like val immutable hello there equals hello plus there. Right? And I could print out that result. Control Alt W. And that works because I'm defining immutable hello there on the same line here I'm taking a previously immutable value, adding in another immutable value and assigning that to an immutable value. So that's okay with the variables. We did it a different way. Like we started off by setting hello there to hello, and then at another operation, we added the string there to it. So as long as we do everything in one line, in one atomic operation, we're still sticking to the rules of using values whenever possible. All right, so we've seen the string datatype here in action, right? So there's many other data types available to you in Scala. Let's talk about datatypes. So for example, we could say val number one. Whoops, if I type right, number one colon int. So an int is exactly what do you think it is an integer, an integer number, a whole number. We could also say val. Truth is a Boolean. And we'll set that to true. Note that in Scala, true and false constants are all lowercase. There are languages where you would capitalize to true and false, but not in Scala. So that's just a Boolean value, true or false. We can also have characters. So we can say val, letter a. As a car type, a single character, a single ascii value. We also have a double-precision numbers, of course. So we can say Pi is a double. And set that to 3.14159265 or whatever it is. And we can also represent a single precision floating point value with foul Pi single precision. And we'll declare that as a float, which is a single precision floating point variable. And we'll set that to 3.14159265 F, which means floating point single precision. Let's Control Alt W to see we have so far it's working. So you can see that all of these variables have been defined as expected. Let's keep going. There's also a long datatype, let's say val big number. We'll define that as a long. And we'll set that to some big integer. 1, 2, 3, 4, 5, 6, 7, 8, 9, out on whatever you want. And we could say also a single character number. We can save all small number. Declare that as a byte equals 127. So a byte is basically a number that's crammed into a single byte. So it can only represent numbers from negative 127 to positive 127. Or if it were unsigned 0 to 255, Control Alt W to execute those. Okay, Looking good. Moving on, let's talk about how to actually print and display your data and format your output. That's always an important thing, right? So let's say we want to concatenate a bunch of strings together or a bunch of values together and print them out. Obviously, you want to be able to view the results of your Spark programs. So this is how you would do that. We could say print line. Here is a mess. And the secret here is you can just use the plus operator to concatenate stuff together and printed out as a big string. And it doesn't just have to be strings either. It can be any datatype. It will implicitly convert that to a string. So I can say here's a mess plus Number 1 plus truth, plus letter a, plus pi, plus big number. And that should work. Control Alt W. There it is. Crab it all together because I didn't insert any spaces between everything, but it works. Note also by the way, there is no semicolon or anything at the end of the lines here. It just assumes that every new line is a new command basically. So there's no need to explicitly terminate your lines of code in Scala. What do you want to do like print F style, if you're coming from a background of like C or C plus plus, you might be familiar with the print f command, which allows you to put in sort of formatting hints for how to actually display numerical data or insert strings and other data into an existing string. Here's how that looks like in Scala. We can say print line F, which means we want print out format. Quote pi is about dollar sign pi, single precision, percent 0.3 F. Okay, so let's break that down a little bit. So first of all, note that we have that dollar sign there that's indicating that we have a variable name or value name rather, in this case, following that dollar sign. So Pi is single precision is going to insert the value of Pi single precision. And percent 0.3 f means that it's a floating point value that we're going to display there. That's what the F means. And percent 0.3 means that after the decimal point, we only want three digits of precision displayed. So let's go ahead and hit Control Alt W. And you can see that it did exactly what we said. Pi is about 3.142. So it just displayed those three digits following the decimal point there because that's all we wanted. That can be handy if you are displaying a double-precision or even a single precision number that has a huge number of digits set has more precision than you need. You can also do things like let's see how that works with integers. So for example, we could say print line F, Again just indicating print f format 0, padding on the left. Dollar sign number 1, percent 05 D. Alright. So this time we're saying that we're going to insert the value of the number 1 value and the percent 05 D. The D just means that it's a number, an integer, and percent 05 means that I want to have at least five digits to the left of the decimal point. So let's see what that does. Control. Alt W. And we get 000 000 001, promising the five digits of precision on the left that we wanted. That can be useful when you're trying to align output and columns, right? So that's also a handy trick sometimes. Also if you just want to substitute variables into a string without actually specifying the formatting. That's easy to do as well. For example, print line S for substitute quote. I can use the S prefix to use variables like dollar sign number one, truth, and letter a. All right, Control Alt W. And it worked. So you can see it's a very easy way to just insert value names into your string there without having to use the concatenation operator unnecessarily. So just a different way of doing it. Okay, What else can we do? We can include expressions in our print commands. So let me show you how that works. Print line S. The S prefix isn't limited to variables. I can include any expression like Ross, a dollar sign, open curly bracket, one plus two. And it automatically completed that line for me. So the key here are those curly brackets. After you have a curly bracket set after a dollar sign, it's actually going to evaluate the expression inside those curly brackets and print out the result of that as part of the string Control Alt W. And if we scroll over, we should see that it printed out the number three. So that's a neat trick to write. So those curly brackets after dollar sign could actually evaluate an expression within a print line command if you use the S prefix. There's also regular expressions. So if you're familiar with those, that's a powerful tool for actually munging your string data. Let's see how that works. So let's start off with a string that is Val, the ultimate answer. We'll define that as a string. Again, we weren't getting back to how to define a value. And we'll set that to life, the universe. And everything is 42. And if you recognize that reference, then U2, our fan of The Hitchhiker's Guide to the Galaxy. Welcome. Alright, so we have this string and we want to do is write a regular expression that will extract the answer to the ultimate question of life, the universe, and everything. So we're going to set up a regular expression to extract that number out from the end of that string. So we can say val pattern equals triple quotes 123. And then we're going to write it in a regular expression to extract the information that we want from that string. Dot star, parenthesis, square bracket, backslash d, n square bracket. Plus close paren dot star, and then those three quotes, dot r. Okay, so the dot R means this is a regular expression that we're defining here. And going over how regular expressions work is probably out of scope for this course. But there are a very useful tool for things like extracting information out of log files and things like that. A quick breakdown of what's going on here. The dot star means to match anything in that string followed by a space. And then within the parentheses is the thing that we're trying to extract from that pattern. And the brackets and the backslash d means I want to extract a number. All right, and any number of numbers, That's what the plus sign means, followed by any other characters. Okay, So by looking for a bunch of characters followed by a space and then a number followed by anything else that's going to pull that 42 out of that string. We don't need that control Alt W just to make sure that's working. All right, cool. So let's take a look over here. So we have that string we defined to life the universe and everything is 42. And now we've defined a matching regular expression object that consists of that regular expression. Okay, So that's what happened with this line here. Now to apply that regular expression to the string, it's very simple. We can just say val pattern, parentheses Answers, string equals the ultimate answer. Alright, so the syntax There's a little bit weird. It means we're going to take the regular expression that we defined in pattern. We're going to assign the output of that to answer string. So the syntax here, it's basically saying, I want to take what's in these parentheses and transfer that result to what's in these parentheses. Okay, That's a way of thinking about it. And we're going to assign the ultimate answer into that pattern. So the syntax there again, a little bit backwards from how you might be thinking about things in other languages. And then we can just print out what that answer is. First, let's convert it to an integer. We can say val, answer equals Answers string to int. So that's just showing you how to actually convert one type to another. It would help if I typed answer string correctly. And then I can print that out. Print line, answer. Alright, so we've defined a string, we've defined a regular expression to extract information from that string, we define a statement. Actually apply that regular expression to the string and store the result somewhere. We're calling that answer string. We've converted that from a string to an integer, and then we're going to print that injure out Control Alt W. And it worked. So you can see that we extracted the string 42. We converted that to an integer 42 and printed it out when we were done. All right, moving on. Let's talk about Booleans. So really easy thing they work exactly as you would expect. So let's talk about Booleans. So for example, we could say val is greater and set that equal to one greater than two. What do you think that will come out to? Is one greater than two? No, the answer is false. So that works pretty much the way you would expect. We'd say Val is lesser equals, one, less than two. That's true. We could say val impossible equals is greater and lesser. And you can see that we can use a single ampersand there. That's fine. But we could also say a double ampersand. Let's see what those do. Control Alt W. So these actually aren't the same thing here. So like in C or C plus, plus a double ampersand is actually a logical end where it's a single ampersand is a bitwise. And so the only reason this works at all is because it's greater and lesser wings can evaluate down to zeros and ones and it still ends up working out. And we just implicitly convert that to true or false. But if you're trying to do a logical operation, which is what we're really trying to do here. You should be using the double ampersand. That's the logical or Boolean operator. So they do give you the same result in this case, but it's really better form to use the double ampersand there works the same way with or if you want to say is greater or lesser, that works. That would be true, presumably, yep. So you get the idea. Booleans work pretty much as you would expect from other languages. Let's play with it some more. And we could say val, the card, which is a string, and set that equal to Pickard. And we can say val best captain. Another value just with a different name. Also a string equals Pickard. And we can say Val is best. Declare that as a boolean. And we'll set that to Picard. Equals equals best captain. So let's go ahead and execute that and Control Alt W does what you might think. So it equals, equals here is actually going inside those strings and comparing the values of the strings. So that means we want to actually compare the values of those two things and see if they're identical. So it's not actually comparing the objects themselves or the address of the objects themselves. It's actually going into that string and comparing the strings to each other. So if you want to compare two strings, just use the equals equals operator. That can be something that's kinda weird in some languages. So it's important to point that out here. And if you do want to debate in the Q and a weather, Picard is the best captain. I welcome that discussion. Alright. If you wanted to get your hands dirty, play around with this stuff, do some more stuff, keep going, guys. So for example, you could write some code that takes the value of pi and doubles it and then prints it within a string with three decimal places of precision to the right. Okay, So actually I'm going to paste that in here as a little challenge for you. And it's really easy. I'm not even going to give you the answer. You guys can talk in the Q and a if you want to talk about the actual solution. But there's my challenge to you. Go and apply what you just learned and do that. Yeah, just write a little snippet of code that takes the value of Pi. We defined, multiply it by two, and printed out within a string with three decimal places of precision to the right. Everything you need to do to accomplish that should be above you here. So little simple way to get some hands-on exposure. So get your hands dirty, play around a little bit more. And then we'll move on to the next chapter of learning Scala. 5. [Exercise] Flow Control in Scala: So moving on, if you want to save what you've done so far, you can just say Control-S, close that out is to keep it around for future reference if you want. And let's make another Scala notebook or Scala worksheet rather. And we'll call this one the creative name of learning Scala. And this time we're gonna talk about flow control. So let's see how if else statements work. They work exactly the same way as they do in other languages. It's nothing too weird here. For example, we could say if one is greater than three, print line, impossible. Else print line. The world makes sense. So exactly like every other language there, if some expression is true, you do this expre