Apache Spark 3 with Scala: Hands On with Big Data! | Frank Kane | Skillshare
Drawer
Search

Playback Speed


  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Apache Spark 3 with Scala: Hands On with Big Data!

teacher avatar Frank Kane, Machine Learning & Big Data, ex-Amazon

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

    • 1.

      Introduction

      1:28

    • 2.

      Installing the Course Materials

      13:54

    • 3.

      Introduction to Apache Spark

      14:26

    • 4.

      [Activity] Scala Basics

      25:58

    • 5.

      [Exercise] Flow Control in Scala

      9:28

    • 6.

      [Exercise] Functions in Scala

      9:08

    • 7.

      [Exercise] Data Structures in Scala

      22:28

    • 8.

      The Resilient Distributed Dataset

      11:30

    • 9.

      Ratings Histogram Example

      11:27

    • 10.

      Spark Internals

      1:59

    • 11.

      Key / Value RDD's, and the Average Friends by Age Example

      10:42

    • 12.

      [Activity] Running the Average Friends by Age Example

      4:51

    • 13.

      Filtering RDD's, and the Minimum Temperature Example

      5:54

    • 14.

      [Activity] Running the Minimum Temperature Example, and Modifying it for Maximum

      11:35

    • 15.

      [Activity] Counting Word Occurrences using Flatmap()

      5:46

    • 16.

      [Activity] Improving the Word Count Script with Regular Expressions

      3:44

    • 17.

      [Activity] Sorting the Word Count Results

      6:35

    • 18.

      [Exercise] Find the Total Amount Spent by Customer

      4:30

    • 19.

      [Exercise] Check your Results, and Sort Them by Total Amount Spent

      5:09

    • 20.

      Check Your Results and Implementation Against Mine

      3:00

    • 21.

      Introduction to SparkSQL

      9:44

    • 22.

      [Activity] Using SparkSQL

      7:05

    • 23.

      [Activity] Using DataSets

      8:33

    • 24.

      [Exercise] Implement the "Friends by Age" example using DataSets

      2:40

    • 25.

      Exercise Solution: Friends by Age, with Datasets.

      7:22

    • 26.

      [Activity] Word Count example, using Datasets

      10:37

    • 27.

      [Activity] Revisiting the Minimum Temperature example, with Datasets

      9:00

    • 28.

      [Exercise] Implement the "Total Spent by Customer" problem with Datasets

      2:10

    • 29.

      Exercise Solution: Total Spent by Customer with Datasets

      6:28

    • 30.

      [Activity] Find the Most Popular Movie

      5:24

    • 31.

      [Activity] Use Broadcast Variables to Display Movie Names

      11:19

    • 32.

      [Activity] Find the Most Popular Superhero in a Social Graph

      12:18

    • 33.

      [Exercise] Find the Most Obscure Superheroes

      5:14

    • 34.

      Exercise Solution: Find the Most Obscure Superheroes

      6:44

    • 35.

      Superhero Degrees of Separation: Introducing Breadth-First Search

      7:14

    • 36.

      Superhero Degrees of Separation: Accumulators, and Implementing BFS in Spark

      7:59

    • 37.

      [Activity] Superhero Degrees of Separation: Review the code, and run it!

      12:55

    • 38.

      Item-Based Collaborative Filtering in Spark, cache(), and persist()

      7:59

    • 39.

      [Activity] Running the Similar Movies Script using Spark's Cluster Manager

      14:48

    • 40.

      [Exercise] Improve the Quality of Similar Movies

      3:54

    • 41.

      [Activity] Using spark-submit to run Spark driver scripts

      11:43

    • 42.

      [Activity] Packaging driver scripts with SBT

      15:06

    • 43.

      [Exercise] Package a Script with SBT and Run it Locally with spark-submit

      2:04

    • 44.

      Exercise solution: Using SBT and spark-submit

      9:04

    • 45.

      Introducing Amazon Elastic MapReduce

      7:11

    • 46.

      Creating Similar Movies from One Million Ratings on EMR

      11:33

    • 47.

      Partitioning

      4:18

    • 48.

      Best Practices for Running on a Cluster

      6:25

    • 49.

      Troubleshooting, and Managing Dependencies

      10:59

    • 50.

      Introducing MLLib

      9:55

    • 51.

      [Activity] Using MLLib to Produce Movie Recommendations

      12:42

    • 52.

      Linear Regression with MLLib

      6:58

    • 53.

      [Activity] Running a Linear Regression with Spark

      7:47

    • 54.

      [Exercise] Predict Real Estate Values with Decision Trees in Spark

      4:56

    • 55.

      Exercise Solution: Predicting Real Estate with Decision Trees in Spark

      5:47

    • 56.

      The DStream API for Spark Streaming

      11:28

    • 57.

      [Activity] Real-time Monitoring of the Most Popular Hashtags on Twitter

      8:51

    • 58.

      Structured Streaming

      4:03

    • 59.

      [Activity] Using Structured Streaming for real-time log analysis

      5:33

    • 60.

      [Exercise] Windowed Operations with Structured Streaming

      6:04

    • 61.

      Exercise Solution: Top URL's in a 30-second Window

      5:44

    • 62.

      GraphX, Pregel, and Breadth-First-Search with Pregel.

      6:51

    • 63.

      Using the Pregel API with Spark GraphX

      4:29

    • 64.

      [Activity] Superhero Degrees of Separation using GraphX

      7:07

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

600

Students

--

Projects

About This Class

New! Updated for Spark 3.0!

“Big data" analysis is a hot and highly valuable skill – and this course will teach you the hottest technology in big data: Apache Spark. Employers including AmazonEBayNASA JPL, and Yahoo all use Spark to quickly extract meaning from massive data sets across a fault-tolerant Hadoop cluster. You'll learn those same techniques, using your own Windows system right at home. It's easier than you might think, and you'll be learning from an ex-engineer and senior manager from Amazon and IMDb.

Spark works best when using the Scala programming language, and this course includes a crash-course in Scala to get you up to speed quickly. For those more familiar with Python however, a Python version of this class is also available: "Taming Big Data with Apache Spark and Python - Hands On".

Learn and master the art of framing data analysis problems as Spark problems through over 20 hands-on examples, and then scale them up to run on cloud computing services in this course.

  • Learn the concepts of Spark's Resilient Distributed Datastores

  • Get a crash course in the Scala programming language

  • Develop and run Spark jobs quickly using Scala

  • Translate complex analysis problems into iterative or multi-stage Spark scripts

  • Scale up to larger data sets using Amazon's Elastic MapReduce service

  • Understand how Hadoop YARN distributes Spark across computing clusters

  • Practice using other Spark technologies, like Spark SQL, DataFrames, DataSets, Spark Streaming, and GraphX

By the end of this course, you'll be running code that analyzes gigabytes worth of information – in the cloud – in a matter of minutes. 

We'll have some fun along the way. You'll get warmed up with some simple examples of using Spark to analyze movie ratings data and text in a book. Once you've got the basics under your belt, we'll move to some more complex and interesting tasks. We'll use a million movie ratings to find movies that are similar to each other, and you might even discover some new movies you might like in the process! We'll analyze a social graph of superheroes, and learn who the most “popular" superhero is – and develop a system to find “degrees of separation" between superheroes. Are all Marvel superheroes within a few degrees of being connected to SpiderMan? You'll find the answer.

This course is very hands-on; you'll spend most of your time following along with the instructor as we write, analyze, and run real code together – both on your own system, and in the cloud using Amazon's Elastic MapReduce service. 7.5 hours of video content is included, with over 20 real examples of increasing complexity you can build, run and study yourself. Move through them at your own pace, on your own schedule. The course wraps up with an overview of other Spark-based technologies, including Spark SQL, Spark Streaming, and GraphX.

Enroll now, and enjoy the course!

"I studied Spark for the first time using Frank's course "Apache Spark 2 with Scala - Hands On with Big Data!". It was a great starting point for me,  gaining knowledge in Scala and most importantly practical examples of Spark applications. It gave me an understanding of all the relevant Spark core concepts,  RDDs, Dataframes & Datasets, Spark Streaming, AWS EMR. Within a few months of completion, I used the knowledge gained from the course to propose in my current company to  work primarily on Spark applications. Since then I have continued to work with Spark. I would highly recommend any of Franks courses as he simplifies concepts well and his teaching manner is easy to follow and continue with!  " - Joey Faherty

Meet Your Teacher

Teacher Profile Image

Frank Kane

Machine Learning & Big Data, ex-Amazon

Teacher

Frank spent 9 years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers, all the time. Frank holds 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, Frank left to start his own successful company, Sundog Software, which focuses on virtual reality environment technology, and teaching others about big data analysis.

See full profile

Level: Beginner

Class Ratings

Expectations Met?
    Exceeded!
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Introduction: And I spent over nine years at Amazon.com and IMDB.com making sense of their massive datasets. And I want to teach you about the most powerful technology I know for wrangling big data in the Cloud today. That's Apache Spark, using Scala programming language spark and run into Hadoop cluster to spread out massive data analysis and machine learning tasks in the Cloud. And knowing how to do that is a very hot skill to have right now, we'll start off with a crash course in the Scala programming language. Don't worry, it's pretty easy to pick up as long as you've done some programming or scripting before. We'll start with some simple examples, but work our way up to more complicated and interesting examples using real massive datasets. By the end of this course, you will have gone hands-on with over 15 real examples. And you'll be comfortable with writing, debugging and running your own Spark applications using Scala. And some of them are pretty fun. We'll look at a social network of superheroes and use that data to figure out who is the Kevin Bacon of the superhero universe. We'll also look at a million movie ratings from real people and actually construct a real movie recommendation engine that runs on a Spark cluster in the Cloud using Amazon's Elastic MapReduce service. We'll also do some big machine learning tasks using Sparks ML Lib library. And we'll do some graph analysis using Spark GraphX library. So give it a try with me. I think you'll be surprised at how just a few lines of code can kick off a massive complex data analysis job on a cluster using Spark. So let's get started. First thing we need to do is install the software we need. So let's get that out of the way right now. 2. Installing the Course Materials: So let's get everything set up including Java and intelligence and all the course materials that we need for the entire course. What we're gonna do is start by going to our own website here, which will direct you to the course materials where you can download all the project files and the data that you need for this course. We'll go ahead and get that installed in your system. Then we'll install a Java Development Kit if you don't already have one, we just need to make sure that we have a JDK between versions eight and 14 installed on your system. Odds are if you're a developer, you already do. After that, we'll install the IntelliJ idea Community Edition. It's a free development environment that can, we can use for Scala and for Spark. And the beauty of it is that it also integrates something called sbt. So gotten a handle all the dirty work of actually installing Apache Spark for us on Windows, we do have one extra step. We need to kind of fake out windows to think that it's running Hadoop and I'll show you how to do that. It's not too hard. And finally, we'll set up our project in intelligence. Run a little simple helloworld problem in Apache Spark. Make sure it's all working. Let's dive in and I'll walk you through all of it. So let's start by getting everything set up that you need for this course. Head on over to media Dotson dog, a dash soft.com slash Spark scala dot HTML. Pay attention to capitalisation and every little letter counts. And you should reach this page here that contains everything you need to get going. But most importantly, let's install the course materials, all the scripts that you need to actually get through this course hands-on. Go ahead and click on this link here to immediate Dotson dog dash, soft.com slash Spark Scala slash Spark Scala course dot zip. If you are typing that in by hand for some reason, be sure to pay attention to capitalisation. Once that downloads. We'll go ahead and unzip it. And on Windows I can just go ahead and right-click that and say extract all. On a Mac or Linux, of course, you would just go to a terminal prompt and use the unzip command. And we should get is a spark Scala course folder within a spark Scala course folder, That's correct. That's what we want. And within that second level of folder is all the materials itself, all the project for this course. So first of all, let's move this someplace where we're not going to lose it. So I'm taking that top level sparks gala course folder and I'm going to move it to someplace safe. Let's put it on my C drive. All right, so now in my C drive I have a spark Scala course folder. And within that is another sparks gala course folder. And on Mac or Linux, of course you would not have a C directory. You would put it in your home directory, just someplace where you're not going to lose it. All right, so next we need to get some test data here. And unfortunately the license terms of the data that I like to use doesn't let me include it myself. So you're gonna have to go and download that yourself. That's the MovieLens dataset here. To set up a 100 thousand movie ratings there that we're going to use to play with throughout this course. So you can use this handy dandy link to get it files doc GroupLens.org slash data slash slash MovieLens slash ML dash 100, k dot zip. Go ahead and download that. And if for some reason the GroupLens.org website is down, that happens from time to time. You can usually find the M L dash 100 K file on Kaggle if you need to. Let's go ahead and decompress that as well. Right-click Extract All. Again, just use the unzip command on Mac or Linux. And the resulting MLH1 100 K folder should contain this stuff. We're gonna take that that level here and we're going to copy that. And I'm going to go back to my course materials folder that I just created, which for me was see sparks Scala course. And in the other sparks gala course directory under that there's a data folder. Go into the Data folder and that's where I want to put my AML dash 100 K folder. All right, so this is how things should look at this point, whether you're whatever operating system you're on, you want a spark Scala course folder. Within that, there should be another Spark Scala course folder. Within that should be a data folder. And within that should be an ML dash 100 K folder. And within that should be all of this test data. Okay, so make sure that all looks right or else you're going to run into weird problems and you're not going to know what's going on. Once you're sure that's fine, Let's go back to our instructions here. Next step is to install IntelliJ a, which is going to be our IDE for developing in this course. Now I used to actually tell people to install Eclipse and the Scala IDE, but it seems like intelligent is winning the battle they are against eclipse. So I'm gonna have you install IntelliJ j now instead. Now in order to run Scala code, you first need a JDK and anything between versions aid and 14 we'll do for this course. But if you do need to get a JDK, there's a handy-dandy link here to do it. You can just head on over to Oracle.com slash Java and go ahead and get to the JDK 14 download for your operating system. For me that's going to be Windows 64. You will need to accept their terms. And wait for that to download. Looks like there's a little security warning here. It's fine. I trust it. And let's that comes down. We'll go ahead and install it. Obviously on Linux or Mac that you'll probably use an alternative means of getting Java. In fact, you probably already have Java installed if you're on Linux or Mac. So this is probably someone that specific thing. We'll go ahead and go through the installer here. And one thing on Windows is that when you run Linux code like Apache Spark on windows, sometimes it gets confused when you have spaces in your path. So that space between program and files could actually be a problem. Let's go ahead and change that just to be safe. And I'm going to say this instead into a C colon JDK 14 directory. Well, let that do its thing. It shouldn't take too long. And we're done. All right, we're done with that site, back to our instructions. So now we can install the IntelliJ idea Community Edition. That's going to be our actual ID. Let's go ahead and click on that link. And we want the community edition, the free one, the open source one. We don't need the ultimate one. So go ahead and download that for whatever your operating system is. You see the offer that for Windows, Mac, and Linux. And this is about a half a gig. And so we'll take a few seconds to come down. Come back when that's done. Right, the installer downloaded and let's go ahead and kick it off. And pretty standard installer here. Let's go ahead and just walk through it. And if you do want desktop shortcuts or as file associations, you can do that totally up to you. I'm just going to leave these as is. And that's fine too. Takes about a minute to install, so I'll just come back when that's done. Alright, well good. Let's go ahead and hit the Run button to actually launch it. I don't want to import any existing settings. And personal preference, if you like a dark theme or a light theme, I like a light theme. So I'll select that, do whatever you want though. And now we're going to install plugins. Maybe the only plugin that we really need is the Scala plugin. And unfortunately I'm not seeing it offered here, so I'm just going to move on. I'm not seeing it offered here either. So if you do see the Scala plug-in though, go ahead and take the opportunity to install it. I did not though. So I have to do with the hard way, which isn't that hard. I'm just going to click on the Configure button here from the welcome screen and select Plugins. And from there I can find the Scala plugin. Let's go ahead and install that. Fine. Alright, and we can restart the IDE to pick that up. And now one more thing we need to tell it which JDK to use. So go back to the configure menu here and go to structure for new projects. And if you need to select a JDK here, select the one that we just installed, that's going to be 14 for us and hit Okay. Now there is one more step that we need to do that's only for Windows. So if you're on Mac or Linux, you can, you can ignore this next step, we need to sort of trick windows into thinking that Hadoop is running on it. And to do that, well, it's a little bit clunky. The instructions are in your course materials page here under the Windows only section here, just follow its instructions. Go ahead and create a C Hadoop bin directory. So I'm gonna go to my C drive. I'm going to create a new folder called Hadoop. And within that Hadoop folder, I'm gonna create another folder called bin. Now I'm gonna go back to my course materials, which is under sparks gala course. And you'll see a when utils dot EXE file there. I'm gonna go ahead and copy that and paste it into C. Hadoop been. Next I need to set up a couple of environment variables. So the easiest way to do that is to just go to your W