# Apache Spark 3 with Scala: Hands On with Big Data!

#### Frank Kane, Founder of Sundog Education, ex-Amazon

Play Speed
• 0.5x
• 1x (Normal)
• 1.25x
• 1.5x
• 2x
53 Lessons (7h 24m)
• 1. Introduction, and Getting Set Up

16:19
• 2. [Activity] Create a Histogram of Real Movie Ratings with Spark!

14:39

0:16
• 4. [Activity] Scala Basics, Part 1

12:52
• 5. [Exercise] Scala Basics, Part 2

9:41
• 6. [Exercise] Flow Control in Scala

7:18
• 7. [Exercise] Functions in Scala

8:47
• 8. [Exercise] Data Structures in Scala

16:38
• 9. Introduction to Spark

8:40
• 10. The Resilient Distributed Dataset

11:04
• 11. Ratings Histogram Walkthrough

7:33
• 12. Spark Internals

4:42
• 13. Key / Value RDD's, and the Average Friends by Age example

12:21
• 14. [Activity] Running the Average Friends by Age Example

7:58
• 15. Filtering RDD's, and the Minimum Temperature by Location Example

6:43
• 16. [Activity] Running the Minimum Temperature Example, and Modifying it for Maximum

10:10
• 17. [Activity] Counting Word Occurrences using Flatmap()

8:59
• 18. [Activity] Improving the Word Count Script with Regular Expressions

6:41
• 19. [Activity] Sorting the Word Count Results

8:10
• 20. [Exercise] Find the Total Amount Spent by Customer

3:37
• 21. [Exercise] Check your Results, and Sort Them by Total Amount Spent

4:26
• 22. Check Your Results and Implementation Against Mine

3:26
• 23. [Activity] Find the Most Popular Movie

4:29
• 24. [Activity] Use Broadcast Variables to Display Movie Names

8:52
• 25. [Activity] Find the Most Popular Superhero in a Social Graph

14:10
• 26. Superhero Degrees of Separation: Introducing Breadth-First Search

6:52
• 27. Superhero Degrees of Separation: Accumulators, and Implementing BFS in Spark

5:53
• 28. Superhero Degrees of Separation: Review the code, and run it!

10:41
• 29. Item-Based Collaborative Filtering in Spark, cache(), and persist()

8:16
• 30. [Activity] Running the Similar Movies Script using Spark's Cluster Manager

14:13
• 31. [Exercise] Improve the Quality of Similar Movies

2:41
• 32. [Activity] Using spark-submit to run Spark driver scripts

6:58
• 33. [Activity] Packaging driver scripts with SBT

13:14
• 34. Introducing Amazon Elastic MapReduce

7:11
• 35. Creating Similar Movies from One Million Ratings on EMR

11:33
• 36. Partitioning

5:07
• 37. Best Practices for Running on a Cluster

5:31
• 38. Troubleshooting, and Managing Dependencies

9:08
• 39. Introduction to SparkSQL

7:08
• 40. [Activity] Using SparkSQL

7:00
• 41. [Activity] Using DataFrames and DataSets

6:38
• 42. [Activity] Using DataSets instead of RDD's

7:23
• 43. Introducing MLLib

9:18
• 44. [Activity] Using MLLib to Produce Movie Recommendations

14:35
• 45. [Activity] Linear Regression with MLLib

5:55
• 46. [Activity] Using DataFrames with MLLib

8:30
• 47. Spark Streaming Overview

9:53
• 48. [Activity] Set up a Twitter Developer Account, and Stream Tweets

12:44
• 49. Structured Streaming

4:17
• 50. GraphX, Pregel, and Breadth-First-Search with Pregel.

10:38
• 51. [Activity] Superhero Degrees of Separation using GraphX

8:59
• 52. Learning More, and Career Tips

4:15
• 53. Let's Stay in Touch

0:46

## Project Description

This class is full of many interesting hands-on activities, involving the analysis of movie ratings and connections between superheroes! But here's one more challenge you can try after completing the course:

Write a Spark script that analyzes the one-million-rating dataset from MovieLens we used in the course. Let's figure out what the worst movie ever made was!

But, we don't want a movie that only has one rating, which happens to be one star, to be the "winner." Start by producing a list of the movies sorted by average rating, which isn't hard - but then sort that list by the number of ratings, so that movies that have a bad rating and also a large number of ratings are the ones that show up first.

You'll probably still need to scroll past a lot of spurious 1-star results, however - so next, implement a filter that removes any movies that have fewer than, say, 10 ratings. That should filter out obscure films that we just don't have enough data for. 10 is an arbitrary cutoff; you may find yourself playing with that number.

You'll also face the challenge of the output being split up across the various cores that are processing this data. You can try just using "local" instead of "local[*]" to get around that, but it would be even better to devise a way to merge the results together - either with a script, or by keeping track of a global "winner" with a broadcast variable.

What looks to be the worst movie ever?