2021 Edition - PySpark - Python with Spark for absolute beginners | Engineering Tech | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

2021 Edition - PySpark - Python with Spark for absolute beginners

teacher avatar Engineering Tech, Big Data, Cloud and AI Solution Architec

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

14 Lessons (53m)
    • 1. Introduction

    • 2. What is Spark?

    • 3. Running Python Spark 3 on Google Colab

    • 4. Spark for data transformation

    • 5. DataFrame concepts

    • 6. RDD concepts

    • 7. Python basics

    • 8. PySpark - Creating RDDs

    • 9. Python functions and lambda expressions

    • 10. RDD - Transformation & Action

    • 11. PySpark - SparkSQL and DataFrame

    • 12. UDF

    • 13. Joins

    • 14. Bonus - PySpark Hadoop Hive development environment using PyCharm and Winutils

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

Learn Python with Spark and practice on Google Colab environment without environment setup hassles. You will learn the following in this course

1. Python Basics

2. Spark RDD, DataFrame

3. Advanced Spark concepts like UDF and joins

4. PySpark Hadoop Hive development environment setup using PyCharm and Winutils

Meet Your Teacher

Teacher Profile Image

Engineering Tech

Big Data, Cloud and AI Solution Architec


Hello, I'm Engineering.

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Introduction: Welcome to the spice Spock, Python with Spark courts for absolute beginners. You'll be learning Python and spark from scratch. And we do not expect you to have any prior knowledge in any of these technologies. You'll be learning Python basics, Spark, RDD, Resilient Distributed Datasets, and the newer API DataFrames. And we'll also be covering some of the advanced concepts like UDF and joints. So this is a completely hands-on course and you'll be applying all your learning to solve a real world use case Bank prospects pocketing. A bank has received prospects data from different sources and it needs your help in cleansing the data and doing some transformation. And you will be applying your learning to help the bank target the right customers. We are going to use Google collab for development. You can be up and running in minutes without worrying about any environment hazard. So let's dive in and learn Python and spark from scratch. Thank you. 2. What is Spark?: Let's now dive into Spark. Apache Spark is one of the most popular technologies to do large-scale data processing. One of the core concepts of big data that we have looked at till now is distributed storage and distributed computing. We have seen how to store data in a distributed manner using Hadoop-based DFS. And then how to do parallel processing using MapReduce. We did not write MapReduce program, but we used Hive queries, which in turn uses MapReduce to do the processing on data stored on HDFS. Mapreduce stores intermediate results to disk, and Spark improved on the parallel processing by storing intermediate results in memory. Spot can also write to disk if it runs out of memory. With Spark, you can do large-scale data processing at a much faster pace. Spark and do not just batch processing, it can also handle real-time streaming data. Additionally, with resource management responsibility, moving to a new component Yan Hadoop allowed non MapReduce jobs to work with. Is the fist data. Spark soon become a great alternative to MapReduce, to do parallel processing on distributed data stored in this div is, instead of writing mappers and reducers, you would write Spark computational tasks. Yan would run those Spark tasks is jobs and the Hadoop cluster. Main region hotspots popularity is they've tracks on it provides for distributed processing. Data processing can happen on multiple nodes on Spark cluster. And spark also has its own cluster manager, or it can use Mesos cluster manager to our Kunda Hadoop platform. Spark on Hadoop is a popular architecture in the big data ecosystem. Hadoop solves the problem of distributed storage and Spark does large-scale data computation in a distributed manner. Spark and work on Hadoop. It can also work with relational databases, NoSQL databases. It can run our different Cloud storage systems like AWS, S3 and Azure blog. It can also run on a simple file system. This is the definition you'll find on Spark website. It's a unified analytics engine for large-scale data processing. You can read a dataset using Spark, do all kind of filtering aggregation, and then create a new dataset which can be consumed by other applications and which sparks touring intermediate result in memory. It could do parallel processing at a much faster pace compared to MapReduce. This diagram depicts various components of Spark. Spark score engine interacts with the cluster manager. And for integer programming, Spark has libraries, lakes spark SQL or Spark DataFrame for data transformation. And then Spark also has libraries for machine learning, graphics, and real-time data processing. Spark is written in Scala. For end-user programming, it supports Python, Java, and Scala. Scala is the language of choice for big data and Spark if other applications like machine learning written in Python and you want to integrate those applications with Spark, then it's better to use Python, Spark or PySpark for your computational work. You can also use Java is a programming language for Spark, but that is less common in the real world. People from a Java background can easily learn Scala and get started with Spark programming using Scala, which is all the latest libraries. You do not need to learn everything that is there in Scholar to be a spark Scala programmer, which you'll see in this course. Let's understand how Spark works. All Spark programs have managed by a driver program which runs on the master node. In a multi node cluster environment, the driver program runs on the master node and it creates several workers which run on the worker nodes are DataNodes. The driver program creates a SparkSession, which is the entry point to at least spark application. Once you have this, Marxism will have access to all Spark libraries and functional. 3. Running Python Spark 3 on Google Colab: Let's look at how to install Spark on Google Colab, Google collaborative free online environment to do Python programming. You can do Python Spark base pair programming on Google Colab by installing spark and also by sitting Java JDK home. Spark is written using Scala programming language, and Scala requires Jericho and any environment. Let's now look at the goal Colab environment and see how we can get started with Python Spark programming. Go to colab dot research dot google.com. You can login to this website using energy melody. Let's cancel this. From the homepage. You can click on File. New Notebook, get the time of this recording. Colab supports Python. Three notebooks only. Collaborate like a Jupyter Notebook environment with some visual customization. You can simply type coordinate, see the output by hitting Shift Enter. And you can continue your programming in subsequent sales. You can also hit on this arrow icon to see the output is sale can be moved up or down using these arrows. Deleted also. You can have a fixed shape to put some comments. In Colab, most of the Python libraries pre-installed, but if you notice something is missing, you can install it by using pip install command. Pip Install is the Python command to install any liabilities. Your prototype exclamation mark, peep and install, and then whatever library you want to install it. Collaboration Linux environment. To type in the Linux command prompt up Colab notebook, you have to use the exclamation mark. You can pay exclamation mark Ellis and see the list of files available under this directory. By default, Colab provides you a sample data directory under which you can store your data. You can also directly store data files under your user directory. You can click on this link here, files to upload file from the local environment. Or you can do a double gate and pull a file from GitHub repository. Let's try to pull the retail store dot csv from GitHub repository. We'll click on Draw tab, copy the link. And let's do W. Good. And the file URL. And the file has been copied. You can verify the same by typing ls command. Let's now see how to do Spark programming using Colab. First we'll go to spark dot apache.org to copy the Spark download link. You can click on Download and select the latest version for peer. If what any of the older versions. You can click on the Archive link at the bottom of this page. Let's pull the latest version three dot one, dot one, which is built for Hadoop q-dot seven. Click on this. Download Spark link. It will take you to another page. From there, you can copy the download link. Let's copy this link. Will copy the file to Colombian moment by using W. Good. Let's do a list. Now. The file has been copied. We can unzip this using tar command and then specifying the file name. So this will create a directory for Spark. Let's check it out. Spark through D21 bean Hadoop, q-dot seven or directory has been created. Let's do a list. We can see been some example files and other Spark. Spark is built using Scala and Scala required CDK. Spark triggered one is compatible with Java 11, and Java 11 is already available in the Colab environment. We can verify the same by going to user leave JVM directory. Here we can see that Java 11 OpenJDK is available. We need to ensure this Java home is sit in our environment variables. In Python, you need to import the ways or operating system liabilities. After that, you said that Java home by pointing it to the Java JDK directory, and then said the spark home by pointing it to the newly created spark directory. After setting this environment variables, unit installed find Spark. Let's do that using the pip install command. Find Spark is a Python liquidity. To find spark in any environment. If you're completely new to Python, don't worry about this command slate now you just simply execute them so that you have the environment ready to get started with Spark programming. And everytime you could do Spark programming in Colombia adjusted to execute these commands again. And you should be good to go. Now let's import find spark. Then we'll have Britney CLAs find spark. After dark from pyspark SQL. We need to import SparkSession software. We have initialized find Spark PySpark liabilities will be available in Colab. Then we import SparkSession. And the next we need to create a SparkSession that is the entry point to any Spark application. The way to create spark session is by using this command, SparkSession builder dot master. Then you specify how many core CPUs you want, and we'll select star. It will use all available cores or CPUs in the Colab environment. Then there are several optional parameters also, which we'll take a look later. Just fired this command and you will have the spark session. In a Dataproc console. Sparksession was already available. But here in Columbia, we had to execute the steps to reach your point from where we can start our Spark programming. To do Python Spark programming, you really don't need to set up a cluster. You can do it in Google Colab. Once you are comfortable in doing programming in a notebook environment like Colab, then you can pick your coordinate, done it in a cluster environment, which we'll see later in the course. Let's give this file a name, my Spark program one. In Colab, you can simply download a file by clicking on File download. Either you can download it using IPython Notebook port MAC, which is the older name for Jupyter Notebook. Or you can download using dark by. If you could take this program to another Python environment or to the cluster, you could download loginData. Darpa IPython Notebook will only work wherever you have a Jupiter notebook environment. Now that we have a SparkSession, let's get a sample DataFrame. We already have a CSV file in the Colombian moment, which we pulled from GitHub. Let's create a Spark DataFrame from the CSV file using the same command we tried on Dataproc, will do a spark dot-dot-dot opsin, specify here that is true, and then specify the file name. Again, we'll be learning all these concepts from scratch, what a DataFrame is, and how to create a DataFrame and do part a processing for an hour just to get a flavor of how Spark works, you try out this command and you will see that from the CSV file, DataFrame is getting created and that is getting displayed here. So we're reading a simple failure. This command would work for any large file in a big data world. You can also create a DataFrame using Spark dot create DataFrame command. Then you specify some values is key-value pairs, which again will be learning later in the course. For now, just specify some values in the format shown here and execute this. This will create a DataFrame from the key-value pairs captured within the curly braces. So this is how we can get started with Spark programming on Colab. And if you're new to Python, Spark Colab is a great tool to get started. You do not need a cluster to do Spark programming. 4. Spark for data transformation: Until now we've seen how to run Spark on a Hadoop cluster. We have also seen how to read a CSV file from the file system and created dataframe. With Spark, you can read data from many different data sources like hype table or CSV files and created a DataFrame. Once you were DataFrame, you can transform it to create another DataFrame and then do series of transformations. Also, you can read the intermediate data from different data stores, blend it, and then create a final dataset which can be persisted to another data store. This is our Spark is primarily used in the real-world for large-scale data transformation. Transformation involves filtering data, doing aggregation like groupBy and orderBy are deriving new columns from existing columns, replacing null values, handling missing records and various other things. You would use Spark is your main transformation engine to process high volume of data. You can do the same operation using traditional technologies, but what data volume is really high and you want to do processing at a much faster pace, you would choose Spark is that technology. Next, we'll dive into DataFrame and understand what it is under the hood and how it does the parallel processing. 5. DataFrame concepts: Dataframes are the main way you work with Spark or higher worsens. You read data from different sources like DB2, Laura, CSV file, or a relational database. And then store that in a DataFrame to do different kinds of transformations. Dataframes are created using SparkSession, which is the entry point to any spark application. And that is the main way you work with Spark coupon G20 or any higher worsen. Data frame is like an in-memory Excel. It holds the data in the row and column format and provides various methods to analyze and apply transformations on the data. In the previous lab, we created a DataFrame from retail store dot CSV file. So in this DataFrame there are several rows and columns. Each row in the DataFrame represents one record. And then each row has multiple columns, which are different entities for the data. And once the data is loaded to a DataFrame, you can do all kinds of transformations like filtering, aggregation, replacing null values. You can drop a column, you can add a new column, or you can derive another column from existing columns. You will use Spark DataFrame to do this kind of transformations, but a very high volume of data. Now how does the processing happens? What a large volume of data in a distributed manner behind the scene for that wind to understand something called RDD, Resilient Distributed Dataset. That is the court building block up Spark DataFrame. 6. RDD concepts: Rdd, Resilient Distributed Dataset is the fundamental building block up Spark. If we're working with Spark to or Spark three, you might be using DataFrame or dataset. But under the hood, DataFrame or datasets are stored as RDDs, which is still the fundamental building block of Spark. Rdds are nothing but data collections, which are stored in a distributed manner across the cluster of computers. The DataFrame that we created earlier is stored as Resilient Distributed Datasets under the hood. So how does it work behind the scenes? Data is split into multiple different datasets and stored in a distributed manner across the cluster of computers. Spark stores this Distributed Datasets in memory of individual worker notes and does the processing and memory. And that makes it really fast compared to MapReduce. Since data is stored in a distributed manner, the dataset is called Resilient Distributed Dataset. If a few notes go down, Spark and still do the processing because data is available in memory in other nodes. Rdts can be reconstructed from the data available in memory, from the available nodes. That's why it's called resilient. Rdds are immutable. And so our DataFrames which are built on top of RDDs. You cannot modify RDD, but you can create an RDD, do some transformations, and create another RDD. Rdds can be created using Spark context, which was the entry point of Spark application in one dot txt worsen in Spark brew.sh AX or higher, SparkSession is the entry point. Sparkcontext is encapsulated within the Spark session. But if you have to work with RDDs, you need to create a Spark context. Then get going with RDD programming. 7. Python basics: Before we dive into pi Spark programming using RDD, your DataFrame, let's learn some other Python basics. You don't need to learn everything that is out there in Python to get started with by Spark programming. You know, to understand basic data types, how to write loops, and how to read data from a file, write to a file. Those concepts should be sufficient for you to get started with Spark. Later in the course, you'll understand how to do more real-world programming using Python, like logging error handling, configuration management, and how to do unit testing. For now, we'll look at basic Python concepts and then we'll move to pi Spark programming. Let's go to Google Colab and create a new node. Python supports all kinds of arithmetic operations like addition, subtraction, multiplication, and in Python, you don't need to declare variable types. Python is dynamically typed. You can simply say, let's say a equal 3 and then be equal 4 and then a plus b in a notebook or repellent moment, you will see the output when you write an expression like this. But in a Python program, to display something, you have to use the print function. You can print a plus b and then it would get printed. Similarly, you can do other arithmetic operations like multiplication or division and see the output. Python supports both single court and double quotes for strings. Let's write another one within single quote. And you can have a single court within double quotes or double quotes within single quotes. And Python are not complained to concatenate two strings. So we can simply add the two strings. It has to be two. And we can see the output here. Like any programming language and find lot of inbuilt, method, JSON, string, and other data types. For example, we can convert this to uppercase. And it is getting printed in the uppercase. List is a collection of elements. Your declare collection of element within square brackets. This is similar to arrays in other programming language. This now we can declare a list. List can I wonder type, or it can have multiple data types. And to grab an element from a list unit to specify the index number. The index number starts with 0 and its gets incremented by one. Let's declare another list. This list has four elements and pain is at index 0 for peace at index one and so on. From the end, index number starts with minus1 and gets decreased by one. To grab 7.3, you can say var list three or vitalist minus1. And if we do minus two, it will print abc. You can also grab ABC by specifying index number 2. In Python, you can write a loop by writing if then the condition and then put a colon. Then you can write whatever code you want. For example, within the flu. The loop ends where the Indian dissonance, if we write outside print loop, but not indented, that would be considered outside of the if loop and it'll get printed. And let's change this to three. Both lines are getting printed. You can have multiple conditions with the elif and else. Let's add another elif. And finally, we'll avails block. Whichever condition satisfies that particular block will get executed. Let's change this and 16, and we'll have, this one is true. This block is getting executed. Let's see an example of for loop. For I in range 10. Print I, arrange 10 would give us 10 elements starting from 0 up to nine. Push toward a file in Python, you can use this syntax with open filename. And more days w, that is for writing. You can write some content. Let's see if the file got generated. We can see that the file got generated. And we can view it in Colab using cat command. Or you can click here and see the file and download it. You can append to this file by using append mode. Now let's view the file again. And to overrate, you can use W mode again. Finally, to read a file, you can use the mode is read and read the content. 8. PySpark - Creating RDDs: Let's see, the practical side of RDD. Most likely will not be working on RDDs, which are older EPA's of spark. It is always good to have some understanding of RDD because that is the core building block of spot. Also in the real world, you might encounter some applications written in the older version of Spark, which could be using RDDs. So it's always good to know RDDs, though most of the operations that you intend to do in Spark can be performed using get A-frame and Spark Sequel, which are the newer EPA's up Spark RDD is a collection of elements partitioned across an order cluster and can be operated in parallel. These parties or data is stored in memory of individual worker nodes, Norton disk, but in memory that's where Spark is fast and processing data in a parallel manner. To create an RDD, we need a SparkContext, not sparks, isn't those Spark context is also encapsulated within Spark session. Let's first create a Spark context and then get going with RDD programming, will go to collaborate and create a new notebook. Let's call it pi Spark RDD. First, we need to install Spark the way we have done earlier. These are the same steps which we executed our LEA to install Spark in a Colab environment. To get the spark installer from Spark website. And then our target set, Java home and spark home. Then install Find spot, and then initialize it. You can keep this quote in the first cell of every notebook where you intend to do pi Spark programming. And every time you re-execute this program, spark will get installed and you'll be able to proceed with your Pi Spark programming. If Spark is already installed, that is not an issue. It will reinstall. Let's run this. Next. We'll import SparkSession and also SparkContext, which is required for RDD creation. Next we'll create a SparkSession. Then from the Sparks is, and we'll create the SparkContext. So now that the SparkContext is created, we can start creating RDDs. Rdds can be created from collection of elements or from a dataset, from a CSV file or from a table, or from other data sources. Let's create an RDD from a collection of elements. Sparkcontext is a parallelize method using which you can create RDD from a list of elements. Let's create a list of elements. Now to know whether this is an RDD or not in Python, we can now derivative of any object by using the type function. Simply type and capture the object within the parenthesis. And you'll know that type. Now that we have an RDD, we can extract all the elements from it by doing collect. Since this is a very small RDD to take less time. But in the real world when you have huge dataset or billions of records, you need to use this carefully because it will try to get all the elements from the RDD, which might take lot of time. So this is how we can create an RDD from a collection. Let's create an RDD from a CSV file. We've already worked with retail store dot csv. Let's pull that file to the local environment from GitHub repository. So now that we have CSV locally, and let's see how to create an RDD from it. Will declare a variable, my CASB RDD. Sparkcontext is it takes 12 method using which you can create RDD from a text file. Now let's do collect the way we are done earlier. We can see all the records collected will display all the records including the header. We can take the type of this also. This is an RDD. There are various other methods of RTD. For example, we can do first, and it will display the first row of data ED. And RDD is something called using which you can display setup rows. It displays the first three rows of the RDD. You can also look through the RDD by declaring a for loop for line in. Then do collect, and then you can print each line. And let's print something else. So this is how we can cleared RDD private collection hard from a file. 9. Python functions and lambda expressions: Before we see the practical side of transformations and actions that can be performed using RDD. We need to understand how functions and lambda expressions work in Python. Let's see some examples of Type. Create a new notebook in Colab. In Python, we declare a function using the def keyword. For example, if my function and then put a column, print something, then you can simply call the function and the function will get executed. Let's see another example. Calculate sum. It will take two parameters and return this sum. Now we can simply pass arguments and get the output. Let's print it. We can also store it in another variable and use it. By time functions can return multiple values, will return what, sum, and multiplication of two variables. And let's store them in two different variables. You can specify the output variables in the same order, h capturing the return statement, and then call the function. It has to be a multiplied by B. Now we'll print both. You can see some hand-on multiplication output are getting printed. So this is how we can create functions in Python. Python also has something called lambda expression. To shorten the code. You can write something like myVar equals lambda x colon x plus 100. So this is a lambda expression. Now to use it, you can say print myVar 10. So instead of writing a function to calculate the sum using this lambda expression, you can get output in a single lane. Let's see how to convert this function that we wrote earlier to a lambda expression. Lambda expression can take multiple parameters but always returns 10. Now we can say print my 45. It gives you the lambda function is an anonymous function. It can take any number of arguments, but can only return one output. Lambda functions are mainly used to shorten the code. Next, we'll see how to combine lambda expressions with other operations like map and reduce to do complex operations in a single line of code. 10. RDD - Transformation & Action: Once you have created an RDD, let's see what kind of operations we can perform on an RDD. There are two types of operations. We do an RDDs, transformation and action. With transformation, we can produce a new RDD from an existing RDD. And when we call action, that's when we see the result of RDD computation. The examples that we saw in the previous lab, like collect, count, tick, are the examples of actions. Quantitative nor peer is. Kinsmen said earlier, RDDs are immutable, they cannot be modified. How well you can take an RDD, apply transformation and store the resulting data in another RDD. And the other thing to notice part does lazy evaluation. Spark will not do any processing until the accent method is called. When we created an RDD earlier, spark will do the processing only when accent method like collect, coward, hot, fast, those are invoked. Until then Spark could not do any processing. You can create an RDD, do multiple transformations, but only when you collate or parses to a disk, then the execution will happen. That is lazy execution. Let's see some examples of transformations we can do with RDDs. The very first transformation that we will see this map. It acts on one element at a time and perform some operation and creates another RDD with same number of rows as the parent RDD. This is how you'll use map transformation on an RDD. Specify the RDD name them called the map and then boss the function that needs to be executed. And this function can be lambda expression also, which we'll see shortly. We've created a my CSV RDD from the data stored in it gets colder season equivalent transformation requirement to convert all medical him. We can do that using the map function. Let's see how that would work. We'll call the map function using a lambda expression that could replace all the mail with them. Now let's run this and see the output. We can store the output in another RDD and displayed. So at this point, though we have defined the transformation, they chill operation will not happen. That will happen only when we call any access method, like collect counter, any other action method. Here we can see that all the male 70 plus big M map axon one eliminated pain and produces an output with the same number of rows is the parent RDD mixed. Let's look at the transformation called filter. Filter. As the name suggests, you can pick an RDD, a pleasant filter operation, and get a resulting RDD. For example, give me all the male customers or give me all the customers with certain age. You take the RDD called the filter method, passing a function of lambda expression. To get all the females from the day that we created in the previous step, we'll use a Lambda expression lake. So give me all the rows where there is a female in the element. Let's run this. Again. We'll call collect that when the operation will be performed. Spark does lazy evaluation until you colony axon method will not do anything. So we can see all the rules with female in it. We can do other accelerators like getting the compost. So there are five female customers in that reverse it. Next year flatMap. It works the same way as map. It returns more elements than the parent RDD. Let's understand it through an example. We'll take the female customers RDD and then create a new RDD in which each element will be a word. We can write a lambda expression that would split the line by comma and give us all the words. Let's store it in watts. And let's displayed. Now we can say new RDD with each element being Award. And this will certainly have more number of requests then the parent RDD. We can do watch dot count to see how many elements are there in that new RDD. So there are 25 worse than 25 elements in the new RDD. Set is a type of transformations using which you can combine two RDDs. Either you can do union which will join all the elements from both diabetes, or you can do intersection which will give the common elements from both RDDs. Let's see some examples. We'll create two RDDs. First one will contain later a to E, and then another one with some other letters. And we can see that there are some common letters between the two RDDs. And you can simply combine the two using the union method RDD1, union RDD too. And then let's display. We can see all the elements from both RDDs. You want to get the distinct ones. You can also do that using distinct method and then doing a collective. We can see all the other distinct elements are getting printed. So this is how we can combine two RDDs using union. Similarly, you can do intersection. And that will display only the common elements which are C and D. Traversing three, RDD is a costly operation. Typically you would combine multiple transformations, capture them in a function, and invoke that function from a map transformation. Let's see how that works. We have this my CSV RDD, which contents all the rows from the CSV file. Let's write a function which will do multiple transformations. For example, converting all the male 2, 0, 3, 1, 2, 1. And converting all the n to 0. If anybody is not purchased will convert that to 0. Otherwise, you'll capture this one and we'll convert it to uppercase. So these are some examples, but you can try out other operations also. Now, we can call this function from a map transformation. We can say my CSV RDD map and simply call this function. And we'll call this resulting RDD, my CSV transformed. Again, the operation will be performed only when we call an action. We call collect now. And we can see all the transformations have been applied. All the males have been converted to one's females 200 purchase into 0 purchases. So Y2, 1, and also all the countries are in uppercase. So this is that we can combine multiple operations, captured them in a function call Chromium map. Recap of all the transformations here mapped to apply transformation on each line and create an RDD with same number of elements as the parent RDD filter to apply any filter criteria and get a resulting RDD. Flatmap will work on each line, but will produce an RDD which will have more number of elements than the parent one. And then Newton set union or intersection. You can combine two RDDs. And you have seen many examples of actions like collect, count, taken first, there is another type of action called Reduce. It is an axiom that aggregates, like some min-max. Those kinds of aggregations you can do using reduced. Let's see how that works. If you have an RDD with certain number of elements, let's say four elements. Then using Reduce, you can get sum of all the elements. And the way to do that is you capture a lambda expression which will take two parameters and return the sum. And this will work recursively. Foster take 10 and 20, get the sum, then take the sum of 10 and 20, and then RT 213 and, and so on. So this is our reduced what's recursively. Now let's see, reducing axon will create an RDD from the four elements and then call the reduced to get sum up all the elements. You've got a 100, which is the sum of all the elements. 11. PySpark - SparkSQL and DataFrame: Programming using RDD was complex and developer community wanted it change. Spark came up with Spark Sequel to greatly simplify Spark programming. Spark SQL is a library that is built on top of Spark Core and Spark RDD. And it supports cingulate operation. You can interact with the underlying data to SQL queries using the Spark SQL liabilities. Spark SQL works to a Spark session. It provides a standard interface to arc across different data sources. You can either work with Spark Sequel, horsepower get optimum IPAs. When you use Spark SQL, output can also be stored in a Spark DataFrame and further processing can be done using DataFrame. We base our Spark SQL EPA's will see that the maximum Spark is used for large-scale data transformation. You can read data from one source, apply a series of transformations, blend other data, and then finally store doubt, put in another data store from where it can be picked up for further analysis. Let's now look at Spark, get FMN, Spark Sequel EPA's which are introduced after Spark 2 worsen whether you are using Spark to our spot. Three, you'd be primarily working with spark SQL or Spark DataFrame. Let's go to collaborate, create a new notebook. We'll call it by Spark. Sequel DataFrame. In the very first cell will capture the CO2 installed Spark is you have done earlier. Let's run it. This will install Spark and it'll create a SparkSession. It's complete now. Let's now copy the same retail Schroeder CSV to the Colombian moment. It has been copied. Sparksession has a read CSV method using which we can create a DataFrame. Let's do that. By using the IF function. We can take the tape up customer df. It is a DataFrame. We can do a dataframe dot show to see all the records in the DataFrame. We can also limit the number of records in the display. Let's display three only. We can get statistical info about the DataFrame. Using disclaimer. You look for to describe dot show. To get the info. We can see the summary here. And then some starts at number of records. What is the mean value of each field? What is the standard deviation, mean value and Maxwell Lu, wherever applicable. For gender and countries, these are string tape. We did not get anything in the mean and standard deviation. From our data frame. We can select a particular column by simply using select and specify the column name. You can also store the output of this operation in another DataFrame and then display that. Here we have stored doubt put up select operation in a country df. And that is also a type DataFrame. With dataframe, we can easily perform aggregation operations like groupBy, specify the DataFrame and call the groupBy method and specify the field on which you want to play the group by. And then you can do a count. So there are three requests for Germany, 44 France, and three for England. Similarly, let's do a groupBy on the gender field. There are five males and females in the customer df dataframe. With spark SQL library couldn't do programming using Spark DataFrame ATPase. Or you can use the Spark SQL API. To work with Spark SQL directly. Temporary table. Let's get a temporary table from customer df DataFrame called customer. We can also convert the timetable back to a DataFrame by using the Spark SQL liability and then doing a select star from that table. You can see that the type is a DataFrame. Now that we have a temp table, we can do all kinds of SQL operation on it by using Spark SQL library. Now let's see how to use Spark SQL to do different kinds of transformations. We'll try a simple filter operation, get all the customers retailers to D2. So now that we have a customer temp table, we can find a select query on it using Spark Sequel and get a DataFrame from it. And we are doing a direct dots show here. Or we can get a DataFrame and then called dark shadow on it to see what you can do filter operation directly on the DataFrame. Also. For example, you can do customer df dot filter and within quotes you can specify the filter criteria and the result would be a DataFrame. Let's store the output in another DataFrame called filter df and display the records. So you can either do different operations directly on the DataFrame or you can create a temp table. And then on that table, you can write select queries to filter our aggregate the data. Whatever you are comfortable with, you can use that Spark SQL provides both DataFrame API and SQL API. You can do different aggregations with dataframe easily. For example, if you want to group by gender and then get the average salary and max-age. You can do that using the syntax. You may not remember all the syntax whenever you a particular transformation requirement, filter aggregation, then you can look up Spark documentation and then get the exact syntax. Conceptually understand that Spark DataFrame is all kind of liabilities were doing different kinds of data operations. This is what you'll be mostly doing with Spark, but with a very large dataset, either stored in Hadoop cluster or on some other big data platform, lake cloud storage. You can fit multiple. Columns from our DataFrame easily specify the columns you weren't in the select method. Adding a new column to a DataFrame will be a very common requirement in the real-world. In some cases, you'll be adding date. In some cases you'll be taking one column and then deriving another column from it. For example, you can get one column like salary, get twice the value by multiplying it by two and then store that in a new column, double celery and data pamper waves the width column method using which you can add new columns easily. Dataframes that immutable customer df DataFrame will not be changed after this operation. And we can validate the same by doing a dark shadow here. This displays the original DataFrame. You can always store the output of an operation in another DataFrame and use that for further processing. Let's call it new customer df. Now Disorder have that new column to Arctic column with fixed value for each row. You can use Spark Sequel LET function. Let's see how that works. You can give a column name and specify the value using lit. You can rate lit and whatever constant value. For example, we can capture a year. And lead is a function of Spark SQL functions library. You need to import that from pi Spark dot SQL functions important lit, and use that to Arctic column with fixed value for all the rows. You can easily rename a column using the DataFrame width column renaming metadata. Here we have renamed salary to income. You can combine different operations in a single statement. For example, get me all the customers with salary 30000 and get that edge. You can also do the same thing using the Spark Sequel. Let's try that. Select star from customer, where salary greater than 30000. Put it within a code, and then do a dot show. This customer is a temp table that we created earlier from the DataFrame. Let's run it. We get the same output. So either you can write SQL queries like this, or you can do operations directly on the DataFrame. You can select multiple fields also after doing a filter. Same can be done using Spark SQL liabilities. Now let's see how to do multiple filter operations on a DataFrame. Here we can date salary greater than 30000 and let's say less than 40 thousand. And we can combine the two using an add operation will change it to 40. So this is how we can easily write multiple filter operations. Same thing you can try with spark SQL also. By having UNCLOS, using Spark SQL function, you could do various aggregates and operation easily. You point to count how many distinct countries out there. Use the count distinct function on the country column and get the output. You can also use an alias to change the name of the output column. To get our average value, you can simply use the average function. For example, you can do average salary to get the average salary of the records you have in the DataFrame. Another example, Let's do average age. You can also determine the standard deviation using the standard deviation function. Let's get standard deviation of age also. Spark SQL is a four-bit number function in port for Mach number. Let's first create a DataFrame with standard deviation of Sandy. And we can see that it has many decimal places. Now using far what number we can restrict the number of decimal places. Simplicity format number, STD and specify how many decimal places you want. And the number who will get formatted. Many times you'll have requirement to sort data. You can use the orderBy method to sort data on a particular column. It has been ordered by salary. By default, it's sorted in ascending order. To sort in descending order. You can use descending, but you can also use the call function. You have to import call from Pi Spark SQL functions. Now it has been sorted in the descending order, specified the column, and then use descending. You can use the same syntax for ascending also do by default it is ascending. You can easily drop a column by a bank drop and the column name. And let's store it in a new DataFrame. Again, when you drop, the original data frame will not get impacted because the resulting data can be stored in another DataFrame, which will show you the data without the drop column. Our data is many rows with null values. To drop all the rows, which is null values, you can use any dot drop that would drop any row with null value in any of the columns. This kind of operations is very common in the real world. When you get a dataset, if it has certain null values, you want to cleanse the data by dropping all the records which have null values. There are other ways to handle that also, but one of the common ways is to get rid of all the null records. You can do that easily by using dataframe dot, dot, dot. Instead of dropping the null values, you can also replace them with some other value. For example, replace all the null values with new value. That wherever we have null values, those have been replaced with new values. These are some examples of transformations you can do using Spark SQL and Spark DataFrame. Depending on your use case, you'll be doing one or multiple transformations. And then finally stood in the resulting dataset in another data store from where it can be accessed by other applications. Thank you. 12. UDF: Welcome back. In this lab we are going to look at UBS are user-defined functions. Opened the py spark UDF and join notebook that is attached to the resources section, login to the collab environment and load the notebook. Or you can create a new Python three notebook and start typing installed spark. These are the same steps that we have performed earlier. Say keeps Spark is installed correctly by creating a dummy data table. Copy the store customer CSV file to the local environment. Copy the store transaction CSV file to the Cun column environment. You can find the part by clicking on the tab. You draw and then you can get the path to the file. Check if the files are copied. Yeah, we can see that store customers and store transaction. These two files have been copied to thought Columbian moment. Now let's create a DataFrame from store customer, CSV called customer DAF. We can see some sample records here. Customer IDEA's salvage gender country. So this DataFrame is 7 thousand records. Now let's load the store transaction CSV file to the plant epsilon df DataFrame. So these are some sample records. So there are about 1 million are transactions that we can see that a been captured in this file. So if 700 customers and 1 million transactions. Let's now try to understand the NBA or user-defined functions. If we noticed that some of the operation that would tend to pop home is not available in the built-in function. You can define your own UDFs or user-defined functions. Let's see how we can extract the year from the date field using user-defined functions. Importing da prompt Pi, Spark SQL functions. And then let's write a UDF to extract year from the date field. It will read the date field and then it will split it on Dash and create a list of three elements. And then we did the second dial limit or the second index location, that is your 2019 or whatever value is there. So this function will take a string and give us the Tarde limit after splitting it on Dash. Now, we can use this UDF to extract year from the date field. So let's add a new column called year and extract the value from the date field and displayed. Here, as you can see here, has been extracted and captured in the new column. Udfs have custom functions, are user-defined functions using which you can do lots of operations if those are not supported by the patient and spark functions. Thank you. 13. Joins: Welcome back. In this lab we are going to look at a join operations. Let's try to look at a use case for to understand to which country customers are spending how much. Then we'll have to combine the two datasets. We have, the customer dataset and the transaction dataset, customer dataset as country Customer ID is Sandage gender, and the amount spent by each customer is captured in a transaction dataset, Customer ID, product ID amount. So we need to understand which country customer is spending, how much, what that will have to combine these two datasets. This is how we perform join operations in Spark, you take one data frame, join it with the other data frame, and then specify the field on which the job will be popover. So here we are saying join customer DataFrame transaction data print, where the customer ID matches. So by default, this will be inner join that will give us the common requests between customer dataframe and transaction data frame. We'll look at the other join types shortly. So if you perform this operation and then print the resulting DataFrame, countries spent details, we can see that both data frames have been combined. We have the customer details and we have the transaction details. Note that Customer ID is captured twice. Now, to get countrywide spent, we can simply perform an aggregation operation. Group either data on country and aggregate on amount and get the sum. This is the mode for each country. Let's now understand other join types in Spark. For that will copy the small datasets, store customer billing, and store transactions. We need to the Columbian moment. Let's see if the files are copied. Will create a DataFrame customer D8, Minnie. This is six records, customer ID 125678. Similarly, let's create a transaction DataFrame mini. This is 123456 records. You can notice that some of the Customer ID from the customer data set tab not present in the transaction dataset. And similarly, some of the customer IDs from the transaction data sets have not present in the customer dataset. Using these two DataFrames will look at some of the other joints we can perform in sparked. By default, the joint is inner join, which you just saw in the previous example. When you take the two DataFrames and join them, you get the common records. So one-to-five, these are the common records across the student assets. Customer introduction, many datasets. Instead of default blank. You can also specify how inner, This is how its recipient join type. Its spark by default is the inner join. So you'll get the same one-to-five, six protocols. Now let's look at left, join. In, left joy. You will get all the records from the customer DataFrame. That is the DataFrame on the left-hand side. And you will get a null values for the records which are missing in the transaction data or the right DataFrame. For example, we have 1256, which are common across the two datasets. So we have records from both customer and transaction datasets. But 78 are present in the customer data for him, but not in the transaction enough it see the left shred, all the records from the left DataFrame will get printed and nonviolence will get printed for the missing records from the right data. Similarly, in a right choice, all the records from the right DataFrame will get printed and null values will get printed for the missing ones from the left DataFrame. So 34 are missing in the customer data. So you can see Nobelist for those. You could also do a full outer join that we'll give all the requests with the two dataframe and wherever a concept missing, you'll get null. Belize. Seven is missing in the right DataFrame, eight is missing the right data plane. Similarly, Customer ID S3 is missing from the live data. Sydney outer join and full outer join. You are getting all the records from both DataFrames. There is something called left symmetry. So it is like inner join with lived at fm value displayed. So you have all the requests from the live data frame and do an arcing, anything from the right, indefinite lives, Semijoin, lift, anti-Zionist, Rosie live data from that are not visited w. So this is late DataFrame, left minus delta m, right? And you'll see data only for the live data. Let's look at some of the advanced concepts of inner join instead of equals n, you can also do greater than equal to. The joint will be powerful and you'll see records where this criteria is met. Get the Customer ID from the lip data frame, which is greater than the customer ID in the Write it up them so you can see two is greater than 15 is going to then. Similarly, you can do less than operation on inner channels. So these are some techniques for join in Spark. You can combine multiple data frames and do inner, outer, left or right joints or depending on your use case. Thank you. 14. Bonus - PySpark Hadoop Hive development environment using PyCharm and Winutils: We'll do Python Spark Hadoop programming using PyCharm. You'd have to each dont bite up, then you have bike shop. After that, JDK is required because Spark is built using Scala language if scholar requires JDK after installing JDK recap spark. And then to do Hadoop hype programming on Windows you need which will do, which is to be added to the Spark Installation Directory. After dark, you can get going with Spark and Hadoop programming using Python. Let's see that in action. Let's go to the Python Office website and download. Python will download the latest version, which is 3.9. Once downloaded, click on the installer. And you are done by Ton to classified and installed. Once installation is complete, go to the command prompt and verify. Python 3.9 is available of this machine. Let's now installed PyCharm, which is one of the most popular IDEs for Python development. Go to jet breads website and download dot Community Edition. Once downloaded, click on the installer. You can select the 64-bit launcher shortcuts that will be created on your desktop. Once finished our Lord pie chart, click OK. In order to import settings, let's create a new project. We created directly endows specify jackpot. Let's give the project and lead. It takes some time to create a virtual environment. You'll get a nice sheltered environment to manage your project dependencies. By default, you get a math.pi. We met. Click on this green arrow to run it. You can see those sampling court output high Python pie chart in the console. Let's add another print statement. Will add a breakpoint. Simply click here, and then pick debug. It would run up to that point. It printed up too late. And desktop, you can click Step Over the next lane. So this is how you cannot debug Python programming by John. Java is required much part. Let's install JDK, put to the Oracle website and download Java. Just take that Java is not already installed. Group the command prompt type java dash version to Java is Northern start. Let's download from what I could website. Will download the 64-bit version, but the Windows machine, you can choose the right one for your operating system. What are colors will prompt you to sign up. If you do not have an autoclave ID already, then create one. After that you can download and simply click and install. Now you can go and verify if Java installed also makes sure Java is added to the class, but go to a Environment Variables. Click environment variables and a read but and artsy Program Files, Java, whatever data coercion you used up to the bin directory could. But let's go to Spark website and download the spot. Three will download 3.2.1, which is really fought Hadoop 2.7, glucometer site and download. Let's unzip it using 7-Zip. Let's copy this directory to see directory, you can copy it to a new ID. While limited spot three. Now I'll go to Windows environment variable. Let's make see sparked three is though Hadoop home. And see sparkly also is that spark home. To work with Hadoop on Windows, you would require window Telstar dxi, which you can download from our GitHub repository if eugenic skill, Python, deaths. And you need to copy that Buddha sparkly bin directory. We also need to art dish to the classified courtier part. And our Debian directly spark creeping directly to the path environment variable. So hadoop home should be C spot three, Spark homes should be Sees part three. And part should contain a CSP R3 and bin. Then you should have though, when we till file within the bin directory. After data, you can open a command prompt and type by spot. You can see that sparked three has been installed on this machine. Good unbiased part from the command line. You need to go to environment variables and hard Pi Spark driver Python is by time. Now go to command prompt, type Pi Spark. Let's have a simple integer list. Compiles Spark Sequel, import the integer type. Now let's create a DataFrame from the list. My list and schemas, integer type. Now let's do df8 show. We can see the DataFrame value getting printed to the console. Let's now connect by Django Pi, Spark, Goto File Settings. Then select the project can go to project structure. Clear canard content group. Select C sparked three, and then by Tn. And click Apply. Which remote goal content group. And we'll have C spot three Python and can apply that. Take okay, go to File Settings, project interpreter, pecan plus search what by Spark. Click on install package. By Spark has been successfully installed. Now import by Spark and run it. Let's now create a list here. We'll import in Pisa type and from Spark. Dark sequel important spot session. Because you keep diving by chat more prompt you for the available classes are variables. Now let's create a spark system. Doorstop name, getter, create. Last Spark dot ph, data, trip, my list, integer type. Let's store tart and do it df8 show. We need to put a parentheses here. We need to go to a reconfiguration. Quote. We read configuration, select environment, hard spot home, which should be see Sparty by tonight, but this should be c sparked three, Phyton. Now run it. You can see that vendors are getting printed. Let's now save this DataFrame to a Hive table. Will enable hype support and get our create. Now from DataFrame we can create a temp, temp table one. Next we'll create a Hive table, conduct pimp table, Spark dot sequel. Create Table, select star from table one. And let's run it now. We need to give a parenthesis here and run it again. The program completed successfully, exit corps gentlemen, success. You can go to the project folder and under dark the spark warehouse directory where they will live in created. We did not give any name to the table, so it big His table nim a 100 Dart, you'll see part files. So that is our Hadoop IGA stores file in HDFS file system. And you are seeing the same thing here in the Windows environment. You can open the files and see the content of the DataFrame. That is 123 integers split across multiple files. One fell is one, and the other file is two entries. This is how we can do a Hadoop and Spark programming on your Windows machine.