2021 Edition - Spark Scala coding framework , best practices and unit testing with ScalaTest | Engineering Tech | Skillshare

2021 Edition - Spark Scala coding framework , best practices and unit testing with ScalaTest

Engineering Tech, Big Data, Cloud and AI Solution Architec

2021 Edition - Spark Scala coding framework , best practices and unit testing with ScalaTest

Engineering Tech, Big Data, Cloud and AI Solution Architec

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
37 Lessons (2h 56m)
    • 1. Spark Scala Introduction

    • 2. What is Spark?

    • 3. Installing JDK

    • 4. Installing IntelliJ IDEA

    • 5. Adding Scala Plugin to IntelliJ

    • 6. Scala Hello World

    • 7. Scala Basics

    • 8. Hello World Spark Scala

    • 9. Configuring HADOOP HOME on Windows using Winutils

    • 10. Enabling Hive Support in Spark Session

    • 11. Installing PostgreSQL

    • 12. psql command line interface for PostgreSQL

    • 13. Fetching PostgresSQL data to a Spark DataFrame

    • 14. Organizing code with Objects and Methods

    • 15. Implementing Log4j SLf4j Logging

    • 16. Exception Handling with try, catch, Option, Some and None

    • 17. Reading from Hive and Writing to Postgres

    • 18. Reading Configuration from JSON using Typesafe

    • 19. Reading command-line arguments and debugging in InjtelliJ

    • 20. Writing data to a Hive Table

    • 21. Scala Case Class

    • 22. Scala Unit Testing using JUnit & ScalaTest

    • 23. Spark Transformation unit testing using ScalaTest

    • 24. Intellij Maven tips and assertThrows

    • 25. Throwing Custom Error and Intercepting Error Message

    • 26. Testing with assertResult

    • 27. Testing with Matchers

    • 28. Failing tests intentionally

    • 29. Sharing fixtures

    • 30. Exporting the project to an uber jar file using Maven Shade plugin

    • 31. Signing up for GCP

    • 32. Cloudera QuickStart VM Installation on GCP

    • 33. Running Spark2 with Hive on Cloudera QuickStart VM

    • 34. Uber Jar Spark Submit on Cloudera QuickStart VM

    • 35. Doing Spark Submit locally

    • 36. Source code and resources on Github

    • 37. Thank you and preview of our PySpark course

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

This course will bridge the gap between your academic and real world knowledge and prepare you for an entry level Big Data Spark Scala developer role. You will learn the following

  • Spark Scala coding best practices

  • Logging - log4j, slf4

  • Exception Handling

  • Configuration using Typesafe config

  • Doing development work using IntelliJ, Maven

  • Using your local environment as a Hadoop Hive environment

  • Reading and writing to a Postgres database using Spark

  • Unit Testing Spark Scala using JUnit , ScalaTest, FlatSpec & Assertion
  • Building a data pipeline using Hadoop , Spark and Postgres

Prerequisites :

  • Basic programming skills

  • Basic database knowledge

  • Big Data and Spark entry level knowledge

Meet Your Teacher

Teacher Profile Image

Engineering Tech

Big Data, Cloud and AI Solution Architec


Hello, I'm Engineering.

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Your creative journey starts here.

  • Unlimited access to every class
  • Supportive online creative community
  • Learn offline with Skillshare’s app

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Spark Scala Introduction: Welcome to this sparks colored coding framework can best practices course. And we have a whole section dedicated to spark testing. And we'll also be covering Spark structured streaming and discourse. You learn what matters in the real world. How to structure that gold Harker blue logging, How to do exception handling, how to do unit testing using JUnit attempts collectivist. You'll understand how to read various configuration from properties by using the most popular big data technologies Parks Scholar. You'll start with the Hello World program and step-by-step understand how to build a production ready sparks Scala application. This course will teach you how to interact with Hadoop. Relational databases like postgres SQL from a Spark application. This course will take you from your academic background to a real-world developer code. And you'll be learning all these concepts while building a data pipeline. You do not need to have any prior sparks Scholar knowledge. However, if you have some big data Hadoop background, you're definitely helped you succeed in this course. Let's dive in and get started. 2. What is Spark?: Big data has two key concepts, distributed storage and distributed computing. Instead of storing huge volume of data in a single database, you split the data and store them on multiple machines in a distributed manner. And you take the processing to where the data is. Before Spark came into picture Hadoop MapReduce was the most popular technology in the big data ecosystem in Hadoop, which stored the data in the distributed file system. And then we run MapReduce programs to do the processing on the data. Mapreduce rates intermediate data to disks, two-part form, various calculation and sparse, improved on that by storing data in memory. Spark website claims it is a 100 times faster than MapReduce. Spark or Hadoop is a very popular architecture in the real world. You stored the data in Hadoop HDFS file system in Hive tables, and you do all kind of processing like transformations, aggregations using Spark. Spark can also access data from relational databases and do the processing and store the data back into relational databases. Spark and even work on file system. Spark and run on various cloud environments and access files from AWS. S3 is your blobs and various other storage systems. Spark has its own cluster manager. And when it is running on Hadoop, it can leverage the yon cluster manager, also. Spark DataFrame or Spark SQL. Liabilities are the main we work with higher version of Spark. Dataframe. You read data from various sources and stored them in a tabular manner. And the new platform transformations, aggregations on the data. And once the data is processed, you stored the data back into another Datastore. Spark Sequel simplify Spark programming further by providing with Secret Lake interface using which you can work with different data sources through SQL queries. Let's first create a development environment using intelligent May 1. We'll learn some Scala basics and then we'll dive into creating DataFrames using Spark's color. 3. Installing JDK: Let's search for job I download or Jerrick a download and that is our kids will be taken with Oracle site before installing. Let's and show Java is not on it and start, go to the command prompt in windows are terminal in Michael clinics. Jammer Dash Tell us base does Merson, as you can see their eyes not recognized, so there is not installing this machine. Let's search for Java Sea Development kit. Accept the license agreement and download the person depending on your operating system. Style downloaded. We know 60 per person. Oracle will prompt you toe side up. But if you do not about Oracle 80 then create what you just need to provide basic information like humility. Name, exit. You should be good to go. I'm sorry, my lady is very fat. You can log in using Gary email. I D and Password. One sign did. You should be able to download Dia installer. Go to your download folder and executed Click. Next. You can leave their default directly or change into some other territory. You can leave. It is default Now, Once the installation is complete, you can go to the command from again and job person as you can see how has been installing this machine 4. Installing IntelliJ IDEA: Welcome back. Let's now installing led idea. It's a very popular a D for sparks colored and loveman in the real world. Search for Indonesia Down Lord, you can go to the jet bedside, download the community addition and once don't ordinarily Condon Stella, you can keep the default folder and install it. Once installation is complete, goto that they stop and becomes a delicious idea icon and launch the duly. You can also find it from the windows such but thank you. 5. Adding Scala Plugin to IntelliJ: will now are the scallop plug into interlace. The idea. Goto configure and click on plug ins for the marketplace. Stop search for Stella. Select the Scholar, Plug in and click Install. We launched abusing Mavin or what you'll see shortly. Now you're restart idea or simply close it and open it again. 6. Scala Hello World: After installing the Scala plugin, will have to restart the ID. And next we'll write as Scala helloworld program using IntelliJ idea. Click on Create New Project. And in the next screen, select the project papers. May 1. We'll be using Maven for dependency management in this part Scala course. You do not need to have any background in May one will be explaining in detail how to use Maven dependency management and also how to package your code using Maven later in the course. Make sure you select maybe not Scala, while creating the new project. Intelligent would automatically pick up the latest JDK premier machine. In the next screen, you have to give a name to a project. Again, select the location where you want to save your project. Choose any location on your machine. Group AD is the package name. You can keep it default organ and have a different package name. I'll change it to carbon dot fugitives. Click Finish. It will take some time to import all the libraries. You can manually click Import changes are enabled, Lord, when port. And at the bottom you'd see importing Maven projects. So once that is done, you will see the project view or you can press hard one to go to the Project View. Expand our project folder. Under SRC men, you would see a java folder that start default folder name. In intelligence. We can change it to Scala by saying Refactor Rename. You can right click on the SRC directly and our new files. But be warned to see any option to r does Scala files until you are awarded the Scala framework support. Right-click on the project name. And then select art framework support. Then select Scala. And if you don't see any liability here, click create ten bar, download those Scala library. We'll be working with. Scala 2.11.8 worsen that works well with JDK ten Spark 2.4. Once you have downloaded, you need to select that Scala Watson and click OK. Now you can again expand the SRC folder and select the scholar directory and right-click on it. And you will see option to art, Scala classes, pecans, colored glass. And now we lard an object for Scala starting point of the application is a singleton object, Nordic class. So let's select the object and give it a name. I'll call it Scholar demo. Now you can go to the object and simply type min and hit tab in Delhi. God, clear, default, our main method. Let's now write this simple print Elon statement that will print AS intends to the console. Now you can run the program using the green arrow at the object level, our main method level. You can see that HelloWorld main first Scholar program is getting printed in the console. You can also go to Scala worksheet and practice Scala REPL interface. You do not have to create object methods. You can simply type coordinate and CDART porta, select Scala Worksheet and give it a name. You can write code here and hit the green icon. And you will see the output on the right-hand panel. Let's declare two variables, a and b, and calculate their sum on the green arrow. And you'll see the output here. This is how you can practice using dot Scala worksheet will now go back to them main object that we created earlier and hardly few one lines of code. As a beginner, you can choose to either work in the Scala worksheet or for more real-world programming experience, you can clear the objects, create metadata right corner. If there is any, compile them into, you will just click on it in the console and intelligence will take you to that particular lane. And then you can fix data and read on the program. So this is how you can get started with Scala programming using IntelliJ J unit to ensure Scala plugin is added. And also Scala framework RT to the project that you're working on. After that, you can get going with Scala programming. 7. Scala Basics: We learned some of the Scala basics in this lab. Let's create a new project. Type would be Maven. We'll give it a name. Let's call it scholar basics. And location would be the location where the project would be stored. Will create a package. Let's call it future Scala basics. Opened the project. Click one to go to the project view. Enable law to import our import all the changes. Now let's go to the main folder and rename the Java folder. Click on Refactor, Rename, call it scholar. Now we need to add frameworks support. We can project our film of support, selects, color, select 2.11.8. That is the more common to use for all our programming. We can see that both dedicated Scala libraries are available a 100. The project will have to make this source route. And after that, we will be able to add new Scanner class. Sometimes when you rename Java Scala, you're going to make it the source route. Let's create a new object, Scala basics. In Scala, all execution starts from an object, which is a main method. Let's print the landscape painting executed. It's running fine, and we are all set to carry on Scala programming in this particular project. Let's add a new function. You'll understand some of those scalar variable concepts. Skies to data type. One is variable that, and the other is val value. Can be modified by val values to be constrained. So let's try to modify both a and B will get an error because bees are eight value which cannot be modified. Larson simple print Ireland shipments will call this function from the main function. You can see that reassignment two valleys, not possible survived type cannot be modified. Rallies the Mendelian sparks color because we weren't able to scale well in a distorted environment. And would you not want them to get modified accidentally? Let's start another function. We'll declare some relatives. There are two ways you can declare the values or variables in Scala. You can skip that type, that skull out automatically detector type, or you can explicitly specify whether it's an integer or string or whatever. Did I for eight Rod medical recommend that it's an integer type. But for the, we have enforced integer type. So if you try to modify B2, a car or a string value, then I could given era because it's expecting integer data type in that variable. Given there. Similarly, you can declare other datatypes, slate, double, floor, cleaning, or any other data type you want. Either put a colonial declared the variable pay, or it could be in further detail from the value automatically declared her to double types. If we specify phi, it would automatically make it 5 because these are paper double. Let's print and associate. Not on it. We did not invoke this method from the main method variable declaration that has to be invoked from the main function. We'll comment out the other one. Now we can see that c and d are printing double values. We can declare an array like this in Scalar is a sequence of elements starting with index 0 and index then index one and so on. You can protect Colonna vendors, say array end of type integer. That's another way of declaring arrays. You can explicitly specify what data type that you have. You have to change it to parenthesis. So let's change all and print their values. So index number starts with 0 and gets incremented by one in another. Similarly, we can upstream the red tape and let's print the values. We can see all the values of the secondary. We can declare a list and scholar lake, this list will again contain sequence of elements. But a list can have different data types in a single list. One list, the index number starts with 0 and gets incremented by one. You can see. We were able to print different values. Let's put a space to make it more peer. Here we can see that three vendors. So list can airborne mixture there day. List is videos, Methodists available or let's look at one of the methods, dark sites that would invoke the size method and give us the size of the list. That is how many elements are there. So there is another data type called tuples. You declared a sequence of elements within parenthesis. And linked list are arrayed in order to specify toppled explicitly. There is nothing specified, it becomes a couple and they can grab elements using dark underscore, one, underscore two and so on. The index number. And a tuple starts with one and gets incremented by one. It doesn't start with Gilo Lake area on list. Typically use a tuple with map data type. Map is a key value or the red tape in Scala, you can have any number of keys and values stored in a map. And then you can grab a Lehman specifying the key name. Let's print the T2. So we are able to display the key to value. Let's now look at some more examples. We'll create a new form, some to demonstrate how Lou works, the scalar. By default, the methodically tons unit that is required. That means it doesn't return anything. Let's change it to string data type. So it'll return a string from this method, we can declare a string. And in Scala last statement, outlast expression gets returned. So ss would get returned. If there is no last TEDMED with string data type, then it would give an error. Which will give an error because we're not returning anything. So the last line will have to be a string data type in this case. So whatever you are returning to the last line in Scalar function, run it again. And this time it should run frame. You can read the return value in the calling method. Let's print out whatever the layman loops function returns. You can pass 100 more parameters to the function. Let's pass a string paper Variable, STR. And we used that to do some operation. Let's pass it from the main function. We say for loop liberal done, and then let's append whatever string we are receiving in the function. With a really simple examples, but gives you a sense of how we can do programming using Scala. This is how you rate the if loop in Scalar. If condition in an LLC for another condition and else, else-if is optional. You can put any number of conditions and whichever condition gets satisfied, that particular block will get exhibited. The scholar works like Java syntax raise also it's very close to Java based straight line and in programming languages by practicing her. So you look at these examples and then try to come up with more scenarios in practice that saw you master any programming language. Just change the conditions are different. Block would get executing this for loop less than N dash, that's the symbol, that's the symbol that you use. And then loop going from one to ten and then do some operations in the form of a. Syntactically, it's slightly different from Java. And it will execute the loop. Let's correct the typo. So the loop guard executed. There is another week rate for look. You can say one line till phi 119, whatever value. Who lexical till that condition is valid. You can also use a list and nitrate to the list in a for loop. You can see output. This already a while loop though you Lord we were using while loop, but the way CLOCSAS while some condition is satisfied then do some operation. In Scala mainly would be using a for loop when most of your programming. While loop is not so widely used. Portfolio knowledge assuming what is the syntax of wild loop. And you can have semicolon in Scala, but typically it's not used as a coding practice. So these are some basics of Scala Programming Language. Thank you. 8. Hello World Spark Scala: Let's now understand how produces sparks Scala programming using IntelliJ. Before, let's increase the font. You can go to File Settings and the font. Create a new project. File, new project will select the type is Marvin. Default shady, cool. We selected eight next gig acrostic to name, hits his spark helloworld. You can also change the package name. Go to the project folder. You can press alt one to see the project view. Will rename the zhao folder to scholars you have done earlier. Refract currently name. Let's start the Scala framework support. Go to add framework supports select Scala. Select 2.11.8. Click OK. Let's enable Auto Import. Now you can create a Scala object which will be starting point of our application. Create a main method, main and tab gets printed as simple line. It's running fine. Next week, creative spark session. So before we can create a spark session, will have to go to Maven repository and add the required dependencies. So search for Spark dependencies in the mainland depository will have two. The Spark Core library slept That is the stable worsen with JDK. Let's add dependencies 600 in the permanent XML, we manage dependencies of Mabry projects through Palm dot xml. We added the spar co-dependency 2.4.3211. Let's also add the Spark SQL dependency. Again, select 2.4.3211 and copy the Spark SQL dependency and paste it in the 100 XML file. So these two dependencies we need to get started with Spark Scala programming. Let's now create a spark session. We have the required dependencies in the palm read XML file. So IntelliJ will automatically pick up the SparkSession class and asked could we import the request library? Let's give our happened name and we'll do gate toward create. Now. Praxis and God created. Let's try to create a w DataFrame. We create a sequence sent from the sequence will create a DataFrame, will have just two entries. One spot to be getter and we'll convert it to a data frame, spider.js dataframe, and use the sample sequence and convert it to a data frame. Let's say additional print statements. Now let's run it. Sparks isn't God created? Data premise created and you can see the output also. We can change the heading by specifying the column names. Let's a course ID and course name. And now it's getting printed with the heading. So this is how you can do sparks Scala programming using IntelliJ. You look dark to the required dependencies in the poem that XML file through palm dot xml Marvin project managers all dependencies. You can add external libraries or Jack files. And then after you create a spark session and get going with the Spark programming. 9. Configuring HADOOP HOME on Windows using Winutils: download with mutuals filed from our get our repository. Future ex killed Sparks color We did this file toe do hype set up on the windows machine Simply extracted and copy to a folder. We create folder under the sea drive called Wheat noodles. And under that will have a created being folder and under that will start up with Newt started XY Now go to your windows and moment Variables it it the system and run variable to set hard up home which will be able to see when you trolls folder not Bill C with noodles. So now we can go her dope High programming on your windows machine from intelligence, you know, to make sure hard a promise set. And under the hot IPO been fold up the we notice finally is there. So this file is also available in other Posey please Online. You can search for it and don't know Now you can go to our get out politically. Feature X killing Don't grow from this. Thank you 10. Enabling Hive Support in Spark Session: Welcome back. Let's now understand how Spark Scala Hi programming using intelligent to create a new project. Select the project type is Maven. Project and name. And choose the location where it, where you want to store the project. Changed the package name. So we formed this. You should have set your Hadoop home, which I have explained in the previous video. Now click or Guan and go to the project view, changed the Java folder named two scalar. Next, we need to add frameworks support was Scholar and can you be explained these tapes in our other videos? Select skyline, select 2.11.8. Create a new scala class. The tape should be object and give the glass any name, a man, and a tab. You'll get the default main method. And let's copy the spark session creation cold from the holder project. We'll just send it to make sure everything is okay. Let's enable high support. So produce Spark can hide programming will have to create a spark session and enable life support. We also need to set the Hadoop home directory, which will be C. We notice if required, you can specify the warehouse directory explicitly by Intel Israel. Pick it up from the spark warehouse location. And when you go to another environment from your local environment, you'll have to make sure you have the correct warehouse directory. But now let's run it. Given Iraq because it's notable to instantiate SparkSession. Let's see. Because hate classes at NORC found we're not read any dependencies for Spark hype support. So let's go to a Maven repository and search for Spark Hive. Select the spark Project Type and select hours than xi. That is the one we have been using. Copy and paste the dependency in your palm read XML file. And make sure scope is changed to campaign. Now run it. Sparks some God created and they can see that the warehouse directory is getting set to spark warehouse directory. But it will get error here. European command that lane, which explicitly said, hey, directory. You look down comment and move it to the SparkSession creation, 600. Sparta's on God created, and we're able to print the data frame values. Let's now write it to a high pebble. We can simply do df.loc. Type is CSV and or we can specify the folder where the file lyrical stored. Now run it. So we can go to the project folder and see that sample sequence. Follow God creator tender and that we can see though hard fail. If we get alert while running this, you might have to set permission for your paint folder. Could do that is simply farthest command on your windows prompt so that you have the required permission to execute the high programs using wind riddles. Thank you. 11. Installing PostgreSQL: will instill post Greece in this lab search for down your post with sequel and ah, cook the download bitch. We can't download the installer. Select the right person for your operating system. I'll select and 0.1 toe. Usually you big Versant that is slightly older so that it's more stable. No need to select stack Builder. We don't need that. The three options would be fine. Then search for PCR 1,000,000,000 goto Busier minute office. Give any password I'll give. I mean, now you are in the post restart minute office and here you can clear database, create schemers, create tables. Let me created database future ex by before you get post Christian capisce, but you can create additional. AWS is under future. Ex will create a schema. It's called Future Ex Cuma, and under the scheme are we could clear tables. You can clear table using a script, or you can use this interface. Let's use this interface. We leave the table is future ex course cat look and then we large Something's, well, our course I D, which is a type of design and its primary key, and it's not know, Let's our course men. It would be off. Life can vary so that it can take any number of characters again. Narc nut. But there is only one primary key course. 80. Let's have another field called ordered name and that they would be God batting. Next, we'll at least six and which will contend the course structure and that I can be. Let's adjacent. We can start the essence Ching in this particular field. And finally, let's out of date field. S O R Table is not really You can pick up the secret time and see their script. Let's create an insert script by before to get an answer quickly with all the column names . Well, let's are some values you can go toe Jason editor Jason later on lane in that there you can create Jason Strings and then validate. Let's clear this. Imposition Street Jason contents. Key value Pairs sauce will have sex on entitles Tookie's. And while you should be there, let's based it will remove extra spaces. Okay, so our insert script is now really let's executed. She managed to insert one draw. Let's insert another one. We live a different course I D. And let's give it a different name. It's stunning. No, we can now fetch data from that table. Could I quit? Easier electrical. Make sure you get those came on then. So anybody were fighting quit? Easier to give scheme are dark, Evelyn. And we can see that boat. That, of course, can be faced. This is how we can install Post. Chris was a very popular list 100 abyss. Thank you. 12. psql command line interface for PostgreSQL: Sometimes BZ admin fails to load. You can also use the post-classical command-line tool, that is P sequence, which we installed when installing Postgres sequent open that. You can use this command line tool to interact with PostgreSQL database. Server is localized. Databases. Postgres, this heater into board is 5432. User name is Postgres, and specify whatever password you specified while installing Postgres. I'd set it as our admin. Now let's create a schema. We need to put a semicolon. Now the future ex schema has been created. Next we'll cleared the table. We need to create a schema with name future schema. Let's do that. After that, we can create the future ex course catalog table are the data schema to the Dibble has been clear, did let's check it out correctly. There is. And then we'll insert two records. And then we'll select data from the course catalog table. You can also create a new database. To connect to this database. We can simply say slash see future ex. Now you can create Schema, create tables within this database. This is how you can interact with the Postgres sequel database from the command line interface. 13. Fetching PostgresSQL data to a Spark DataFrame: In this lab we'll understand how to interact with the Postgres database from the spark application. So we didn't have future x-bar class format object. Let's write the code to interact with a Postgres database. And the first thing we need is connection properties. We need to specify the user ID and password for the Postgres database. And we do that through something called Java properties. You'd important java.util drug properties. And let us now specify the user ID and password. I specified administer password waiting, configuring Postgres, and that is the password I'm using, and different user IDs post. This will create a DataFrame from the course table that we created in the lab. That is food shortage schema are dark eugenics course catalog. So we'll fetch data from that table and popular died and electron, say Sienese Reidar JDBC method using that you can interact with a Postgres database table. You have to specify the URL for the database and also the database name. And by default it'll answered port 5432. So that's the port that we've used, the word post-crisis. Specify the table name, and specify the connection properties. Let's now predict. We'll add some print statements. Let's run it now. It gave her let's check it out. No suitable driver. We need to post maven dependencies so that our program will get the required driver, such for dark and select the DOW lowest sense, let's say 42 dark color for. Let's copy this and paste it in the palm grid XML. So this should import all the required jar files for the spark postulates interaction. And now then it, it gives a different error, the stain. It is not able to find our future schema or dark future ex course catalog. That's because we specified post Chris's dark database name and we'll have to change it to future. So let's change the database name to future, extended on it again. So it should be suffered port memory database name. This is the US Supreme Court repos slash it running. Now, we can see that the data is getting faces from the Postgres dynamics table. It's now popularity EBITDA DataFrame. And we add our console. We had inserted two records into the future next course catalog table. And we're fetching those two records and populating them in a Spark DataFrame. Thank you. 14. Organizing code with Objects and Methods: In this lab we'll organize the cord and give our product is structure. Let's create a new package. Glade, click on Scala. Then new package. We'll call it common. Now are the dark. Let's create a new scala class of type object. And we'll call it sparked common. Within this spark common object will create a method to create the spark session. Unit is the default return tapes, which means wide. That means the meter doesn't return anything. Let's go to the main class and copy the entire code that creates the SparkSession. Remove it from the main method and pasted within the create SparkSession method of the new object. Comment out this dummy data frame creation chord. Now this method should return the SparkSession. Let's change the return type. And in Scala, the last lame gets returned. So let's return spark here. Now this method is ready. It will clear the spark session and written. Let's invoke it from the main method to get the spark session. Within the main method, we'll declare a validator tapes part. And you can specify the type is SparkSession. And let's call the spark common object. It is highlighted in grid. So we can hit enter. And intelligence would automatically import the package. Now by calling spark common cleared SparkSession method, we can get the spark session and populated in this park variable. This will work fine without SparkSession type declaration explicitly, but it's always a best practice to declare the type. Let's run it and see if everything looks fine. You started running. It, created the SparkSession and now trying to interact with the Postgres database. And it worked fine to know whether the metal is getting involved or not. We can put a print statement at the beginning and at the end. It's always a good practice to have a print statement at the start of the method and at the end of the method. And we'll have another one at the end. We left just before the line that returns the SparkSession because the last claim gets returned. Let's run it again. We can see that the new method is getting in books. There is a typo here, we can fix it. So let's now create another object. Will are a class of type object, and we'll call it Postgres common. Now will move the Postgres interaction delicate code to the new object. First, we'll create a method which will return the Postgres properties. We'll call it Postgres common properties. And the return type would be properties. And let's hover over it and then hit the alt tag. And java.util dot properties of goods imported. Now let's copy this code from main method. And basically didn't have new metadata and will return the PG connection properties. Here also, we'll add print statements and we'll add one at the end. So this method will return the Postgres common properties so that we don't have to write it in every method where we need to interact with the database. So let's see what else is there in the main method. So we have a p z table, then we have core to interact with the Postgres database. Let's go to the Postgres common object and add a new method. We'll call it phage DataFrame from PC z table. And it would take SparkSession is an input variable, will import the Spark SQL Spark session class. And then we'll also pass p z table name as a parameter. And the return type would be DataFrame. This will be of type Spark Sequel DataFrame. Now let's copy this code from the main method. Using the Spark session. We'll make a JDBC connection, and then we'll create a DataFrame. This method requires the Postgres, connects and properties. So instead of specifying a variable here, will read from the gate Postgres common properties. So let's change that. So now we're reading the Postgres common properties from the new method that we had created earlier. And we have hard-coded the URL here. We can also move it to a new method. So let's add a new method which will return as the database URL. We can also move these two properties file which will understand later, will declare the PGRN and return that. Let's add some print statements here. And we'll return this DataFrame. So now we have one method to return the properties, one for the URL, and then one method to return the DataFrame. Sparks Shun and peasy table is parameters and it returns a DataFrame. Let's go to the main method. Here. Let's invoke Postgres c'mon. Let's import the package. Now let's phage DataFrame from PG table. And we'll pass the spark session which you have retrieved from the cleared SparkSession method. And we'll also pass the PivotTable limb. First we are calling dog park common to get the SparkSession. Then we are calling the Postgres common object to get the data frame. Let's run this now. Create Sparks is unmetered, started, cleared SparkSession method ended. Then the Postgres Geta from Kaizen Method started. Then fetch data from P-GW method started, and invoke the Postgres common properties method. And then finally, we are displaying the dataframe within the main method. So this is how we can organize sparks color code two different objects and different meters is according to this practice. 15. Implementing Log4j SLf4j Logging: In this lab we'll understand how to add logging to our application. Till now we have been using system.out.print Ireland statements. We can use the SELF for logging framework. And Marvin provides a dependency jar file, which you can use to include log for J and desolate for logging frameworks. So these are very popular frameworks called Java and Scala development. So hard that log for J seller 4J implementation dependent. And you will also need guard Apache log for GPA dependency is true dependencies. You need to work with a CLF for logging in your application. Select the same worsen for both. Both the dependencies are now included. Next unit to a log for J properties fail, which will have some configuration parameter polio login. Let's create a new file under the resources section. We'll name it as log for J dot properties. We'll add a route logger. Property will have different logging labels. So I'll demonstrate that shortly. Std out means the output will get printed to the standard console. And these are some of the properties that are needed to specify the type. We'll see that maximum shortly. Let's instantiate the logger factory and create a logger wild-type. And we'll use that to print output to the console. So wherever we are print a let statement and replace that with the logger to info will do the same change in the other two. Scholarships. Lets our logo, dotted point, oil, all our methods. Let's do similarity in this spark common object, also, soil objects and method scepter New logger statements. Let's run it now. And our logging level is set to in for right now. So a New Year we have Info captured logger dotting for those treatments will get printed. And the date format is beta and the date format we specified in the properties file. You can also completely suppress it by using different configuration option has changed the date format to whatever you want. Let's demonstrate how different logging level works. Will add another statement, logger D21. So this is for wanting. Let's add a dummy wanting, wanting. And we'll see how you can control logged level year. We'd run it. So both informed wanting would get printed because our login label is set to info. You can see this is a warning, is getting content and other import statements are also getting printing. Let's now change the root logger long level to wanting. Suddenly wanting or higher wanting or error root get printed. In a production environment. This is how we can reduce the number of lines that get written to the log file. Now only the wanting statements are getting printed. So the next one animal to Lord that says park inbuilt wanting statement created SparkSession ecosystem out that we forward to I'm commit. And then you can see the actual output. So the log looks lot more cleaner now with wanting level. And you can also increases to error slit only if there is a Era that laden would get printed. So all you have done here is we included the required jar files, permanent XML, and then created a log for J properties file and then our data logger infrastructure twins to all our objects and methods. So this is how you can implement a CLF for log for J, logging in your application. Thank you. 16. Exception Handling with try, catch, Option, Some and None: Welcome back. In this lab we'll understand how to implement, except some handling sparks Scala application. We can have simple try-catch blocks so that if any accepts any is found in a try block that can be caught in the catch block. And we can handle that exception either by displaying something on the console or by exiting the application. So we put a try catch block in the main method. We learned that a logger statement with a drug label and we can use dot-dot-dot print stack place to pin there. So any error in the main method would be God, and it could be printed in the except block. Let's run, it. Ran fine. We have not introduced any error. Let's try with that table name which doesn't exist. Let's see what happens. We can see that it entered the exception block and our main method and record the stock price also. So let's see the IRA. It says it cannot find the table. So here we are invoking another method which is part of the Postgres common object, but excerpts in handling is done in the main method. So it got caught in the main method. Let's go to fake short DataFrame from PC table method and implement excerpts on handling there. We put this entire coordinate try-catch block and will simply write the Euler to console. It says unit. That means this method is expected to return a DataFrame, but if an exception occurs, then it doesn't know what to do. Let's see what happens when we run it. It will give it up because it says in case of exception it can't return a DataFrame. So how do we get around this? Scala provides an easy way to handle such scenarios. We can leverage the ofs Sun submit concepts. So instead of just having a data frame would say opsin data frame, the program executes successfully toward return a DataFrame. But if there is a nadir in total return, none. So some in case of the successful execution and none in case of air. Let's run it. There is another change we need to do in the main method where we are invoking this method, we have to say dark get. So that would return as.data.frame. So the changes had pops suddenly in the method signature and then some and then none. And then do it dot, Good. Now we caught an exception and it entered the fate should DataFrame except sunblock. And it also entered the main method accepts unblock because we are returning a nun and we're not able to handle that in the program. There are multiple ways we can handle it. One of the ways is to simply exit the program in the exception mappings. So in case of an error, the program would simply exit. We can see that it entered the except some block of the phage DataFrame method. And it did not come back to the main method. If we correct that table name, everything should run fine. So this is how we can handle exceptionalist Park Scholar Program. You cannot try catch block in other methods. Also, if we're not returning anything, then opsins, some non those things wouldn't be required only for returning something, then you'll have to handle that. Let's implement try catch in the create SparkSession method. So there we are returning a spark session, so elliptic and handle that using opsin. Let's change this to opsin spark session. And this would be, this would be some. And in the main method we would have good. Now let's run it to children is expected. Get created as SparkSession and completed the program. Exhibits euro been successful. You can also have a System.out zip tie in the catch block of creates system. So that if sparks isn't is not created, it will simply exit the program. So these are some of the ways you can implement the exception handling in Pascal application. Thank you very much. 17. Reading from Hive and Writing to Postgres: This is going to do the most exciting lab in this course. We are going to read data from a table, apply some transformation and stored the transform data in a Postgres table. Let's first create in local hype table. Let's go to spark common and create a meter to write some data into a table. Using the Windows tool we have seen how to populate the data directory. We can clear the table and pointed to a particular directory from where it will read the data. In a Big Data Hadoop setup also, HIV doesn't store the data. It points to the data stored in HDFS. The new material created is create future IQ scores hype table. And using Spark Sequel will create a new database. The database name would be fugitive scores DB. And we'll say if not exists. So that when we run it multiple times, civil NOT clear dishes. And let's create it. They will also, they will limb will be effects underscore course under the table. And we'll add some fields to the table. We live course ID, course name are ten mm and number of reviews. And auditor dave Should be string. Hi, works like a relational database. Let's insert some records. The first row is core, say D1, Java course autonomies, future ex, and number of reviews for d5. Similarly will are a bunch of other records. We have intentionally kept some of the filters black. Now we can force hype, could treat them as null by altering the table. And we can sit there. We'll properties relation Non-Format, equal blank. Now Hive will read this bill says null. And when you read them in Spark, there will also be treated as null values. We have kept blank values in number of reviews and alter them in some of the records. Let's now go back to the main method. We'll comment out got Postgres table interaction code. We don't need it now. Let's now populate data to the table using the Spark common create future ex course hype, they will matter. And we'll pass spark system as a parameter. Now let's run this program got executed. And we can see if eugenics courts DB database folder under the spark warehouse directory, under the main project directory. And then there are bunch of part files which contain the actual data. We can see some sample files. So the pebble is reading data from these files. This is similar to how data is stored in HDFS in a Hadoop cluster environment. Now let's understand how to read data from Hive applied transformation and then break the transform data to a Postgres table. Let's add a new method under the spark common object to read data from the table we just created. We'll call it read future ex course hype table. And this takes parks as soon as a parameter and return a DataFrame. We'll import our DataFrame class. Now let's declare a DataFrame valid type. We'll call it courts df. And using Spark Sequel will read data from the Hive table. Select star from eugenics course table. We loaded our statement also. And then we'll return the DataFrame. Let's put the entire coordinate try-catch block has shown earlier, we need to declare DataFrame has option. And we'll hit on some course bf, and we'll return none in the catch block. Now will go to the main method. We'll comment out the cleared future IQ scores. High pebble code is you don't need to create it again. And then we can invoke the read fugitive scores hype they will muttered bypassing the spark season and then get the DataFrame. And we can store their demand local DataFrame. We'll call it course BF in the main maternal. And we'll print it out. So let's run this now. We can see that data is getting phase from the Hive table and there is null value in certain records. We're good. Bland, grey, lose in authored. Mm-hm. And in number of rivers in some of the records. So that is getting created, it's null. Now we'll create a new sparks color class called Scala class format. That would be uptake object. And here we'll write a method to do that transformation. We'll call it replace null values. So the Sunil greed DataFrame, it didn't replace authored naming the DataFrame with unknown. So this is the syntax did fem.h Nadar outfile replaces null values. And then we will also replace number of reviews null values with 0. And after doing this transformation will store the data in a new DataFrame called class bomb df. And written that we lard some loggers pigment in the transformation class. And in the main method will invoke, replace null values with a DataFrame we created from the table. And then we'll print that transform data frame. And let's run it. Regarded an error because there is a dark gate. But then there is no try catch block in the replacement values metadata. Let's remove it and run it again. Started running. This is the original DataFrame. And this is that class form DataFrame. We can see that null values have been replaced with unknown, unknown in outer column and 0 in number of reboots column will now store the data in a Postgres table. Let's create a new method to postgres. Will ever try catch block. This method will not return anything. Read the DataFrame, and store it in a table. Dataframe is a method using which we can save data to a Postgres table. And we'll say more disciplined so that every day we zip, the courts will get appended to the postgres table. And this is the complete syntax to radar data to a Postgres table. You have to specify the table name and which database user ID password. And then the readout good record to that table. And we'll call the new method from domain method. And we'll pass dot transform DataFrame to that metadata. And we'll also pass dark target pebble limb. So we are saying store the transform DataFrame in future ex Course table. We'll verify their data in Postgres. Lets run it. It started running. Let's check the postgres table. Will fire a select query on future ex course table. We can see that tardiness records have been populated end up future ex course table. And there are some unknown way loginData name column also. Let's run it again. Since the more disciplined and other records will get inserted into the future ex core stable. We'll run the query again. And we can see that there are 26 records and w to x core stable. So this is how we can read data from Hadoop Hive, apply transformation, and stored the transform data in a Postgres table. 18. Reading Configuration from JSON using Typesafe: Welcome back. Now we'll understand how to read the different properties from.properties file log JSON file. So if the PG coach table will store it in adjacent file and read it from there. Let's go to JSON editor online. We can create ten parented JSON files using the online tool. Will have a simple JSON file, which remember one property PCR underscore target undescribed table in the main body of the Titian. We'll also have another header section with some information about the author of the JSON file and create some data. You can click on dot p type to validate the JSON file. So if we store the table name in the JSON file, it could be easy for us to modify without touching the code. Once you validated for JSON file, you can copy it and store it in a JSON file in your project folder. Let's create a new file called a fixed underscore conflict or JSON. The final would be stored under the resource folder, under the main folder. And we'll copy the content from that JSON editor. So you have one property that we are currently interested in, that is the post list target table that we are using in our publication. To read the property from JSON file, we need to include certain dependencies. Let's go to Maven repository and search for types, CIP for config dependency. And let's select any of the latest Watson's. We select 1.3.4. Using this type-safe config library classes, we can easily read the adjacent plays, copy their dependency to the palm file. Home.html, safe and country. Intel is. You take a few seconds to download the required Jeff fails. Let's create a new scala object called a fix JSON parser. We like to use your logger statements. Let's create a method. Read JSON file. You read the JSON file and return config, which is a class Center third type-safe library, page, safe for conflict or conflict. So using this class, we can read any JSON file and return a config object. Physical conflict factory class that has a low-order method using which you can read adjacent file. And the output will be of type conflict, which will get returned. This is the only language sorts of the last line that would get returned from this method. We live another method to read the postgres table m. Let's call it fate target table. Let's declare a wild-type easy target table. We'll first invoke the read JSON file method that total return the config object. And that is a good string method which will read the key and return the corresponding value, the key name in our cases, so P, z and this could target table which is present under third body tak. So to read that, t will say body could target and descript table. That would give us the target table name. This is how you read different properties from a JSON file. We'll comment out the string from the main method. And we'll use di fix JSON parser object to fetch the pose risks target table. Now rest of the code would remain unchanged. So the only change you have done is we have moved Postgres target table name to a properties file and we are reading from there. It would be easy for us to modify it later if the table name changes. Let's print our table name, whichever to fetch from the JSON file. We need to return the PC target table from this method. Let's hip, that is the last statement in the method. Run it again. Now it ran fine and we can see that it's fetching the table name future Schema dot future IQ scores from the JSON file. Result can move sudden properties outside the code. We included the typeset conflict dependency and then created a new parser which will read data from transition file. Let's now move the transformation logic to the config file. We were replacing null values in two columns. Now will control that JSON file. Let's add a new section for null. Smith will specify the column name and we'll capture whether those, for those columns, non-value added sugar placed on. Let's declare all the column names, course ID, course name or commandment number of reviews. Will say yes or no for whether the novel is due to replace, those columns are not. So let's have only autonomous yes, because earlier we had tried Melville replacement for author name and number of reviews. But this time we'll do only for one column and which control that is a JSON file. Within the effects, JSON parser retard another method which will lead key name and director Master value. It will be called return config value. It will take key as a parameter and return the value. So that this method can be reused for multiple keys. Earlier we had just one method to read the post list table limb, but this would be chimeric. It can take any key and direct and the corresponding value. Now within the spark transformative placed not value will reap the config values for different columns. So there are multiple ways you can do that too. Concrete that JSON string to a map of key-value pairs and then extract values from there. We'll keep it simple. We'll read the values for each column. And then based on that, we will have a simple loop to recommend which all columns three, replacement. To reteach column value, we need to read the reading test body dot replaced non-value dot the column name. When you declare a val author name value field, and store the key value in that. Similarly, we live in other val Tate for the number of reviews will ignore the first two for the time being. It can enhance this logic to check on all columns. We print the values that were retrieved from the JSON fail. He has to know. Now let's add a conditional loop. Number of reviews value is Yes, will check for both. You both are true. Will do valvular replacement for both thought will do other one for which the value is set to? Yes. It's a simple MDX look portrayed. If number of reviews is yes, send autonomies naught, then we'll do a replacement for number of labels. And similarly for the other case. And if both are more than will not do any replacement will return the original DataFrame. The intent here is to understand how to read data from JSON fame, not so much around how to write the code in the most efficient way to do a 100 different scenarios. Let's run it. Now it started running. We can see that author name values, yes, and number of reviews value is known. This is what we specified. Invitation. Only autonomies, Yes. So non-value replacement will happen only for the author name. You can see null values and replace with unknown for the author name. But if the number of reviews column, the null values are still intact. This is how we can manage the transformation logic to JSON prime, which is a good practice in the real world. Not just non-valid replacement team can manage all kind of transformation through JSON files here. And you can store that d sub frame in a database also, if one, And the idea here is for business logic changes, you should be able to change your transformation logic using the contract fail without modifying the Court. Thank you very much. 19. Reading command-line arguments and debugging in InjtelliJ: Let's understand how to pass command-line arguments to a Spark program. Depending on your use case, there might need to pass certain variables sit the runtime. For example, you might want to pass the environment name where the Spark job is getting executed, or certain variables which are getting passed from some other program. So let's understand how to pass some command-line arguments while triggering a Spark job. So if we look at the main method, there is your args variable, which is of type array string. Using this, we can store all the command line arguments. In intelligence. You go to a read configuration, the program arguments. You can put the whatever value per algebra. Let me put ABC, apply. And okay. Now we can read the value in the main program. Since a pass just one value. Let me try to extract that which is at index 0 and simply print it out. We can now run it and see the argument that we passed using this program argument textbooks. And we do not need to run the entire program could taste it. We can also put a breakpoint and paste this. So let's see how that works. Intelligence. If you want to put a breakpoint at a particular line, you just click on this panel at that particular lane. For example, I wanted to put a breakpoint here so that when I run the program will stop here. And then I can decide to go to the next step. Once you put a break point, if we run the program, it will get executed without the breakpoint. So you need to run it using the debug option by clicking on this icon. You'll also find that here in the Run menu. So let's debug it now. And the program will get executed and it will stop here. Now it is stopped here. And we can go to the concert tab. And now here we see different options. Tape over which takes it to the next lane, or SPAP into which goes into any method that might be getting invoked at a particular lane. So let's step over for now. And we can see that argument 0 is ABC, that is getting printed here. And that's what we passed from the program argument. And we can have multiple arguments. Let's send another one. And we'll say argument one is R1. We'll put a breakpoint here and then run it again. Click on debug again, it'll ask you to stop and rerun. It stopped here again. We can see that line 14 has been executed, which is argument 0 is abc. Now it has stopped it line 15. Let's table what now? And XYZ is getting painted. Now let's look at an example of how we can use the command line argument in our program. While creating SparkSession, we are setting the Hadoop home directory is see when riddles. So this we need to do in the local environment. Howard, when we're running it in production environment, we might need to set it to a different value. Or this might be optional because Spark will automatically read the Hadoop from the configuration. We need to know which environment we are running. The program will go to command line argument and will pass a parameter div, which stands for development environment, will declare a new variable. Here will print environmentalists. Whatever it is, then whatever name we are reading from the command line argument. And we'll go to the Spark common method. We'll change it to read the environment name in the method parameter. Now if EMV name equal, then will set this value. Otherwise we don't need to. And we can are the luggage treatment here also. Now let's pass this variable from the main program. So while calling those SparkSession method will pass this variable. So let's execute this and we'll put a breakpoint here for now. So it is stopped here at line 17. And we can see that environment is div. Let's flip over to the next lane and then next lane. So here, if we stable work, then the entire line and go to the next lane. But if want to see what is happening within the create SparkSession method, then we need to step into click escape into against tape into so it'll enter this creates partition method. And then here environment name is Dave. So to sit that Hadoop home. And then it'll proceed. We can also put a breakpoint here. Let's put a breakpoint here and go to the main method, remote this break point, you can click again, it will get removed. Now let's run it again. It stopped here. Depending on how you want to debug it, you can decide where to put the breakpoint. Under the Debugger tab. You can see all the variables. For example, environment name equal div. So this will help you in debugging. Under the console, you see the lock and under debugger you see all the variables. Let's put a breakpoint here and then change the program argument. We have a different value, let's say production. We'll run it again. Click on debug, stop and rerun. It came here. Then we click again. Then let's step over. Environment is production, so it will not match and it will come out large. Set this Hadoop on variable. So this is how you can use command-line arguments to control your program flow. You can also check the length of the arguments at the beginning so that if this argument is not getting passed, then your program will simply exit. Let's see how to do that. We'll remove this argument. If arcs dot length is not equal, let's say one. We are expecting one parameter. Then we can say System.out.print LM, No argument passed, exiting. We can exit with a git status one, which stands for failure. So let's run this now. We have removed the command-line argument. You put a breakpoint here. It has reached this point. Let's step over it saying no argument passed. So it's exhibiting. And if we pass an argument, stop and start again. Now it's getting executed successfully because we add passing an argument. 20. Writing data to a Hive Table: Let's now jump to how we can re, heater behave. People will commit out postcards separating part. The depending on your requirement, you can lay it at Google multiple sources. We have already seen how to relate to a post-test table. Let's now write this DataFrame to hate people. Will first create a temporary table from this DataFrame class from B. One is the DataFrame that we have after replacing null values created from it. And then using that, we create another table using Spark SQL command that would read from that temp table and played to the heap table. Let's move this corpus spark common object instead of writing it in the main object. Or so whatever code we wrote with this move it here so that all sparkly period called is in one object. Elaborate try-catch block. Move this to the new method name and read that T2 will also make changes to the method signature so that the pebble name we can control isn't method argument contemptible also, whatever is the table name that we're just happened. Some extender used that to create a timetable to while we are doing here, is creating a temp table in the news today to have a way to men we are controlling so that the court can work with any lesion in our program. Now the method is really, let's invoke dark from the end object as.data.frame, spark session, and also the pebble name. Let's send this score now here in the catch block. So it will create a new table under the spark warehouse for Latvia reading from a fixed course TB and then a blank glass formation. And the new hire will also get created under the spark warehouse for that. Please take it out. Yes. So we can see that Customer transformed, day blinkered created. We can open many of the files and see the data. So this is how you can read data from a height. They will relate it to another table are preplanned class permission. And we have also seen how to read and write from imposed list table. So depending on your use case, you might be required to build the data by plane to the dark data from one source and drain into another source. Spark is a very popular tool for data ingestion. Did at clash permission in ruling to data sources can be picked up by Codename insects. Thank you. 21. Scala Case Class: Let's understand how we can use Scala case class to manage the input parameters. Scala case classes are immutable objects that do not need to be instantiated. You can use them to store a constant values which do not need to be modified. You can store values in a case class and use it throughout the project and by default on variables within the case class out of validate. So we'll understand how to store the input parameters, syndicates, glass and the new sit. In our application. We have defined in port conflict with two parameters, environment variable in the target database where the transform data will be written to. Now let's go to the main method and we'll populate this import config case class. We rate the environment variable per document gyro and store it in environmentally. Now instead of doing that too, will read all the parameters in popular don't input config kids class. Let's declared the input conflict case class in the main method will initialize the ENV and the target BB variables. Here. The NBA would be argument, that is switch environment where running the code target dB would be the second argument that we can pass a command line argument. Let's elite configuration to pass the second argument. Earlier, weird depth. Now we're both grids sought BZ, Super Target degrees PC. Then it would write to Postgres. You can also habitats hyp hyper parameter and you are bored parameters by using spaces there delimited. We don't need this now. So instead of ENV name will read the input config values and then do all our operations. We have to modify it, creates SparkSession method to take the new input config. Sql-like is glasses and argument. It's in the same package, so no need to import it. Now will just simply say input config dot ENV. Then we can fetch the environment name and then use that in the program. And we're passing that to the method from the main. Ok. Now let's use the second parameter. So based on the import type, based on the target database will either relate better to Postgres or rate too high. If it is busy, then will write to postgres. If it is high, it will let go. Hey, this import config can have any number of variables and you can declare multiple case classes in your skull application depending on your requirement. But you got a sense of how to use Scala case. In his spark Scala application will simply pass high EPA and direct to The Hague. Where else? Let's run it. And you can see that exciting to hype they will because we passed Hayabusa parameter. If we change the input parameter to PCO2, right? Tuples, kay? So this is how we can use case class to manage configuration. It can store constant values which can be used in your application. Thank you. 22. Scala Unit Testing using JUnit & ScalaTest: Let's understand how to do your testing of skull applications. He was in jail. You can skeleton go to file new project Select the type is member. Give the project and name with the future ex scholar. Retesting click finish. Open it in the window. Import changes go to SRC made Java with renamed Javert a scholar. Go to re factor and rename Change Alberto scaler. We also need to are the scallop from what support? Select Scala, and you can select the library. If you don't see any options, create one click, create and download. You'll see all the scallopers is that are available with scallops. You select 2.11 part eight and let's click. OK, now we could create those color classes. Let's create a news color object called Future Ex up. You can give it any name type men and that you'll see the main Matar with the spring. Some statement within it, he allowed another object. Call it the future executed. I would write a simple my garden that went straight out of the unit testing. Let's ever made thought Toto, Simple division. You take toe, deserve variables s parameters and are done. The division and diaper, a baby and different types of being The last line gets returned in this kilometer. Say, maybe you would get returned from it. Now you can use this material in your application in your main method. And, uh, you can try to divide numbers using this method puzzle. It's tornado and make sure everything that's pipe looks good. 30 by tennis three. Got expected results. Now let's understand how Mobile unit distinct up that pretty gloomy. We'll go over the taste for that. That's where we captured all the urine testing. Cool will rename it this color. And what about backpacking structure we have in the men S R supporter? We let the same package structure in the best package. Also, let's start the Naval Junior dependency June. It is a popular framework for job. I only testing, and it's also used its color stylized, buried, using our language with select one of the older persons slightly more stable person. What What one does will be fine within the Palmer XML hardhats except for dependent season . Within that you captured the junior we were dependency. Now in at a class for protesting. The convincing is to use whatever class wanted based and then appended with the best. For example, future ex usually tastes so that is, the class will be with him toe unit based, future excluded class. And with this we can have ah, multiple taste. Let's Artemis third the taste one using which will taste their division. Would you? Two are the best annotation, so that Jalen it would recognize this is a test. You can see that the greener I can't Green arrow waken is now visible. Now it's separated. Taster, we lose that assert metal. Whatever is the expected result you can try to match it with. It's your prison. There is a certain equal mattered also, and there are multiple other. I start metal seven level for you to use. Well, just demonstrate a certain equals. So here expected results with match. It actually is a 30 by trees then, so that should matter It then. That's how we can unit is this method so that we can be sure that it's returning. Expected result. Click on plus and run Taste Oregon. Click on this green arrow icon in the class level or it thought best level. We can see that, but based past, so expected result matter. Debt religion. Now let's change the expected when it to something else and this damage should feel. As you can see, it's expecting a part. Actual legend is certain. So this is our community test using Jr. We cannot multiple taste here. Let's add another test metal. I will say Cuddy by two. It's under dumb for being that I need and one failed one bust. So that is expected. Will correct the 1st 1 Now we have to test in. Both were expecting the correct region. So what? That is Rebus. So this is how he commuted Desta scallop tickets since he was in Germany. Let's know understand how to use Callard ist for that will go to the mainland depository and look for the corrector dependency for the scholar test. We'll find something for 2011. That's the scholars of your using at stake. Three dot Wondered wondered. This didn't faked. Only 20 are the dependency to the pondered examine. Now let's start a class or psychologist here. The convincing these you should repent to speak to the glass. Name our projects. You did you would have something called future accepted. Speak. There are multiple styles available in scholar test. We lose flat speak. You can read more about the different styles on the scholar. Tiesto have said well demonstrated, using a simple example. Will they pay best? It's similar to jail. It has something of lesser, but the syntax is slightly different. We're saying it should match your matches. The test name. You can have it, a synonym and whatever is in the country braces. That will be the artist. So you're saying 10 certain Patrick, 30. Wait three and we can see that the taste past and let's change it to something, going to make it fail. As you can see it fail. So this is Are you in a distinct using scholar artist? You can have my people face here within a class. Lets out another one similar to what we did in jail. 30 way you were expecting 15. You give era. Let's take it out. Duplicate testing. We should have a different name for the second best. I will call it a match to now. It's a good thing first ones would fail in second ones of us. Yeah, we can see that one failed one past. Now let's correct expected prison for the first test and boats or something. Thank you very much. 23. Spark Transformation unit testing using ScalaTest: Let's now are a couple of unit test using scholar tester to the project. We have them locked in this course. Make sure the Java directly in under taste is really into Stella and then Arthur. Dependency for scholar paste, as you have done earlier. So this is the dependency we need. Well are using Jalen it in this demonstration. Let's start a new scholar class. We call it the first speak. So before writing unit tests for the project, well, just see a sample unit based using behaviour driven design pattern. So this first spake I would extend any flat speak Platt speak is the testing style that really using in this course, make sure the classes imported typically a behavior driven development. You define the behavior of the class and then do that 11 work. What a jump will be be a beer off sparks color, according framework. So that's history. Oh, so what other behaviors it can have? We can start with spot, and it can angered framework. So these are a couple of behaviors of the stream. Let's just write a sample taste and London, as you have seen out yet this in taxes, eat should start with spark and then in then within college braces. You're right. The taste that you want to perform here. We're doing a simple assert and it passed successfully. Similarly, we cannot Another test for the string to end with They work. So for this street were expecting toe behaviors and then we're testing it using scam artist and then this class we have extended any flat space, so that may be my people classes. So better to use a common base class and extend the flat speak and any other class celebrity that you might be requiring in your unit testing package, we'll make it and it could start class. We don't need any implementation here. You don't extend any flat speak or any other plus that we might requiring future like market or are other distant family. Now let's ah, right Unit days for the Creator Spark system. May thought that everything in our program so you can they start may 3rd without running the actual project. Let's see how we can do that. Things can Artist. You created a class called spot Common speak. So we have been suspect with the last name which regular test that is a convincing This test class would extend future EC space, which extends flat speak. Now we'll say baby up sparked common. So one of the baby we had expecting one being that is it's sort clear this park session. So that's how you like best using scholar artist. So let's not try to invoke that method. And Dartmouth Matar takes input configures a perimeter. We passed the environment variable and also the target database with a little bit in tow. But we primarily was the environment. And based on that moment that we p a spark position, it could be for windows or it could be for other and moment. So we need to populate that in four config variable and then pass it on toe the spark. Come on, PXs and mentor and we'll be testing that. Using this scholar they scored. We don't need to run the future ex parte transformer project mean class toe. See the output We can independently best the the sparks is increasing, matured Using this technique, we don't need toe do any start or anything here. This isn't gets created. Then everything looks all day. This isn't God created. So that's it This is how we contest with your spark session case and he's working on or using Skeleton. Next will be are testing the replacement value transformation meter. We re blessed null value from the greater face from the hype warehouse before starting it in the post Christian of this. Now we only toe connect toe. People opposed restore testes. We can simply have another taste class and, uh, district using some dummy into different. Let's check it out So the last name would be spot transformer speak. You don't again extend future X bests Now we were defined the behavior of this class. So we will say again, behavior up a spark could transformer and it should do the non value replacement. We can give it any meaningful name it should replace No Luit unknown. That's what we're trying to test and was bested without running them in application completed and moxie is we find which is present under the project. Look for that. It is couple of lake recourse and we'll be replacing our values in one of the record using this matter. We need a sparks. A Shinto created a tough for him, so it defining class level variable and clear this partition. And then we read from the CSB file and created enough name CSP file contents the header. So make sure you read the header and make in four scheme. Eyes droop so that the late afternoon would also contain the header from the CIA's buffet. Now these days, similar to the data that were stored in life. And we're using this sample data toe taste our transformation that well defined a new variable court transformed if and well past the or did a frame and will redone the grass wondered off who just sprinted out now to know whether God this is returning the correct values like novel. Who's getting replaced with an No let's fetch the rule, which has no values, which is the row with course I didn't do will face that draw and then from that will face the column which content starter name and from that will extract dot in it. So let's subject that it again. The row with the course I did to contents, knowledge daughter, name, field. That's where fitting the role with the course I get to and then getting the water name from the other name call, and it pretends in this really took it. Die child fellow using Get Jiro and Jiro that will end up value contained in the other nephew. We'll just print it out and we'll also compare this with expected when you are expected values on. So we'll say a certain unknown utter So what? Or should match with unknown. If it matches, then the transformers and his taxes school. Now let's run it started building the project. Then we'll see. The test is successful or not. You created sparks session and we can see that unknown when you got fetched in the taste passed successfully So this is the or is noted a frame and then we have a very tough time with knowledge placed. So this is how we contest that class promise and method without running the jail application. So this is one example of scallop test. Similarly, you can taste all your other methods and you can see what are the values. So the method is returning and based on that, you can design your dist. Thank you 24. Intellij Maven tips and assertThrows: Let's now look at a certain tools. So for that, I'll download the project from Udemy and tried to import it. When you import the project, it says non Management Project found. You'll have to go to palm read XML and it is a Maven project. So the stapes had again File New Project from existing sources and point to the source directly after the nanotube, the project. There are greatly conformant directors in Maven project. And it takes a few minutes to resolve all dependencies. Now this project is loaded. Dow will send this directory as does source stroke. And we'll set the SRC directly under test is test directory. You to first setup JDK. Pick the one that is latest on your machine. Sometimes he doesn't get picked up automatically will represent the Scala SDK. Do it here, or you can right-click and art framework support and do it there. And we should see all the dependencies support grid XML automatically imported. If you face any sues, there might be a main.xml file here which you could delete. And then all the dependencies and classes will get imported automatically. Sometimes elliptic go and shake the mavens setting. Use plugin Ministry of lane. So this is required the fair in a restricted environment. Make sure to select it. Okay, Mark, looks good. Let's run another test. It's running that test and all the pits successfully. So it successfully imported the project can be run now the paste. Let's now see. Another way we can break this. Throw null pointer exception will copy this. And instead of writing a try-catch, we'll say has shirt throughs. So this is another way of catching exceptions. Let's run this. So it passed successfully. And if we change it to some other exception, let say analysed this exception. It should fail because the function would throw NullPointerException and we are expecting an analyst is exception. So this is another we of pasting her different exceptions that your method might throw is the preferred way of testing whether your function is doing the right exception or not. 25. Throwing Custom Error and Intercepting Error Message: We'll now look at how to throw custom exceptions and how to best those exceptions using Scala test. Let's go to spark session clears unmetered. If the environment is not, Dave will throw some custom exception. For that. Let's add a new package. We'll call it exceptions. And are in the dark. We learned new Scala class. Or we'll call it valid and invalid on men. The exception. Look to make it a case class. Now let's modify the class signature to make it an exception class. So this is how you cannot declare a custom exception class. And this niche to extend exception plus. Ok, let's first run the main program. So in our Spark common cleared SparkSession method will sit through new, invalidate and return meant exception. And we'll say, please parser valued environment. This class has been important. So let's test it out. Instead of day will pass AVC here appears the first argument will run it here as you can see it through a valid environment exception, please. Buzzy valid environment. So this is working correctly. Let's understand how good is this will go to spark common spec. We learned newest colored test. It should throw an exception. So we'll change it to some other value. And we'll say close invalidate environment exception to this is one way of testing. There are nutrients see some in valued the environment. So let's run it because they want to catch the exception. But because this method is also catching the exception, it did not reach there. It caught the exception here and it exited. So what we'll do here is we'll decide another block here. Instead of exiting the momentum, we find invalid environment exception will throw new invalidate environment exception. This is one way of handling it, typically would not have catch blocks in inner methods. Let's write something like this. Instead of exiting will say through New Invalid environment exception. Now let's run it. So it was able to catch that exception and artists passed successfully. There is another way we can test this. We can also test whether the error message is correct or not. And let's copy it. And we'll say it should change error message for inviolate and with unmet. And then instead of I assert clause will create val exception, intercept invalid. And we don't rent exception. And within that, you would write your code. So disorderly interceptor, invalid environment exception. Then we can check if a shirt exception Dart is in stance of invalid environment exception. So this is one check. And then we can check assert exception, dark, good Mrs. Dart, contents value. Let's check that out. We'll search for this. So here we're checking if the exception is invalid environment exception head also, you are taking whether the MRS is as expected. We are intercepting the exception and trying it out. We need to put it in a square bracket. Okay, now it looks good. Let's run it correctly as expected, this technique can be used for all kinds of exceptions, including custom exception. Let's say we change this and it will fail because though it will throw invalid environment exception, IT would not find this matching strings, so this should fail. So it failed because it did not find type header message did not contain pass valid really mode d from here. So this is one technique you can use to taste all kind of exception that your application might draw. And then you can also check the error message along with the type of x. Thank you. 26. Testing with assertResult: But let's now look at a search result, how we cannot test using a circulatory. So earlier we have seen assert using that too can match certain expressions. For example, we replaced null value with unknown, and then when we performed certain transformations using past data, we compare the value of author with unknown and checked whether it is matching or not using SR. Now we can also do the same thing using a certain result, but without adding any additional variable. So let's see that. Let's copy the old test and we'll give it a different name. Now here, we don't need we don't need a sod, will say a shirt preserved unknown. And then whatever expression we want to put here, you can have one expression or multiple expressions, late addition, subtraction, substring. The final output of this should match with whatever we are expecting in this cartilage. So let's run it. So it passed successfully as expected. Thank you. 27. Testing with Matchers: Let's now see how we can test using matchers. So far we have seen SR, Assad, lesser, lesser throws. Using matchers, we can write English statements to paste two. So let us see that example. We use matters we need to ensure our classes have, we tell managers declare mission will put that in the base class so that it's available for all the glasses. There are two types of measures, should matters and matters. So let's look at the short matters. Now. We will go and copy this test. We'll give it another name. And the way we write measures is unknown should equal whatever it is now, that is. So instead of doing a search, a search result, you can break. This get-up statements. Should require large, should contain, and do your desk. Let's run it now. So it passed is expected. This is how you can use matchers. Thank you. 28. Failing tests intentionally: Let's now look at how to fail unconditionally. Sometimes you just want a place holder for test so that you know, or unit too late that tastes later stamp based one and you just want to fail it so that you know that it is incomplete. You want to let your team know that you have not completed this. They can expect you to complete this later on. Climate. So this should fail the test. And your team would know that paste which are not written and those will be taken care off in future. Now we'll run paste for the entire module, will tell you that some tests that passing some desktop failing so that your beam knows dot-dot-dot incomplete taste, which need to be taken care of. You can also use this technique to let your team know that certain taste or incomplete in existing test. Also you can just put fairly so that you can communicate the message to the people that this text is not written correctly or diminished. We modified this to taste very less expected. Thank you. 29. Sharing fixtures: If we notice this spark, plasma must speak glass. We have created multiple test. And in all this, we are creating the DataFrame which we are passing to the transformer. Replace null value method to Buddha replacement. Is there a better way? We can use something called Scala fixers to create the data which can be shared across multiple tests. So let us see that. So instead of declaring this DataFrame in every test will remove this. So we'll declare a new fixture. This is how you declared it. And within that, have a variable for DataFrame. Now you declare a variable of type fixture and extract the data frame by saying f dot d. If you do that in each of your test where you knew the details DataFrame. You're focuses more on braiding based, not so much on creating this data, which can be created once and usually across multiple page. Now let's run all artists on the paste web-based before. Thank you. 30. Exporting the project to an uber jar file using Maven Shade plugin: before we dive into how may won an intelligent works if you quick tips on intelligent member When you put a project from outside, let's say you downloaded this project from you, me and you're trying to import makes your door scholar forgery, status source route and this colorful there under the historic trees said Best troop you can do that using, though made directly option here. So this is Kestral, that is sore throat, and you also did toe import. This is a member project you have prompted on the right hand side A to the bottom, and you can also import all external jar files by bringing May want to import. It may not automatically import all the Jeff fails. So there's some quick tips so you don't know and you to ensure all the Jeff Files get important to the project and visible on Directional Liberties folder. And when you click on top of me and file, you might have to set up Scholar is decay manually. You can do it other right. Taking on the project and Arctic the framework by right clicking and adding famous support . Oregon. Do it here where you see the set up stylized, tickling. Select the skull, Everson and then you to import that. After that, you'll be able tow them your project. Using intelligence. So far, we've been learning you have a green icon, but it can also run the project, using the May 1 tool that you see on the right hand side May 1 life cycles and under 30 and see different options for clean, validate, compile, install. She can use the member to also do. We are big deployment and testing if you're new to maybe check out their website for more info about Bill Lifecycle like what is validated? What is compiled? What is taste. We need to ensure the project is compiled. It's built, and then we can install include a jar file. So going to try out the install option to create a jar file, which you can export from this machine and use it to in another environment or another bigger than moment. To do that simply used our install that that you see under the map metals option. But we have to also ensure maybe plug in that is required. Takeda Jaff A. That is all started with a popular exemplified on may 1 provide, say, shade plugging, using which you can clear Daewoo. But our Fancher who were job means secretly crude all the dependencies required for their applications, like the spark dependencies. Audra Desolate for log longer dependencies. Without this plug in may want to create a simple jar file containing your Scholar Source Corp. Which you laudable done in a different environment. So make sure you use this shape plug in. I've already copied pasted the same configuration, taking problem, even website. Just few changes I have done that is in terms off specifying what is the last name. There are various other configurations options available, but just to get started. The default configuration with the main class name is good enough. This file will be available in that is so six and four uto download and trail. But this plug in is required for a stroke, a Tober job, which can be deployed in another sparking moment. Make sure plucking is added within the bill tag that you see her after that go to make when they're double click on install, it would start are doing the clean out being the compels. Whatever upset you see before install lately. Invalidate, compel Those will get executed before the jar file. Creeson happens. So all the Jaffa friends that we have in the project that projected total get incorrect and will get a huge star file that can be used for the Plamondon. Another moment you can see from the log. All the jar files are getting included in the world on fire Chat files were using proposed graze longer for spark all everything Perrin. Finally, you will see that jar file has been created, and it's a little under the target for that. In the process directory, you can go toe Explorer and check this out. So for this project you created, if I with one from being besides, and there is another jar file, which is off three kilobytes size. That is only those color files that you have written for the plummeting Another spark and moment you need the future ex jar file, their God created with 1 15 Besides, so just to summarize, you are the shared plugging in the apartment. Examine and then click on install and you'll have a jar fail. Thank you very much. 31. Signing up for GCP: And let's try to understand good Cloudant GCP is free tier limit. Such far gold Cloudant GCP free tier. Then go to cloud dot google.com slash free. Whatever link you'll see. So keep checking this page. Currently it's $300 free credit for a year. So using this amount it can explode Dataproc or any other service that's available on GCP. There are services like big data, machine learning, storage, computing. So those services can be explored using the free credit. And you'll have to enter your credit card details to sign up and developed a free credit. Google wants to ensure that you are a genuine person was trained to use Goal Flow service and not a board or automatic process. And also you are not someone which tend to take advantage of the free tier limit and tried to create multiple accounts. You'll be charged a small amount after $1 on your credit card, which will be refunded within a day or two. And if you do not have a credit card, you can use your bank account details, but that may take slightly longer to get validated. Now let's understand how to create a GCP account such for Google Cloud RTCP. Go to cloud dot google.com. And you'll find a link to go to the console. Go to console. And Google wants to authenticate to you before are taking you to the console. So authenticate yourself by entering or Google or Gmail ID. And now you will land on the Google Cloud Console homepage. And there you will be prompted to start your free trial. Right click on try for free. And the next screen, European country details Reidar Terms of Service, Center, select the check box, and then hit Continue. You'll have to enter a few more details about yourself or your address, etc. And also your credit card details. Do that. And you'll also interior tax tables after entering your credit card details. Whether you are a registered entity or unregistered individually, select that option and then you start your free path. Depending on your card type, you might be prompted to enter additional details, slay contain password. At this point, your car should get charged up to $1, which will get refunded within couple of days. And you are now ready to start the free trial, right? You have been authenticated. You'll see billing link on the left-hand side, P contact you. How much credit is sedimenting. Always go to this building link and the console on Page and then find out initially, you can see that I have put d 171830 is in Indian rupees that must credit is available, which is equivalent to $300. And then as and when you consume something or you die mode will get it's just stepped out from your free credit. 32. Cloudera QuickStart VM Installation on GCP: Welcome back. We'll understand how to install Cloudera QuickStart VM on Google Cloud Platform. In this lab. Let's log into GCP Console. We'll search for virtual machine. Select Compute Engine virtual machines services on GCP platform. Click Create, Instance and name. Now let's select a machine with eight CPU. That's the max we can get with GCP free there. And glad, 30 GB RAM. And let's also increase the hardest concise leap, 34, a Linux type. Increase the heart decides to 330. Let's allow us to HTTP and HTTPS traffic to this instance will also ensure all the ports are open to this particular instance. Go to network details. And you can add a new firewall rule or modify one of the existing firewall rules to to open all ports for this particular instance. So although ports would be allowed for this instance on the portion reopen so that when we install CDH, we can easily access hue and Cloudera Manager from this instance. Let's now log into the instance using the SSH link. And we'll install Docker on this machine. Even if you don't understand anything about Docker, just setup commands that you see here. And you'll have a Docker installed on your VM. And then using the Docker you can install cloudera quickstart VM easily. The key thing that we're trying to learn here is how clever Hadoop instance on GCP platform, which you can use under the three mile limit. So once Docker is installed, you can check the Watson. You can run a Hello World program to validate if everything is fine. So this looks good. Brokerage installed now. Hello from Docker. So this confirmed the Docker is installed. Upgrade yourself to the root user. And then using this command, you can see whether Docker is running or not. And then just try this command to get our quick start VM, as simple as that, you can set up Cloudera QuickStart VM on your local machine also. But for that you would require at least a 22 LGB lamb. Now you can see all the images in this docker instance. You can see the image for Cloudera QuickStart VM. And this is the command to. The quickstart VM 7080 is the port for Hadoop user interface and 888 it is the port for Cloudera Manager. All these commands will be available in the resources section. Using the public IP address and port 8880 tau, we can get the Hadoop user interface, Hugh homepage. Then login to Hugh using Cloudera. Cloudera at that star default user ID and password. So this is the Hadoop user interface homepage. From here you can access different tools that are available to interact with the Hadoop cluster. And let's go to the editor. Here we can execute queries. See the list of databases, our table, insert data into tables. You can also get into high EPA from your command prompt on GCP Instance. Click hey brendan, you'll be taken to buy prompt and they are, you can also in the same hype command. In the silicon is only one database. So let's create another database called effects cause TB. And make sure to select the entire line before you execute. We can see the course table which got created. Let's go back to the previous page and refresh the database list. You'll see Effects course DB database. So you can either use Hive Editor or you can use the command prompt depending on what kind of access you have or the purchase Getting Started with Hadoop and Hive, then better to use the query editor will create one table and in suck some records, there is nothing. Initially. We'll use the few tricks course db so that any table we create to be created under the feature X cause to be Database, you created Effects course table. And let's insert some values. I'm switching between the two windows just to demonstrate that you can use either of the two windows. But you can do all your operation either die better or in the console. Let the record we inserted. Here. We can see that. So as you can see, it's really easy to set up our Cloudera QuickStart VM on GCP and get started with the Hadoop. Programming will also do Spark programming, which we'll see shortly. You can click on this manager link and see all the jobs that are running on your Hadoop platform. Hiv gets converted to MapReduce jobs. So you're seeing those jobs in the R console. You can interact with HDFS file system. Initially there would be nothing hadoop fs dash is the command to interact with these devices file system. Let's create a directory h Now, make directory command. We MS, single node plus teleology Ferrante can have a multi node cluster on GCP also, but for your training and for your practice. So one node cluster is good enough. You can also access says DFS folder from file browser in hue. So they are, you do not have to worry about firing his deepest commands. You can simply create files, upload files, stdout, move files around. It's really easy to work with Hadoop user interface. As a beginner. Cloudera is the default user and then there are other IOS you can create more. You just, let's change our region to cloud in Asia and then create a directory under the Cloudera amnesia folder. We'll call it effects data. And you can go to Hugh interface and view the directory. You can also do it from here. 33. Running Spark2 with Hive on Cloudera QuickStart VM: Let's now enter the Spark shell. So here we can do Spark programming using Scala, is you can see the Watson is currently 1.6. So that is the default version. Let's see how to create the spark. Watson will first remove the default JDK, which is 1.7. Aldus commands that are available for you to execute. Then we'll install open JDK 1.8. As you can see, dedicates been upgraded to 1.8. Let's set the Java home and willing to start obligate using which will copy a file over to our VM instance. And let's go to the Downloads folder and get any of the stable version of Spark. 2.4.3 is good enough. We'll select the 2.4.3 and Hadoop 2.6 Watson, and copy that spark installer to VM instance. Let's install the obligate again looks like a did not get installed. Wcs.com to get files from outside to the VM instance. This time we are able to pull the file. So the Jasmine copied to the VM instance. Let's enter it. Now. We can move this folder to user local Spark, so that could be our new sparkle. Go to that directory and you can see all the spike phase. Let's make some changes to make this Spark is the default Spark used Nano text editor. You can install that easily. Then go to user bin folder and such for all the default Spark educators. And you modify the spark in the file to point to the newest bar that we just installed. Instead of user1, Spark, we pointed to usual local Spark where we have installed our latest 2.4.3 was enough spark. Make that change in older sparking scooter files. We modify all files to have the user local Spark. Now let's go to Spark shell. Now we can see that 2.4.3 was enough. Spark will create a dummy DataFrame to ensure everything is fine. Hole looks good. If r greater to Spark 2.4.3 on the Cloudera QuickStart VM will now run Spark quick hype. Let's first get rid of the high folder. Sometimes it created SUS. And after that we can log into Spark shell and then try to fetch data from the table using Spark will fetch data from the course table that we had created earlier and store it in a span data frame and then display it using dots show. So Spark and Hive by taxon works fine on the Cloudera QuickStart VM now will now create a DataFrame and write it too high. And then use that to create a Hive table. You can see that we can go to the high end of fetching the data from Don new Hive table that we created using Spark. Thank you. 34. Uber Jar Spark Submit on Cloudera QuickStart VM: Let's understand how we can deploy this application in the Cloudera QuickStart VM. Before during the deployment, let's make a few changes. Will ensure there is no Hawaii, but Postgres code will just create a spark session and then try to print something just to keep it simple because the focus Sparks Summit. And also in the Upon bird XML, we need to make a few changes. In the mavens shared plug-in retains the Watson to 3.2.2 and will exclude certain artifacts. And a few more changes. We will append a fat extra jar file. Next, let's do install. This entire project will be available in the resource section. It takes a few minutes for the jar to credit Generator. Installed is now complete. The JAR file should have been generated in the target folded. So this is the JAR file X Scala project and fat text appended to it, 150 MB file, say so this is the one we're going to apply on Cloudera QuickStart VM and granite with class for this file to Cloudera QuickStart VM through a Google cloud cloud GCP bucket. You can create a new bucket on GCP discharge per bucket. And you have to make sure the bucket name is unique. Grip the bucket. And then you can upload the JAR file to here. We'll upload this one file. It will be easy to pull this file to QuickStart VM from the bucket. The file has been uploaded. You click on it and copy the you are in this file to your cloud environment. If you face any permission issue, you have to check the permissions sitting, but by default it should be public. Now back in the Cloudera QuickStart VM doc raise environmental, you pulled the file from disappeared bucket w greater than the entire path. The file has been copied unit to state part for Hadoop config directory. The unconscious directly. After that, you can do a Spark, submit your persuasive later class name, which is eugenic sparked transformer, Master Yan. We'll use yarn to submit Spark job diploma ordered BY cluster and specified. They'll complete JAR file name and then hit Enter. 35. Doing Spark Submit locally: Let's understand how we can run Spark submit locally for that search for Spark download and go to the Apache website. And then you can download the Spark works and you want. So we had been using 2.1.3, which is not available here. But you can go to the archive section and look for any order worsen, download the 2.2.3 and Hadoop 2.6%. You can also try with 2.7 should also work. Let me download the 2.6. I have downloaded and unzipped it and rename the folder to spark. Open the Spark folder or future indicial Purdue. The windmill file that we had under C When we kill bin, will move it to spark bin folder. So that is now here. So that means we have to say that Hadoop home to this directory. Let's do that. Your Hadoop home to point to the Spark root directory. And we'll also set a spark home, which will also point to the same spark root directory. And in the pot, you also our Spark bin directory and the path variable. So three changes. Sparkle. Hadoop home part should contain the bin folder up this part directly. And that your win-win deal file should be under the spark bin folder. So these are the environmental literature. Now you can go to any directory and you can type Spark shell. That will take you to the Spark shell where you can do Spark programming. So here you can write any sample code. Let's understand how Spark Summit in the local machine. This should work for Windows Mac or any other local machine. Only thing is for Windows unit tab. When we did under the spark bin folder, we have the Obergefell store under the future next target folder. So we'll run this jar file using Spark submit command. Let's launch it command prompt. So here on the local machine, but we do not have any Hadoop setup. So when the radicand Cloudera, you specified whether it is local mode or cluster mode. But here simply to say which class. And they give the JAR file path which is under the target directory and hit enter. Here you will directly see the output in the console because there is no Hadoop involved. In Cloudera QuickStart VM, the log would appear within Hadoop yarn lock. But here you can see the output directly in the R console. So this is how you can run Spark submit locally and test whether your jar is working correctly or not. We can see that it's predictor desired output. Thank you. 36. Source code and resources on Github: You will find source coordinate videos, resources for our project. Github repository future skin, go to fidget IQ scale and does Parks Scholar. You will see the source code then Spark Summit command and various resource file. Thank you. 37. Thank you and preview of our PySpark course: Welcome to this Python Spark PySpark coding framework and best practices course. Step-by-step, you'll understand how to build a production ready Python Spark application from scratch. You'll understand how to organize the code, how to do error handling, how to do logging, how to read configuration from properties file. You'll also understand how to do unit testing of your Python Spark Core. You lot what matters in the real world. And you'll be learning Python Spark while building a data pipeline. You'll understand how to read data from different sources using Spark, do the processing and store it to another data source. This course will take you from your academic background to a real-world developer role.