Transcripts
1. Course overview: Hi, and welcome to the coast pipeline spark for big data. Here I will give you a quick overview of the course and what you are going to learn as part of this course. First, we will be talking about big data and its applications. And how can you analyze big data is in Python and Spark. Spark, RDDs, spot DataFrames and SQL real-time data process using a Spark streaming distributed machine learning models using Spark's MLlib and graphic computation and Capstone project at the end. As per international data cooperation, the studies have estimated that there will be 41.6 billion connected devices, which will generate 79.4 zettabytes of data in 20-25. So this is very big data, huge data. So big data is a term that refers to the datasets, the size, which is beyond the ability of existing database tools. In terms of capturing, storing, managing, and analyzing the big data. We can characterize big data by five V's volume, variety, velocity, value, and veracity. Volume means huge amount of data as per the recent studies. Say one petabyte of data. Variety means different kinds of data, like structured, semi-structured, and unstructured. Velocity means data comes in and goes out with high-speed. Big data has huge value. That's why every industry is trying to adapt data analytics. Veracity, it has inconsistency and uncertainty. So we need to process the data. Organizations from different domains are investing in big data applications in order to extract the important information from it. For examining the large dataset, to uncover all the hidden patterns and known correlations, to understand the market trends, customer preferences, and other useful business information. Apache Hadoop is the basic solution for handling big data. Then later on, Apache Spark, which is faster cluster computing framework, is used for processing, query and analyzing the big data. It works based on the in-memory computation, which is a big advantage over the other big data platforms. Apache Spark can run up to a 100 times faster when it uses the in-memory computation, and even ten times faster when it uses the disk. Dan, other MapReduce tasks. We can also work with four languages with Spark, Java's Kayla, R, and Python. So these are the cost benefits. You can have a hands-on experience on the different components of this park, which allows for real-time exercises, real time case studies. And you can lead the global standards to ensure the compatibility between the next generation of Spark applications. You also have a set of add-ons At the end. Interview questions, frequently asked questions, quizzes, etc. So this is the course methodology. It will be set of presentations and lecture classes, practical sessions, demonstrations, small and large exercises, capstone project, and as value add-ons, Frequently Asked Questions, interview questions, etc. So thank you for enrolling and hope you enjoy this course.
2. Course Intro: Hi and welcome again to the course on Titan and Spark for big data. So who are the target audience for this course? This course is intended for anyone who has pious to get into the field of big data and lightning speed of processing big data, big data and Hadoop into CS. Software and architects and engineers, developers with the minimum programming background, data scientists and analytics professionals for all those who are working in this industry and want to enhance the skills. The main emphasis of this course is on practicality. So this course lays a special emphasis on hands-on learning with different types of real time examples, set of case studies and also a capstone project.
3. Intro to Big Data: Hi again and welcome to the course Python and Spark for big data. So here I'm going to talk about what is big data, big data characteristics, and big data challenges. So as per the International Data corporations, they Studies have predicted that 41.6 billion connected IoT devices will produce 79.4 zettabytes of data in 20-25. So this is a huge amount of data. We can say big data. So big data is a term that refers to the datasets, the size, which is beyond the ability of a typical database or tools to capture, store, manage, and analyze what are the big data characteristics. So we can characterize big data by five V's volume, variety, velocity value, veracity. Value means huge amount of data. So as per the studies minimum, one petabyte of data. Variety means different formats of data from different sources. Basically, we can classify the data as structured, semi-structured, and unstructured data. Structured data means well organized data, which has a fixed schema. Example for this, we can say RDBMS, relational database management system in which we create a database. So database is a set of tables. We can define a table by specifying a schema. Based on the schema, we only need to enter the values. Semi-structured data means partially organized data, which also does not have a fixed schema. Example for this is XML JSON types of data. Unstructured data means are now organized data with an unknown schema. So we can say audio, video, animations, images, type of data. Velocity means high-speed accumulation of data. So data is delivered in high-speed. We need to process and produce results with the same speed. Value means extracting the important information from the big data. So that's why every industry is now trying to adapt and data analytics to extract meaningful information about the business. Veracity means data as inconsistency and an uncertainty. So basically we can say dirty data, so we need to clean it by applying data pre-processing methods. Big Data challenges. So the first one is data capturing. Capturing huge data could be a tough task because the size and volumes are increasing. So there are millions of sources emanating data at high speed. So in order to handle this challenge, we have to create efficient devices which can capture the data at high speed, which also maintains efficiency. Example for this, we can say sensors, which not only sends data like temperature of the room, step counts, weather parameters in real time, but send this information directly over the cloud. Storage is also another important challenge. We need to have efficient storage devices. So in order to handle this challenge, we can go for increasing the disk size and compressing the data using multiple machines to store the data. But when we go for increasing the disc, we lead to purchase efficient storage devices, which will cost us. Compressing the data is also not an efficient approach because while compressing the data, the data quality might be lost. So one of the efficient approaches is going for multiple machines to store the big data. Querying and analyzing the data. This is one of the most important task. Query must be processed and should give us the output with high-speed. In order to handle the challenge, we can look for several other options like increasing the processing speed. It means we go for increasing the processes, which will cost us. So this approach is not good. Alternatively, you can build a network of machines. We can say cluster. Cluster is nothing but a set of machines. In this scenario, what happens is we break the task into set of subtasks and solve each subtask and finally, aggregate the results to have a final output. So this type of mechanism is called distributed computing. In summary, what I mean to say is we need to process the data with high speed in order to have meaningful information in real time. So that's it we discussed about what is big data? What are the big data characteristics, and what are the different sources of generating the big data. So big data applications, El Paso, in analyzing the big data and extracting the important information. So Apache Hadoop was the first solution to handle big data. Thank you and we will meet again with the big data applications video.
4. Big Data Applications: Hi and welcome again to the course on Python and Spark for big data. So here I'm going to talk about big data applications. The primary goal of any big data application is to help a company make clever decisions by analyzing large amount of data, and that means big data. So that could be web server logs, intended clickstream data, social media content and activity reports, text from the customer, emails, mobile phone call details, and machine data shared by multiple sensors. So the organizations are investing in big data applications to extract meaningful information or to find out hidden patterns, unknown correlations to understand the market trend, customer preferences, and some other business information. If you consider health care in healthcare domain, big data applications improve health care by providing personalized medicine and prescriptive analytics. So these are the two important improvements in the healthcare domain. The researchers are mining the data to see what treatments are more effective for a particular condition. So when we consider some particular medication that will not be suitable to all the bodies, all the different types of bodies. So identifying the patterns related to the side effects is very, very important, as well as gaining the important information which can help the patient also reduce the cost. When we say health data, these are the sources from where the data is being generated. Imaging and lab results, genomic data, insurance claim providers, all the mobile health applications, public health data, electronic medical records. So this is the data which must be analyzed so that proper services are given to the patients. In manufacturing. These are the benefits, product quality and defect tracking, supply planning, manufacturing process defect tracking, forecasting the output, increasing the energy efficiency. Testing and simulation of the new manufacturing process. Support for mass customization in manufacturing. In media and entertainment, it's predicting what actually the audience wants. So understanding the customers is very, very important in any business to understand what the audience are looking for. In this case, predictive analytics helps us scheduling the optimization, how to increase the customer acquisition and retention frayed and targeting advertisements to specific people based on their interests and content monetization and new product development. So these are the ways how the big data applications help. Internet of Things mean interconnection of smart devices. And all these smart devices produce huge data. The important issue with IoT is mapping of heterogeneous devices. If you analyze this data, you can gather some information, sensory data, and this sensory data is used in medical and manufacturing context. Even governments have also adapted big data analytics. So in these scenarios like cyber security and intelligence, criminal finding, crime prediction and prevention, pharmaceutical, drug evaluation. As I said, scientific research, weather forecasting, tax compliance, and traffic optimization. Most of the governments of different nations, I'm already started using big data analytics. And they are providing quality services to the common people. Thank you, and we will meet again with another concept in the next session.
5. What is Spark: Hi, and welcome again to the course, Python and spark with big data. So in this session, I will talk about the basic concepts of spot. What is park? What are the different components of Spark? Basic history of spark, its features and limitations. So what is Apache Spark? Apache Spark is a faster cluster computing framework which is used for processing, querying, and analyzing the Big Data. It is based on the in-memory computation. That means it works with the data which is stored in the memory. Instead of getting the data from the disk. It is an open source application and it is free. We can just download it from the internet. It is also known to be a faster cluster computing framework. It can run task up to a 100 times faster, then it uses the in-memory computation and ten times faster when it uses disk compared to the traditional MapReduce tasks. Also note that it is not a replacement of Hadoop. It is actually designed to run on top of Hadoop. Apachespark was originally created at University of California, Berkeley's AMP lab in 2009, and later on became open-source in 2010. It was donated to ApacheSpark Foundation in 2013 and has now become popular from 2014 onwards. Apachespark is mostly written in Scala language. Scalar is known as the native language of spark. It also has co-written in Java, Python, and R. It also provides for APIs for programmers, Java's Kayla, R and Python. You can use any of these programming languages for development. What are the spark fetus? Here we talk about features. First is in-memory computation. The biggest advantage of ApacheSpark is in in-memory computation. That means it makes the data available in memory and uses its data for that computation. So as you know, a computer memory as a hierarchy, if you have data in secondary disc two arrays, you need to bring that from secondary storage to the primary memory. Then caching memory, they'll keep the data in the registers. Once it finds data in registers, then you can use that data and process the task. So if you keep the data in the primary memory, there is no need to fetch it from the hard disk. That's why it is 100 times faster than Apache Hadoop framework, because Spark works with the memory and Hadoop works with the hard disk. The second feature is map-reduce operation. So Spock also supports map-reduce operation. When the data is required for processing, each node has to load that data from the disk and save the data into the desk for computation. This process ends up adding the cost in terms of speed and time, which is the drawback of Apache Hadoop. Later on, if the data is available, then you can execute MapReduce operation on disk. But here, data is not available in memory. You can simply execute the operation. Here. You need to fit the data. That means it requires time to convert the data in a particular format when writing the data in RAM to disk. So this conversion process is known as serialization, and the reverse of it is DC realization. So Spark can perform MapReduce operations on the data frames or on the RDDs, which are the data representations of span. Also, spark supports for languages Python, Scala, and Java. But keep in mind that ApacheSpark native language is scalar, but it also has a code written in languages like Python, R, and Java. So if you're a Python programmer, you can simply do a programming. You don't need to bother about the programming language because you have four options. Spark is known for real time computation, but it also supports batch processing. Hadoop is known for batch processing and ApacheSpark is known for both real-time and batch processing. Batch processing means that data is collected over a period of time and then the processing is done on the data. Whereas real-time computation means just capture the present data, work on that, and get the insight from them. So most of today's problems are real time. That means they are looking for the solutions in real time instead of using the past data and getting insights from that. Sparks supports lazy operations. That means operations are made lazy until you initiate the operation. It doesn't take the time of the processor. That means these operations save the time of the processor. Instead of making it busy. Apache Spark supports multiple transformations and actions on RDDs. Rdd is one of the data representation of spot. It stands for Resilient Distributed Datasets. So on RDD, you can perform multiple set of transformations and actions. What are the transformations and actions? In a later session, we'll talk about this. So in this session we talked about what is pi t Spark and its features. Apachespark is known for fastness, which can run task up to a 100 times faster when you use it in memory and ten times faster when you use it to disk. So these are the spark features in memory computation. Mapreduce operations, language support real-time and batch processing, support of lazy operations. And you can execute multiple transformations and actions over RDDs.
6. Spark Components: In this session, we are going to talk about Spark components. Spock has well-defined master-slave layered architecture where all the spark components and layers are loosely coupled. Here we can't see cluster is nothing but a set of machines where one machine access master and the remaining missions act as worker nodes. Every worker node as executor which actually does the job. There is also the Cluster Manager, which actually connects the master machine with the worker node. The program that we write is called Driver Program, in which we need to create an object to the Spark context. Then we can have a connection to the cluster manager. These are the key terms, Park context, which holds a connection with the Spark Cluster Manager. All Spark applications run as an independent set of processes which are coordinated by Spark context in a program. As I said, the program that we write is called Driver Program, which is in charge of the processes, running the main function of an application, and creating the Spark context. A worker means any node which can run a program in the cluster. If a process is launched for an application, then this application acquires executors to run that job. There is also a cluster manager which allocates resources to each application in the driver program. There are three types of cluster manager supported by ApacheSpark, standalone, Mesos, and yarn. You can install any cluster manager. Every custom manager has its own advantages depending on the goal. There are differences in terms of should duly security and monitoring. So once an object is created to the Spark context, which can connect to the cluster manager. Then the program acquires its executors on the cluster node. And all these executors run the job independently by interacting with each other. What are the different components of Spark? The first one is Spark Core, which actually provides in-memory computation and reference datasets in external storage. It is also known as an execution engine for Spark. Apache SQL, we can write SQL queries to fetch the data from different sources by creating Spark RDD. Once the RDD is created, you can perform set of transformations and actions over in the data supported by Spark is structured and semi-structured. The third component is Spark streaming, which is also known to have fast she dueling capability. To perform streaming analytics. It can ingest data into Spark from different sources and perform the operations over RDD. The fourth component is ML LIB, or machine learning library, which is a distributed machine learning library framework. Will you consider library likes scikit-learn, where it create a model and that model runs on a single computer. But Apache Spark's ML LIB provides distributed machine learning platform. You can create a distributed machine learning model. Spark MLlib, it's nine times faster compared with Hadoop disk-based version of Apache Mahout. The fifth important component is graphics, where you can do the graph computation. Once a graph is completed, then you can run set of tasks over it to find out the insights from the graph. So these are the five components of Spark. And summary. We discussed about these five components, Spark Core, Spark, SQL, Spark Streaming, ML, LIB, and graphics. In the future classes will be talking about these concepts along with their implementation.
7. Spark Limitations: What are the limitations of spot? As we know, ApacheSpark is the next-generation big data tool which is being used by industries. But it has a few limitations. No support for real-time processing. In Spark streaming, the real-time live data is divided into batches of predefined interval. And each batch of data is treated like Spark RDD. Once we have Spark RDD, we can perform a set of transformations and actions which we are going to talk about later. It's not real-time processing, but Spark is a near real-time processing offline data. That means the Spark supports microbe batch processing of data. There is a problem with small files. Spark can generate large number of small files. So this is a nice concept, but the data is stored in D zipped format in S3. This pattern is where we can have large and small trials, but the zip files have to be uncompressed. The anterior file, the lifespan of time will be spent in Berlin these files and unzipping the files in the sequence. So in the resulting RDD, each file will become a partition. Hence there will be large amount of smaller parties ends within an RDD. Now, if you want efficiency in your processing, then RDDs should be a repartition into some manageable format. So this actually requires shuffling over the network. The third drawback is no file management system. As we know, if you want to use the file system, then we need to make use of Hadoop. But there is no separate file management system provided by span. Expensive. As you know, spark is based on the in-memory computation. You need to have a larger memory in order to do the computation. So this is a costly approach. Less number of algorithms, Spark machine learning library support a very limited number of algorithms in comparison to Python scikit-learn library. Manual optimization, the Spark job requires to be manually optimize in order to have a official results. So if you want to partition and cash in Spark to be correct, then it should be controlled manually by the users. Iterative processing in Spark, the data iterates in batches, and each iteration is shit dude and executed separately. Spark as a higher latency as compared to the Apache Flink. And Spark is a time-based windows criterion. In tariff record based window criteria. The next drawback is back pressure handling. Back pressure is a buildup of data at every input and output. When the buffer is full. In order to receive the data, this buffer has to be empty. So ApacheSpark is not capable of handling the pressure implicitly. Rather, it is done manually. So in this module you have learnt what is Park is to your spark. What are the features of Spark? It's components and last, its limitations.
8. What is Streaming: Hi and welcome again to the course, Python and Spark for big data. In this session, we will be talking about what is parks dreamy. We will take an example. After that. What is D-Stream and what are the operations performed over it? So what is Spark Streaming? Spark Streaming is an extension of the Spark Core API, which enables scalable, high-throughput and fault tolerant stream processing of the live data. Data can be ingested from different sources, like Apache Kafka and Flume kinesis or TCP sockets or can be processed using the complex algorithms and can also be processed using the simple operations, MapReduce operations like map, reduce, joint, and window. Once the data is processed, that data can be pushed out to the file system or even to databases. Or you can visualize that with the help of real time dashboards. We can also apply some machine learning on the data. We can also apply some graph related task on the data. This figure uses a clear picture. You can get the data from different sources like Apache Kafka and Flume, HDFS, which means Hadoop, Distributed FileSystem, kinesis, Twitter. This live data can be fed to the Spark Streaming. After processing, the Spark Streaming gives us the process data. We can feed and visualize that data with the help of real-time dashboards. Or it can store that in databases. You can also store that data in Hadoop distributed file systems. So how does it work internally? So Spark extremely receives the live input data. After that, that data is divided into micro batches, which are then processed by the Spark core engine, which will generate processed data. And the ADD process data can be stored or visualize. So this is the meaning of Spark Streaming.
9. Streaming Example: Let's take an example for Spark Streaming. We would like to get the data from the network utility, which is the Netcat. Next, we would like to call on them. So whatever the streams, whatever the data received from the network utility or from a Cline, We would like to count and display them at the server side. You can consider this as a simple master-slave architecture where there is a master and there is a client or slave machine. Slave machines keep on generating the packets and then the Master Machine key point processing them. That means it will display the count. We will import this library. So from pyspark, import Spark context, and then from Pi, Spark dot streaming, we have to import streaming context. So there would be two contexts required, your Spark context and streaming context. Next, creating object for these two contexts, SC means SparkContext is equal to Spark context. This is a local machine, so we have a single node cluster. Local of two means there are two instances running. So one instance can access master and other one is slim. Our program, Lehman's network wordcount. Sse means Spark streaming context, which equals streaming context. We need to specify the parameter that is SC for context. And then one. That means the terminal will be getting the data every 1 second. Then we have to read the lines from the terminal or from the client. So that's why here I'm creating one variable. Lines is equal to SSE dot socket text stream. And this is localhost. It is listening on the port 9999. After reading the text from the client, after reading the line, we want to split that line into words. So words is equal two lines dot flatMap. We are using lambda function. The lambda line is equal to line dot split. Then this line is split based on the space. That means we would like to split each line into words. And then we would like to have the word pairs. What is the word and what is the value. So pairs is equal to words.py map Lambda function. And here we are pairing it like word one. Next, making it count by using the reduceByKey function. So we will just add up the values, that is x plus y. Next, we want to print the word counts dot print. So this is a sample example where we would like to read the lines from the terminal. And then we want to count on each word and we want to display. So this is a continuous activity. Let me open the terminal. Before that, let me run this. Now. Let me open a terminal and let me run the command, Network Utility command NC, Netcat with sensors that data from the client. And this is also working on the port 9999. Next, let's start the session on the server side. It's now starting whatever you type here. For example, Hi, how are you again type high, for example, you can type anything. So I type three lines here. You can see them. These lines are converted into word c. That means in Jupiter notebook. So you see here the Spark Streaming has been started. You can see here the output is 11 time, hi is one. How is one? And you can still see it's still running. Whatever the words you are typing there, the count is displayed here. So here on the server side, what are we doing is we are just counting them. So it keeps on running. Again. If you type something there that time. Again, tie for example, how are you today? So you can see the output again. You can see this. So this is a simple example which illustrates how to get the data, how to capture the data from the network. And we are processing them on the other side, the server side, making a count.
10. DStream: Discretized streams or DStreams in short, is the basic abstraction which is provided by Spark Streaming, which represents the continuous stream of data. The input data stream received from sows are processed data generated by the streaming input. Internally, it is represented as a continuous series of RDDs, Resilient Distributed databases, which is Spark's abstraction of an immutable distributed dataset. As we discussed in the earlier session. Each RDD in D string contains the data from a certain interval. We can visualize this with this figure. So the data is being generated. We can collect the data at different intervals, like the data from time 0 to one. You can have an RDD with respect to time one. Rdd at time do RDD at time three like this. The difference at every interval, you can have data generated and captured. And you can go for the processing. Any operation which is applied on the stream, translates the operation in onto the RDD. That means that the stream is converted into RDD, as we have seen in the previous example, where we converted the lines of words into RDDs. So we applied the RDD operations like map and flatMap. Then we converted the line into Word and then be counted divert. Here you can see the illustration is lined streaming. And then you get all the lines, get line from time 0 to one, and then apply the flat map operation on it. Then you will be getting words from time 0 to one, and then another bad lines from 12 to get word, line from two to three, gigahertz, line from three to four. So this is a line stream. So you'll be getting lines and the lines are converted into words. So this is the meaning of D-Stream discretized streaming, that is continuous data capturing. Once distributed generator or converted into RDD, it can perform these operations. These operations are quite similar to the RDD operations. So the first function is map. It means return our new D-Stream by passing each element of the source. Flatmap. It similar to the map, but each input can be mapped to 0 or more outputs. Filter means you can filter the data based on some condition. Repetition. You can partition the data by creating more or fewer partitions. Then performing the union population. Counting the element, then come by value. It can count based on the value. So it's a key value pair. Then reduce by key. It's similar to the RDD operations. You can reduce the keys, you can reduce the vertebrae keys, which will return the DStream of k. We pairs where the values are aggregated using the given Reduce function. Joining. This means you can join two streams which returned a new stream of K, v, and w. So this VW is a pair with all the pairs of the element from East Key. Then you can have a cogroup. And then you can transform the RDD to IDD are RDD to some other D-Stream. Then update date by key, which will return a new D-Stream where the state of each key is updated by applying the given function on the previous state of the key, which returns new values for each key.
11. What is DataFrame: Hi and welcome again to the cause Python and Spark for big data. In this session, we'll cover about DataFrame. What is a dataframe? What are the different ways to create a DataFrame? The basic operations on dataframe and we will have, and so on, on stock-market did analysis to start with what a Spark DataFrame. So our DataFrame is a distributed collection of rules under the named columns. It is equivalent to a table in a relational database. So similar to the excel sheet with column headers or a DataFrame, R or Python language bite with richer optimizations, rules can have variety of data formats. We can say heterogeneous, whereas columns can have data at the same time, we can say homogeneous. Data frames usually contain some metadata in addition to the data. For example, column and row names. What are the fetus of DataFrames? Dataframes are immutable. That means we can create a DataFrame, one similar to RDDs, but cannot change. And we can perform transformation on a DataFrame after applying transformation operations. Lazy evaluations. It means our task is not executed until an action is performed. Lastly, DataFrames are distributed, so these are the three features of data frames. What are the advantages of DataFrames? Dataframes are designed for processing a huge collection of structured and semi-structured data. Observations in the Spark DataFrames are organized under the legal columns, which held ApacheSpark to understand the schema of the DataFrame. This also helps ApacheSpark to optimize the execution plan on the queries Dataframe in Spark as the ability to handle petabytes of data. So this is one of the striking features of Spark. Dataframes support verity of data format. For example, you can collect the data from Hive. We can collect the data from Kafka. So different sources and different formats are supported. It also has a PA support for, for languages Python, Scala, and Java.
12. Creating DataFrames: How to create a DataFrame? There are three important ways to create it. You can create a DataFrame from different data formats, such as loading the data from disown or a CSV file, loading the data from an existing RDD and programatically defining the schema. So here you need to first import py spark, import py spark, and then you have to create a Spark context. So from pyspark import Spark context as I sc, sc is the alias. Then we need to create a spark session first. So from spice Park dot SQL, import spark session. And then creating objects of the spark session. If the session exists, then we can just get it. Otherwise, you can create it. This is the name of the application, DataFrame basics. So we are creating spark session. First. I'm creating a DataFrame from the JSON file. So the F1 equals sparc one dot dot decent. So this is the function which reads the data from JSON file. So res.json. And this is my file employee dot JSON. I'll create it in the data folder. The data is read and the DataFrame has been created. So once the DataFrame is created, we can take it out. What is the content? Taking time the DataFrame is created. Now, df one tells what are the columns in the DataFrame. So we have each column and the data type is a big integer. Named column is also there with a string type, and salary is big integer. So there are three columns in the DataFrame. And we would like to show the DataFrame content by using the show method. So df 1.So shows the DataFrame. So there are three columns and three observations or samples. We can use print scheme monitored to show the schema of the DataFrame. Like I said, there are three columns, Aedes, lean, and salary. The datatype can also be mentioned. If you just want to retrieve the columns, What are the columns in the DataFrame? We can use the property df.columns. It will list out the columns. As you can see, there are three columns. If you want to perform descriptive statistics of the data, we can use the describe function similar to Python, Df V1 dot. Describe. So in this case, it has just listed the names of the columns. But if we want to show the descriptive statistics, then we have to use the show method. So df f1, df.describe.show, then it will actually give us the descriptive statistics of the data. So this is the summary count. In aid, we have two observations. One is missing in name. There are three observations. Salary also has three observations. The count is 233, the average age is 26. The name we cannot find average. So the average salary is this value. The standard deviation you can just see in the table. Minimum value, maximum value. So this is the descriptive statistics of the data. If you want to have more summary, then we can use the summary method D F1 dot summary dot show, which will show us the summary of the DataFrame by introducing these meshes slide 25th percentile, 50th percentile, and the 75th percentile. If we compare sell 1011, What are the differences? We can see there are only the percentile added. The rest are there in the describe function. So this is one of the ways of how to create our DataFrame. We have loaded the data from the soil type. And the second way of creating DataFrame is you can define the schema. For example, from Pi Spark SQL data types. We are defining these. We are importing these StructField, integer type, String type, and StructType. And then defining our own schema struct field. I'll then name is subtype, the string type, integer type, and then Salary is integer type. Our data for the schema has been created. Sorry, I didn't run the cell. We define our own schema and then converting this truck into the structured tied with our user-defined names. And then reading the data from the file. You can make a notes park one dot dot d zone. This is the file. Then Schema is equal to data's truck. So that means user-defined schema is applied on the data. And then we are showing the fields of the data. So as you can see, there are three observations and three columns. Are there three rows, three columns. Then we can take out the schema now. For H, it is an integer for salary also integer. If he check it. The structure of this schema compared with the previous one, there is a difference. By default, you see your salary is long type. Now we define a salary as integer type. Now, he can't see the difference. Here. Salary is integer type. How to create DataFrame from RDD? So this is the process. First, you have to create a list of tuples, then create an RDD from list, and then create a DataFrame from RDD. Import the Spark context if it is not important. And dense part configuration parameters are specified. So fc is equal to SparkContext dot get or create. And the Spark configuration setting the parameter. That means we are running a local host. I'm creating a list first from Pi Spark, SQL, import rho. And this is the list. Here. I'm mentioning the names of four people remain, and 25 is the age of the person. Similarly, he mesh 22 and arrange 2206. So this is my list. Then creating RDD from the list. Rdd is equal to Spark context as SC dot parallelize of alphabet L. That means this data is distributed over the local cluster. And then mapping these values into two columns named Andrew. So one of the columns is named, which is x of 0, FirstColumn, and x of 1. Second column is eight. From this, we are creating dataframe. So schema people as equal to SQL context dot create DataFrame from the people. Then taking that type of schema people. So this is a dataframe. It is Pi Spark dot equal that dataframe. And schema dot Collect will collect the rules from the cluster. So there are four rows. You can see that there are two columns, age and name.
13. DataFrame Operations: Hi, and welcome again to the course Python and Spark for big data. So in the last session, we discussed about DataFrame. What are the different ways of creating DataFrame? So in this session, we will talk about the different operations performed on the DataFrame. We are going to take one case study, which is dark market data analysis. First, we need to create a spark session. We can do that in this way from Pi Spark SQL import sparks session as ss is the alias for the SparkSession name. Then we need to specify some configurations of the application. So Spark equal to SS dot builder dot app name stock operations. If the session exists, then we can retrieve it. Otherwise, we can create a new session, which can be done by mentored, good, or create. When the session is created, we can load the data and we can create the DataFrame. So df is equal to spark dot-dot-dot read.csv. And this is our file application underscores talk dot CSV, which is available in the data folder. The inferred schema is equal to true. It means it follows the syntax or structure of the data file. Header is equal to two. It means the file's already had the names of the columns, so we can retrieve them. Once the DataFrame is created, we can take the print schema. This is a schema. We have seven columns, date, open, price, high, low value, close volume, and adjusted close value. This is the schema. Then we can use a show method to show the dataframe. Here it can list out all the columns. We have seven columns. It is showing us Top 20 rows. If you want to see the column, we can use the property df.columns. If you want to check the number of observations, we can use Khan matter. So as per the results, here, we have 1762 rows. And then taking the descriptive summary of the data frame. So we have a method summary. So dF dot summary dot show will show that descriptive summary of the data. So here we can see that statistical measures like count, mean, standard deviation, minimum value, 25th percentile data, 50 percentile, 75 percentile, and then maximum. So that is giving us these values on all the columns. We can use the filter method tf dot filter to filter out the data. My condition is this close, less than 500. Show File. That means it will list out top five rolls on which this condition is applied. If you look at the values of the close column, all these values are less than 500. Here, the filter method is years. If you want to take out the null values in a particular column, you can write like this. D f dot filter of df.loc is null. Show low is the column is now show, will show if there are any null values. So as per the result, we don't have any null values. And filtering the data where the condition is close, less than 500. I'm showing only one column Open. Sure, five means it will list out top five rows in a particular column. With the condition applied. You can see all these values are less than 500. Close less than 500 is a condition. And then we are selecting two columns, open and close. Show five. It will show the top five rows from the two columns, open and close. We can select three columns here, date, opening price, closing price, and the condition applied is close less than 200. So there are three columns. Condition is applied close less than 200. He can also apply multiple conditions. For example, df.loc filter of df. And then the DataFrames close value is less than 201, more condition open is greater than 200. Now you can see open value is greater than 200 and close value is less than 200. So multiple conditions are specified. Then select two columns, open and high, showing the top five rows. Here are the top five rows. We can drop all the rows with null values. This is the function drop in a dot count. It will drop all the rows which are having the null values. So there are no null values. That's why we are getting the original size, which is one 76 to Then one more condition is opening greater than 300 count. How many observations are following this condition Open greater than 300. So there are 918 observations which are having the opening value greater than 300. Then grouping the data based on the date. And we are applying the lean operation on the open column. Now here, the data is grouped. So you can see on a particular date, what is the average open value, which is 321.08? Close less than 200. And we are applying the or operation close less than 200 or df of open greater than 200. And then showing the values. You can see our clothes values are less than 200 or either this condition or this open greater than 200. So these are the observations, but it is showing only the top five rows. We can also use the negation operation. This is close less than 200, but not opened less than 200. Show, will show the observations. So there are three observations which are following this condition. Close less than 200 and open less than 200 and show, it will show all the observations which are following these two conditions. Here, I would like to list out low. The condition applied is this one. It means how many observations are having the low value, 107.16. So there is only one observation. We can use collect method instead of show to collect the data from the cluster. So DFT dot filter of df of law is equal to 197.16 collide, which means that it will written the outcome as a Python object. You can see it is returning as a Python object. Date is equal to date time format. This is the date opening prices, this high value and low value is this much. Close value is this, and volume is this. And I just to close value is 25.62, like this. Next, I'm applying the same operation, but storing the result into the variable result. That's done, then checking the type of the result. So it is an excuse for rho. We can store the result into the row variables. We are converting this rho into Python dictionary. Dictionary means it has key value date and it has associated value like open. And this value like this. We have the column with their associated values. So we converted our results into the Python dictionary. If you just want to list out all the values of this result, that means you have the values of all these columns. And we can use for loop and list them out. So these are the values. In this session, we learned the basics of the different operations that can be performed on stock market data.
14. MLLIB Intro: Hi, and welcome again to the coast, Python and bug for big data. In this session, we are going to talk about what is ML LIB and overview the machine learning algorithms that it provides and also hands-on experience. But first, what is Park MLR? Spark? Mllib is a scalable machine learning library that is used to build distributed machine learning models. It provides to Modules. Smart dot HTML limb, which contains the original API built on top of sparks resilient distributed databases. It is currently under maintenance mode. The other package is Spark dot HTML, which is providing the higher level API built on top of dataframes for constructing machine-learning pipelines like distributed machine learning models. It is also the primary machine learning API for Spark at this moment. Spark ML LIB provides these facilities. First, machine-learning algorithms. These machine learning algorithms are used for classification, regression, clustering, and collaborative filtering. For recommendation systems. Featuring spark includes feature extraction, transformation, dimensionality reduction, and selection facilities. Pipelines. It means constructing or building a model, a series of stages. Spock is also providing the facility for that persistence. It means saving and loading the algorithms, models, and pipelines. Lastly, it also provides the set of features, we can say, utilities for linear algebra, statistics, and data handling. So we are going to discuss the machine learning algorithms and some other features provided by Spark ML LIB with case studies.
15. Feature Transformation: In this session, we will see how to transform your data into machine learning format. Since most of the machine learning algorithms do not work with categorical type of data. We need to transform the data into numeric. So we will take an example over here. But first, you need to create an object to the SparkSession and also spot. So importing the SparkSession first and then Spark is equal to SparkSession builder app name. I am giving the name of the application as feature encoding. If the session already exists, it will retrieve it. Otherwise, it will create a new session. Then I'm creating a Spark DataFrame over here. Df is equal to spark create DataFrame. Here, 0 Apple. This is one element. Another element is one banana, two, avocado, three, Apple for orange, and five Apple. The name of these two columns are IID and category. The DataFrame is created and then we can show the DataFrame. It is processing. Take some time for the outbreak. Now here, this is the dataframe. We have totally six rows and there are two columns. So basically our category is the column which is having the categorical type of value, string type of values. So if I use this dataframe and it created a machine learning model, then the model will not be created because the algorithms do not work with categorical type of data, we need to convert them. In this case, we can use one of the functions, one of that times farmers is string indexer. We can import it from Pi Spark ML feature. From Pi Spark ML feature input string indexer. And then indexer is equal to string indexer, an input column as category. The output column is category index. Next, we need to fit this data and then this data is to be converted into the numeric. Then we can see the output over here. So a new column will be added for category index. It is treating Apple as 0. By default. Banana is 1.00. Avocado is three. Again, Apple is repeated, so the value is 0. Orange is not repeated. It is a new one. So it has been encoded as 2.0. APA is repeated 0. So this is the conversion of the column category into the numeric type. We can also have one mode transformer, which is a vector indexer. This is a transformer which combines all the given list of columns into a single vector column. For example, here, we have these features. Id, our mobile user features and collect. Then what we can do is we can combine all these features and we can create a vector less. This is a vector where we combine all these values. This is probably used in building a model. Whenever we create a machine learning model, we need to define what is x. It means it's the feature vector or list. Then one more variable is y. So in X, you lead to list all the features. So for this purpose, you can use a vector assembler with a mine's all the features. So here I'm creating a dataset is equal to spark create DataFrame. And these are the values 01. And these are converted like this. These are these values. And then vector dense are these. And then the value is 1. In this case, this is converted into a data frame. So these are the values ID, our mobile and user features and clicked. You want to combine these. In this case, I can use a vector assembler. Assembler is equal to VectorAssembler. We're combining our mobile and user features and treating them as features. So this is the label for this combined one. Then we can transform this dataset into machine learning algorithms dataset. Then we can print these things, printing the assembler columns, like our mobile user features to the vector column features. And then we can display. Here we are selecting two things. The features where we combine all the fetus and the last variable clicked is separate. Now we can consider this as our x and this as our y. It means target variable. In similar situations, we can use vector assembler to combine the values. Another one is one-hot encoding, which is converting the categorical feature to the binary vector. It maps the categorical value to the binary value zeros and ones. If it is present, then it makes one otherwise 0. For example, here, I'm creating a DataFrame. Before that, I had to import the function one-hot encode our estimator from Pi Spark ML features. And then dF is equal to spark create DataFrame. These are the elements as we used in the previous. So DataFrame is created with these fruit names. Then we need to convert this into one hot encoder. So my input column is ID. So the one-hot encoder takes the input column id and then it gives the output column with the name category vector one. And then fitting the data to the encoder. After transformation, we can see the data. This is a new column, category vector one. We have the equivalent values of categorical names here. Happily, banana, avocado, orange. So totally army unique values or their Apple 123 and then 404 are there. If you consider Apple again, the total is five. So there is a total of five, ID's, 0 to six, total of six are there, but we have five values. So five comma. If element is present, then it makes 1.00. Otherwise it makes 0. So this is a conversion of one odd encoder, which converts the categorical value to the binary vector, where we can have zeros and ones. So these are the three important functions which are used for feature encoding and feature transformation.
16. Linear Regression: Regression analysis is a predictive modelling technique which investigates the relationship between two variables. The first variable is dependent variable and the other is independent variable. So in the regression analysis, what happens is we can find out the best fit line with minimum error. The data is fitted over the line. And there might also be some data which is not returned to the lines, and that is considered as the error of the model. The main objective of this linear regression is to check whether the independent variable explains the dependent variable. We can define Equation of y is equal to beta 0 plus beta one multiplied by x plus epsilon. Y is the dependent variable, or we can say target variable, beta 0 is the intercept and beta one is the slope of x and epsilon, which is small error term or residual. If you know the value of beta 0, beta one, X, and epsilon, then we are going to get the value of y. We will take a small case steady here, salary prediction based on the experience. Here we have data. We also have two columns. Here's an experience and salary. So salary depends on the experience. We need to build a model for this. So our target variable is salary, and the independent variable is years of experience. So years of experience is input. And from years of experience, we are getting the value for salary. So these are my input. Here. I have to create a DataFrame. Before that. Sparksession is also created. So these are the SQL context, Spark configuration. Sparkcontext. We are importing Matplotlib to plot a graph and also importing this polyfit to have a regression graph. So let me run this. If Spark is already running, I would like to stop it and then open the connection again. The configuration is equal to spark on headmaster. This is a local machine and name of the application is machine learning linear regression. Spark context is created by creating the object to this SparkContext. Sql context is also created. After creating these contexts. Here, I'm reading the data. My data file is available in the data folder. So data is equal to SQL context dot read CSV. Then this is the file, and this file already has the header. So header is equal to true. Infer schema is equal to true. It follows the structure of the file. From this, the schema is defined. Then for any machine learning model, first, we need to determine what is x and what is y. So x is equal to years of experience. We are converting data into Pandas, and then this will be the one with column years of experience. Then y is salary. Then we are plotting a graph of x and y values, and the color is red. The x-axis label is years of experience, y-axis label, Salary, and male title of the figure is linear regression. Then, with the help of polyfit function, we are plotting the values of x and y. Sorry, I did not compile the first cell. Now we can see the graph of the linear regression line. It takes time for processing. This is the graph of the training data. As you can see in the figure, there is a positive relation. The correlation between salary and years of experience is positive. That means if one increases, the other one also increases. So there is a positive regression. We can see that there are some points or objects which are not falling on the line. That is the error of the model. That means the data is not fitted to the model. We can set the schema of the data. We have two columns here. Now, let me create what is x and what is Y? That means what is the training and what is the testing data? These are the fetus. I'm going to use VectorAssembler, as we saw in the last class, which assembles the features and combines them as a list as a vector. Then really Regression is a main algorithm and then every model has to be evaluated. So for linear regression, this is the regression evaluator which is imported from Pi Spark ML evaluation module. So once you import these functions, then here I'm selecting only the data, which is years of experience and salary. Considering the salary as a labeled and this is a selected data, data to then spreading this data into two parts, training and testing. Train test is equal to data to random split. So this is my split size, 70% training and 30% testing. Then assembler is equal to vector assembler set input column. So this is my input column. Years of experience, and i just now considered this as a feature. My target variable is this salary. So if X and Y are defined, that means if features and labels. Target variables are defined, then we can transform the data. And then we are subsetting the data trends 0-2 with these features and labels. Then we can show what is our trained 0 to subset data. So these are the training data. For example, for 1.1 years of experience, the salary is 301343. So you can see the trend of the training data. The model is trained. With this training data. We can apply some testing data to make the prediction. So here I'm creating the model LR is equal to linear regression. Fitting the data, train 0-2 to the model. Transforming the distinct data model makes a prediction. And we can see the predicted values of the model. So these are the predictions as per the feature. The feature value is 2.2 from 2.2 years of experience and then training data salary is 39,891. For this, the model has predicted 47,627. If the person is having 2.9 years of experience, then the predicted values 54,143, like this. So these are the predictions made by the model. We can also take the coefficient and the intercept values, beta 0 and beta one. So beta 0 is this 9.36308. So we need to evaluate these models. For that, we use a regression evaluator. These are the measures, R-squared, mean-square error, reduced mean-square error, and mean absolute error. So we expect that the value of R-squared should fall between 01. If the value is near to one, then the data is properly fitted to the model. If the value of R-square is near to 0, then that is considered as a weak model. So data is not properly fitted. So R-Squared wheels, this is the measurement which determines how good our algorithm fits the given data. So as I said, R-squared best value is 1.00 unless value is nothing but the moral astronaut fitted. So as per this R-squared values, 0.94 to 0, that means 94.2% of data has been fitted. Sometimes the value increases or decreases. So here r-squared value is 94.2. So we can change here. So that is the R-squared value. And the value is almost better, almost near to one. So you can say this is one of the best models we can use for salary prediction. If you give the years of experience as input, then you can make the prediction of the salary.
17. Logistic Regression: In this session, I'm going to demonstrate logistic regression model. We are taking one case study, Titanic survival. First, we have to create an object for the SparkSession in Spark context. So importing the libraries of SparkSession from spice Spark ML. Our algorithm is this logistic regression. It is a classification type of algorithm. We can import it from Pi Spark ML classification. Once these libraries are imported. Next, I'm creating objects for the spark session. And then my application name is this titanic logistic regression. If the session already exists, then get it otherwise created. That is the use of ghetto Create Function. And then reading the data from my data folder. The filename is Titanic dot CSV and infer schema as equal to true and header is equal to true. That means there is a header and the structure of the file is followed. And we can see the first three rows of the data. So I think you might be well aware of this dataset, titanic survival. We need to predict if the person has survived or not. So there is a target variable survived. Based on the other features. These are the first three rows of the data. Passenger ID 1-2-3. Then survived is a target variable. It has the values 01. So binary classification problem. And these are some other variables. P class means ticket class, name of the person, gender, age, and siblings for spouses, number of parents or children, ticket and then fair. What is the price, cabin number embarked. There are three values for embarked as c, q. So this is the data, and then we can take the schema of the data. So these are the columns. It is also showing the data types of these columns, integers. You can see this name is string type, sex or gender is a string type. Then you get the string type and our target variable is integer. So this is the schema of the data and we can see what are the columns by using this property, df.columns, which lists out all the columns of the data. I'm going to select one, lead the numerical columns. So over here, my call is equal to DF, select of survival, pclass, sex, eat, SP, siblings or spouses, or parents or children, fare and embarked. These are the selected values from my column. Then I would like to draw up if there are any null values from the data. So my call, any dropped dot, the missing data is dropped. Then I have to convert these features, whatever the String type of columns we have, we need to convert them, as we discussed in the last session. So we need to use the feature transformers to convert the data into numeric. I'm going to use this transformer like VectorAssembler, which converts the data and gives us a vector of features. String indexer to convert the string type of columns, vector indexer, vector column, and string indexer. We're also using this one-hot encoder. So these are the transformers that we are going to use. First, I'm considering the feature sex agenda, that this input column, we are going to have an output column that is TSX index. Similarly, after converting this into a number, then we are also using one-hot encoder to encode the values as one or 0. For example, if the particular column or row is less, in that case, it can label it as one, otherwise 0. So that is the use of one hot encoders. So once this is transformed, that means we have transformed the gender by applying string indexer first, then one-hot encoding first. Then one more feature we have is embark, which is also string type. We need to convert that embark. An indexer is equal to string indexer. The input column is embodied and then we are going to have the output column embark index. Next, we are going to apply the same functionality to perform 100 encoder, which converts the values into ones and zeros. Then I'm going to use assembler to assemble these features. So assembler is equal to VectorAssembler of input columns. So these are my input columns, pclass, sex vector, age, siblings or spouses, Piot, fair, and I embarked vector. So these are my features. These are considered as the features. I assemble all these things and label them as fetus. They are referred with the name features. So that is the use of VectorAssembler. The fetus have been transformed and we have used VectorAssembler. Now we have a list of features. We can define a pipeline. So Pipeline means we can define a process. A process is nothing but a set of steps. So first, what is a step-by-step procedure to build a model? In that case, it can define that by using Pi's parks pipeline feature. So from pyspark dot HTML import pipeline. And this is our algorithm. We are applying the logistic regression algorithm over the data. And then our model name is large red logistic regression. Then our features are these feature column is this features and label column is survive. That means we are going to find the relationship between the fetus and survive. Then this is a pipeline. So Pipeline is equal two stages. So what are the stages here? First, gender indexer is called, an indexing is done. Then embark indexer, then encoding of gender, then encoding of embodied. Then we call the assembler, which gives us the list of features. Lastly, the logistic regression. That means the more or less built. So before building model, we need to perform these operations. Sorry, I didn't not run the previous cell. Let me run it first. Now, pipeline has been defined, then splitting of data into two parts, training and testing. So final data, random split. Random splitting is done with these values. 80% is training and 20% is testing. So the data is split into 820. Then we fit the data to the model. And then the results are retrieved by applying the testing data. Now, we have the results of testing data. The predictions have been done. And we are selecting two things from the results. The prediction under survived value. I'm showing only the top five features, TOP five observations from the data. You can see the prediction is this 11, but the actual data from the training data is 0. So the first column is the predictions of the model. This one, these are the values of the training data showing the top five rows. And then we are grouping. And then we would like to know how many people have survived and how many did not survive. What is the value for one and what is the value for 0? We do that by aggregating the data. These are the values. We have 20% of the data. For this 20% of the data, 93 observations fall into the category 047 observations fall into the category one. So one means survive. So 47 people have survived. 47 observations fall into a category, classes 1.00. So this is a class distribution. We can say it's a binary classification problem. Two classes are there, and these are the classifications of the samples. So we need to evaluate our model. In that case, as you know, one of the important metrics for logistic regression as ROC, which stands for receiver operating characteristic. It is one of the metrics for logistic regression model. This is used for binary classification. What it does is it finds out the true positive rate. That is the value of this is equal to true positive rate minus the negative positive rate. In other terms, we can see true positive rate is nothing but specificity. So ROC Curve plots sensitivity, which is one minus the specificity. Importantly, we need to know the value of AUC, which stands for area under a curve. You might have learned the basic machine learning Scikit Learn. So we have this measure, area under curve, which shows the area fitted to the model. So our evaluator is this binary classification evaluator or logistic regression. And our predictions are these. These are the raw predictions. This is the target variables survive. And the metric uses ROC area under curve. So what is the value of AUC over here? It is 83.40. So 83.40 data is fitted. So this is the performance of the model. So this demonstrates how to create a model by using the logistic regression algorithm. We took the Titanic survival case study, and finally we got the performance which is 83.40.
18. Tree Methods: In this session, I will discuss about Tree-based methods provided by Spock. First, you might have heard about decision tree. A decision tree is a flowchart like structure in which each internal node represents a test on a particular attribute. The brand represents the outcome of the test. And each leaf node or terminal node represents a class label, which we can say is outcome. The result from root to the leaf represents the classification rule. So we can form classification rules after having a decision tree. And these classification rules are used to have the prediction of the testing data. So decision tree is a type of supervised learning algorithm. It works for both categorical and continuous type of data. We have two types of decision trees. Wine is a classification tree, and the other one is a regression tree. If our target variable as a categorical type of data, then we can consider a classification tree. If the target variable as continuous data, then we can construct a regression tree. So Spark provides both the types of algorithms, classification and also regression. So here I'm going to take a small case study. We have data which is taken from UCI Machine Learning Repository, which is related to direct marketing campaigns for phone calls of particular banking institution. This is a classification problem. We need to predict whether the client will subscribe to a term deposit or not. So it's a binary classification problem. The outcome is either S or no. We can represent this as one 0's. As usual, we need to import the libraries. We are creating spark session here from pyspark dot SQL. Import SparkSession creating object for this SparkSession builder happening, for example, and bank tree. If this does not exist, we can get it. Otherwise, we can create a new session. Creating a Spark DataFrame. Df is equal to spark dot-dot-dot read.csv and higher bank dot CSV, which is stored in my data folder. And following the headers of this file. And schema is also followed. So INFFER schema is equal to true. What are the input variables in this data? Let's go for displaying the content of this DataFrame by using pandas. Import pandas as pd, pd dot DataFrame. So DataFrame is created from this df, dx of phi. That means displaying the first five rows, which is in the form of pandas DataFrame. So this is our data. So we have these input columns like aide, job, marital status, education, default, balance, housing, loan amount, contact, day, month duration, campaign, PI days, previous p outcome and deposit. So our target variable is deposit. That's a binary classification problem. As usual, we need to determine what is our training data and what is our testing data. So we need to go for splitting. Before that, we need to preprocess the data. We are displaying this as pandas DataFrame because Pandas is better than sparks DataFrame. So just display in the form of Spark DataFrame. These are the data values, data fetus, age, job, marital, education, and last rehab, the target variable, deposit as values yes or no. Then we can group the data in to understand the class distribution. That means how many samples are in class? Yes. And how many are in class? No. If you look at this deposit, no. Are these mini observations 5873 and ESR 5-2 89. So from this, we can say that this is a balanced class distribution. Let's take a descriptive statistic, descriptive summary of the data by using the pandas described function. This is a descriptive statistic. These are the numerical features, age, balanced, day duration, campaign, PDAs and previous. These are the statistical measures. Count mean, standard deviation, minimum value, 25 percentile data, 50 percentile, 75 percentile, and maximum. So the maximum ages 95 and minimum is 18. You can make a note of maximum balance Available. Maximum balance is negative as per the data. So this is the descriptive statistics of the data. From this, we can select only few columns. Here. Day and month columns are not necessary. So we just remove them. And we are selecting these features is job, marital, education, default balance, housing, loan contact, duration campaign PI days, previous p outcome deposit, except the day and month column. So we are selecting all these fetus and creating a subset data that is df. And these are actually the columns. These are applied to the data store in calls and then printing the schema. So this is the subset of the data. If you make a note of the data types, for example, this job is string type, Maiden String type, education string type, default, String type, housing, contact, be outcome. And also the target variable deposit isn't string format. So as we discussed in the last session, we need to transform these categorical type of data into numeric. So we have to support some encoding method, as we discussed in the last class, our last session. So we need to go to for preparing the data. We can use this encoding methods like one-hot encoded string indexer, vector assembler to group the data. And these other functions. These are the methods which are imported from Pi Spark ML feature. We have these categorical features, job, marital, education, default, housing, loan contact, and B outcome. We're just creating a list of these categorical columns. And then this is the empty list stages. Then for each categorical value in the categorical column, we are applying the string indexer. Then later on we are applying one-hot encoding. Then adding to the stages. All these are converter. We can have a numerical column and categorical type of columns, and we are grouping them as feature variables. We can define a pipeline for this problem. We can import a pipeline. Pipeline means we can carry out our operation in steps one by one. So first we do the pre-processing, then encoding, then building model, then evaluating a model like that. We can define a pipeline. So from Pi Spark ML import pipeline. And then pipeline is equal to pipeline of stages. It is equal to whatever stages we mentioned earlier. Then we fit the data to the pipeline model and then transform the data. Then we had these things are classed, label and the fetus. And we can go for printing the schema. We can have all the converted values. Now, these are the DataFrames and selected variables. Then taking the first three rows of this data and then taking a transpose to give them in proper order. Now, these are the fetus converted features and we have the label. Importantly, these two are important, label and features. But this we can go for building a model. Next, once the data is converted, we can go for splitting the data into training and testing. So that train data size is 0.7 or 70% and testing data size is 0.3. We are also setting the seed value to reproduce. And then we can determine what is the train and test calm. After applying random spread function with a given threshold values, we have training dataset, dataset count as 7764. That is 70% data of the original. And testing data is 3398, which is 30% of the original dataset. Now we have processed data's split data. Next, we can proceed to building the model. First, I'm applying the decision tree classifier. The prerequisite is you must have knowledge on basic machine learning package, scikit-learn, what is decision tree and how it is selected and everything. So from Pi, Spark, ML classification, import the classifiers. This is a classification type of problem because our target variable deposit as a categorical type of value, yes or no. That's why we are importing the decision tree classifier and then creating a model. The name of the model is DT, decision tree and then decision tree classifier. So we had the labeled with these features. Labeled column has the label, and then we are defining the depth of the tree. Maximum depth is three with this condition, that tree is constructed and data is not fitted to a model and then transform. Then from the predictions, we can select these features, a job label, and what is the raw prediction? What the model as predicted, and what is the probability? I'm showing the top ten rows of the data? It takes a bit of time to process. So you can see the age is 37 as management level is 0, original label. And the prediction is this. These are the features. These are the labels and features that the more or less predictor, the classes zero-point zero. And the predicted probability is 0.83. All this, there is a small variation, especially for this observation, 1.00 predicted classes 1.00. So the probability is 0.39. We can make some conclusions. If the probability is less than 0.5, then we can make it 1.00. If it is greater than 0.5, then we can conclude that 0 class means no default. Once the decision trees are built, we can apply evaluator. It means every model has to be validated by applying some evaluator. So from Pi Spark ML evaluation, import your binary classification evaluator and then apply these predictions. What are our predictions? Are? Predictions are actually stored in predictions. Then our evaluation metric is area under curve. Let us see what is the performance of the model. The performance of the model is 0.7, which is area under curve. 79% data is fitted. Even though it is poor performance, the accuracy of the decision tree is improved by going with further advanced methods like ensemble method, random forests, and gradient boosting. Next, I'm going to apply random forest classifier, which is an ensemble type of algorithm. We tried to create a set of trees. We can say forest and choose the best prediction from all these trees. So from Pi Spark ML classification, in putting the algorithm a random forest classifier and building a model. The model name is RF, Random Forest. And then we had these features and label column is labeled, fitting the data, transforming the testing data, and then selecting the same features as we have done before. And then showing the top ten predictions of the testing data. Then we need to evaluate the random forest model. I think it will improve the performance. Now you can see these are the predictions of the model. Zeros 00 here, only one, 1.00. Earlier we had to 1.00 for these two. Then let's evaluate this model, random forest model. The area under the ROC value is 0.88. Earlier we got the value that is 0.79. Now it's 0.88. So there is a big improvement in the model performance. Similarly, we can go for another ensemble algorithm, gradient boosting tree classifier, DBD classifier, which is available in this module from Pi Spark ML classification. Here. This is our algorithm DBT classifier. Then to this algorithm, we are specifying the maximum iterations and then fitting the data, testing the data, pig making the prediction, and also again showing the top ten rows. It takes quite some time because we're specifying the iterations and so many trees are generated. So here we have top ten rows. Performance prediction is done. Let's evaluate this model by using the same binary classification evaluator. And let's see the value of area and the ROC, which is 0.9. So there is an improvement. So totally we built three models. Decision tree, Random Forests, gradient boosted tree classifier. So what is our final conclusion? Gradient boosting tree as given the higher accuracy, which is 0.8989% is the accuracy of the model. So this is the best model compared to the decision tree and random forest.
19. Recommendation System: One of the striking features of ApacheSpark is to build a recommendation system. A recommendation system is a technology which is employed in an environment where all the items like different types of products, movies, emails, and articles are recommended to the users, customers, Visitors, application users, readers, etcetera. This recommendation system these days are playing an important role in every application in all the domains. Companies like Netflix, Facebook, Amazon, they're already started using recommendation systems to recommend movies. Friends recommend some products to the users based on their browsing history or summer activity they perform. So what are the different types of recommendation systems we have? Importantly, there are three recommendation systems. First one is content-based. In content based system, the algorithm tries to look for features of the items and then determines the similarity whether or not these features are present. Features can include characteristics of movie, the year of publication, whether or not a certain actor is present. So based on these features, the movie or the particular product is recommended. The major problem of content-based system is that the user can get stuck in a bubble where no serendipity, here's items are recommended. A decently expert humans are required to recommend the product. What I mean to say is this system's performance will be low. If the content is not properly analyzed, then recommendations can be of lower quality. We can go for collaborative filtering. These are the popular type of machine learning techniques which use the similarities between the users and the items. Domain knowledge is not required as it is required in content-based systems. The recommendation can go outside the regular content based because they recommendations can be made based on what the user likes. So here, uses data is important. Getting the details about the user is very important. For example, the length of time watching a movie can signal that the user liked a particular movie or video or he disliked it. So if you keep this information of the user, then it will help us build a very good recommendation system. The ApacheSpark is providing a very good algorithm for us, the Alternating Least Squares or OLS algorithm to build a collaborative filtering based recommendation system. The third type of system is knowledge-based. These are very intelligent. We can say that this is a combination or hybrid model of the previous two, content-based and collaborative based. In this example, we are going to take a MovieLens data. It's an open-source data. Let's build a movie recommendation system. First, we have to import some required libraries. So from Pi Spark SQL, SQL import spark session, the aliases SS. Also, we are importing this algorithm ALS. Then in order to evaluate our model, we are using the regression evaluator from Pi Spark ML evaluation. So this is important. Then our application name is recommender creating session for this. If this does not exist, we'll get it. Otherwise we just create a new session. There it is my data file, MovieLens rating dot CSV, which is stored in my data folder. And these are the two parameters infer schema is equal to true, and this follows. The schema of the file. And headers are present will follow the headers. We take some time. Let's go for printing the schema of the data. Importantly, we have three columns, movie ID, which is integer type, rating double UserID, integer. Let's show the top eight rows of this data. These are the values movie ID to rating for movie ID two is 3 and user ID's 0. So these are all the movies, top eight movies which are used by this is Arieti 0. And these other ratings given by him in a particular movie ID. So this is the data. If you want to know the columns, we can list out the columns. So these are the three columns. Then take the descriptive statistics, descriptive summary of the data. Df.describe.show. These are the statistical measures. Count mean, standard deviation, minimum value for movies 0, maximum movie IDs 99, maximum rating is 5, minimum rating 1.00. This is the count of the rows, 1501 rows. There exist 30 users, 30 user ID's 0229. This is the descriptive statistics of the data. Let's split this data into two parts, a training and test using the random split function. The training data size is 0.880%. Testing data size is 0.2 or 20%. So data is fitted with the statement. We will use the algorithm ALS. This as a collaborative filtering algorithm. Als is equal to Ls of maximum iterations looping five division parameters. Improvement is, we expect 0.011%. What is the user ID is a column is UserID and item ID, Item column. And then reading these three are important for this recommendation system. Who is the user, what is the item, and then what is the rating given to a particular movie or a particular product? So the model is built, the data is fitted. Let us transform. And then we can see what are the predictions made by the model. Well, 20% of the data. And now here we have the predictions movie ID 31. The prediction is minus2, minus 0.2. Recommended or not. We can see the probability values here. 31, movie ID 31. These are the ratings by four users. Movie 31 as reading from users 2672518. Here are the ratings and then these are the predictions by the model. Now, let's evaluate our model. So we can use a regression evaluator as this problem is regression type. And the metric is equal to root mean square error, MSC and labeled columns rating. So we will consider the rating column and then prediction column. The comparison between rating and prediction is done. And then we can evaluate our predictions and we can find out what is the RMSE value, root-mean-squared error value. So we expect this RMSE value to lower. The actual values can be if it is standardized data, then it can be between 011 is considered as the best model and zeros consonant as a very weak model. So this is quite greater than 1.61. It's acquired low-value RMSE is lower, 1.661108. Then we can find out what are the movies recommended to the user id one. We can select users from the testing data. From the testing data. These are the movie IDs related to user ID one. And we can consider this as the testing data, UserID and the movies. Then we can transform this data or just order the predictions of the model in descending order. So as a 10x is equal to false, it means our ring. The predictions in descending order. We can see top movies which are recommended to the user id one. So if we look at this output, what are the Moon is recommended. Look at cell 12. What are the top movies? Let's go for the descending order. So top movie, which is recommended to the user ID is movie 88, with a reading of 3.85. And the lowest movie recommender is movie ID 94, with the prediction rating of 0.09. So this is the least recommended movie ID to the user id one. In this session, I explained what is a movie recommendation system and what are the different types of movie recommendation systems? By taking the case study, movie recommendation.
20. Clustering: In this session, I'm going to demonstrate how clustering is done by using ApacheSpark API. Large technology firms need your help. They had been hacked by hackers and they have grabbed some value, some data about the hacking. So this is the data given session connection time. It is how long the session lasted in minutes, bytes transferred, which is an mb megabytes. Cali trees used indicates if the hacker was using Kali Linux or not, then servers corrupted. It means number of servers corrupted during the attack and pages corrupted and location from where the attack came from. And WPA am typing speed means the estimated typing speed based on the session logs. So this is the data from this, we have to make a decision. The technology firm has three potential hackers that penetrated, that are. So they came to know that there are two hackers, but they are not sure if there can be a third hacker also. So they have requested you for being a data scientist or machine learning engineered to find out and to help them. They've also mentioned that the hackers trade-off is equal. That means each have roughly the same amount of attacks. So that is our key finding. First, we create a spark session and the application name is cluster hacking. Finally, if it does not exist, we will get it. Otherwise we'll create it. Then this is the algorithm k-means, which is imported from Pi Spark ML clustering. And then I have data which is located in my data folder. So reading the data, or I can say learning the data, the header is equal to true, the data as header. And following the schema of the data file, then better, I would like to create this data into a pandas data frame and show. This is the header part of the data. I'm taking the first three rows. So session connection time, but bytes transferred, Cali trees use. So this is a binary feature which has a value of one nonzero number of service corrupted pages, corrected location, and then we have typing speed. I also would like to take a descriptive summary of the data by using pandas described function. This is the descriptive summary for so from this, what we can conclude is the minimum number of bytes transferred is let say ten MB and a maximum is 1330. The number of Cayley trees used. Maximum, one, minimum is 0. So how many servers were involved? Maximum of ten servers were enrolled. Minimum is one, so this is the average typing speed is 57.3 for 2395. This is a descriptive summary of the data. If you want to list out the columns, then we can write data dot columns. So these are the columns. In total, we have seven columns. Then I would like to assemble the data. So I should have to go for a vector assembler. From pyspark and linear algebra, importing vectors and also importing VectorAssembler from Pi Spark ML feta, then collecting all these features as column list. So this is a list and then assembling these features. So input column is equal to feature columns and then output column is features. Now VectorAssembler is applied. After that, I can transform my data. Also using the scaling. I'm applying standard scalar on the feature columns. So from Pi Spark ML feature import standard scalar. And scalar is equal to standard scalar. So these are the input features and these other scale features. This will be the output column maintaining the standard deviation and then meet his fault. Then the data is scale. We have scaled data. So once the data is scale, now the immediate task is to find out whether they were two attackers or three attackers. In this case, we will build two models here, k-means three and came into to initially came in three. So algorithms k-means and feature columns are these scaled features. Then taking k value equals to three and also building one more model with k equals to two. And then transforming the data to the model and then computing the within sum of squared errors. So what is the within sum of squared errors with k equals three and k equals two. So here I just want to know what are the values for WSSE. When k is equal to three within sum of squared errors values for 34. And when k is equal to two, the Witten said sum of squared error value is 601. I would expect that as k increases, this error rate might also decrease. So let's take a small experimental area. I'm varying the value of k in the range of four to nine. And then looking at the value of within sum of squared error value. So written our for loop range four to nine. Building the models when k is equal to four within set sum of squared error values as our expectation to 6.137. When k is equal to five, the within sum of squared errors value is 400. And when k is equal to six, it has decreased to 32.73. When k is equal to seven, to 21.58. Even when k is equal to eight, the value is reduced to 2.4309. So the error rate is decreased when k value is increasing. The organization also mentioned that the attacks should be shared equally. So in this case, what are the predictions? We will use a clustering model to help out. So I'm using the model K3. That means when k is equal to three, then what are the predictions? Are the attacks shared equally or not? So if there are three attackers, 012, this is the distribution of the attacks 8551. Let me take it out. When k is equal to two, what is the sharing? So this model is good because here the attacks are shared equally. 167, unwind 67. So there were two attackers, are clustering algorithm created two equally sized clusters.
21. Capstone Project: Hi and welcome again to the course Python and Spark for big data. In this session, I'm going to demonstrate the capstone project, which is credit card fraud detection. So for detection is an activity that every financial organization is looking for solutions today, there have been maintaining large data. If it is small data, then we can go from machine learning to solve the problem. But if the data size is larger than machine-learning cannot help us so much. In this case, spark type of environment is better. So the dataset has been provided from Kaggle, and this dataset contains the transactions made by credit card holders in Europe. We need to import the libraries. So here I'm importing the spark libraries and also configuring Spark. So input find spark and initializing find spark. And then importing PySpark and setting some configurations like application name. The application is running on the local machine and then specifying spark executor memory as it's quite a large data, so four gigabytes is required. Then sending the configuration parameter to Spark context and then importing the libraries such as SQL context and Spark context. And then importing the basic SQL function. And also importing window function from Pi Spark SQL window and creating object to the Spark context SC. If this session exists, it will lower it. Otherwise it will create a new session. Then my data is available in my data folder. The file name is credit card dot CSV, and we have to load it by using Spark's read function. Once the libraries are important, then we can go for reading the data. Then I would like to check the shape of the data. Then I can write data No.2 Pandas. It means we are converting this data to Pandas dataframe type and then taking the shape of the data. After taking the shape of the data, I would like to see that descriptive summary of the data. Here also, I'm using pandas describe function to summarize. Then I will come to know what is the minimum, maximum and then standard distribution value. And also some percentile values like 25 percentile, 50th percentile, and 70% dark. So describe function uses the complete summary of the data. It will take some time. As data is quite large, it has to fetch 2 million records. Here it is. So we have 31 features, including the topic variable. And then I would like to describe the data using the describe function. And then taking the class distribution, that is observations as per the classes. So this is a binary classification problem. I should know whether there's a fraud activity or not, yes or no type of problem. This would also take time. So we should have a distributed environment. Basically this is my local computer. If the data size is large, then we should have a basis to see the output. Here it is. 3031 features are there. The first features, this time b1, till we have the fetus up to V2. And then there's also the amount and the class. It has two unique values, true or false. Then we want to see the Account class distribution. Here is the calm. For 0 type. We have 2284315 observation. And the other one we just have 492. That means the fraudulent activities are only for nine to the non fraudulent add 28435. So this is the class distribution. Basically, as per these values, we come to know that this is an imbalanced data. We can convert the string type data type to the double. So for each column in data columns, we want to convert them into double, casting it to double, and then adding the index to keep track of the rose, even half-term shuffling the data. That why we are creating an object to the window and ordering by time. And this is the ID of the transaction idx. Then importing other PySpark libraries like pipeline and GBD classifier. And we would like to transform the data. So vector indexer, VectorAssembler, and mini to evaluate our model. So binary classification evaluator and DenseVector. So here I would like to create a model which is a gradient boosting tree. Dbt classifier is used to build a model. Once these libraries are important, then convert each column to the dense vectors. Converting each column to the dense vector, and then create the label and index to the data. Here we are separating the data into features and label and also doing the indexing. Finally, we'll be having these three columns in the training data. Once we've formatted the data in to proper format, then we can split our data into 80-20. There are two resulting variables here, train and test and train random split off 0.8, which means 800.2%, which means 20%. Also, we are setting the seeding value to reproduce and then labeling the data, grouping the data by label in both training and testing. Let's see the output over year. Also, it has to count the observations based on the class label. It will take some time. Now, this is account for the training data to 27418. These are the non fraudulent observations. Then in one we have three, 7-6, which means fraudulent. Similarly for testing data, we'll get the results like this. Class wise. Here is the count. So fraudulent off 106. Then here we can apply the algorithm. Our algorithm is GBT classifier, gradient boosting tree classifier. It is one of the ensembling type of algorithms. Try to improve the accuracy by boosting the features. So GBT is equal to GBT classifier. And our feature columns are labeled with features. And then maximum iterations we are performing is a 100 and the maximum depth of the tree is eight. After that, we fit the data to the model training data. And then we have to make the predictions. And then these predictions results are available in the predictions. Let's wait for the output. It will take some time. Once the data is filtered, we can group the predictions and see the calm. Then we can go by using the binary classification evaluator. As it is a binary classification problem, we are going to use this binary classification evaluator, creating object for that and then evaluating the predictions which are mean. So here we have the predictions. In the non fraudulent. We have 56,905 on fraudulent. We just have 98 observations. And then making the evaluation, then calculating the percentage of fraudulent records which are predicted. But here our logic is prediction label equals one and prediction label not equal to one. That means 01. So here the evaluation accuracy is 0.97. Then calculating the prediction, then grouping the predictions, and then making a note of the predictions. Fraudulent predictions. Now, these are the fraudulent predictions. 1940 type is 56,919. And then also grouping based on the values, based on the label labeled wise grouping. Here we got a fraudulent of 9494 fraudulent. The number of total records are 116 for fraudulent records in the testing data. We can also determine the accuracy over here. So I'm importing this column from peice Park dot SQL and then moving the data by class wise. Then calculating the total fraudulent and calculating accuracy. The accuracy is equal to accurate fraud divided by total fraud multiplied by a 100. So here we can make a note of the accuracy of the model. This would also take some time. So 81.03, this is the accuracy. We can also have a confusion matrix where we can show true positive, true negative, false positive, and false negative. True positive means for which the class label is one. Prediction label is one, and also prediction dot predicted is equal to one. And then making the count of this also true negative label is equal to 0, prediction is equal to 0 and false, false to meet labeled 0 and prediction is one. False negative means label is one and prediction is 0. And then we can have a print off all these values. True positive to negative, false positive and false negative. After this, after knowing the true positive to negative, false positive, and false negative, we can calculate recall. So recall is one of the measures for classification type of problem. Recall means true positive divided by true positive plus false negative. And precision means true positive divided by true positive plus false positive. For classification type of problems, it is very important to know the precision. Precision means true positive rates. So many observations or classified correctly. So these are the values. True positives are 94, true negatives are 5-6, 893. False-positive is for, false-negative is two. From these values, we can calculate the recall value, which is 81.03, and then precision will be 0.9591. So obviously we have very good value. We can say that this is one of the best models we have developed.
22. GraphX Intro: In this session, I will introduce you to what this Spark GraphX. These other concepts which are going to be demonstrated. What does the graph, what is graph X, and what are its features and use cases? We will also see a couple of examples. So first, what is graph? A graph as a mathematical structure that has a set of objects and we can have relationships among these objects. The relations can be represented using edges and vertices, which together form a graph. Vertices represent the object and edges show the relationship between the object. You can see a graph here. This is an undirected graph, which means there is no direction among the nodes or vertices. But totally we have five vertices, a, b, c, d, e. And you can have the edges from the vertices like AB, AC, AD, BC, CD, C, and D. Graph can also be directed graph where you can have directions from one vertex to the other vertex. Example c2, e. The sources C and destination is e. What is Spark GraphX? It's a Spark API for graphs and graph parallel computation. It unifies ETL stands for extract, transform and load process. It can have exploratory analysis and iterative graph computation. Graphics include a growing collection of graph algorithms and builders to simplify graph analytical tasks. The usage of graphs can be seen in real time applications like Facebook, friends, LinkedIn connections, internet routers, relationship between galaxies and stars in astrophysics and Google Maps. So Google maps are also implemented as graph and you can have source and destination. You can find out some intermediate routes as well. Next, what our graphics features. First is flexibility. You can work with both graphs and computation's. Graphics unifies ETL. And you can view data as both grabs and collections. Transform and joined grabs with RDDs. And you can also write custom iterative graphs using Pregel API. Graphics also support speed. That means if you form any computation by building a graph, you can maintain your ISP compared to other graph processing systems. This library is now a growing algorithm library which can support the different types of algorithms, like PageRank, connected component, label propagation, SVD plus, plus strongly connected components, triangle count. We hope in the future, this library grows with even more algorithms and problems allusions. What are the use cases? First, we can consider disaster detection system, which is used to detect disasters such as hurricanes, earthquakes, tsunami, forest fires, and volcanoes to provide a warning to alert the people. Pagerank algorithm is to use to find the influenzas in any network, such as paper citation network and social media network. Financial fraud detection, it can be used to monitor the financial transactions and detect people who are involved in financial fraud and Money-Laundering. Business analysis used along with machine-learning, which helps in understanding customers purchase trends. You can say Uber and McDonalds are examples. Yeah. I introduce what is graph? Graph as a mathematical structure which represents the relationship between sets of objects. And graphs have two important components, vertices and edges. And graphics is the Apache Spark's API for graphs and graph parallel computation. Graphic supports ETL. It's foetus are flexibility, speed, and growing algorithm library.
23. Graph Operations: In this session, we will have the basic hands-on on graphics operations. So we discussed about graphics, what are its features? What are the use cases in the last session? So here I'm going to demonstrate the basic operations of graphics. First, we need to import graph frames. Next, we have to create a DataFrame. So here I'm creating a couple of vertices in the graph. So a, b, c, d, e, f, g. So these are the vertices and these are the labels. That means we'll be having three nodes in the graph. That is id, name, and age. Here I have given the set of names and also the age values and this is the vertex. Then we need to also create edge and also create SQL DataFrame. These are the three edges, source, destination, and relationship. So these are defined from these vertices and edges. We can create a graph. So g is equal to graph frame of V0 and we are printing the graph. When we do that, all these vertices and edges are printer. You can have a look at this. Anyhow. It's not that much clear, but it shows what are the vertices and edges. So here, V, This is the list of vertices. And also we have e, which is nothing but the list of edges. Once the graph is created, we can now determine what are the degrees. Degree means the number of connections. We can have these types in degree. It means the number of edges and to a particular vertex. And out-degree means outgoing edges. The number of tail ends at descend to the particular vertex. Degree means simply the number of edges connected to a particular node in a particular direction. These are the degrees of each node for each index. So f has degree two 0s, 3d2, c3, b3, and A3. These are the number of edges connected to that particular vertex. We can also have in-degrees. It means coming edges in a particular norm. So fs one, indegree, S1, D1, c2, b2, and A1. We can also list out all the vertices of a graph. These are all the vertices. We have these vertices like a. Also each vertex has an id, name, and age. We are listing out all the vertices by using the show function. Then listing out all the edges. So g dot-dot-dot show. Then it will list out all the edges from source to the destination. Next, what is the relationship? So these are all the edges. Then grouping the vertices based on the minimum age. In the particular graph, we have minimum ages 29. So we can list out this minimum of eight is 29 based on the age and then show it. Then importing the SQL functions and also finding the minimum age. This is also similar to the previous one, but we are using the SQL function min of age. Then counting the number of follow relationships to the graph. So number of follow relationships num follows as equal to g edges filter based on this condition, relationship is equal to follow and then making the count of these relationships. So number of relationships, we have four follow relationships, total is four. Then motif finding allows us to write complex relationships. These cells find the pairs of vertices with the edges in both directions. And the result is the DataFrame in which we can have a column name given by the motif keys. So this is my instruction motif as equal to define. And this is a relation from a to E and then go into B and then from B to e2 and going to a. So these are the relationships, yay to E and then E to B, and then from B we are going to E2. So these type of relationships are represented by using motive. It means we can write complex operations. We can also filter the data by using the motif. So motifs dot filter. This is a condition b dot age greater than 30 and aidan, age greater than 30. So either this condition or this condition must be true. And then it will find out the edges and vertices which are following up this condition. So here we have two rows. Here. I'm taking one more example, creating a Spark DataFrame. And these are the values a, b, c. So we can consider this as a name. And then 44 is the aid. So we have couple of rosier, total of five rows with name and age. So this is the data frame. From this, we can perform operations like grouping the data based on the name and then performing the aggregation operation with the minimum and average age and then showing the records. So these are the names and these are the minimum ages. I'm aggregating the value which are 32.044.023 for these three names. Then grouping the data based on eight and also, and taking the average eight. So these are the values, h values, and then we have the average age values, grouping the name with minimum age. So these are the names with the minimum ages.
24. Graph Algorithms: Hi, in this session, I'm going to show you the standard graph algorithms available with Spark GraphX x. These are the algorithms supported by the library at this time. Breadth-first search, connected components, strongly connected components, labeled propagation algorithm, short for LPA, page rank algorithm, shortest paths, and triangle count. You might have basic knowledge about what our BFS connected components and some basic terms related to the graph. Let me first create a graph. First, we create a graph with vertex and E. I'm importing graph frames. Then creating set of vertices, each vertex as these values, ID, name, and age. These are the records and set of edges. So these are source, destination and then relationship. We also use this example in the last session where we discussed about graph operation. The same graph is being used over here. That these vertices and edges and creating a graph G is equal to graph frame of V comma E. And then it will list out the vertices and edges. Breadth-first search. It is one of the popular algorithms. It searches the particular element based on the breath. We can implement BFS like this. Part one is equal to G dot BFS, and this is the expression. From expression is we are specifying where ID is equal to j and two expression we are specifying id is equal to b. Falling this expression, it will give us the record. So yay, and the name of the person as Anita 34. It is connected with the edge of E. These are the IDs e and b. The relationship as friend. And this is from, and this is two. So two has what we mentioned, IDB. So what does B, these are the values b and age is 36. So the same one. We can also perform this operation. We are specifying the name, our name values. So these are the names with these two. What are the records we have? So we have to record by following this condition and performing the BFS operation. Also defining one more part. Name is smile and the age is less than 32. What are the parts? So this is the path. So from E to D, these are the intermediate edges that is connected components shows the relationship between the nodes. We can list out the connected components and we can specify our part with respect to the root directory. We can help connections among the vertices. This is output id, name, age, and this is the connection number. Every connection is identified are given with some ID. These are the connections. The same connection except the last one. This last one has a different connection name. It ends with 64. And these are the connected components. Then what do we mean by strongly connected component? Strongly connected components means each vertex returns a DataFrame with each vertex assigned to strongly connected components having a vertex. So the result is equal to d strongly connected component. It's a predefined function. We're also specifying the number of iterations. Iterations is ten. It will show us the ID and the result of connected components. So strongly connected component is this. B is connected with this one like this. So with every ID, it will give a strongly connected component with its index. The third algorithm is labeled propagation. Labeled propagation is used for detecting the communities in the network. Each node in the network is initially assigned to its own community. Also, at every superstep, node sends their community affiliation to all the neighbors and update the state to the most frequent community affiliation of incoming messages are edges. Lpa is considered as one of the standard community detection algorithms for graphs. It is very expensive computationally. Although convergence is not guaranteed. Also we can end up with trivial solutions. That means our notes are identified into a single community. Here, I'm entering the maximum iterations as five. And then this is the algorithm G dot labeled propagation. It will show us all the communications between the IDs. So this id d as the communication with this component. Similarly BS connection with these components. So this is the label propagation algorithm, PageRank. You might have already known about what is page rank algorithm. It is used for ranking. It was developed by Google. It is used for ranking the pages or vertices. Here I'm implementing the page rank algorithm and specifying the reset probability as 0.25 and total is 0.01. And then it will list out the vertices, the vertices IDs. And with respect to them, the PageRank, that means number of connections with this importance. So if you look at this table, what is the vertex which has the higher rank? The last one is C. C is having the higher rank 2.68. And the next highest level is this one, B, 2.65. So this is the ranking of algorithms. We can show the edges of the results of the previous operation. So this is the source vertex, this is the destination vertex. Meanwhile, this is the relationship friend. And the weight is given after the calculation of the PageRank algorithm, which is 0.5. as I mentioned, these are the vertices which have weight of 1.0.0 and this also has 1.00. B C also has a weight of 1.00. Then running the page rank algorithm with some fixed number of iterations, maximum iterations or ten. The graph is created, then the page rank applied. So here we are creating two graphs by mentioning the probability values 0.15 and maximum iterations. Insulted in we specified the source IID from a particular source where we are finding the PageRank. We also have the algorithm to find the shortest path. This is the one. The result is equal to d dot shortest paths. We are finding the shortest path from vertex a to vertex D. So these are the shortest paths from a to D. Here it is mentioning the distances as well, so no distance for this d. And then these are the other vertices. So DS1 distance e as two distance between. And so these are the vertices. And then we are also calculating the distances. The shortest path, DS, no distance. B also has no distance. C also acetone distance. If there is distance, then I hope you can see over here. One more algorithm is triangle counts, which computes the number of triangles passing through each vertex. So triangle means having a relationship in the form of triangle. Here, we can make a note of the count with a particular index. And these are the other values, ID, name, and age. These are the vertices. With that. What is a triangle count? 01. There are more also, we can see them. So these are the algorithms which are supported by graphics. So in this video, I demonstrated the uses of these algorithms with examples. These are the supported algorithms. There should be improvement in the computation of graphs and also the incorporation of more algorithms in the near future.
25. Big Secrets: Hi, I'm going to tell you a couple of secret using Apache Spark so that you can do better programming. The first secret is sparked tuning and cluster sizing. This means that we need to do proper tuning and cluster sizing. So in order to do that, first, we need to configure Spark job. So basically it's an art. We need to follow some basic secrets here. Choosing a configuration actually depends on the size, setup of the data storage solution. The size of the job also depends on how much data you are processing and the kind of jobs. For example, jobs that cache a lot of data and perform many iterative computations have different requirements than dues actually contain few very large shuffled. Tuning in application also depends on goals. For example, in some instances, if you are using shared resources and you might want to configure the job that uses the fewest resources and still succeeds. In some other times, you may want to maximize the resources available to give applications the best possible performance. The second secret is adjusting the spark settings. You need to configure proper settings. Whenever you specify a Spark context object, which actually establishes a connection to the spark application and cluster. This contains Spark context object, which defines how Spark applications should be configured in your system. So this contains all the configurations, defaults, and the environment information which govern the behavior of your Spark application. These settings are represented as a key-value page. For example, setting the property spark executor instances to five. That means submitting a job with five executors. You may want to create a spark con, configuration with some desired properties before beginning the Spark context. In some cases, you need to specify the name of the applications which have corresponding APA calls. Otherwise said, the properties of Spark conf object directly with the dot set method, which takes as its argument arbitrary as key value pairs. Now, to configure spark applications differently for each submit, you may create an empty spot conf object and supply that as configuration at runtime. The third secret is knowing your cluster. It is very important to know about your cluster. So the primary resources that the spark application manages, our CPU central processing unit, which is nothing but the number of cores and the memory. These two resources must be managed properly. Spark request cannot ask more resources than are available in the environment in which it will run. It's very important to understand the CPU and memory available in the environment where the job will be running. If you set up your own cluster, the answers to the questions may be obvious. Like we are going to have a system which was set up by someone else. So it's very important to know how to determine the resources available to you. Or you can ask questions to the system administrator. So there are four primary pieces of information. You should be knowing. That is, how large can one request be? How large is each node? How many nodes does your cluster have? What percent of resources are available on a system where the job will be submitted? These are the three important secrets. Besides that, we need to also follow tips and tricks during the programming. First is ensure that have sufficient partitioning. It means you must have sufficient partitioning task and reduce the amount of shuffling which occur frequently. Use Parkway file format to predict pushed down and faster axis. If you find that you're constantly using the same dataframe on multiple queries. It's very important to implement the caching or persistence when joining together to dataframes. For example, one is smaller and other one is bigger. Or in this case, you have to do the caching persistent. If the data on the right side is larger, then it actually leads realization and transfer of the data, which actually takes longer time. So when you are joining two datasets where one is smaller than the other one, put the larger data set on the left side, instead of putting it on the right side. Compute statistics of tables before processing. This helps the optimizer to come up with a better plan on how to process the tables. Also avoid user-defined functions. Stop using user-defined functions by using Spark SQL functions.