Want to be a Big Data Scientist ? | Kumaran Ponnambalam | Skillshare
Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
12 Lessons (1h 25m)
    • 1. Introduction to the course

    • 2. Data Science Definition

    • 3. Elements of Data Science

    • 4. Data Science Life Cycle

    • 5. Use Cases and Examples

    • 6. Big Data and Data Science

    • 7. Teams and Responsibilities

    • 8. Roles and Skills

    • 9. Challenges for a Data Scientist

    • 10. Transitioning into Data Science

    • 11. Finding Opportunities

    • 12. Closing Remarks

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

"Data Science is the sexiest job of the 21st century - It has exciting work and incredible pay". You have been hearing about this a lot. You try to get more information on this and start querying and browsing. You get so many definitions, requirements, predictions, opinions and recommendations. At the end, you are perplexed. And you ask - "What exactly is this field all about? Is it a good career option for me?"

**** Please note: This is a career advice course, not a technology course.

Data Science has been growing exponentially in the last 5 years. It is also a hybrid field that requires multiple skills and training. We have been training students in Data Science. A number of them committed to training without realizing what it really is. Some were happy, some adjusted their expectations and some regretted committing too soon. We felt that professionals thinking of getting into Data Science needed a primer in what this field is all about. Hence, we came up with this course.

Through this course, you will learn about

  • Data Science goals and concepts
  • Process Flow
  • Roles and Responsibilities
  • Where you will fit in to a Data Science team.
  • Building a transition plan

Getting into the Data Science field involves significant investment of time. Learn about the field in order to make an informed decision.

Meet Your Teacher

Teacher Profile Image

Kumaran Ponnambalam

Dedicated to Data Science Education


V2 Maestros is dedicated to teaching data science and Big Data at affordable costs to the world. Our instructors have real world experience practicing data science and delivering business results. Data Science is a hot and happening field in the IT industry. Unfortunately, the resources available for learning this skill are hard to find and expensive. We hope to ease this problem by providing quality education at affordable rates, there by building data science talent across the world.

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Your creative journey starts here.

  • Unlimited access to every class
  • Supportive online creative community
  • Learn offline with Skillshare’s app

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.



1. Introduction to the course: Hello. I am Cameron. And it is my pleasure to welcome you to my course. Want to be a data scientist before we stop the course? Let us review the coverage and expectations for this course. The goal for this course is to answer the question. Should I pursue a carrier in data science? What type? Of course is it? This is a carrier advice. Course not a technology course it is for professionals who want to decide whether they want a carrier in data science, Not for those who have already decided about it. What are we going to cover in this course? We will look at the basic concepts and terminology in data science. We will explore the process off data signs from data acquisitions, toe business recommendations. We will look at data science team compositions, the different roles, their responsibilities and the skills needed. We will discover the unique challenges that a person working in data signs will face. Finally, we will review a game plan to transition from your current position into data science, including education and finding opportunities on what is that we do not cover. There is absolutely no coding in this course. We will run through a quick court example. But that is not to teach you anything, but just to give you a feel off things. There are no designed discussions. There are no technologies thought either. And there are no machine learning algorithms covered in this course who are the expected students. For this course, you are an I t. Professional, but little or a lot off experience are you are a statistician looking to transition into data signs or you have a business or analytics background you have heard about data signs from your friends and colleagues have read, Aborted on became interested in it. No. You want to decide whether you want to pursue a carrier and data science? If that is your position, then you have come to the right place for carrier advice. But if you're already familiar with the data science process, its rules and functions this course is not suited for you. Similarly, if you're looking to learn a specific technology like our by don or machine learning, this is not the course. If so, I don't want to waste your time. The last thing you want is to go through the course and realize that your expectations are not met. And the last thing I want is a bad rating and review because expectations are not met a little about myself. I have been working with the Data and Analytics Field for the past 25 years. My competencies include database designed big data machine learning, no sequel architecture software as a service cloud deployment on data pipelines. I want to share with you the knowledge and experience I have gained in my work and help you in your carrier. Let us now move on with the course. 2. Data Science Definition: what exactly is data signs? Let us try to understand it with an example in data signs, we take raw data and build value for a business. Let us explore what that value changes. We start with draw data. Also called input data are source data. Data by itself is draw When I said Oh, it means that it is in its primitive state like a machine log or a stream of bites. Oravec click event. This data is dirty. It contains inconsistent, incomplete or unneeded evens. This raw data is then converted into information information that can be used purposefully information in raw data that is cleaned, linked with other pieces of data transformed into its usable forms and summarised based on time and segments. Information itself is not useful in any means. We have to use this information to generate insights about the behavior off a business. To understand information, it needs to be explored, analyzed and model. This understanding results in knowledge, which can then be used to predict future outcomes off a business. Information on insights are of no use unless they are put into action. Action taken by a business toe, improve its future outcomes actions include prescriptions are strategies cost benefit analysis off these strategies experiments to test the validity off these strategies on finally deploying them in the field. Let us now walk through a simple, easy to understand example off this value addition chain we have later about high school students and their characteristics and performance. We start off with data about students and their age. This table shows data for each student, their student, I d and their age. Next we have the students. They say this course. Finally we have the students Great point average or GPS Course the question We are going toe Answer. But data signs here is Can a student's a safety score reliably predict a student's GPS score. This data we see here in these T three tables is called our our data. Next we convert the raw data we have into meaningful information. We have linked the three tables. We saw you earlier into a single table using the student I D. No. We have a data site that we can explore. Let us eyeball this table to understand if there are any patterns we can notice. Is that a pattern towards G p A. does eight impact G B. We have students off two, ages 15 and 16 but they both seem to have the same range of GPS scores, so that is no backing. Next does a city impact G B A. A careful back and forth eyeballing will indicate that the G p. A generally goes up when a 30 goes up. We only have six rows, so we can quickly eyeball them. What if there are more than 100 Rose? How do you analyze them? We have graphical charts to the rescue. We have a scatter plot off 400 students s A. D is on the X axis and G p a is under white access. You can see in general that when s city improves, GP also improves the tick blue in the shot keeps moving up to confirm we've now do another plot called as the box and whisker plot. For each GPS core, there is a box that shows the range of a city values The box contains the most frequent values on the lines are whiskers show the out layers. The exit middle shows the mean value. If you're still confused. I recommend reading up on this kind of plot from the blood. You can see that the box keeps moving up as GDP increases. His GPS score also has a range of a city's course indicated by the tick box. In comparison, let us now do a box and whisker for age. What's the GP here? You will notice that the age remains the same irrespective off the G B s core, the boxes are one and the same. So that is no pattern here. I know that we have seen about done would be S a D and G B A. Is it possible to predict G B A from a city? Is it possible to build a model that can be used to reliably predict G p A. We will use a machine learning technique called linear regression toe plot a line through the data points. You can see the red line here. This is a model we are building. In this case, this line will have a linear equation. If you have done college level math courses, you should have learned about linear equations. From this data set, The linear equation is shown as follows The equation compute GP as a function off the S a T score. This is the model that captures the relationship between GP and S a T. Now we can create an equation for any data set, but we need to know how accurate this equation is going to be. There is another metric called our to that is used to measure the accuracy off a linear equation from this equation. It comes about 70%. I want you now to take a deep breath on analyzed. How you felt going through the last few slides? Did the tables graphs on numbers intimidate you? Are are you family comfortable with them? If you want to work in data science, you need to feel comfortable with data. I have seen a number of smart engineers who do not feel comfortable looking at spreadsheets and graphs. If so, data science might not be a good option for you. So what are the insights gained from this analysis? First we analyzed that as a city goes up, GP also goes up next. We found that age has no relation to G P. A. We continue that given an S a T score, we can predict a GPS score with an accuracy off about 70%. The action you could take on this would be to use the model toe credit GPS course off students. If you know that is a discourse. This is an oversimplified example for the purpose off. Explaining data science to you really loved later signs is much more complicated than this . The question I want to ask is what if instead of just 100 rose, you have a 1,000,000 rows of data and about 100 columns. You can't eyeball it. It won't load into a stretchered. You need to write code to analyze it. You need technology that can handle this kind of data that ISS In here, data science becomes a programming on automation driven discipline. So let's now see the definition off data sites. The first goal of data signs is to extract insights from data. The next goal is to use these insights to predict future outcomes on unknowns. The final goal is to use these insights on predictions to improve business outcomes. In order to achieve these goals, Data signs acquires, transforms on stores data accurate prediction. Models need huge quantities of data. Data science uses a variety of techniques and technologies from i t statistics on dmat toe achieve its goals. There are signs also includes the need to build under deploy these data pipelines processing systems Prediction Al gardens in a scalable and automated manner toe achieve business whistles. Next, let us look at some of the terminologies used in data science. 3. Elements of Data Science: let us look at some off the terminology used in data science. We start off with an entity. An entity is a thing that exists in the real world. Examples of entities include a customer off a business, a patient off a hospital, a car. A student at the school entity has a business context in the elbow. Examples. A single individual can be a customer, a patient and a student. What matters is the role that is being played inside the business environment. Next is characteristics. Every entity has characteristics or attributes that describe that, and two D characteristics are considered within a business context. Our customers characteristics are properties include age, income and gender. A patient's characteristics intubate age and medical history. Acars properties include Make model year and oven number. A student's properties include age on the great in which they are studying. Next we look at environment. Every entity exists in an environment. A given environment is shared among entities, and Roman affects the entities behavior during various evens, for example, a customer's entity is a country city or work please. For a patient, it is a city they're living on the climate off that place for cars. It is a type of use, whether it is mostly city or highway driving on also the climate in which they are used. For a student, it is a school class. Our team, they are a part off. We didn't move on toe evens even is a significant business activity in which an entity participates. These are the actions that are measured on outcomes predicted for even happened within an environment. Examples off events include a customer browsing a website making a store visit are taking a sale. Scott, a patient doing a doctor visit or a black test a car green through a smart list or a comparison test. A student attending classes are taking a test. Even Stryker behaviors from an entity behaviours bought a specific entity does in a specific, even different entities exhibit different behaviors for the same even in the same environment. Examples include a customer's click while visiting a website. The products viewed from the catalogue or the reviews day red for a patient. Behaviors include nausea, lightheadedness or cramps for a car that things like skid during braking or acceleration for a student. Behavior includes how the answer questions in a test or take notes. And finally, that is outcome. Outcome is the result often even deemed significant by the business outcome. Values can either be bullion like yes or no. They can be a continuous numeric value, or there can be a classified values. Examples of outcomes include whether a customer bought a product after browsing a website. A patient's diabetic type based on the tests. A car stopping distances measured during a comparison test are a student's G p A. Or a grade level like A or B. Now that we have looked at the six elements, let us try to redefine data signs in these terms. In data signs, you first use a set of Indy's study. You collect data about all the six elements. Remember that the six elements are interrelated on their relationships should also be collected, like the relation between a student and his GPS score. On this age, you don't transform and integrate data to convert them into meaningful forms that can be used for further analysis. You will analyze this, transform data to understand patterns and develop models. Finally, you will use this inside learned on the models built to predict the future outcomes for new evens. I hope you should now start having some Brassed off. What kind of work is done in data science? In the next video, we will see how the lifecycle offer data science project is structured. 4. Data Science Life Cycle: So, what are the typical stages in a data science project? Let us see. Here are the overall stages In a data science project, we first acquired data from data sources. Then we transport data from the data sources toe the central data center. We process data toe, transformed them into information. We store and manage data. We analyzed the store data to derive insights. We used insights to predict future events on behavior we prescribe recommended actions on validate them for deployment. Let us know deep dive into each off these stages. The 1st 8 in data science is data acquisition data for data science comes from a variety of sources. We have the traditional RGB Emma systems that store data from internal business applications. Black CR, um, in monetary management financial accounting, exit truck. We didn't have data that is available in flat files like see? Yes, refiles. These are usually fetched from third party businesses. Our products. We have real time streaming data from internal on cloud sources. We have social media data from websites like Twitter and Facebook that can be used to monitor customer behavior and Commons on companies products. We have mobile data, including location data that needs to be captured on linked We have data from I OT devices that need to be collected to acquiring data from these various sources required building suitable competence to these systems. These connectors work either in batch mode are real time mood for data collection. The next stage is data science is data transport from the point of collection to the processing centers, please, no dot This is big data Volumes are heavy and they need appropriate pipelines to transform them, possibly around the world. There mark people forms in which transport happens There are batch more files that can be transferred to fjp. We have real time streams that connect and work on a continuous basis. The streaming technologies can be native are standard A very common model in big data is the publish subscribe mortal Here the providers off data will publish them into a queue like Apache Kafka on the south. Labours will pull the data out of the cube on important consideration for transport his scalability. Imagine having to transport Movil data collected from thousands of devices to a central location in real time. That is a last prevention issue that needs to be considered during this transport throughput requirements can be stringent for real time applications. Next, we moved to processing of data processing off later involves various types off activities. We start off with gaining data to remove bad on incomplete data. We then standardized data to match analytics requirements like standardizing data formats, willing data from various sources. For example, customers order information will be linked to a customer study characteristics on the customer's social media activities. Dada is then transformed to like converting strings to numeric representations. Data is aggregated to create summaries based on different segments on profiling variables. There is also the need to transcribe video and audio data into their equi. Valent text For analytics, The next phase of data science is data storage. This is big data and it requires appropriate story resources and strategies for optimal functionality. Storing big data requires resources like hardware, software, disks, peace and network bandwidth. It also requires human resources for set up on administration activities. Data needs to be guided by a good later architecture toe award it from becoming a data swamp. The architecture and the infrastructure should be scalable to accommodate terra and petabytes of data. The set up should be fault. Tolerant by design. It should also provide acceptable performance for both data ingestion on analytics activities that are required so far. We saw the heavy lifting their day work in data science. We now get to the more exciting part. Analytics analgesic starts off with said mending and profiling data through manual graphical or automated tools to discover and understand patterns. Business analogue dicks perform exploratory data analytics toe discover hidden trends. Correlation analysis is performed to understand relationships between. Indeed, these characteristics and Ron Mint evens behavior on outcomes. Statistical techniques are used to build models on perform. Statistical tests explain a tree. Analytics focuses on finding reasons for behavior, seen tools and automation or a key part off the analytics face to make it both productive and cost effective. Next comes the rock stars teach within data science predictions. In order, the build models data need to be pre processed to convert them into forms that can be understood by machine learning algorithms. Algorithms don't understand text, but only numeric values later is split between training sex on testing sets for model creation on valuation, a model is created using appropriate machine learning algorithms. Models are validated for accuracy and performance. The modern creation on validation processes are hydrated. It is repeated with changes in algorithms, data sets and techniques until desired levels off accuracy on performance are achieved. The final stage in data science is validation to make sure that the outcomes off the earlier stages can be applied to the business in a useful way. Violations start with a set of recommendations from the analytics and machine learning work . These recommendations are run through a SWAT analysis to make sure they make financial sense. Simulation can then be used to see whether the entity's on the environment continue to behave on expected lines on the desired outcomes are achievable. A B testing can be done to see initial user reactions to the new strategies on recommendations. Control tires can be conducted for this purpose. Finally, an action plan is arrived at with all stakeholders in agreement to deploy the learnings in production and reap the benefits. A data science process does not stop here. It is an eye directive on a continuous process of learning, validating, deploying on improving business conditions. Keep changing on the analytics process on models need to keep adapting to these changes to stay current on, continue to deliver bristles 5. Use Cases and Examples: data signs as today being deployed in a number of customer centric on business centric applications, we will take a look at a few examples which you should already be familiar with. First, we have predictive text. I would be surprised if you have not used a mobile smartphone toe takes to other parties On When you do that, you see a predictive text feature suggesting the current or next word that would be based on the first letters that you type in predictive text. Build models based on popularly used sentences and uses them to predict the next word. It also has auto correct features built in the same way to correct you. If you made spelling or grammatical mistakes in your text, next comes protect recommendations. If you have used a maison or any of the other shopping websites, you will come across a set off recommendations made for you based on your browsing on shopping history. This again is data signs at work websites build model based on the customer's like and dislikes. They associate similar customers into customer groups based on your browsing history. As for last, similar customers like you tabled recommendation lists if you use Google search and who doesn't. You will see advertisements pop up in the beginning are on the side off your search results . Such ads are based on keywords you have entered on your browsing history. So Jin Jin's use complex predictive models toe identify adds that you'll most likely be interested in and provide them as part off the search junk on email. Identification is a great tool where data signs is artwork every day. Our personal email account. Get hundreds of junk mail. These days, the automatic spam folders in Gmail or Hotmail do a great job off identifying such emails on moving them toe a separate folder so you don't feel the pain off managing these emails. They build models based on center patterns and content patterns. No classified emails in the junk or known junk. We will now look at a live court example off classifying emails to spam or ham. Using data science. This example is to demonstrate to you how a typical data science project is executed. Please know that this is an over simply fight example. Rial world Later signs is much more complex than this. Please ignore the technical details in this example, the idea here is to give you a feel of things not to teach you any specific technology or mattered. Hi. Now we will be seeing a use case for our name bias in our on the use case we will be looking at is spam filtering. Powerful during is a very popular activity that happens on any kind of textual data. It can be like email SMS data quit that day, Dino. Whatever has to be Onda New place is one of the popular algorithms that is being used for spam filtering. So in this particular example, we are going to be having a data set which has a set off SMS messages. There's so much messages have been pre classified as either ham are spam. On using this data, we are going to come up with a model that can help I only identify messages to be either a hammer span. The idea behind you know, using this kind of analysis is that ah ha messages and spam messages. They different what people typically right? The different terms of what kind of bird occurring homicide was a spam asi spam message Typically have words like, you know, deals, money offer something that is more selling than hammers it. And that's what we are going to be seeing. The the techniques used in this use case on nearby classifications training and testing confusion matrix on The new thing we will be seeing is text pre processing. How do you process text and prepare text on converted it into a numerical representation for it to be consumed by mission learning algorithms. We start out by setting the working directory. Then we read this file call SMS Spam shot, which is available as a part of the resource bundle. Ah, you don't worry dot CS We unloaded into this SMS data. There's so much data currently is off type Hammond spent. We just make sure it is off. Has it been the factor? Why it is not a factories? Because we loaded that a string satisfactory called the Force. They have loaded the string as characters. I didn't did this to show you how you do this conversion. So as it must date our dollar type, you make it into a factor. Now look at the structure of data. You see there. 500 observations finder droves off two columns There is a type column, which is a factor off ham and span. And there is a six column, which is all all the XX Pam pets. It was somebody of the data you see. There are 437 hand messages. Was a 60 d spam messages. Head of the data, actually social. See the message and you've difficult. The message. You see a lot of stuff going on there. There's a lot of the numbers and their currency symbols in the lot of punctuation is and stuff in there. So let us see how people go on and process all these ones for Excellency explains ING, and are the most popular library that is available for text. Cleaning is the library called PM Onda. We just load this library tm on It also loads the other package call, and I'll be once we load the celebrity, we have the convert that text data we have into what is called a message corpus. This and their next PM library works on a message corpus, and it has a function to convert it so you can work that display you called his first method called director source and then you call this mother carpet? This is how you have to use celebrities. So you just follow the convention and convert this into a message. Carpets on. Then once you convert to a message corpus, you can take a peek at what this card Baskins contains using this function called Inspect. So you see that I'm just looking at the tough five messages only. So every message you see, 123 and it just gives you actually the content. It does a lot of meta data I puts into their That is what this object actually shows. But there is also the content. Once you have this now we're going to go and cleanse the data. We talked about how data is to be clinched in the presentation. So we're going to be doing them actually. Now, on father, there is dysfunction called be a map, which has a lot of this clean of functions. So this d a map you're passing the actual message carpets on. Then there is a perimeter you pass called removed punctuation. So this is going to the more you about punctuation symbols on the output that another message carpets which be a saint toe this particular ah variable called refrain card Plus And then you repeatedly no do other processing like the next thing you do is remove white space on again. They're all ever do is call the same But they're not being with the new refrain carpets as the input parameter on. Then you get some called is my thing called strip white space which strips on the white space in the data that is us. And now toe the different carpets. Then you The lower case conversion when there is a coward is called a content transformer again. Something built into the PM library. Just call it but TM map and said to lower that gives you without lower case. Then you remove the numbers in the text using the ramon numbers. Then you remove stop words where you say, OK, remove Werth on what? What I want to remove Use call dysfunction. Here in Donley called stop words, it removes all the stop words. Then you wanted to move some specific words than you again comfortable words on. Then give a list of words that you wander the move as a C list on it is going to just go Remove these words from all the carpets on. Now, once you're done with this, let us go. Take an hour. Peak in again to an inspector on the data that has been terrifying. Now you see the data being a lot more peanut like there are no numbers and stuff like that . The spaces are out. A lot of darts are. We know it's a lot more cleaner. Once this data is ready in this fashion, the next thing you do is create this document. Dramatics. A document Automatics consist off the document being converted into a mat. Tricks are dead offering in which every document is sero and every word is a column. So every dock Monday, zero and every word is a column. So you just call this mattered on then that converts the in their carpets. Indoor document term metrics. No, let us dio that look. A demo of the document on metrics, the dimensions of the document automatics. It shows you that I find it rose each representing the input document that be the document here is actually the SMS message on then the columns There are our 2000 columns. So every word becomes a column. So there are 2000 columns in this particular A metrics. So this is kind off. Another interesting thing, because you have so many different Romany columns and they're on any mission learning all goddamn needs to process all these columns. So it may be pretty tired, eh? So what you're just going to do now, is it either going through this stuff, you want to only focus on words that occurred at least 10 times in all the undead. So you take 100 documents, do a work on off what everyone make account off. How many times this word occurred in this finder documents on. Then you can only filter those words that occurred only 10 times. So this is how what I'm gonna be doing. I'm calling this function called fine, frequent terms on the DTM. And I'm passing the perimeter value 10 which means it is only going to give me the list stuff numbers that occurred at least 10 times in this endia carpets. And then using that as an input, I am going to be doing refrain corpus doing this whole dysfunction list. What this is going to do with it is going to now go and finish that. This dark document on Matt tricks for only worse. What? Which means it is going to reduce the columns from 1966 the only those boards that occurred at least 10 times on DSO. After I do this furniture DDM if you look at the diamond chance of the full guardian, I see this finder and 59. So from 1009 166 the columns have come down to just 15 in condoms with this kind of we know decent huggable. So we're really made a lot off data that you know are very sparse and we worry, And when we may not be that useful in the signaling process because you need to have this word occurring a lot more times for it to have any impact on the mission learning algorithms So last. Now, go on, inspect this baby document on matters. Look exactly how it looks like. You see the documents occurring as those on the words occurring as columns on. Then you see a company eras which board up in this world and say the call according document 51 time. So we just put the code is one here. This is called a sparse metrics because, you know, the data is very sparsely populated. Ondas somewhere you see some 20 Otherwise, it's all no one's all over the place, mostly ones. And when they printed it out and cleared the whole metrics or sorry for that, I scrawled on a lot to get back toe the next piece of court. So once I have a document of metrics Okay, let me start doing no. Some explore it, read it down. Alice is one other things you want to do with words is what is called a word cloud. You would have seen this workload a lot of times where people just plod the words that are occurring the size off. The word is depending upon the number of times the word occurs in that particular data said So we are going to be doing the same thing for work cloud. So we used this library called Word Cloud. We said the panel we said the pilot Sorry, the pilot basically tells you the colors came to be used. This brewer dot pal has a number of color schemes. I'm just picking the caressed in called dark too. And then I'm gonna you first plotting a word card where I'm picking only from the refined karpers. Only those data where the type is he called the ham. It's only pick those date words where the type he called of ham. I'm using the refrain carpets, not the document on metrics. So just look at that one and then saying Look at only words that occurred on these five times on, then do a plotting. So this is how the plot and comes out to be chose, which was typically occuring ham messages and using. Will you get guard? No, these were occur a lot. Now I try to do the same thing. The same word cloud for the spam messages. Ah, Now let us see how it looks like there's a big word called Khar called Seem to be very frequently used free. Seem to be very frequently used, you know, claim. So you see that there are certain word that uniquely occur in spam messages which differentiate it from how ham messages will be looking. There s so this is how you can do a word cloud. That's hardly any other. You know you can do any correlations in here. Not many more Exploratory did. Alice is in here because of the type of data. You have their own necks, parsley, blood in there. So you're straight away getting door goodies training, investing split. Uh, we again go back and use the library Carrot on. Then we're going to be doing the data partition by 70 to 30% on. We are actually going to be splitting three types of data into training and testing. So we first pretty raw data, which is their some of their kind of training. Interesting. We stripped the car pus into training and testing on. Then we spread the document on Patrick's into training and testing Parliament the same methodology. So there's so much data refrain Carper's and filtered radium, each of them being split into training and testing. The next thing what we're going to be doing is we're gonna be converting numbers and the factors. So the document, a matter that they built as in its actual cell values the count off the number of times the word occurs in every document we're going to convert that now into it s our no irrespective of how many times a bird occurs in a document. We're going to say the water cut, yes or no? And for that we're going to write a function called Anaconda Counts. Convert counts in which years you take an input on. If that input s value is greater than zero better done one or zero. So then we regard value is better than zero, which means it doesn't matter if it has five or six or 10 with greater than zero. Are they done? One else Make it zero. And then I convert that into a factor of using this command. So in this case, I'm converting them and going No and yes. So I just saying factor off this. I called his mother posit what? The levels are the levels of zero and one and then zero, not one are mapped. Oh, no. And yes, So that is their done. So once I made this can can function, I'm going to use my a plating, a play function which are placed toe every row or every column in the data. So I'm going to say training. I play the training BDM on margin equal door, which means it is not going to apply to every column, the con counts mattered. I'll play for us to the training document or mattress than to the testing document dramatics. And then a good demand train and test. Once they could get them a strain and test their actually like Mac tricks. My place is so I can work faster. Are the matters into a date offering using the asked our data, friend, Because data frame is what they all got. Those will take input ass on, then fire. What I'm gonna do is I'm gonna add this type the actual type we will go into predict because the Dark Montel Matters is not going to have the type of just built off the text part effort. So I'm just gonna add this column type both the training and testing data frames. So this is all the processing Eido. Once I do all the stuff, let me take a peek at this data friend. The first tendrils into Western columns. So you see that these are the rose. This is the training. Get a set. So you some rose missing kid because they went to the destined assert. And then you have the columns on how many number of times? The Eckert. So you see that the ones and zeros kind of thing Ivory being replaced with yes or no, because off the processing we did with this data. So once this is done and it's that is a simple prayer thing off building the model and predicting using the modern for which we are using the library called each e 1071 So even 07 minutes a library that gives me a name by us function. So I called this name by US function to build the model to which I passed all in my predictor variables. In this case, the predictor variables are all the 59 columns I have in the data frame, except for the 60th column, which is the type color. And then I passed. What is my target variable, which is my type. So this builds may martyr on. Then I am going to be looking at what? How the modern looks like. So we talked about when we looked at the presentation on neighbors about all the probabilities and condition probably devalues. So you're gonna be actually seeing in the market the actual value started. Figure it out. First is the call the Madonna call pretty straightforward on the first, and then he chose What is my A priori probabilities, which is overall overall. What is this split between ham and spam? What is probably something is a ham is 0.87 87 person and promise 870.12% in the training data said this is the overall probability. Then I'm a one toe conditional probabilities, where for every column for every column in the data, it is going to give me a probability that it is a ham or a span. So every column in this case is every word because we made all the words are columns. So you start out with this world, anything and anything values where nowhere else in the table. So what is the probability that anything is and know if the document is 1/2 and that is coming out? Toby, 0.97 the same year. What is the probability that anything will be a Yes if the document is a hammer, disappoint Toto the same, probably the first time. What is the probability that anything is going to be no you've the document that span on similarly for yes. So you see all the probabilities coming up here, and you see, this will always add up to one. This will add up to one. So first there is the overall probability that something is a ham and something. There's a spam right on. Then, once something is a ham, what is the probability that anything will be? No. If there's anything will be yes, is So you got all these levels are probably that it is building. So you'll see this mattress for every word in the particular document. That will be 15 inwards. That really 59 such things in here. Okay, so this is the overall car mattress it builds and using this mattress, it always are placed the base formula toe. Find out the actual abilities that we saw in the presentation. Now we go and go the predict function internally predict function will be actually computing. These probabilities were going to say, predicting credit function. Using this model on using this data on you're going to be coming out, but predictions on then you use confusion matrix toe actually tablet the predictions against the actuals on how? See how well my model has performed against the Dustin data. So reference you have ham. Assad visited overall confusion metrics, Hammond span prediction. There's only seven errors, and here in that there have seven spam, which actually got any different as hams. What that means would be that the seven spam messages would actually be send us ham to the actual person. The buzzing will be looking at seeing ways. My A Stanford. I'm not working. Fine. So that is work. This one, you know, means So now we look at the accuracy 0.95% accurate. 90% agree. Very, very, very good accuracy for this algorithm. What? We're trying to pretty Pierre on that, that this is what you have So in really, What is going to happen as whenever a new message comes in, you're gonna convert. You gonna have this model, you're going to save this model. That email marked for it to some kind of a file on been really time. When really, when the rial message comes in, you're gonna convert that message into, like, a like a vector of the data friend. The same structured that D d of testes looking like on, then passed that toe the same predict function. It is going to come out with a prediction. So in this case, stuff coming off the Leicester Productions, it is going to come up with a vector one because you only creating one message. And that is what then you will use to identify if the message is a harmless Pam on. Based on that, you decided to, you know, send it to the actual inbox. Are this market that spam or whatever you want to do? So this is a really real ham span fielder would actually work using nearby us. Thank you. 6. Big Data and Data Science: you should have been hearing about the terms. Big data and data signs being used together in a number off conversations. Why let us review data Science needs massive amounts of data toe array but meaningful trends and accurate models more the data better. The models, the algorithms and techniques used for data signs have existed for a number of years. In fact, be kids, but they are confined to research institutions and advanced uses. Only a number of doctoral pieces were done in this area but never applied to practical business use. Why it is because the technologies to process data and execute algorithms were not cheap storage. WAAS Expensive individual CP use did not scale toe this level. You need super computers in those days to do data signs. Data signs was not cost effective. Hence, data signs was limited to very suffer instigated applications like space on defense, where such huge costs are justifiable. Hence, until the year 2000 and 10 data signs was not popular. Since it was not mainstream, it was not cost beneficial for businesses. But since 2000 and 10 data science is exploding. More and more commercial applications have started using data signs toe improve their business outcomes. Data science has become cost effective and big data made it happen. Big data enabled data science, low cost software, commodity hardware, massive scalability, big data provided the technologies that made data science possible in the real world. It provided technologies which are efficient, cost effective and for regular use and business applications. There has been a plethora of technologies that have swamped the big data space. Today, the most popular ones are listed here. Hadoop started the revolution of big data, and it still continues to have a great impact. No sequel databases broke the traditional limits off database storage and processing. A practice part. Give a massively scalable platform for data processing. Apache Kafka provided scalable pipelines to transport data. The concept of containers to create virtual computing missionary enabled. Manageable. Skilling dressed a PS enabled integration between big data applications. Cloud computing like AWS on DCP provided easy toe access. Computing resource is to scale for later science projects languages like our and bite on provider, an excellent set of libraries for data processing on machine learning. All these technologies spiraled the growth and deployment off data science. So remember you CanDo data signs on your laptop, but it is neither going to be accurate. Nor is it scalable. For really world examples, you need big data to do data science at production skill. 7. Teams and Responsibilities: How is a data science team structure? What are the various roles and responsibilities? Let us see that in this video. If you do a Google search on the Internet for data signed skills, you will end up seeing a number off this triple circles. It might be intimidating to see that you need to master my people. Areas are no mines in order to become a data scientist. That is not entirely true. Data science is a field that requires a variety of skills in different domains. Software development, Mac and business. But not all these skills need to be processed by the same individual. There are multiple roles inside a data science team, with each role specializing in a specific skill set. Sometimes in small organization, there could be at one person data science team, and that person need to do all the work and process all the skills. But in large organizations, there are large teams with individual responsibilities and skills. Let us know divide the entire data science life cycle stages between three teams. We saw in our earlier lectures that the data signs lifecycle consisted of seven stages. A choir transport process store, analyze Reddick and validate the first reached ages, namely acquired transport and crosses and to some extent store is collectively called data engineering. It is the responsibility off the data engineering team. The data storage aspect comes under the data management being the last three, namely, analyze, predict and Validate is collectively called Business Analytics. It is managed by the Business Analytics team. There can be folks who disagree with me on the classification, but I choose this white classifications toe make rose distinct in the data science team. So what are the responsibilities off each of these things? Data engineering teams work on the following aspects off Later. Science Building connectors. Two sources. Setting up transport pipelines from sources to the data center. Processing and transforming big data using on deploying machine learning algorithms which are built by statisticians, architect ING data pipelines and setting up and configuring various competence which make up the pipeline. Data Management team works on the following aspects. Designing schema for databases including RGB Emma's on sequel Setting up storage schemes Administering data sources are demonstrating an architect ING data implementing fault tolerant skins on designing scalability in tow. The data architecture Business analytics teams handled the following aspects Off later. Science Exploring Data Identifying Qurans in data. Preparing on doing presentations to senior management on the results. Off Analytics model building for predictive analytics. Designing on conducting simulator and feel experiments to validate recommendations. So we got these sub teams in data science, and these are the responsibilities off these teams. Let us know drill down a little it to understanding the key roles in each of these teams. In the next video, remember, once again the same person can play multiple roles. 8. Roles and Skills: What are the various rules inside a data science team? What skills do they need? We start off with the data engineer. A data engineer is part off the data engineering team. A data engineer has a software engineering background. They typically have less than 10 years off experience. They are skilled in programming with data science languages like our Java or python. They are familiar with various machine learning libraries in these languages on how to use them. They also have experience with Hadoop and Apache Spark, which are used to build data pipelines. They have familiarity with s Cuba and no sequel storage on programming. The next role within data engineering is the pipeline architect. They're also called Data Engineering Architects or Big Data Architects. A pipeline architect has a software architecture, background on experience. This is their key skills. They typically have more than 10 years off software experience Like data engineers. They also need our Java Python programming experience, Asprilla's Hadoop Spark sequel and no sequel experience. Their focus is more on architecture and design for these technologies. Next we look at data management and the first rolled within data management is a database administrator. Unlike a traditional DB, this administrator has to deal with no sequel as well as Heck TFS Data stores. They have a database background have less than 10 years off experience. Their key skill said, is RGB. Emma's on No sequel. They also have familiarity with Hadoop on Hedge DFS. They are skilled in scripting management tools for data stores is another one. They need familiarity. We then move one. Do the data architect. The data architect has prior DB, a experience they usually have put in more than 10 years off experience in the database field. Their masters off both RGB Emma's and no sequel. Their family with Hadoop on Apache Spark a lesser extent than the Pipeline architect, though a key skill for them is database design, both for RGB, Emma's and no sequel on day have familiarity with management does. Also, we didn't luan toe the business analytics team, The Foster. All we see in that is that off. A business analyst, a business analyst, may either have a software or a business background. Typically, they have more than two years off experience. They have no mind knowledge in the specific domain. The team works in se e Commerce, Finance Healthcare Extra. They are skilled in sequel on scripting to do analytics work. They are well versed in various tools and technologies used for data analytics, including prescriptive and experimental analytics. They process management skills. One of the key unique responsibilities they have is to pull together individuals from different backgrounds and get them to work together as a team. And finally we have the statistician. A statistician has a math or statistics background on typically have a doctorate in these fields. They have more than two years of experience. Their key skill is building mathematical and statistical models for the data in question, they are familiar with SQL and scripting. Also on this helps them perform simple queries on aggregations. They have familiarity with machine learning libraries in our and fight on they build model using these libraries in the lab. The data engineer then deploys these models into production code. What is typically the team size on Dwork composition between these three teams in data science data engineering work takes about 30 to 50 person off the effort inside a data science team. This is the heavy lifting needed to be done before data becomes usable data management work takes 20 to 30%. This includes some design and mostly administration work. Business analytics takes our own 30 to 40% so if you're building a team off around 10 people, you would typically have four in data engineering to indeed and management and four in business analytics. If it's a single person team, the time spent is apportioned in this fashion. If I am guessing you right, you are thinking, Where is the data scientist in all these rules? I haven't heard that role at all. Here are some thoughts about it in yearly definitions off data signs The statistician droll , was usually called the data scientist, but that has grown on changed pretty rapidly. Sometimes it's a one person data science team, and dark person gets called the data scientist. Please think about it. It's nearly impossible for one person to master all their 11 skills for data science, Big data programming, analytics, management, machine learning statistics. Each is a master's degree off its own. Completing a course in any of them does not indicate you have mastered them units significant experience in each of them. So it is team data science these days, I repeat team data signs on not individual data. Science data science has grown into a very broad field, like a bee, similar to how we call someone on I'd be professional, irrespective of whether they work in development, Cuba administration or networking. Similarly, we can also call anyone working in data science as a data scientist. So if you work in their signs, you can be called a data scientist. Some off you may not agree with me, but I am going for a broader definition and also looking head into how this field is expanding. 9. Challenges for a Data Scientist: you should have already heard about the good things in data signs, which is one of the reasons why you are evaluating a carrier in data science. But what are the challenges that data scientist face? Are you up to it? The first challenge in data science is technology. The technology is used in data. Signs are rapidly evolving. There are new products, tools and techniques that are popping up every day, which wants to use which ones do you rely upon? You have the challenge to quickly learn and adapt to new technology. You learn one, start using it, and then a better one comes true. Stability in this field will take a few years, and until then, you have to live with this evolution. Also, remember that much of these new technologies as good art deal with scaling with big data. They lack sophisticated and easy to use. Interface is setting them up on troubleshooting them are a big challenge, given the lack of good interfaces, troubleshooting tools and documentation. The next challenge and eight assigned this would face is data itself. Are you comfortable looking at large quantities of data in a spreadsheet or a collection of charts and graphics and analyzing them, or do you want to run away? When you try to get dirty data from 1/3 party and load them into a database? The experience can be very frustrating. Badly for matter. Data in huge files require painful iterative effort to get them cleaned. When you can't open the data in a spreadsheet or a note pad on have to rely on stream programming to look into huge files, it can be very frustrating. Also remember that not all data since projects yield positive outcomes. A number of them get dropped due to lack of data or lack off signals in them. Such projects can be very frustrating. Next comes project structure data. Science projects are loosely structured in terms of requirements on expectations, goals and plans keep changing as more discoveries happen with data or the lack of data that will be replanning Andre doing as the projects go on. As I said before, projects fail frequently you, the lack off signals and that can be very frustrating. After putting in a lot of effort, can you handle these frustrations as a business analyst, our statistician. Your job is to look for signals in data, but data science projects do not eat signals always. And did I say that for the end time, half the time there are no signals and data, we expect to see some, but the results would turn out otherwise. You need to plan for iterations on redesign. You need creativity and discipline when going on search for these signals, the most important thing is not to get lost in the search and wonder off. And finally, that is teamwork. Data science projects require diverse teams from engineering business analytics on the field on the teams need to work together. These teams have different and sometimes contradicting expectations. It is vital for the team toe understand on empathize with each other. It is important to stick together in both success and failure. Finger pointing can turn to be a killer for team productivity. Are you really a team player? So who is really a good data scientist? Not the ones who mastered that technology and domain, but the ones who can make the most out of available signals and situations. This needs openness, teamwork, flexibility under domination. Are you ready to be a good data scientist? 10. Transitioning into Data Science: If you have made up your mind by now to pursue a carrier in data signs, how do you transition into it? If you are like most others, you already have. Some are significant experience in your current field. How do you do an optimal transition while keeping your experience impact? The first recommendation from me would be for you to focus on your current skills. Choose a role in data signs that closely matches with your current skills. You would then have more skills and experience that are relevant. Do data signs than choosing a role that does not related. For example, if you currently have an M B, choose a business analyst role, then choosing a data engineer role. If you are currently as software engineer, choose a data engineering role rather than choosing the statistician role. This kind of choice positions you better for entry into data science. Since you can utilize most off your current skills rather than entering in as a fresh person, it gives you leverage in job interviews and compensation. If you have 10 years as a software engineer and try to become a statistician ur position lower than a math graduate with two years of data signs experience. You will be able tow. Add quickly value to the team on Be productive from day one. It also provides for easier transition into the role, then getting into a totally new do mine. The second recommendation I have is related to the first recommendation. Build data science skills that are appropriate under 11 to your current skills. If you are a software engineer, learn data signs Products like Apache Spark. If you are a domain expert, learn Business Analytics skills. So here are some of the transition parts you can choose on the skills you need to learn. If you are in software development but less than 10 years of experience, you can choose to become a data engineer. You can re use a lot of your current experience. Since data engineers also build software based products, you need to learn new skills like our bite on for data science, Hadoop Spark and no sequel. On the other hand, if you are a software engineer with more than 10 years off experience, focus on becoming a pipeline architect. You would learn the same skills as a data engineer but also build on big data architecture skills. If you are a database administrator but less than 10 years of experience trying to become a big data base administrator, these needs additional administration skills like Hadoop and no sequel. If you are a data architect with more than 10 years of experience, you can choose to become a big data architect. You need an answer your skills toward no sequel, Hadoop and Big Data Architectures. If you're currently a data, our business analysts but two plus years of experience choose to become a business analyst in a data science team. You have to build on big data query technologies like Hype and Impala, scripting our and fight on for data signs and some introduction to machine learning. If you have a mat, are statistics background, Focus on becoming the statistician in the team. You have to build on your skills and programming specifically with our and bite on machine learning. Libraries also unit the build on big data query technologies like Hive on Impala and also some scripting. If you are a domain expert, say in the areas off e commerce, healthcare finance, supply chain, public administration, etcetera choose to become a business analyst in a data science team that works in problems specific to your domain. That way, your current skills are used in the thing you need to learn high Impala scripting and also some R and fight on programming. Finally, if you are a business professional experienced in management, you can choose to become a project manager inside a data science team. This would make you useful from Day one while you build upon the remaining skills. You would need familiarity with data signs, concepts for the same. Now that you know the skills you need to build, how do you build them? The cheapest and easiest way is self study. You have the flexibility to continue in your current job while building these new skills. But self study is only for people who are disciplined and comfortable in doing the same. The next option you have is online courses number days Online course catalogs are booming, with data science courses similar to self study. You can do them while focusing on your full time job. The next option is to participate in full time workshops on boot camps. They are more expensive on may require time off from your current jobs but they keep you focused on your learning. You can also pursue an advanced degree. Nowadays, there are universities who offer masters degrees in data, signs and analytics. But Dad again is a heavy investment off time and money. I would recommend them for fresh graduates are folks with around two years of experience. Next comes cross training. While you start off with a specific role, it is important for you to start cross training on technologies needed for other roles. This is after you get your first job in your primary skill. This is gradual learning on the job. For example, if you are a data engineer, start learning about data, architectures and administration or start focusing on mortal building. He takes longer time to acquire these skills, but given that you are in a data science team, you will learn from other members. But getting familiar with other skills required for data signs will position you for leadership positions in the team from being a team lead toe. Becoming the chief data scientist or the director off data Science, as discussed before the field of data signs, is constantly evolving. It is important for you to keep up with the industry trends, you need to bring in cutting edge skills into your job. I off some recommendations on websites focused on data science, NATO sign Central Kagel and Katie Nuggets register and subscribe to these sites. You will keep getting emails about the latest happenings and learnings in the field. Next, let us see how we can find opportunities for your first job in data science. 11. Finding Opportunities: Once you have acquired the required skills, how do you get the first job or assignment in data science? The 1st 1 is always the top one to get. Let me ask you a question. Are there really data scientist with 10 plus years of experience? Radley. But job postings asked for the same. Then you see that you must leverage your current experience on did use it as part of your data science experience. That's why you want to get into a role that matches closely with your current skill. So where can you find opportunities for data science? There is the obvious look for job postings and apply. But what advantage do you have over other applicants, especially when you are a newbie in the field? Here are some of these additions to improve upon your chances. First, look at your current position, your current team and your current organization. Are there any data science related projects that are coming up? This is your best bet. Most organizations are low employees, toe internally transition to new projects. They like internal folks because of mutual familiarity, on also because it's a no cost higher. When you moved internal projects that is value for your turn, your in your company. That translates to better chances off being added to the team. Asked for last your position within the team. You already have the relevant dough mine on product knowledge. That would be very useful for analytics, So look for such opportunities to move within your organization. Not all of us are lucky, though. Toby in such organizations, those that allow you're to move between teams and those that have new data science projects . Next, trying to do some many projects on your own and publish your work. This is proof off your competence in the field. Use publicly available data sets. Choose a familiar problem. Do an Indo went data Science project With that data set finally published the code you have built to a public location like the IT hub document the process you followed the trends you discovered and also your learnings in a blogged used them as references in your S u me. When you go for an interview, make sure you emphasis on your current skills and how they apply to data science. When you go for an interview, make sure you emphasize on your current skills and how they apply to data signs. Use really examples to demonstrate something you learned in your current job and how it applies to the data science field. For example, if you are a DB demonstrate, have you have both scripts for automating a number of administration jobs on Emphasize that the same would play for big data products like Sequel. No sequel and Hope Show your value to your interviewer so they don't see you as a Norway's , but someone who has relevant experience build a strategy on how you would do that. Another thing you won't do in the interview is to demonstrate a vision for your data. Saints carrier enumerated How you want to parcel your career? This shows the interviewer your focus on interest in the field goals and a carrier part. Demonstrate your seriousness with a greater sense carrier and that you are not merely trying it out. Having newly acquired skills to match these schools is also a good selling point. Finally, data scientists need a good work ethic. Toe work on beams on toe, worked tirelessly on complex problems. Make sure you have a strategy to demonstrate that I hope the citations would be helpful for you to build a carrier in data science 12. Closing Remarks: we have come now to the end of the course. I want to help you answer to the question. Should I pursue a carrier and data science? I tried my best on high. I hope you will be able to get some meaningful insights from this course. We looked at the basics off data signs. We analyzed the roles and data signs education required for the same. We required carrier parts options on a transition plan as promised or has not promised. We did not look into any code or design or technology are regarded ums. So what are the next steps? Data science is a revolting carrier for the right candidate, both from work, satisfaction and monetary points of view. So make a decision. If you want to pursue a carrier and data science, create a learning part, our strategy and work on it. So build your skills, create a transition plan based on your current position skills on situation, execute the plan and finally get started. Now I want to thank you for taking up this course. I wish you the very best in your carrier and life. If you like this course, please don't forget to leave a review. It will be helpful for other folks who are looking for similar advice. Thank you. Once again