Reinforcement Learning #1 : Introduction to Reinforcement Learning | Artificial Intelligence | Abhishek Kumar | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Reinforcement Learning #1 : Introduction to Reinforcement Learning | Artificial Intelligence

teacher avatar Abhishek Kumar, Computer Scientist at Adobe

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

12 Lessons (55m)
    • 1. Introduction

    • 2. Overview

    • 3. Agent and Environment

    • 4. History and State

    • 5. Markov Decision Process

    • 6. Components of RL Agent

    • 7. Categorising RL Agents

    • 8. Learning and Planning

    • 9. Exploration and Exploitation

    • 10. Action Selection for Exploration vs Exploitation

    • 11. Prediction and Control

    • 12. What's next

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

This class introduces you to the fundamentals of Reinforcement Learning. So no prior knowledge is expected for taking this course. After completing this class, students will gain familiarity with the basic Reinforcement Learning terminologies and will be ready to dive into intermediate and advanced level courses on Reinforcement Learning.

The class contents are:

  • Overview
  • Agent and Environment
  • History and State
  • Markov Decision Process (MDP)
  • Components of RL Agent
  • Categorising RL Agents
  • Learning and Planning
  • Exploration and Exploitation
  • Prediction and Control

Meet Your Teacher

Teacher Profile Image

Abhishek Kumar

Computer Scientist at Adobe


Computer Scientist @Adobe

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Introduction: welcome to the first course under enforcement learning. This class is broken down into 10 lessons, which is roughly off one over of video content. And we don't expect any knowledge off reinforcement learning for starting this course in particular real full. It's You have some basic understanding of how Nuland what's work. And this is for start off 10 courses on reinforcement learning. It has some intellect stream material, so you will gain some introductory knowledge of the reinforcement learning concepts. So you would be wondering who is the instructor? So my name is obviously Kumar and I work as computer scientist at Adobe. I have seven years off experience in programming and poor plus years of experience in machine learning. So what will you know? After completing this course, you will gain some basic understanding off reinforcement learning, so you'll be able to understand the key terminal Aziz used in reinforcement learning. And after completing discourse, you will be ready to drive into advanced courses under enforcement learning. So let's briefly look at the course contents. So it has over you where I will give some overview off the machine, learning as a whole and reinforcement learning in particular. Then we will look into an agent, an environment which are the competent. So for enforcement learning than history and state than Marco busy and processes components off reinforcement, learning agent, categorizing our religion, learning and planning, exploration and exploitation and prediction and control. So welcome to the scores and hope to see you in the next listen. 2. Overview: welcome to the scores on reinforcement. Learning reinforcement Learning is a branch of machine learning. So let's first see and over the orphans in learning on very reinforcement. Learning for its interest from seeing learning is categorized into three main categories. Supervised learning on supervised learning and reinforcement. Learning. Nowadays semi supervised learning term is also popular, but for the sake of simplicity will stick to these three men branches in supervised learning. As the name suggests, there is some super reason or guidance present. Toby provide a set off level later on. Also the Chris Morning or put on the job of the network is to learn using those well level training data and the main classes or supervised learning or classification and regression . So classification as the name suggests and deals with categorizing the return to closest we read in regression. We get some real valued or pop, so one example of classifications would be that were given some examples sample image it off course on some image it off Mike or basically, and some other vehicles maybe was, and we have leveled boat objects in our images, so we provide the position where these are located in limits. So it's we provide turns off tons of detail like maybe 100 kr 10 gifts such level later to our network in this case for image processing. CNN then ultimately doubles. The network will figure out what's the difference between the representation off a car to a bicycle or members on? Ultimately, when we feeding numerator toe the strange neural network, it will be able to predict correctly whether the image belongs to the class off card Mike Oremus So can called courage. Close one and by case, class two and class tree. So it will give out some discreet classes here, too, we read. In the case of regression, you may give some continuous data like some hosing prices are given and we have given some input factors like number of bedrooms, ID eo those locality on DSO until B a provided a bunch of certain later on. Also, the crisp morning price off knows so this will be some real number, so training our new network. But these sorts of data will ultimately be ableto predict the price off a new hose. So when we feeding ah set off bedrooms Oh, how Syria and the locality of those, the network will be able to predict what should be the price of nose. So this will be gamble of a rigorous and we're it will predict some real valued or or in unsupervised learning, there is no supervisors or guidance Who here? The network just tries to group in a given data based on similarity or tries to understand they'd instructor in the litter and tries to find Originate are similar and we did it are different and it was ultimately trade group similar data together. So the main classes off one supervised learning or clustering unless a season so plus jingle mainly group or different data points there is just a season were or try to find some relation between different perimeters, like in the earlier close, supervised learning example we sold. We had three parameters. Excellent next to extremely And we were predicting way and we heard a record win and or second record heard record surplus stringer trade to group These different data point to get in. Like these two records are similar. There s a season were tried to find some misses season there. This, uh, ex tree is highly linked to ex excellence whenever X one is there. It's highly likely that extremely also there, So these are related. So this is the difference system clustering and association. And these fall under the the article your funds over Very learning. Finally, reinforcement learning, which is the main subject of this course in reinforcement learning. There is no so providers, but it has reward signals, and the main components of reinforcement learning are isn't an environment, so agent takes a maxim based on some policy. So here, by isn't I mean some algorithm and it has some policy. Please don't digital takes a maxim. An environment will emit certain obvious reasons, depending on the action taken by the agent and also some reward 3 30 points on the agents accent, for example. Here there is no pre defined set of greater and support. We have their work, and it's trying to learn to walk, so it's reward moves in this direction. You make it different types off obvious reasons, like some objects. Maybe they're in its way, and when it falls down, it will receive negatively work from the environment, and it will try to correct itself Where, if the world would have moved in this direction, and you should have all of your different kind of experience, and it could have or got some different observations. So here the return with the agent is being trained depends on the accent off the agent, whereas in the case of supervised learning, we would have provided some fixed deter set on which we will train over our neural networks . So it's different from supervised learning. So some examples reinforcement learning could be learning to play chess where innovation takes, um, off step and it will get someone did your negative reward. And if it gets negative word, it will try to correct itself. And finally, with lots off experience, it will. Basin will learn order pledges. Similarly off. Flying helicopter could be one example of reinforcement learning the cresting on cresting helicopter. We may give some negative reward on before falling. Some tragic tree, as we want we would give some was to reward and similar leader. What walk also comes under reinforcement, learning 3. Agent and Environment: isn't and environment are the two major components off reinforcement learning. So let's see how agent and environment indirect at each time Step t isn't execute cynics in a and in turn receives a reward. Chris Morning. Two previous excellent, and some of your reason and what environment is it receives tax and an imitation of terrorism or P plus one on article. Isman. So whatever accent Asian took in the previous time step the environment sense a reward for dirt in the next time step. And also, though, Chris morning off the reason, and we increments the and the environment step. So you've ordered this Keller feedback signal on it indicates Hole Will agent is doing a time so it will be useful in comparing between borders with the Harbin or are to service. Will be able to different awards on some skill so that we can compare rewards which will help. The isn't in optimizing its policy, so the policy visually year more cumulative reward will be a better policy. So the main goal of agent is to maximize the cumulative reward over time, so it's not necessary that the immediate rewards remax him. So that's why it's different from angry delivered. Um, so the goal is to maximize the disc cumulative reward overtime on. We call this community reward as they return reinforcement. Learning is based on a reward repetitious and my reward effort. This is, we mean, that any goal can be formalised as our commerce maximizing accumulative reward. Let's see some examples of reward. So, in case of just, we can define a port city reward for winning the game and negatively were for losing the game. So you see their tail for individual moves were not giving any reward. Andi reward is delayed and we make it reward at the end of the game. So it's not necessary that after every X and there will be a transforming reward. Second example could be learning the robot to walk, making the rubble to learn to walk where we will give pushed the reward for forward movement and negativity. Word for swelling and in case off helicopter maneuvers we can keep was the reward for following the tragic tree. If the helicopter follows the deer trajectory, it will receive quality reward. There is a *** to record for cursing or the helicopter. We saw their different problems can be formulated under the reinforcement learning. So are these trolling very different from each other? Or can we find something common police so we used or sequences of decision making for unifying them under some common goal. So the common goal for all of those tests were to select accents which will maximise total future rewards. So we may have to plan aired. For example, some of sometimes the reward may not be obvious immediately. For example, in the case, off game off chests we will receive the reward only after winning or losing the game. So we may have to plan ahead so the results may be delivered and we may need to sacrifice immediately words or better long term rewards. So some example could be that some moves in chess may not be obvious, but it may be useful in the long run inventing the game. Similarly, in financial investment, we give up some money in the current time, so we kind of get some negative rewards, hoping that we will get bigger poised to reward in the future. Similarly, you spend on education hoping that the return will be much more than the current expenditure 4. History and State: in this video, we will study award history instead, the street GIs sequence off the recent accents and George that agent had seen so far. So remember we talked about isn't and environment, and how they interact is in takes a max in beauty and in turn receive some reward and observations. So the street, just though accumulation of such object level variables up to Time T, so it's very important. And what happens next depends on the street. So agent residual algorithm it will select its accent based on the past experience or whatever it has seen so far is part of the history. So it will take accent depending on this tree and also the environment. Select obvious reasons and rewards based on history. What one problem with history that it keeps on going with times. So after some time it will increase and keep on accumulating, and it will be very difficult to process the entire history. So we have something called ST, which is just information used. What will happen next? So it's just a function of history. One example could be that we just take last three of the reasons as the near term Moderations are more important than the jury since that occur in the remote past. So this is just one example. It can be some complex function off history also, then we have something called Environment State or words to so environment. State is the state usually environment to determine how to generate next alteration and reward, so it's usually not accessible to the agent. And even if its visual, it may not be very useful for agent to determine it's next accent. So when example could read ERT, there's one. Ah, room and wonder work is walking and currently the reward is here. So it has some camera tested. So it has a very narrow view of the environment so it can see just this part of the environment. It has no idea their words in this part of the environment and on other parts of the environment. It has a very limited view of the environment. So our agent gadgets one state, which is agents, internal representation and this information a used very agent toe pickets. Next Hexham. So it can be any function of history for jumper. It could just take lost three states, so one example could be that in algorithmic trading, the traders look at some moving average gross. So this is, let's say, five day moving average under Ghani's 20 day moving Enbridge. So here there's one trigger point. It's time to sell. So this five, the movinary logistics the last five days price into consideration. So last five days floating price of a stock and based on that tablet, Staybridge and similarly 20 day moving, irritable take into consideration the past 20 days. So it's not oh, considering the entire history of the prices of the stock, but just some past, ah, prices, because do they are more living in determining the next races and our inner enforcement learning. We have something called Marco Jimson. So where we will say that the state used by agent it's sufficient to statistics off the history. So in order to predict the future, you only need the current state of the environment. So state STD's Markov. If it satisfied this property. So the next state, given the current state and Dax in it same as next listed, given the entire history and excellent. So these is three steps contribute nothing, so it be removed these. Then we get the same thing. So if we are instead, Esty and we takes a maximum beauty Denver elite, the next to ST Steepness one. But we also heard some other history step like this one is to Oliver Twist E So this kind off, really, since it these don't exist. So just the current accent and cut into state, it's sufficient to determine the next to So this is what record? Markova Jensen. And this is status scored. Marco State. If it follow this Markle property, the future is independent of past. So these are the past given the present. And this is Steve Sprint. So one example could be again Oh, yeah, I could consider trading algorithm through. No. And then we have the moving average. So Ah, Wilkens, the algorithm considers the past 20 days so here one state would be the price between June 20 days. This is just e onda Steve minus one. His prices from 40 days till 20 days and tone food and the high frequency trading algorithm does not consider these states. It just takes into account the past 20 days prices. So this will be one example off. Marco property 5. Markov Decision Process: in this video, you'll study of Wharton Important concept in reinforcement learning called Marco Decision Process or in short, M v P. In order to understand MDP, one needs to understand What are the different types of environment. So an environment. Can we have two type either fully off the rule or parcel of your rules? So where does this mean So in fully off the rural environment, these indirectly objects the environment states, So there's nothing hidden from the region. So the agent knows the rules of the game. So whatever is the state the agent moves. So here, dog a reason. Its aim of the agent state and which is same as the environment Stewart. And when this is the condition, then we say that agent is in mark or precision process, the other cases partially off the rule environment where there isn't partially off just doing government. So some a jumble of this can be that a high frequency trader is concerned or just limited part off the price chart. So, no, if you consider it is not concerned about what was the history off the price of that stocks ? Who is concerned? Onley Award. Well, it's small number and in their chart and his algorithms, it's 50 algorithms do not use those. So the trader does not have access to those data just partially of your objects. It's and makes his decision vision on this parcel of your reason. Another jumper could be that it'll work. Oh, this is learning to walk through his It'll work is here and it has some camera vision. Then you take a very small view of the environment. This is not the complete environment that it objects. It makes a decision based on this course he loved your reason. Similarly, a poker playing a gentle objects only the public cars nursed soon to him. So in this case, Agent State is not Samos Environment Stoute and within this is the condition Then the agency said to in Brasilia, General Marko decision process or in sort want BP. So since ah environment is not fully off the rule to the agent, the agent has to construct its own representation off state. So one way of constructing the state off the agent will be to just take the current off the reason. But this may be very small. It and it may not be enough. On the other hand, the agent can take the complete history as its state and this the altar valid representation off state. But this may be too much of data records that history keeps on going, and the street contains too much of redundant data. So in between, the agent can construct some incremental representation off its states. So this, Yes, we sometimes called state trenches and function state troopers interest in function and, ah, so, uh, takes into a corner. It's passed the state and the current operation. So this is similar to what we call Oregon or recruit neural network on on the artist Validation can build some probabilistic view off the environment state. So, agent, make a strict one view that bit probability even the environment Agency status one with probability, P two environment region status to and with a ruler. Tippi in the environment is in the state is in. So this is a busy in or probabilistic approach on golf course. This some off all these probabilities has to be one. So this is another common approach for ah, building and Age Institute. Another example of for silly of the rule of mark or vision crosses will be that some well, game, let's say or temple run. So you're the isn't is running and it only objects were two very soon to it in a small distance. So, for example, it may get some fire, and in that case, you test to jump over there, and then it make it, uh, water. And in water it Memphis other obstacles, like rock. So in this case, it has toe Slade around that spoon, or it may get some longer food, and in that case, it has to slate under that would. So the agent does not have a complete picture of the environment. You just partially object in warm the small regional, the environment. And based on that O r great, it's one street. 6. Components of RL Agent: in this, really? We look inside in our religion that did you look, Atwater the confidence often, Agent. So these are the three components off an agent, and all of these may or may not be present in our religion. So the first component is a policy second value function, and 30 is immortal. So policy defiance, the agents behaviour. Every agent has some policy which will determine what accent agents will take in a given state. So remember, the ghoul off any religion is to maximize the expected future returns. So the policies would be such there. The agents xom move in their direction. So it is a map from state to Aksam. So the policy should decide if agent is instead a what should read accent. And this policy can be either deterministic or stochastic. So deterministic policy means that it will exactly tell what accent? Oh, the agent tactic. We read the stochastic ah policy will just give some probability distribution. Like what? Probability off what Exxon the agents picked next company. Did you value function? So it is basically a major off. How good or bad some state is because you tells in the prediction of future reward. So we define a value function under given policy and some state, as they some off expected total returns in the future and views some discounting factor here toe or give Laura. It is too far ahead in the future and give more voters to immediate rewards. On this rating. Factories list in one, so this is used to value goodness or badness of state. So if very functional status full is more than really, function of state is too. Then we will see their yes, one use it registered, and the agent will try to move toe that state for which the valley function is moved. So are this helps in selecting between Jackson negative action. One takes it to state one an accent who texted to state two and received in the Valley Functional state. One is more so. We will prefer accent. Even now, the third and last confident is immortal. Some modern is just a view of the environment that these in bills so model predicts what the environment will do next. So it's not exactly the exactly with environment were just mortals. Oh, agents of your environment. So for the last two conference trying to send more mgr. It's morning trying to stand mortal lives in predicting the next state, whereas the reward mortal health in predicting the next to work, given some state and if the agent takes a maximum So this is the trend use and modern. So it tells what is the probability off moving from status to a sprain, Given some xom it. Where is this reward? Mortal little. What will be the immediate reward is the agent is in some status and it takes a maxim a No , no. Let's look at some examples off these three Conference to get more clear understanding of this. So this is the actually environment. So this is a starting point, and the goal of agent is to reach here and is in tick some part and it comes along this way and reaches to the goal. So the agent builds some view of the environment. Soon, agent will living mother. It came from here and then reached its goal. So this is the view of the, uh in a moment. As for the agent, it does not have idea where what is in this part of the environment. So this is not the complete environment. What the agents of your environment. So now our policy. So this is a policy map off the agent. So the policy map says that if the dentition disa state true off, if it in insisted it should go up if it's in desisted, it's sort of all right and similarly foods here. Then go right on, On the other hand, inflation Dizzy, irritable, right. Ultimately, this policies leading towards no good. So this policy as we studied well, really determined Agent Saxon. No, let's look at value function. So this is the same greedy example that we saw here. So value function is the some off expected community reward in the future from a given state. So here these are the immediate states very close to go soon. So if agent is in desisted, the very function is minus tree. Just this value because next it will go towards the goal. If there isn't is here. The expected return is minus two. If there isn't, is here between minus t minus. Flavier on minus six Here, similarly here, minus Lee. So this kill you, then we go to any state. We aired the crisp morning reward. So if agent is here, it will have minus seven. Because, agent will, I heard this Valued and wear on. Go where? So these read 97 similarly minus eight here, so is the value. A function is given off is defined of the different states. So these grid positions are the different states. So this is a steer, Delerue, Mrs Jo One, This is one you know. So the value from Senate itself good enough tradition. Mind what accent agents will take. So agencies are this is starting to reason. So it can either go here or here, but it will see that the value function off one Joe, this is minus six is more than the value function off Joe one, which is minus here. So you know that state one Jew is better than state driven, so they didn't will go here. Now, here it has to awesome this and this again. It will see that this is better state. So it will go here and here and here and ultimately leads to the goal. So this value function is very important. And it tells in evaluating the goodness or marinus off again. Mystic 7. Categorising RL Agents: In this video, we will see the various categories himself in moral agent. So one categories, any based on the presents, an absence off value and policy on the second category is in is based on the presence or absence off mortal. So as for the first cattle greatest sin, and our religion can be either value based, fully serviced or actor critic. So a value based agent uses value function and here policies not record policies in place it. I guess you are given an environment. So this good represents an environment and the various cells represent, and state world agent can be so on these air, the value functions of each state noted, What is the expected future return from that state? So it's the agent is in. This is starting state and it will see that this cell is your own on. This is one euro on this is starting a 00 So you do verily function off. Geruman, which is minus here, is listening very function off when you which is minus six. So it means that this state is better than your own. So when you is better so the patient can go. But when you and again you test two ways towards minus seven minus face. So it will again go here. So with this will you function? The isn't can takes its decisions and policies not required. So this kindof agents are called value grist agents. Second type is policy vist. So here rage into stores the policy and no well, really function. So if this policy is defined, then if agent is in this starting state, it will go here. And if it's in this Stuart, the policy says that Go oh, up and hear your table Right on. Right on day agent reaches here. So here the accents are decided by the policy and not develop foursome. So this kind of agents are called policy vist and the third is actor critic. Yeah, the difference in the story as related policies talks We take that one to go Both of these kind of religions. No. As for the second category, reason though our religion can me either model three or model based in model three, our religion policy can be there or value function can be there or both of them can be there. But there is no more so the agent tries to build policy or value function depending on the experience in order to maximize the future. We work so it does not try to build a model of the environment. Or it does not try to understand the dynamics of the environment, hold environment works, whereas in the case of mortally station policy and or really function can video and mortally also present. So the first task off station is to try to learn how the environment works so they isn't tries to build a model of the environment and then figure out the optimal policy or value function. 8. Learning and Planning: learning and planning are too important concepts in reinforcement learning. So with secret still decision making, there are two fundamental types of problem. One type of problem is reinforcement learning and second type of problem is planning. So let's see. What is the difference between these two in reinforcement learning problem? The model of environment is unknown to the our religion. So they are diligent, has no idea how the environment functions our religion in tracks with the environment and tries to understand how the environment functions. So it's kind off a trial and error and based on their age and tries to improve its policies . Who that its future rewards have maximized in planning problem model of environment is known to agent, so no interaction is required for exploring the environment. Agent plants by performing competitions based on the knowledge of environment missed on the knowledge of model of the environment. So it's thinking and planning ahead as compared to trial and error in case of reinforcement , learning problem and based on that agent tries to improve its policy for getting more reward in the future. So let's take an example of this one example. Off planning could be that you are told the rooms off the game like you're playing it just game, and you know that work steps are valid and what are invalid. So you were told before. And so your task is to plan that. What if I move here or what if they move after two steps, what we left? And so it's kind off thinking I heard or planning ahead. But on the other hand, and agent may not be told. Hold the chess game functions and it will just try to explore the environment so it will try to go here, and it will get the feedback that it's an invalid move and it will try several other moves , and it will get the wrecking return that which moves are valid or invalid. So after some time, it will figure out the rules off the environment hold a model of the environment is and then it will try to maximize it returns. So these are the two fundamental problems in reinforcement learning 9. Exploration and Exploitation: exploration and exploitation are two fundamental problems in reinforcement. Load exploration means to find out more toward environment on this main world, giving up some immediate reward for maximum aging future rewards. To understand this, let's see our religion, Dejan some status and with its past experience, you test for ransom. Excellent, even in this state. As for college, sippy and it most it to some different state. Let's say this frame an intern give some reward are one, and this reward is positive. So one way would be to continue during the Saxon in desisted notice. Keep the policy 60 and keep on getting this reward or one. But there may be some other accent available from this state. Maybe you worried three or many more actions, which are more profitable than are one. So let's say two eels are do it. Reels are three, and it's possible there are list in are ones where it's even worse than the current policy . But it may be possible that our trees more more than our one. So it's agent discovers nerd, derision, accent atrial to which we can take from this estate. Then we will get a better return so this would mean exploration that is exploring more award inward but on the other hand, explored isn't would be just too following the profitable in formation that is involved in status, and it has formed some excellent even which it can take, which will give it some reward. So it will keep on doing this with our exploring for bed Robson. So this will be known and exploited. So there is exploration tickler decent tradeoff, because when you are exploding, you are losing out on the known reward, which you knew that that Jackson was giving some points to reward. So why Lex Lorrison? You may lose or does rewards, but on the other hand, you may also gain some better option that you will give you or more return in the longer future. So there is a balance required between extra reason and exploitation. So let's see some of the practical examples off exploration and exploitation. So one example is in advertising where extradition would mean sewing up some profitable air suspectedly, whereas exploration would mean suing some new airs which may be more profitable in the future. Similarly, if you have some favorite restaurant in your locality and your figured it or maybe sting several restaurants. So explanation would mean that you always keep going toe your favorite restaurant, where exploration would mean trying out some new restaurant in your neighborhood, and it may be forced to alert the food. There is better than your favorite restaurant, but you males are end up eating some bad food in the process. So also here the balance is required with next listen and exploitation. 10. Action Selection for Exploration vs Exploitation: In this video you will see a few accents, Alex in algorithms, which will help us in deciding when to exclude and going to exploit. We have already seen that we cannot do explorers and and expletives and simultaneously, and we call this exploration explode. Isn't Trude off? So we will see to off the popular or extent selection algorithms who wanted very basic one , which is called Epsilon Greedy X and selects. Um, and it's a kind off a random of X and selects an algorithm. And then we will see another algorithm called optimistic any cell values. So first Cedar Playland, Greedy X and selection. Here we choose to explore it most of the time with a small chance off exploring babies. Don't some randomness. And here epsilon were forced to the probably be that we choose to explore. So it has to be between your value offered You in one, for example. We can choose some accent base, don't roll off for days, so consider a situation. So these are all the six possibilities and we roll their dice so we can get a number from 1 to 6 so we can sit there if it comes one or two or 34 or sleep then we will explored. That is, we will pick and known greedy algorithm known greedy step in the next day Mr So Income better sense. We have ah di algorithm paradigm which says that you take your Wrexham based on the immediately world and it's used in sort of spot finding minimum spanning tree finding So you have a couple of accents options and then you pick the one which give you immediate reward. So here we will take such a greedy xom Bees don't roll over nice. So if it comes from 1 to 5, didn't exploit notice Take a greedy xom. But if we come six, then we will explore some new excellent for which we don't know what is the reward. So this can be one way off or solving the exploration and exploitation problem. And we're seeing that silent was the religion that we explored. So in this case, we explore one order of six times so we can say they're absurd and in this case is 1/6. So let's formalize this algorithm. So here it either at first to the extent selected our names to be, then this can re either. GDX. Um so this refers to a greedy xom, and this we will take with a probability of one minus a planing. And this is it turned a maxim. And this we will take the probability of Upsell. And there can be or different radio off, uh, using the same algorithm. No, Let's see the 2nd 1 which we call a mystic Any cell values and we conform allege it in this way. So here Q Refers to some initial or guests guests off value. So physically I'm equal to do We don't know of its accent really more reward. So we optimistically assigned some the loot to those accent. So we're very optimistic, Andrea ST Some was devalues toe each other do goes accents. And in the next time step no, they will operate those values based on what reward we actually got so earlier read or the guests. Then when we actually take their accent, we will get to know how much you're bigger So we will update those values to hear que in place even or deny oldster state or off their excellent and Cuban in the previous estimate value or the previous estate and I'll fight. Some factory can be between here and one. So let's assume it's Europe on five for our example. And this is the reward regard am Step T or in its time, step and minus the previous value. So let's see an example, and it would really were clear. Let's see via three possibilities here three actions that we can pick. So A, B and C dropped the possible actions, so initially be normal, which is better. So I'm called to Geo there. Excuse you. We are very optimistic here because it's optimistic in the silver values algorithm, Andrea, sane and good, forced to value to each of these accents. So no desire equal and we will randomly pick one on these. Let's set the pick A and we got a values actually or so were We were very optimistic, so we assign a value for him. But we got to and we took a so I'm too cool to one the big So the big day. So we will not big BNC, but we go to the New York too. So we will update over to one so Cuban will be you in this half times the difference. Or you can do assisting your own way. Well, it's a big Dave originals. Duty free, just value in the current value. Or let's stick to this farmland. We will use your for equal to zero point flame so you and place for Valerie here in Plus, Do you know point flavor or in minus and unicorn flavor you in. So this will become and, you know, point flavor. Cuban Because Cuban minus 0.5 q and his European five Q and Plus Negroponte side Ardant or Cuban Plus or in Newbury Way. Who this will do? Q. An plus one. So in this case Q. Do we are estimating Cuban? So we will do Q zero plus return divided right to so Phi plus two divided by two. That is seven by two or 3.5. So I came to equal to one. We have these values. So Mrs Fleet miss it immensely because we did not big these. No, we're estimating the Value Cube. So we will try to pick the one accent, which says that its most profitable So in this case, clearly a 3.5, which is less than the BNC. So we will pick one of these. Let's say we pick would be and it festival youth one so in next time still visible remains him. This will remains him. And this will become five plus one during righto notice. Six wait two or three. So no, this is the value of time equal to do so this is equal to one. Take a little who were estimating cute Be here no time. Three We will see that these two Earless and C is the most Seems to be the most rewarding, for we will pick this and let's save the Gordon values Ah, three So loud updated were leisure and these will remain unchanged three and this becomes 45 plus street or the right to food In the next time step will pick against four and we will see what reward bigger So that each time step we tried toe pick the one with the highest value and then based on the actual reward that we get me off dead news values So this is the off the mystic any cell religion over them. So there are a few limitations to this optimistic initial values algorithm When is that it , Dr. Exploration on Li. In the early phase, after some time, the may stick to one off the accents, which may seem, Do we optimum in that pain? But it's not well suited for known stationary problems. By this, I mean that. Or there may be cases where the directions were bad earlier on this mystic. Any cell valley algorithm, as correctly discovered there are based on initial exploration. But there may be possibility that some accents which were not good earlier no have become in my drops. Um, then we're meeting. So this week all non ist is mary problem Because those accents are not justice Mary. They they also changed based on with the time. So that accent, which was not so tell earlier no is a better option. But this Exxon will this algorithm in nor discovered that because it will try to the optimal Lexan at each times two So and there is one another problem that any sell gas. So we were optimistically giving some value to each of Jackson and these maybe not a good guest. These may be a very bad guests. So this is another limitation were despite these limitations this algorithm has prevented be an effective accent. Selects, um mattered because you're maybe these kind of scenarios, it's are not very common. So it's a very simple and effective algorithm. So I hope you got some in Houston off to select your actions. And the me name off this lecture was to give you some interesting about that on go full. You draw something using from this and maybe our device your own accent selection algorithm , which performs even better than these. 11. Prediction and Control: prediction. Control is another fundamental problem and reinforcement learning the prediction means competition or estimation off consequences, often accident. The policy here is given on the goal is two million hole will that policy performs. This policy function is fixed. So if indigent status, then using that policy function it will exactly get what accident has to take given this estate. And the goal would be to find out or compute the expected return from this estate using the given policy. So their goal is to predict the future, whereas in control, the policy is not fixed. Agent is in some status, and it does not know what accent to take. So the goal is to find optimal policy, the policy that will maximize. I expect it'll delivered. So hear this song. Son is not known, and we have to find this Syria. It's all about optimizing the future. There is, in the case of prediction, it's about predicting the future because policies fixed. So let's take one example. So if this is our ah mais example, so if we are in re or the agent is in, this is starting still, then it's the policy says that go to the right, Then go up then right then up and then reached the goal. So if this policy is given in this case, the return would be minus one minus two, minus two minus three. That is 78 So minus eight. What, on the other hand, is Legent is in this estate starting stewed and the policies nor fixed. Then it needs to find New York received the best policy. Then it will figure out that it can take this part and guinea return off minus one minus two, minus one minus two. So minus six. So it seems that this is better than this. So it will figure out that this is optimal policy. You fruits and station. This you tested grew up. Hope up and then rate. So it will figure out the optimal policy. So this is the main difference between prediction and control. 12. What's next: Congratulations on completing your first course on the introduction to reinforcement learning. You have taken the first step towards mustering reinforcement, learning now your family with the basic turning Rogie starting used in reinforcement learning and you're ready to move ahead. So you can know Take the second course or the enforcement learning there. We will dive deep into the Marco decision processes. So thanks for making toe the end of the course on Hope to see you in the next course.