Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Umar Jamil · Beginner ·📄 Research Papers Explained ·2:15:13 ·2y ago

Skills: RL Foundations90%Reading ML Papers70%

Key Takeaways

Reinforcement Learning from Human Feedback (RLHF) is explained with math derivations and PyTorch code, providing insights into model alignment.

Full Transcript

hello guys welcome back to my Channel today we are going to talk about reinforcement learning from Human feedback and po so reinforcement learning from Human feedback is a technique that is used to align the behavior of a language model to what we want the language model to output for example we don't want the language model to use curse words or we don't want the language model to behave in a impolite way to the user so we need to do some kind of alignment and reinforcement learning from Human feedback is one of the most famous technique even if there are now new techniques like DPO which I will talk about in another video now reinforcement learning from Human feedback is also how they created chat GPT so how they align chat GPT to the behavior they they wanted uh in the topics of today are uh first I will introduce a little bit the language models how they are used and how they work then we will talk about the topic of AI alignment why it's important and later we will do a deep dive into reinforcement learning from Human feedback in particular I will introduce first of all what is reinforcement learning then I will describe all the set top of the reinforcement learning so the reward model what are trajectories in particular we will see the policy gradient optimization and we will derive the algorithm we will see also the problems with it so how to reduce the variance the advantage estimation importance sampling of policy learning etc etc the goal for to this today's video is actually to derive the loss of the PO so I don't want to just throw the formula at you I want to actually derive step by step all the um algorithm of PP and also show you all the history that led to it so what were the problems that PO was trying to solve mathematical from a mathematical point of view and uh in the final part of the video we will go through the code of an actual implementation of reinforcement learning from Human feedback with Po and I will uh actually not code from uh by line by line I will actually explain the code line by line and in particular I will show the co the implementation as done by the hugging face team so uh I will not show you how to use the hugging face library to to use reinforcement learning from Human feedack but we will go inside the code of the hugging face library and see how it was implemented by the hugging face team uh this way we can combine the theory that we have learned with practice now the code written by the hugging face team is kind of obscure and comple complex to understand so I deleted some parts and I also commented with my own comments some other parts that were not easy to understand this way I hope to make it easier for everyone to follow the code now there are some prerequisite before watching this video first of all I hope that you have some Notions of probability and statistics not much at least you know what is an expectation um we we need to know of course some knowledge from Deep learning for example gradient descent what is the loss function and the fact that in gradient descent we calculate some kind of gradient Etc um we need to have some basic knowledge of reinforcement learning even if I will review most of it so at least you know what is an agent the state the environment and the reward one important aspect of this video is that we will be using the Transformer Model A lot so I recommend you watch my previous video on the Transformer if you have not if you're not familiar with the concept of self attention or the Cal mask which will be key to understanding this video so the goal of this video is actually to combine Theory with practice so I I will make sure that I will always kind of give an intuition to to formulas that are complex and don't worry if you don't understand everything at the beginning why because I will be giving a lot of theory at the beginning because later I will be showing the code I cannot show the code without giving the theoretical knowledge so don't be scared if you don't understand everything because when we will look at the code I will go back to the theory line by line so that we can combine you know the the Practical and the theoretical aspect of this knowledge so let's start our journey okay what is a language model first of all a language model is a probabilistic model that assigns probabilities to sequence of words in particular a language model allow us to compute the probability of the next token given the input sequence in particular for example if we have a prompt that says Shanghai is a city in what is the probability that the next word is China or what is the probability that the next word is Beijing or cat or pizza this is the kind of probability that the language model is modeling now in my uh tract of language model I always make a simplification which is that each word is is a token and each token is a word this is not always the case because it depends on the tokenizer that we are using and actually in most cases it's not uh like this but for Simplicity we will always consider for the rest of the video that each word is a token and each token is a word now you may be wondering how can we use language models to generate text well we do it iteratively which means that if we have a a prompt for example a question like where is shahai then we ask the language model what is the next to token and for example greedily we selected the token with the most probability so we select for example the word Shanghai then we take this word shangi let me use the laser we put it back into the input and we ask again the language model what is the next token and the language model will tell us what are the probability of the next token and we selected the one that is more probable suppose it's the word is we take it and we put it back in the input and again we ask the language model what is the next token suppose the Lang next token is in we take it we put it back in the input and we ask again the language model what is the next token etc etc until we reach a number of tokens that we have generated or we we believe that the answer is complete so in this case we can stop for example because we can see that the answer is in Shanghai is in China is the answer generated by the language model so this is an iterative process of generating text with the language model and all language models actually work like this now with what is the topic of AI alignment a language model is usually pre-trained on a vast amount of data which means that it has been pre-trained on billions of web pages or the entire of Wikipedia or thousands of books this gives the language Model A lot of Knowledge from which it can uh retrieve uh and it can learn to complete a prompt in a reasonable way however this does not teach the language model to behave in a particular way for example we just by pre-training we do not teach the language model to not use offensive language or to not use racist expressions or to not use curse words to do this and to create for example a Chata system that chat assistant that is friendly to the user we need to do some kind of alignment so the topic of a alignment is to align the Models Behavior with some desired behavior let's talk about reinforcement learning so reinforcement learning is an area of artificial intelligence that is concerned with training an intelligent agent to take actions in an envir environment in order to maximize some some reward that it receives from the environment let me give you a concrete example so imagine we have a cat that lives in a very simple world suppose it's a room made up of many grids and this cat can move from one uh cell to another now in this case our agent is the cat and this agent has a state and uh which describes for example the position of this U agent uh in this case the the state of the cat can be described by two variables one is the x coordinate and one is the y-coordinate of this of the position of this cat based on the state the cat can choose to do some actions which could be for example to move down move left move right or move up based on the state the cat can take some actions and every time the cat takes some action it will receive some reward from the environment it will for sure move to a new position and at the same time will receive receive some uh reward from the environment and the reward is according to this reward model so if the cat moves to an empty cell it will receive a reward of zero if it moves to the broom for example it will receive a reward of minus one because my cat is scared of the broom if somehow after a series of states and actions the cat arrive to the btub it will receive a reward of minus 10 because my cat is super scared of water however if the cat somehow manages to arrive to the it will receive a big reward of plus 100 how should the cat move well there is a policy that tells how what is the probability of the next action given the current state so the policy describes for each position so for each state of the cat with What probability the cat should move up or down or left or right and then the agent can choose to either choose a randomly an action or it can choose to select the action with the most most probability for example which is gridy strategy etc etc now the goal of reinforcement learning is to learn a probability so to optimize a policy uh such that we maximize the expected return When the agent acts according to this policy which means that we should have a policy that with very high probability takes us to the meat because that's one way to maximize the expected return in this case now you may be wondering okay the cat I can see it as a reinforcement learning agent and the reinforcement learning setup makes sense for the cat and the meat and all these rewards but what is the connection between between reinforcement learning and language models let's try to clarify this so the you can think of the language model as a policy itself so as we saw before the policy is something that given the state tells you what is the probability of the action that you should take in that state in the case of the the language model we know that the language model tells you given a prompt what is the probability of the next token so we can think of the prompt as the state and the next token as the action that language model can choose to perform which will lead to a new state because every time we sample a next token we put it back into the prompt then we can ask the language model again what is the next next token Etc so as you can see we can think of the language model as the reinforcement learning agent itself and also as the policy itself in which the state is the prompt and the action is the next token that the language model will choose according to some strategy which could be the greedy one which could be the top P or the top K or etc etc the only thing that we are missing here is the reward model how can we REM reward the language model for good responses and how can we kind of penalize the language model for bad responses this is um done through a reward model that we have to build let's see how Okay imagine we want to create a reward model for our language model which will become our reinforcement learning agent now to reward the model for generating a particular answer for questions we could create a data set like this of questions and answers generated by the the model for example imagine we ask the model where is Shanghai the model language model could say Okay Shanghai is a city in China we should uh give some reward to this answer so how good this answer is now in my case I would give it a high reward because I believe that the answer is short and to the point but some other people may think that this answer is too short so they maybe want they prefer an answer that is a little longer or in this case for example what is 2 plus2 suppose that our language model only says the word for now uh some in my case I believe this answer is too short so it could be a little more elaborate but some other people may think that this answer is uh um is good enough now what kind of reward should we give to this answer or this answer as you can see it's not easy to come up with a number that can be accepted by everyone so as humans are not very good at finding a common ground for agreement but unfortunately we are very good at comparing so we will exploit this fact to create our data set for training our Lang reward model so what if instead of generating one answer we could generate multiple answer using the same language model this can be done for example by using a high temperature and then we can ask a group of people so expert labelers experts in this uh field to choose which answer they prefer and having this uh data set of uh preferences we can then create a model that will generate a numeric reward for each question and answer so first we create a qu data set of questions then we ask the language model to to generate multiple answer for the same question for example by using a high temperature and then we ask people to choose which answer they prefer now our goal is to create a neural network which will act as a reward model so a model that given a question and an answer will generate a numeric value for uh in such a way that the answer that has been chosen should have a high reward and the answer that has not been chosen which is something that we don't like should have a low reward let's let's see how it is done what we do in practice is that we take a pre-trained language model for example we can take the pre-trained llama and we feed the language model the question and answer so the input tokens here you can see are the questions and the answer concatenated together we give it to the language model as input the language model it's a Transformer model so it will generate some output embeddings these are called hidden States so as you know the input are the token tokens which are converted into embeddings then the positional encoding then we feed it to the Transformer layer the Transformer layer will actually output some embeddings which are called hidden States and usually for text generation we take the last hidden State we send it to some linear layer which will project it into the vocabulary then we use the soft Marx and then we select the next token but instead of selecting because here we are we do not want to generate text we just want to generate a numeric reward we we can sub substitute the linear layer that is projecting the last hidden State into the vocabulary but instead we replace it with another linear layer with only one output feature so that it will take an input embedding as input and generate only one value as output which will be the reward assigned to the um to the response to to the answer uh for the particular given question of course this kind of mod this is the architecture of the model we also need to train it so we also need to tell this model that it has to generate High reward for answers that are chosen and low reward for ANW that are not chosen let's see what is the loss function that we will use to train this model the loss function that we will be using is this one here so you can see it's minus the log of the sigmoid of the reward assigned to the good answer minus the reward assigns to the bad answer now let's analyze this uh loss function here so uh pen okay so there are two possibilities either this difference here so he is negative or it is positive which means that either the response assigned to the so how do we train it first of all basically because our data set is made up of questions and possible answers I suppose there are only two possible answer one is the good one and one is the bad one we take each question answers we feed the question to the model along with the answer concatenated to it and with Gen model will generate some reward we do it for the good question and for the sorry for the good answer and also for the bad answer and it will generate two rewards suppose this is the reward for the good one so let's write good one and this is the reward associated with the bad one now either the mo the model assigned a high reward to the good one and a low reward to the bad one so this difference will be positive and this is good so in this case the loss will be like this so if the reward given to the good answer is higher than the reward given to the bad answer the uh this difference will be positive so let's see the sigmoid function how does it behave when the input is positive so when the input is positive the sigmoid gives an output value that is between 0.5 and one so this stuff here will be between 0.5 and one when the log receives an because here you can think of as having a parenthesis when the log sees an input that is between 0.5 and 1 will generate a number negative number that is more or less between 0 and minus one more or less so with the minus sign here it will become a positive number between 0 and one so the loss in this case will be small because it will be a number between more or less between zero and one uh I maybe it's two or three but okay depends on the graph of the log I don't remember what is the exact value for the 0.5 here however let's see if the model uh gave a high score to the bad response and the low score to the good response so let's start again okay and okay here is the bad response and and here is the good response now what happens if this value here is smaller than this value here so this difference will be negative when the sigmoid receives as input something that is negative uh it will return uh an output that is between 0 and 0.5 the log when it sees an input that is between 0 and 0.5 so more or less here it will return a number negative number that is between minus infinity and more or less one so it will return because there is a minus sign here it will become a very big number in the negative range so the loss in this case will be big so big loss here it was a small loss small loss okay now as you can see when the reward model is re is giving a high reward to the good answer and a bad uh a low score to the bad answer the loss is small however when the reward model gives a high reward to the bad answer and a low score to the good answer the loss is very big what does what does it mean this for the model that it will force the model to always give High rewards to the winning response and low reward to the losing response so it because that's the only way for the model to minimize the loss because the goal of the model always during training is to minimize the loss so the model will be forced to give High reward to The Chosen answer and low reward to the not chosen answer or the bad answer uh in hugging phase you we can this reward model is implemented in the reward trainer class so if you want to train your own reward model you need to use this reward trainer class and it will take as input a auto model for sequence classification which which is exactly this architecture here so it's a Transformer model with instead of having the linear layer that projects into the vocabulary it has a linear layer with only one output feature that gives the reward and if you look at the code on how this is implemented in the hugging face Library you will see that they first generate the reward for the chosen answer so for the good answer then they generate the reward for the bad answer so for the rejected response here it's called and then they calculated the loss exactly using the formula that we saw so the log sigmoid of the rewards given to the chosen one minus the rewards given to the rejected one let's talk about the trajectories now now as I said previously in reinforcement learning the goal is to select a policy or to optimize a policy that maximizes the expected Return of the agent when the agent acts according to this policy more formally we can write it as follow that the we we want to select a policy Pi that is gives us the maximum uh expected reward when the agent acts according to this policy Pi now what is the expected return the expected Return of the policy is the um expected return over all possible trajectories that the agent can have when using this policy so it's the expected return over all possible trajectories as you know the expectation can also be written as an integral so it is the probability of the particular trajectory using this policy multiplied by the return over that particular trajectory now what is a trajectory first of all and later we will see what is the probability of a trajectory so the trajectory is a series of states and actions which means that a trajectory you can think of in the case of the cat as a poth that the cat can take suppose that each of the trajectory have a maximum length so we don't want the uh the agent to to perform more than 10 steps to um to arrive to its goal now the cat can go to the meat for example using this spt here or it can chuse this partt here or it can use this partt here or this one here or for example it can go forward and then go backward and then stop because it has already used the 10 steps or it can go like this etc etc so there are many many many parts what we want is we want to optim we want to find a policy that Max maximizes the expected return so the return that we get along each of the these paths now uh we will also model the the the the next state of the cat as been St stochastic so first of all let's introduce what is the the states and actions so let me give you an example suppose that our cat is starting from some state s0 which is the initial State the policy tells us what is the next action that we should take given the state so the cat will ask the policy what is the next action that it should take and because the policy is stochastic this policy will tell us what is the probability of the next action so um just like in the model in the case of the language model we given a prompt we select what is the probability of the next token so imagine that the policy tells us that the cat should move down so action down for example with very high probability or it should move right with very lower probability it should move left with even lower probability or it should move up with an even lower probability suppose that we select to move down it will result in a new state that may not be exactly this one why because we model the cat as being drunk which means that the cat wants to move down but may not always move down and we will see later why this is uh helpful but another case could be for example imagine we have a robot and the robot wants to move down but this the wheels of the robot are broken so the robot will not actually move down it will remain in the same state so we always model the next state not as being determined deterministically determined but as being stochastic given the current state and the action that we choose to perform so imagine that we choose to perform the action down the cat may arrive to a new state S1 which will be According to some probability distribution then we can ask a again the policy what is the next action I should do the policy may say okay you should move right with very high probability and you should move down with a lower probability or you should move left with even lower probability etc etc so as you can see we are creating a trajectory which is a series of states and actions which Define how our cat will move uh in a particular um trajectory okay let's see now what is the probability of a trajectory the probability of a trajectory as you can see here uh the fact that we chose a particular action depend only on the state we were in and the fact that we arrived to this state here depended on the state we were in and the action that we have chosen and then the the fact that we have chosen this action here depended only on this state we were in because the policy only takes as input the state and gives us what is the probability of the action that we should take so we can because uh they are independent from each other these events we can multiply them together to get the probability of the trajectory so the probability of the trajectory is the probability of starting from a particular starting point so from this state zero here then for each step that we have take so for each action state of this particular trajectory we have the probability of choosing this the action given the state and then to arriving a new state given that we were at this state at time step T and we chose action a at a time step T and we multiply all these probabilities together because they're independent from each other another thing that we will consider is that in when how do we calculate the um reward of a trajectory a very simple way to calculate the reward of a trajectory is to just sum all the rewards that we get along this trajectory for example imagine the cat to arrive to the myth follows this trajectory you could say that the the reward is zero here so it's Z 0 0 0 0 0 0 0 and then suddenly it becomes plus 100 when we reach the meat if the cat for example follows this part here we could say okay it will receive minus one because the cat is scared of the broom then 0 0 0 0 0 0 100 actually this is not how we will calculate the reward of a trajectory we will actually calculate the reward as discounted which means that we prefer immediate rewards instead of future rewards to give you an intuition in why this happen happens uh first let me talk about money so if I give you $10,000 today you prefer receiving it today instead of receiving it in one year why because you could put the $10,000 in the bank it will generate some interest so at the end of the year you will have more than $10,000 and in the case of reinforcement learning this is helpful also for another case for example imagine the cat can only take 10 steps to arrive to the meat or 20 steps so one way to for the cat to arrive to the meat is to just go directly to the mid like this and this is one trajectory but another way for the cat is to go like this for example go here then go here then go here then go here and then go here so in this case we prefer the cat to go directly to the meat instead of taking this longer route why because we modeled the next state as being stochastic and if we take a longer route the probability of end the Gap in one of these obstacles is higher the longer the route is um so we prefer having shorter roots in this case and um this is also convenient from a mathematical point of view to have this discounted rewards um because this serus which is infinite in some cases okay we will not work with infinite series but it's helpful because this series can converge if this um this um element of the series is becoming smaller and smaller and smaller so let me give you a practical example of uh how to calculate the reward in a discounted case so imagine the cat starts from here and it goes to the uh follows this part so uh to calculate the reward of this trajectory we will do like this so it is the reward at time Step Zero which is arriving to the broom multiplied by gamma to the power of 1 so it will be gamma multiplied by minus one then all these rewards are 0 0 0 so they will not be summed up and finally we arrive here at where the reward is plus 100 at time Step 1 2 3 4 5 6 78 so it will be gamma to the power of 8 multiplied by 100 so gamma is usually chose not usually it's always something that is between zero and one so it's a number smaller than one so it means that we are decaying this uh reward by gamma to the power of 8 so it will be smaller the longer we take to reach it this is the intuition behind uh discounted rewards now you may be wondering the trajectories make sense in the case of the cat so I can see that the cat will follow some path to arrive to the meat and it can take many paths to arrive to the meat to so so we know what is the trajectory in the case of the cat but what are the trajectories in case of language model well as I saw before uh we we want to we have a policy which is the language model itself so because the policy tells us given the state what is the next action and in the case of language model we we we can see that the language model itself is a policy and we want to optimize this policy such that it selects the next token in such a way as to maximize a cumulative reward according to the reward model that we have built before using the data set of preferences that I've saw before also in the case of the language model the trajectory is a series of states and actions what are the states in the case of the language model are the prompts what are the actions are the next tokens so imagine we have a question like this to for the language model so where is Shanghai of course we will ask the language model what is the next token which will this will become the initial prompt so the initial state of the language model we will ask the language model what is the next token and that will become our action the CH the the token that we choose but then we feed it back to the language model so it will become the new state of the language model and then we ask the language model again what is the next token it will be for example the word is and this will become again the input of the language model so the next state and then we ask the language model again what is the next token for example we choose the token in and then the concatenation of all these tokens will become the new state of the language model so we ask the language model again what is the next token etc etc until we generate an answer so as you can see also in the case of the language model we have trajectories which are the series of prompts and the tokens that we have chosen now imagine that we have a policy because we our goal is to optimize our language model which is a policy such that we maximize a cumulative reward According to some reward model that we have built in the past now uh our more formally our goal is this so we want to maximize this function here which is the expected return over all possible trajectories that our language model can generate and we also saw that before the trajectory is a series of prompts and next tokens um now when we uh use stochastic grade in this ense so for example when we try to optimize a neural network we use stochastic gradient descent which means that we have some kind of loss function we calculate the gradient of the loss function with respect to the parameters of the model and we change the parameters of the model such that we move against the direction of this gradient uh so we take Little Steps against the direction of the gradient to optimize the parameters of the model to minimize this loss function in our case we do not want to minimize a loss function we want to maximize a function which is here and this is can also be thought of as an objective function that we want to maximize so instead of using a gradient descent we will use gradient Ascent the only difference between the two is that instead of having a minus sign here we have a plus sign now uh this is this algorithm is called the policy gradient uh optimization and the point is we need to calculate the gradient of this uh function here of our objective function so what is the gradient with respect to the parameters of our model so our language model what is the gradient of the um expected return over all possible trajectories with respect to the parameters of the model we need to find an expression of this gradient so that we can calculate it and use it to optimize the parameters of the model using gradient Ascent using also a learning rate Alpha you can see here now let's see how to derive the expression of the gradient of this uh objective function that we have now the gradient of the objective function is the gradient of this expectation so it's the expectation over all possible trajectory of multiplied by the uh the the return over the particular trajectory as we know the expectation is also an integral so it can be written as the gradient of the integral of the probability of following a particular trajectory multiplied by the return over this trajectory uh as you know the the from high school the gradient of a sum is equal to the sum of the gradients or the you you may call it as the derivative so the derivative of a sum of a function is equal to the uh sum of the derivatives so we can bring this gradient sign inside and it can it can be written like this now we will use a trick called The Log derivative trick to expand this expression so P of to given Theta into this expression here let me show you how it works uh let's use the pen okay you may recall also from calculus that the gradient with respect to Theta of the log function of the log of a function in this case of P of to given Theta is equal to so the gradient of the derivative of the log function is 1 over the function uh P of to given Theta multiplied by the gradient with respect to Theta of the function that is inside the log so P of to given Theta we can take this term to the left side multiply it here and this expression here will become equal to the this expression multiplied by this expression and this is exactly what you see here so we can replace this expression that we see in the equation Above So this expression with this expression we can see here in the equation below now um we can this integral we can write it back as an expectation over all possible trajectories of this quantity here now because the probability is only this term here so we can write it back as a expectation now we need to expand this term here so what is the gradient of the log so this this expression here so what is the gradient of the log of probability of a particular trajectory given the parameters of the model let's expand it we saw before that the probability of a trajectory is just the product of all the the probabilities of the state actions that are in this trajectory so we the probability of starting from a particular State multiplied by the probability of taking a particular action given the state we are in multiplied by the probability of ending up in a new state given that we started from s the state at time step T and we took action at time step T and we do it for all the state actions that we have in this trajectory if we apply a log to this uh expression here the product here will become a sum and uh let's do it actually okay so we the log of P of to given Pi Pi Theta actually because we we model our policy pi as parameterized by parameter Theta here I forgot the Theta but doesn't matter it's equal to the log of all this expression so it's the log of a series of products so it can be written as the log of p 0 of s0 plus the summation uh the the log of P of s t + 1 given that we are in St Plus 8 uh not plus and 0 and we are in we took action 80 plus the log of uh this the action that we took according to our policy a given that we were in St okay now we are also taking the gradient of this expression and if as you can see here there is no term that depends on uh Theta so it can be deleted also in this case we do not have any um expression that in this expression here we do not have anything that depends on Theta so this can be deleted uh because the derivative of something that does not have the the variable being um derived is a constant so it can be deleted because it will be zero so the only term surviving in this summation is only these terms here because it's the only one that contains the Theta as you can see here so in the final expression is this one here so this summation now let me Del it so we have derived an expression that allow us to calculate the gradient of the objective function why we need the gradient of the objective function because we want to run gradient Ascent now one thing that we can see here we still have this expectation over all possible trajectories now to calculate over all possible trajector in the case of the cat it means that we need to calculate this gradient over all the possible path that the cat can take of length for example 10 steps so if our TR we want to model trajectories of only length 10 it means that we need to calculate all the possible paths that the cat can take of length 10 and this could be a huge number in the case of language model it's even bigger because usually imagine we want to generate a trajectories of size 100 it means that what are the possible all the possible text that we can generate of size 100 tokens using our language model and for each of them we need to calculate the reward and the log action probabilities which I will show later how to calculate now as you can see the problem is this expectation is over a lot of terms so it's intractable computationally to calculate them to calculate this expression because we would generate need to generate a lot a lot a lot of text for the language model so one way to um to calculate this expectation is to approximate it with the sample mean so we can always approximate a an um an expectation with the sample mean so instead of calculating it over all the possible trajectories we can calculate it over some trajectories so in the case of the cat it means that we um take the cat and we ask it to move using the policy for some number of steps and each uh uh and we will generate one trajectory we do it many times and it will generate some trajectories in the case of the language model we have some prompt we ask the language model to generate some text then we do it many times using different temperatures and different sampling strategies for example by sampling randomly instead of using the gridy strategy we can use the top P so it will generate many texts each text will represent a trajectory we do not have to do it over all the possible text that the language model can generate but only some so it means that we will generate some trajectories so we can calculate this expression here only on some trajectory that our language model will generate and this will give us an approximation of this uh gradient here once we have this gradient here we can evaluate it over the trajectories that we have sampled and then run gradient Ascend on it so practically it works like this in the case of the cat we have some kind of Neal Network that defines the policy which is taking the state of the cat which is the position of the cat tells us what is the probability of the next action that the cat should take we can use that this policy which is not optimized is to generate some trajectories so for example we start from here we ask the policy where should I go and we for example we use the gridy strategy and we move down then uh or we use a top P for example also in this case we can use top P to sample randomly the the TR the the action given the probabilities generated by the the network so imagine the cat goes down and then we ask again the policy where should I go policy may say Okay move right move down move right move down Etc ET so we will generate trajectory we do it many times by sampling always randomly according to the probabilities generated by the policy um of for each state actions we will generate many trajectories in this case then we can evaluate because we also know the rewards that we accumulate over each state actions we calculate the reward we also know the log probabilities of the each action because for each state we have the log what is what was the probability of taking that action and we choose it and we need uh to calculate also the gradient of these log probabilities this is done by automatically by pytorch when you run loss dot backward so pytorch actually will calculate the gradient for you we do it for all the other possible trajectories this will give us the approximated um gradient of over the trajectories that we have collected we run gradient ascent and we optimize the parameters of the model using um a step towards the gradient now then we need to go uh we do we need to do it again so we need to collect more trajectories we evaluate them we evaluate the gradient of the log probabilities we run a gradient Ascent so we take one little step towards the direction of the gradient and then we do it again we go again collect some trajectories we evaluate this expression here to um calculate the gradient of the policy with respect to the parameters and we run again gradient Ascent so a little St towards the direction of the gradient this is known as the reinforcement learning algorithm in literature and we can use it also to optimize our language model so in the case of the language model we we have to also generate some trajectories so one way to generate trajectories would be to for example use the database of questions and answers that we have built before for the reward model uh which means that we have some questions so we ask the qu the language model to generate some answer for each question using for example the top pie strategy so it will generate according to the temperature many different uh answers for the same given question this will be a serious of trajectories because the the language model generation process is iterative process made up of States so prompts and actions and which are the next tokens and this will result in a list of trajectories for which we have have the log probabilities because the language model generates a list of probabilities over the next token and we can also calculate the gradient of this state um log probabilities using p torch because when we run loss. backward it will calculate the gradient but how do we do it in practice let's see now uh we want to calculate this term here so the log probabilities of the action given the state for language models which means what is the probability of the next token given a particular prompt imagine that our language model has generated the following response so we ask the language model where is Shanghai and the language model said Shanghai is in China our language model is a Transformer model so uh it is a Transformer layer um and it will generate given an input sequence of embedding it will generate an output sequence of embeddings which are called hidden States one for each input token as you know the language model when we use it for text generation it has a linear layer that allow us to calculate the logits for each position so usually we calculate the logits only of the last token because we want to understand what is the next token but actually we can calculate the Logics for each position so for example we can also calculate the Logics for this position and the Logics for this position will indicate what is the most likely next token given this input so where is Shan question mark Shanghai is so this is because of the caal mask that we applying during the self attention mechanism so each hidden States actually encapsulates information about the current token so in this case of the token is and also all the previous tokens this is a property of the language mod of the Transformer model that is used during training so in during training as you know we do not calculate um the output of the language model step by step we just give it the input inut sentence the output sentence which is the shifted version of the input sentence we calculate the uh for we do the forward pass and then we calculate the log using only one forward pass we can use the same mechanism to calculate the log probabilities for each uh States and actions in this trajectory which as I saw you is a series of prompts and next tokens now we can calculate the logits for this position for this position for this position and for this position then usually we apply the soft Max to understand what is the probability of the next token but in this case we want the log probabilities so we can apply the log soft Max for each position this will give us what is the log probability of the next token given only the previous tokens uh compared to the current one so for this position it will give us the log probability of the next token given that the input is only where is Shanghai question mark Shanghai of course we do not want all the Lo L probabilities we only want the log probability of the token that actually has been chosen in this trajectory what is the actual token that has been chosen for this particular uh position well we know it it's the word is so we only select the log probability corresponding to the word is this will return as the log probability for the entire trajectory because now we have the log probability of selecting the word Shanghai given the state where is Shanghai we have the probability of selecting the word is given the input where is Shanghai question mark Shanghai we have the log probability of selecting the word in given the input where is Shanghai question mark Shanghai is etc etc so now we have the log probabilities of each um of each position of each uh State action in this trajectory when we have this uh stuff here we can always ask py Tores to run uh the backward step to calculate the gradients and then we multiply each gradient by the reward that we receive from our reward model we can then calculate this expression and then we can run uh um gradient Ascent to optimize our policy based on this approximated gradient let's see how to calculate the reward now for the trajectory so calculating the reward is a similar process as you saw before we have a reward model that is a Transformer model with a linear layer on top that has only one output feature so imagine our sentence is the same so where is Shanghai Shanghai is in China this is the trajectory that has been generated by our language model now we give it to the reward model the reward model will generate some hidden States because it's a Transformer model and we apply the linear layer to all the positions that are corresponding to the action that are in this trajectory so first action is the selection of this word the second action is this one the third and the fourth so we we can generate the reward for each time step we can just sum these rewards to generate the total reward of the trajectory or we can sum the discounted reward which means that we will calculate something like this for example we will calculate let's write it so it will be the reward at time step zero plus gamma multiplied by the reward at time step one plus gamma multipli by the reward at time gamma to the power of two multiplied at the by the reward at time step two plus gamma to the power of 3 multiplied by the reward at M step 3 etc etc so now we also know how to calculate the reward for each trajectory so now we know how to evaluate uh this expression you can see here so now we know also how to run gradient Ascent to optimize our language model the algorithm that I have described before is called the gradient policy optimization and it works fine for very small problems but it exhibits problems it is not perfect for bigger uh problems for example language modeling and the problem is very simple the problem is that we are approximating so let's write here something so we as you saw before our um objective function which is J of theta which is an expectation over all possible trajectories that are sampled according to our policy and expectation each one with its reward along the trajectory so we are approximating the expectation with a sample mean so we do not uh calculate this expression over all possible trajectories we calculate it only on some trajectories now this is uh fair mean it means that the result that we will get will be an approximation that on average will converge to the true expectation so it means that on the long term it will converge to the true expectation but it exhibits High variance so to give your intuition into what this means let's talk about something more simple for example imagine I ask you to calculate the average um age of the American population now the American population is made up of 330 million people to calculate the average age means that you need to go to every person ask what is their birthday calculate the uh age and then sum all these ages that you collect divide by the number of people and this will give you the true average age of the American population but of course as you can see this is not easy to compute because you would need to inter 330 million people another uh idea would be say okay I don't go to Every American person I only go to some Americans and I calculated their average age which could give me a good indication of what is the average age of the American population but the result of this approximation depends on how many people you interview because if you only interview one person it may not be representative of the whole population even if you interview 10 people it may not be representative of the whole population so the more people you interview the better and this is actually a result that is statistically proven by the central limit theorem so let's talk about the variance of this estimator so we want to calculate the average age of the American population suppose that the average age of the American population is 40 years or 45 years or whatever um if we approximate it using a sample mean which means that we do not ask every americ but some Americans what is their average age we need to sample randomly some people and ask is what their age suppose that we only interview one person because we are uh we do not have time suppose that we are unlucky and this person happens to be a kindergarten student and this person will probably say it's the age is six so we will get a results that is very far from the true mean of the population on the other hand we may say ask again some random people and these people happen to be for example all people from retirement homes so we will get some number that is very high

Original Description

In this video, I will explain Reinforcement Learning from Human Feedback (RLHF) which is used to align, among others, models ...

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

This video explains Reinforcement Learning from Human Feedback (RLHF) with math derivations and PyTorch code, helping viewers understand model alignment and implementation.

Key Takeaways

Understand the basics of Reinforcement Learning
Learn about Human Feedback in RL
Study the math derivations of RLHF
Implement RLHF with PyTorch
Align models with human feedback

💡 RLHF can be used to align models with human feedback, providing a powerful tool for AI alignment and model improvement.

🔒 Pro feature: Ask AI to explain this lesson →

More on: RL Foundations

View skill →

Build a Doom AI Model with Python | Gaming Reinforcement Learning Full Course

Build a Doom AI Model with Python | Gaming Reinforcement Learning Full Course

Nicholas Renotte

Deep Reinforcement Learning for Atari Games Python Tutorial | AI Plays Space Invaders

Deep Reinforcement Learning for Atari Games Python Tutorial | AI Plays Space Invaders

Nicholas Renotte

Training & Testing Deep reinforcement learning (DQN) Agent - Reinforcement Learning p.6

Training & Testing Deep reinforcement learning (DQN) Agent - Reinforcement Learning p.6

Build a Game Bot (LIVE)

Build a Game Bot (LIVE)

How to Win Slot Machines - Intro to Deep Learning #13

How to Win Slot Machines - Intro to Deep Learning #13

Build an Mario AI Model with Python | Gaming Reinforcement Learning

Build an Mario AI Model with Python | Gaming Reinforcement Learning

Nicholas Renotte

Related Reads

A lightweight workflow for keeping up with AI conference papers

Learn a lightweight workflow to stay updated with AI conference papers and never miss important research again

Dev.to · Daniel

Why CitedEvidence Believes Great Researchers Read Less Than You Think

Great researchers don't read every paper, but rather focus on reading the right ones and applying their knowledge effectively

How to Write a Literature Review That Actually Argues Something

Learn to write a literature review that presents a clear argument, a crucial skill for ML researchers and students

Medium · Machine Learning

I Built a Personal Paper Engine to Stop Losing Research Papers

Build a personal paper engine to organize and annotate research papers efficiently

Dev.to · Ethan

Welcome to the Next Temperamental Era