Reinforcement learning
- yasobel
- 7 déc. 2020
- 5 min de lecture
Dernière mise à jour : 18 déc. 2020
The aim of this lab is to optimize a CartPole game using RL.
What is reinforcement learning ?
The system learns with experience not with labels by using the Markov Decision Process and defining 4 crucial elements : State, Transition , Action and Reward.
In every state the system tries to choose an action to try and maximize the reward. Through an iterative process of learning, the system is able to perform better.
In our case of the cartpole game we can describe the system as follows :
State : Position and speed of cart and pole.
Transition : Movement of cart and pole.
Action : Accelerate cart left or right.
Reward : Is the cart pole up ?

The CartPole game has many versions, in this lab we will be using the one with continuous input and discreet actions (0 for pushing the cart to the left and 1 for pushing it to the right). For every step taken, a reward of 1 will be given and the game ends when the pole angle is larger than 12 degrees or if the cart goes too far to the left or to the right.
In this lab we use four important classes : QNetwork, Doer, Experience Replay and Learner. I'll briefly explain each one of them.
-> QNetwork :
Implements a QValue function, this is where we define the layers of the network (input, hidden, output). All of the layers are linear and the output is a fully connected layer.
We have one output layer for each action.
The class returns a tensor with a list of values for each action.
-> Does :
This class is used to determine which action will be done 1 or 0 ? It chooses the action depending on the maximum value of the tensor. If the value corresponds to action 0 then the system will go to the left, otherwise it corresponds to action 1 which means that it will go to the right.
-> Experience Replay :
This class gives us all the relevant data for each transition : the list of states, whether they're terminal or not, the actions, the reward of the transition and the relevance of each transition.
-> Learner :
And finally, this class is used to train the QNetwork using the batch from the Experience Replay.
In this class we can either use a SARSA algorithm or a QLEARNING algorithm.
At first, I didn't modify the initialisation values, I had :
-> a width of 16 for the QNetwork
-> a Doer with an epsilon of 0.3
-> a Learner with a gamma of 0.95 and a SARSA algorithm
I obtained the following performance, which is not that good because the scores are kind of low and they obviously don't increase over time.

1/Modifying the width:
Next, I modified the width of the QNetwork from 16 to 128 to see how it affectes the scores. In this case, the training is slower and the score starts at around 9 and increases to about 13. But I noticed that it always reaches a peak then decreases, so it's not simply increasing overtime.

2/Modifying the algorithm:
This time I changed the algorithm from SARSA to QLEARNING to update the weights:
It still runs slow, but as opposed to previously, we get higher score values that start at around 11.
In this case, the agent is overall performing better.

3/Modifying the Doer:
This time, I wanted to see what happens when the Doer is random. In this case, the agent doesn't improve over time, the plot is random as well as the transitions of an episode.


4/Modifying the Experience Replay (4 of them):
If sortTransition is set to True in the Experience Replay, the less relevant transitions will be removed first and the data will have an off policy (which handles data better).
This time, I had 4 Experience Replays as follows :

While having a width of 512 and a gamma of 0.9 and I got the following results :


The score highly increases but reaches a wall at around 65 then stabilises at around 60.
As for the reward, there are no sharp drops but it's not very stable.
We also notice in the tensors that the transition is not very smooth, we go from around 9 to 5, then to 3 then to 1 then to 0.
The batch size is obviously a limiting factor, increasing it helped increase the system's performance.
Having a bigger batch size and buffer size makes the system reach very high score values, meaning that the pole is being kept balanced, but there is a risk that the cart might quickly leave the room.
5/Modifying the Experience Replay (2 of them):
This time, I tried keeping only 2 Experience Replays with a sortTransition set to True and I got the following results :

The score reaches a wall at around 40 then drops a little bit and the reward function isn't stable.
The episode was very long this time, but it wasn't smooth at all because I got a drop from around 6 to 2 which is not what we're looking for.
6/Modifying gamma to 0.85 :
Maybe keeping the same configuration as previously and decreasing gamma will give us better transitions during an episode.
If gamma is equal to 0.99, the first values would be around 100 and the system would have more trouble to transition smoothly.
In this case, I got an episode that is way longer than previously, but there was still no smooth transition and we don't even reach 0 at the end of it (as we can see in the second image that represents the last transitions of the episode).


7/Train every 10 transitions instead of every transition :
Overall, I noticed that having 4 Experience Replays is better than 2 so this time I kept all of them and I trained the network at every 10 transitions instead of at each transition.
Why do this ? It's because the amount of data is a limiting factor and the training is what's slowing the process.
So maybe if we train the network less but with more data we'll get better results.
These modifications did indeed make the training phase faster but it didn't entirely solve the problem.
As we can see, the performance increases and stays at around 15, but it's not high enough.

As for the tensor, the transitions are smoother but it's not perfect yet

8/Modifying the reward to 0.1 :
In this case, putting the reward at 0.1 didn't do since the highest value of the score was at around 13.

9/Modifying epsilon to 0.1 :
The higher the epsilon, the more random the trajectories are so the system can explore more.
The score increases but hits a wall at around 21 and then decreases a little bit.

As for the episode, practically the values are around 7 and 6 and at the ends it drops to 1.
So this modification didn't give us better results.
10/Modifying the QNetwork:
I modified the network as follows and I put back epsilon to 0.3.


But the results were still not satisfying because the score is not that good (hits a wall at around 20) and doesn't increase significally.

As for the episode, the transitions are actually not that bad and they're kind of smooth.

11/From Sigmoid to LeakyRelu in the QNetwork:


Final results :
So to try and optimize the system I chose the following parameters based on the experiments that I've done earlier :
A reward of 0.1.
For the layers, I used the same configuration as in step 10.


As for the initialisation of the training I had the following parameters : as in step 4 for the Experience Replay and a width of 512, a gamma of 0.9 and a QLEARNING algorthim as in step 2.

And finally, I chose to train every 10 transitions instead of at each transition.

After training for some time, I got the following results :

The score was pretty high actually, it reached 70 than decreased a little bit and got stable after that.
As for the episode, it was neither long nor short, the transitions were a little bit better and we do reach a value that is close to zero at the end.
If you want to check the code here's a link to the collab : https://colab.research.google.com/drive/1D3kSlph9PofB_se0AI3yKAMtKe7tIaeG#scrollTo=nxRMOc32sYCB
Comments