Spinning Up Exercises¶
Documenting solutions as I work through Open AI’s Spinning Up exercises. I won’t be writing down every detail of every solution, but rather what I see as the key insights to each exercise. I couldn’t do some of the exercises without checking the solutions.
P-Set 1: Basics of Implementation¶
Exercise 1.1: Gaussian Log-Likelihood
- Key is to realise that a diagonal multivariate Gaussian is just a univariate Gaussian along each component
- Since each component is independent, we can treat the probability of an n-dim obervation as the product of the probability of each component
- Calculate component-wise before summing to potentially make use of parallelism
Exercise 1.2: Policy for PPO
- Just need to get familiar with TF API
- MLP solves regression problem of input state to mean action of Gaussian
P-Set 2: Algorithm Failure Modes¶
Exercise 2.1: Value Function Fitting in TRPO
- TRPO is an advantage-based function (advantage is how much better off you are doing action a given state s compared to the average over all possible actions a’)
- Therefore it makes sense that training your value-function would help TRPO do well
Exercise 2.2: Silent Bug in DDPG
- It is a tensor shape error
- Should squeeze scalar-output MLPs so that they have shape [batch_size] instead of [batch_size, 1] since we’ll later be performing ops with shape [batch_size]
- Weird things happen when you trying adding/multiplying tensors of shape [a] with shape [a,1] (not what we want here)