Spinning Up Exercises¶

Documenting solutions as I work through Open AI’s Spinning Up exercises. I won’t be writing down every detail of every solution, but rather what I see as the key insights to each exercise. I couldn’t do some of the exercises without checking the solutions.

P-Set 1: Basics of Implementation¶

Exercise 1.1: Gaussian Log-Likelihood

Key is to realise that a diagonal multivariate Gaussian is just a univariate Gaussian along each component
Since each component is independent, we can treat the probability of an n-dim obervation as the product of the probability of each component
Calculate component-wise before summing to potentially make use of parallelism

Exercise 1.2: Policy for PPO

Just need to get familiar with TF API
MLP solves regression problem of input state to mean action of Gaussian

P-Set 2: Algorithm Failure Modes¶

Exercise 2.1: Value Function Fitting in TRPO

TRPO is an advantage-based function (advantage is how much better off you are doing action a given state s compared to the average over all possible actions a’)
Therefore it makes sense that training your value-function would help TRPO do well

Exercise 2.2: Silent Bug in DDPG

It is a tensor shape error
Should squeeze scalar-output MLPs so that they have shape [batch_size] instead of [batch_size, 1] since we’ll later be performing ops with shape [batch_size]
Weird things happen when you trying adding/multiplying tensors of shape [a] with shape [a,1] (not what we want here)