Hi ShangtongZhang,
I'm confused about the adaptive KL divergence that you used in your code in order to update the actor model (two separate actor and critic model case). In your code, you use both object clip and the adaptive approx-kl, and if the $\text{approx-kl} \le 1.5 \times \text{target-kl}$, the actor model is updated. After reading the PPO, I saw that adaptive KL should belong to TRPO instead cause TRPO has a constraint at Equation 4. Since along with the two constraints including both clip and adaptive KL actor finds it very hard to be updated.
To my viewpoint, I think you are using CLIP and TRPO $L^{KLPEN}$ at the same time, and the $L^{KLPEN}$ should be constructed as
surr = ratio * advantage
if klaffter <= 0.66 * target_kl:
kl_coef /= 2
elif klafter > 1.5 * target_kl:
kl_coef *= 2
else:
print("KL is close enough")
actor_loss = surr - kl_coef * klafter
# Backwarding the actor loss ...

After calculating the KL coefficient $\beta$, it's used for calculating the loss and the gradient in Equation 8

And, only $L^{KLPEN}$ or $L^{CLIP}$ is used in training

Hi ShangtongZhang,
I'm confused about the adaptive KL divergence that you used in your code in order to update the actor model (two separate actor and critic model case). In your code, you use both object clip and the adaptive approx-kl, and if the$\text{approx-kl} \le 1.5 \times \text{target-kl}$ , the actor model is updated. After reading the PPO, I saw that adaptive KL should belong to TRPO instead cause TRPO has a constraint at Equation 4. Since along with the two constraints including both clip and adaptive KL actor finds it very hard to be updated.
To my viewpoint, I think you are using CLIP and TRPO$L^{KLPEN}$ at the same time, and the $L^{KLPEN}$ should be constructed as
After calculating the KL coefficient$\beta$ , it's used for calculating the loss and the gradient in Equation 8
And, only$L^{KLPEN}$ or $L^{CLIP}$ is used in training