In the previous post, we solved CartPole using the Cross-Entropy Method: sample 200 candidate policies, keep the best 40, refit a Gaussian, repeat. It worked beautifully, reaching a perfect score of 500 in 50 iterations. But 200 candidates per iteration means 10,000 total episode evaluations.

That got me wondering: do we really need a population of 200 to find four good numbers? The original code