Convergence of a L2 regularized Policy Gradient Algorithm for the Multi Armed Bandit
arXiv:2402.06388v4 Announce Type: replace Abstract: Although Multi Armed Bandit (MAB) on one hand and the policy gradient approach on the other hand are among the most used frameworks of Reinforcement Learning, the theoretical properties of the policy gradient algorithm used for MAB have not been given enough attention. We investigate in this work the convergence of such a procedure for the situation when a $L2$ regularization term is present jointly with the ‘softmax’ parametrization. We prove convergence under […]