Softmax gradient policy for variance minimization and risk-averse multi armed bandits

📰 ArXiv cs.AI

arXiv:2604.00241v1 Announce Type: cross Abstract: Algorithms for the Multi-Armed Bandit (MAB) problem play a central role in sequential decision-making and have been extensively explored both theoretically and numerically. While most classical approaches aim to identify the arm with the highest expected reward, we focus on a risk-aware setting where the goal is to select the arm with the lowest variance, favoring stability over potentially high but uncertain returns. To model the decision proces

Published 2 Apr 2026
Read full paper → ← Back to News