Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization
📰 ArXiv cs.AI
arXiv:2604.13175v1 Announce Type: cross Abstract: Large language models can be aligned with human preferences through offline reinforcement learning (RL) on small labeled datasets. While single-objective alignment is well-studied, many real-world applications demand the simultaneous optimization of multiple conflicting rewards, e.g. optimizing both catalytic activity and specificity in protein engineering, or helpfulness and harmlessness for chatbots. Prior work has largely relied on linear rewa
DeepCamp AI