Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation

📰 ArXiv cs.AI

arXiv:2604.00536v1 Announce Type: cross Abstract: Large language models (LLMs) achieve strong downstream performance largely due to abundant supervised fine-tuning (SFT) data. However, high-quality SFT data in knowledge-intensive domains such as humanities, social sciences, medicine, law, and finance is scarce because expert curation is expensive, privacy constraints are strict, and label consistency is hard to ensure. Recent work uses synthetic data, typically by prompting a generator over doma

Published 2 Apr 2026

Read full paper → ← Back to News