FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

📰 ArXiv cs.AI

FURINA is a customizable role-playing benchmark for large language models via a scalable multi-agent collaboration pipeline

advanced Published 7 Apr 2026

Action Steps

Identify the limitations of existing role-playing benchmarks
Design a scalable multi-agent collaboration pipeline to construct customizable benchmarks
Implement FURINA-Builder to automatically generate benchmarks at any scale
Evaluate large language models using the generated benchmarks

Who Needs to Know This

AI researchers and engineers on a team benefit from FURINA as it provides a flexible benchmark for evaluating role-playing tasks, while product managers can utilize it to assess language models for various applications

Key Insight

💡 FURINA enables the creation of fully customizable role-playing benchmarks, addressing the limitations of existing benchmarks