I needed to know if the cheaper model was good enough. So I built an LLM-as-a-Judge pipeline
📰 Dev.to · archminor
Benchmarks are useful, but they don't really tell me whether a prompt change or cheaper model is good...
Benchmarks are useful, but they don't really tell me whether a prompt change or cheaper model is good...