2 min readfrom Machine Learning

[D] Tested model routing on financial AI datasets — good savings and curious what benchmarks others use.

Ran a benchmark evaluating whether prompt complexity-based routing delivers meaningful savings. Used public HuggingFace datasets. Here's what I found.

Setup

Baseline: Claude Opus for everything. Tested two strategies:

  • Intra-provider — routes within same provider by complexity. Simple → Haiku, Medium → Sonnet, Complex → Opus
  • Flexible — medium prompts go to self-hosted Qwen 3.5 27B / Gemma 3 27B. Complex always stays on Opus

Datasets used

All from AdaptLLM/finance-tasks on HuggingFace:

  • FiQA-SA — financial tweet sentiment
  • Financial Headlines — yes/no classification
  • FPB — formal financial news sentiment
  • ConvFinQA — multi-turn Q&A on real 10-K filings

Results

Task Intra-provider Flexible (OSS)
FiQA Sentiment -78% -89%
Headlines -57% -71%
FPB Sentiment -37% -45%
ConvFinQA -58% -40%

Blended average: ~60% savings.

Most interesting finding

ConvFinQA showed 58% intra-provider savings despite being a complex multi-turn QA dataset. The scorer correctly identified that many questions inside long 10-K documents are simple lookups even when the surrounding document is complex.

"What was operating cash flow in 2014?" → answer is in the table → Haiku

"What is the implied effective tax rate adjustment across three years?" → multi-step reasoning → Opus

Caveats

  • Financial vertical only
  • ECTSum transcripts at ~5K tokens scored complex every time — didn't route. Still tuning for long-form tasks
  • Quality verification on representative samples not full automated eval

What datasets do you use for evaluating task-specific LLM routing decisions — specifically trying to find benchmarks that span simple classification through complex multi-step reasoning?

submitted by /u/Dramatic_Strain7370
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#financial modeling
#financial modeling with spreadsheets
#natural language processing for spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#real-time data collaboration
#real-time collaboration
#rows.com
#self-service analytics tools
#automated anomaly detection
#large dataset processing
#cloud-based spreadsheet applications
#self-service analytics