THU LLM Benchmark

500 Questions x 6 Models | 15 Categories | Reasoning, Debug, Taiwan-localized | 3 Difficulty Levels

Best Accuracy
Mistral-Small-4 (91.5%)
Highest Throughput
DiffusionGemma-26B (807 tok/s)
Best Chinese
Gemma-4-31B (73.6%) / Nemotron-3-Ultra (71.4%)