Recent AI systems have achieved strong results on various benchmarks but lack economically meaningful deployment across many professional domains due to evaluation problems. Researchers have proposed several solutions, including Agents' Last Exam (ALE), RLinf-VLA, SkillOpt, Harness-1, SCAIL-2, WhisperKit, Mirage, SearchSwarm-30B-A3B, Docling, Agent Lightning, MinerU2.5, GLM-4.5, and Cosmos 3, ...