🛠 Scalable task synthesis via graph-structured reasoning with topological complexity control
🚀 Cost-efficient training via mid-training of core search-agent subskills
🏆 SOTA performance across both text-only and multimodal benchmarks
Figure 2. QA Generation. We construct directed acyclic graphs (DAGs) from Knowledge-Graph entities and Web-Walk hyperlinks, enabling explicit difficulty control via topological complexity. Each node is enriched with multi-source evidence, then sampled to form reasoning paths with Query Fuzzing (entity/attribute anonymization) to increase search challenge. Verifier Pipeline. A cascaded filter progressively validates quality: LLM difficulty check, QA-graph alignment, Google retrieval verification, hallucination detection, agent rollout confirmation, and answer uniqueness validation—producing 20K+ verified trajectories at >85% accuracy.
Comparison between REDSearcher and closed / open agentic models
| Model | BrowseComp | BrowseComp-zh | GAIA | HLE | Overall |
|---|---|---|---|---|---|
| Proprietary Deep Research Agents | |||||
| Seed–1.8 | 67.6 | 81.3 | 87.4 | 40.9 | 69.3 |
| Gemini–2.5–pro–DR | 7.6 | 27.3 | - | - | - |
| Gemini–3–Pro | 37.8 | 51.6 | 74.8 | 45.8 | 52.5 |
| Claude–4.5–sonnet | 24.1 | 42.4 | 66.0 | 32.0 | 41.1 |
| OpenAI–o3 | 49.7 | 58.1 | 70.5 | 20.2 | 49.6 |
| GPT–5–Thinking–high | 54.9 | 63.0 | 76.7 | 41.7 | 59.1 |
| GPT–5.2–Thinking–xhigh | 65.8 | 76.1 | - | - | - |
| Open-source Deep Research Agents | |||||
| Kimi–K2.5–Agent | 60.6 / 74.9* | - | - | 50.2 | - |
| GLM–4.7 | 52.0 / 66.6* | - / 67.5* | - | 42.8 | - |
| DeepSeek–V3.2 | 51.4 / 67.6* | - / 65.0* | - | 40.8 | - |
| LongCat–Flash–Thinking | 56.6 / 73.1* | 69.0 / 77.7* | - | - | - |
| Open-source 30B–A3B Agents | |||||
| WebResearcher–30B | 37.3 | 45.2 | - | 28.8 | - |
| WebSailorV2–30B | 35.3 | 44.1 | 74.1 | 30.6 | 46.0 |
| Tongyi DeepResearch–30B | 43.4 | 46.7 | 70.9 | 32.9 | 48.5 |
| GLM–4.7–Flash | 42.8 | - | - | - | - |
| REDSearcher | 42.1 / 57.4* | 49.8 / 58.2* | 80.1 | 33.3 | 51.3 |
* Results with Context Management (CM). Best results in bold.
Main results on multimodal search benchmarks
| Model | MM-BC | BC-VL | MMS+ | MMS | LiveVQA | HLE-T | HLE-VL | BC | BC-ZH |
|---|---|---|---|---|---|---|---|---|---|
| Proprietary Deep Research Agents | |||||||||
| Gemini-2.5-Flash | 5.6 | 44.6 | 19.9 | 64.0 | 73.0 | - | - | - | - |
| Gemini-2.5-Pro | 7.1 | 49.9 | 22.2 | 69.0 | 76.0 | - | - | 7.6 | 27.3 |
| Seed1.8 | 46.3 | - | - | - | - | 40.9 | 31.5 | 67.6 | 81.3 |
| Seed1.8† | 21.4 | 54.1 | 11.0 | 69.7 | 62.4 | - | - | - | - |
| GPT-5 | - | 46.1 | 17.2 | 63.7 | 73.3 | 41.7 | - | 54.9 | 63.0 |
| Gemini-3-Pro† | 28.5 | 56.4 | 38.1 | 73.0 | 79.9 | 45.8* | 36.0* | 37.8* | 51.6* |
| Multimodal Agent Flow | |||||||||
| Qwen2.5-VL | 1.8 | 10.2 | - | 29.2 | 35.7 | - | 4.9 | - | - |
| Qwen3-VL (30B) | 10.7 | 37.1 | 11.0 | 59.7 | 64.8 | 8.8 | 8.7 | 0.2 | 7.2 |
| Qwen3-VL (235B) | 12.1 | 43.1 | 17.4 | 63.3 | 70.2 | 14.5 | 14.1 | 0.3 | 18.6 |
| Multimodal DeepResearch Agent | |||||||||
| MMSearch-R1 | - | - | - | 53.8 | 48.4 | - | - | - | - |
| WebWatcher | - | 27.0 | - | 55.3 | 58.7 | - | 13.6 | - | - |
| DeepEyesV2 | - | - | - | 63.7 | - | - | - | - | - |
| Vision-DeepResearch | - | 53.7 | 28.5 | 69.6 | 77.6 | - | - | - | - |
| REDSearcher-MM-SFT | 25.3 | 55.3 | 20.2 | 70.3 | 78.5 | 24.4 | 24.2 | 30.1 | 43.1 |
| REDSearcher-MM-RL | 23.5 | 57.2 | 26.6 | 72.9 | 79.3 | 25.3 | 25.6 | 31.2 | 44.5 |
† denotes results evaluated using the same evaluation tools as ours, and * denotes results taken from the original papers.
@article{redsearcher2026,
title={REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents},
author={Zheng Chu and Xiao Wang and Jack Hong and Huiming Fan and Yuqi Huang and Yue Yang and Guohai Xu and Shengchao Hu and Dongdong Kuang and Chenxiao Zhao and Cheng Xiang and Ming Liu and Bing Qin and Xing Yu},
journal={arXiv preprint arXiv:2602.14234},
url={https://arxiv.org/pdf/2602.14234},
year={2026}
}