DeepRubric: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents

Minghang Zhu, Chuyang Wei, Junhao Xu, Yilin Cheng, Zhumin Chen, Jiyan He

DeepRubric builds query-rubric supervision from retrieved evidence before reinforcement learning. Instead of starting from a user query and asking a model to invent a rubric, DeepRubric first constructs evidence trees from local Wikipedia and OpenScholar corpora, then synthesizes aligned research queries and grounded factual/logical rubric criteria from selected evidence leaves.

The release exposes the full reproduction loop: deploy retrievers, construct evidence trees, verify KEEP/REVISE/DROP samples, convert verified data to verl-tool parquet format, and start GRPO training for a tool-using deep research agent.

Query-first versus evidence-first rubric construction.
Query-first pipelines infer rubrics from a given query; DeepRubric builds an evidence tree first and derives both the query and rubric from the same structure.

Method Overview

DeepRubric method overview.
DeepRubric first constructs evidence trees, then co-generates evidence-grounded query-rubric pairs, and finally trains a tool-using research agent with rubric rewards.

The same retriever endpoints are used in data construction and training. This keeps the reward signal, tool observations, and generated tasks aligned with the evidence available to the final agent.

Data Construction

9,838generated evidence trees
9,064retained query-rubric pairs
7.0rubric criteria per example
5.0selected leaves per retained example

DeepRubric verifies synthesized examples with a separate audit step. Samples are either kept, revised with evidence-grounded criteria, or dropped before conversion into training data.

Training Query Distribution

Training query distribution projected with BGE-large embeddings and t-SNE.
Semantic coverage of generated training queries, visualized from BGE-large embeddings with t-SNE and density contours.

Results

Model SQAv2 ResearchQA DRB Avg. RL compute
Qwen3-8B + Search 57.2 46.3 18.2 40.6 0h
DR Tulu-8B (1900-step) 86.8 74.3 43.4 68.2 ~9700h
DeepRubric-8B (140-step) 86.0 75.2 43.6 68.3 ~750h
Training step efficiency comparison between DeepRubric and baselines.
DeepRubric reaches competitive open deep-research performance with 140 GRPO update steps.

Reproduction

The code release includes wrappers for retriever deployment, data construction, data conversion, and verl-tool training. Large corpora, indexes, generated datasets, model checkpoints, logs, and private service configs are intentionally excluded.

START_RETRIEVERS=1 bash scripts/run_pipeline.sh

BibTeX

@misc{deeprubric2026,
  title  = {DeepRubric: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents},
  author = {Zhu, Minghang and Wei, Chuyang and Xu, Junhao and Cheng, Yilin and Chen, Zhumin and He, Jiyan},
  year   = {2026}
}