Agentic orchestration with recursive skill evolution

SkillFlow

Flow-Driven Recursive Skill Evolution for Agentic Orchestration

A trainable Supervisor learns reward-proportional trajectories, reads hindsight credit from flow, and turns those signals into a self-evolving skill library.

Paper PDF arXiv Code Dataset Model

TTB training demo reward-matching flow

SkillFlow teaser: before training uniform flow and after TTB reward-matching flow

Rollout DAG Before training, flow is nearly uniform across possible skill paths.

TTB residual \(\Delta(\tau)\) pushes probability mass toward reward-proportional flow.

Hindsight credit \(I(t)=P_F/P_B\) marks decisions whose value becomes clear after execution.

Skill evolution \(\hat F(s)\), \(G(s)\), and \(\log I(t)\) drive retain / refine / prune / create.

1Roll out paths
2Fit TTB
3Read \(I(t)\)
4Evolve skills

\( \pi^{*}(\tau\mid q)\propto\widetilde R(\tau)^\beta,\quad I(t)=P_F/P_B \)

Reward-matched sampling

\( \pi^{*}(\tau\mid q) \propto \widetilde{R}(\tau)^{\beta} \)

TTB keeps multiple high-reward orchestration paths alive instead of collapsing to one route.

Tempered trajectory balance

\( \mathcal{L}_{\mathrm{TTB}}(\tau)=\left(\Delta(\tau)/T\right)^2 \)

The residual aligns forward path probability, hindsight backward probability, and terminal reward.

Flow credit

\( I(t)=\frac{P_F(H_t\mid H_{t-1})}{P_B(H_{t-1}\mid H_t)} \)

Step importance and skill marginal flow tell the library what to retain, refine, prune, or create.

Evidence path

The interface follows the paper’s evidence path.

Each visible block maps to the paper’s action-DAG model, TTB objective, zero-cost credit signal, and phase-boundary skill curation.

Action-level DAG

Each node is an interaction history \(H_t\); each edge is one orchestration action.

\(a_t\in\{\texttt{skill},\texttt{act},\texttt{accept}\}\)

Reward-proportional flow

TTB preserves multiple successful trajectories instead of reinforcing only one mode.

\(\Delta(\tau)/T\rightarrow 0\)

Hindsight credit

The backward policy reads execution feedback and turns the same loss into per-step credit.

\(I(t)=P_F/P_B\)

Recursive library update

Flow diagnostics decide when to evolve and which atomic tips to retain or rewrite.

\(\mathcal S^{(k+1)}=\Phi(\mathcal S^{(k)};\cdots)\)

Overview

A cleaner way to train agent orchestration.

SkillFlow shifts orchestration from static workflow selection to flow-trained decision making, so diverse high-reward trajectories remain visible and actionable.

Before

Terminal rewards hide the route.

Direct reward maximization can collapse several valid strategies into one mode, while leaving step-level credit and library updates under-specified.

SkillFlow

Flow turns trajectories into signals.

Tempered Trajectory Balance connects terminal reward, backward credit, and skill marginal flow inside one training-and-evolution loop.

Method

The paper’s core idea, arranged as the system actually runs.

Chapter 1

Why SkillFlow changes orchestration.

The framework comparison moves from brittle routing, to reward-only learning, to flow-driven orchestration that keeps multiple successful strategies alive.

Heuristic

A fixed route cannot hear feedback.

Handcrafted orchestration searches and executes a pre-defined path, so terminal quality does not reshape future skill selection.

RL-based

Reward-only learning collapses modes.

Result-driven updates can over-concentrate on one path, losing alternative high-reward trajectories that should remain available.

SkillFlow

Flow keeps diversity actionable.

TTB residual flow, skill marginal flow, and step importance turn trajectory evidence into retain, refine, prune, and generate decisions.

Chapter 2

How the training loop produces reusable signals.

The system diagram shows one pass through Supervisor rollouts, TTB training, hindsight credit, and phase-boundary skill-library updates.

SkillFlow architecture with Supervisor, Executor, TTB training, backward policy, and recursive skill curation — Supervisor rollouts, TTB learning, hindsight backward policy, and phase-boundary skill curation.

Rollout

Generate reward-proportional trajectories.

The Supervisor samples skill, act, and accept actions while keeping high-reward paths likely rather than forcing a single route.

\( \pi^{*}(\tau\mid q)\propto\widetilde R(\tau)^\beta \)

TTB + credit

Fit forward and hindsight backward flow.

The TTB residual aligns forward probability, backward hindsight probability, and terminal reward; their ratio yields step credit.

\( I(t)=P_F(H_t\mid H_{t-1})/P_B(H_{t-1}\mid H_t) \)

Evolution

Update the skill library at phase boundaries.

Skill marginal flow and high-importance gaps decide which skills to retain, refine, prune, or generate for the next phase.

\( \mathcal S^{(k+1)}=\Phi(\mathcal S^{(k)};\{G,\widetilde\Lambda\},\{\log I\}) \)

Flow diagnostics

Skill curation is derived from training signals.

At phase boundaries, SkillFlow reads the TTB residual floor, step importance, skill marginal flow, and CGF stability diagnostics already computed by the loss.

When Residual plateau

\(\overline{\Delta^2}\) stops falling against the current library’s floor.

Where High-importance gaps

\(\log I(t)\) localizes decisions whose value appears only after execution.

What Skill marginal flow

\(\hat F(s)\) ranks which skills attract reward-matching probability mass.

Stability CGF / Jensen gap

\(\Lambda_1^{(s)}-G(s)\) separates reliable tips from context-unstable ones.

Bootstrap Empty library

\(Z_\theta\) adjusts while the base Supervisor explores.

Emergence First plateau

\(\Phi\) creates candidate atomic tips from success/failure pairs.

Maturity Boom and prune

\(\hat F(s)\) removes low-flow tips and refines unstable ones.

Steady state Compact library

Flow entropy remains high while the active skill set stabilizes.

Training loop

A continuous loop from rollout to skill evolution.

Roll out Supervisor selects skill, act, or accept actions.
Fit flow TTB aligns path probabilities with smoothed terminal reward.
Read credit Forward/backward ratios identify decisive steps.
Evolve The library is updated when the flow objective plateaus.

Results

Benchmark evidence kept compact after the method.

The tabbed metrics mirror the paper’s IID, OOD, and mechanism analysis while keeping the full tables in the PDF.

0HotpotQA EM

0TriviaQA EM

0AIME 2026 Acc.

0WebShop SR

0ALFWorld SR

0SWE-bench Resolved

Figure story

The figure layout follows the paper’s evidence path.

Performance comparison across task families and model backbones — Across QA, math, code, and interactive tasks, SkillFlow raises both average and category-level scores over the base supervisor.

Pass at K curve comparing strategy diversity — Reward-proportional sampling preserves useful diversity.

Reward curve during transfer analysis — Transfer reward rises as orchestration skills are reused and refined.

Token and time cost comparison — Lower token and time cost keeps the learned flow practical.

Skill evolution events aligned with TTB plateaus — Skill evolution events appear at phase boundaries instead of every rollout.

Reward curves comparing model backbones — Reward dynamics stay stable across different backbone supervisors.

Loss curve comparing transfer models — TTB loss drops quickly, then settles into a low-variance regime.

Evaluation scope

Four task families, one orchestration recipe.

Question answering

HotpotQA, TriviaQA, MuSiQue, and NQ-Open test multi-hop retrieval and answer formation.

Reasoning

AIME 2026, MedQA, MATH-Hard, and GPQA Diamond test mathematical, medical, and scientific reasoning.

Code generation

SWE-bench and HumanEval measure executable code repair and synthesis behavior.

Interactive decision making

WebShop, ALFWorld, ScienceWorld, and Mind2Web evaluate long-horizon tool and environment interaction.

Citation

BibTeX

@article{zhang2026skillflow,
  title={SkillFlow: Flow-Driven Recursive Skill Evolution for Agentic Orchestration},
  author={Zhang, Mingda and Luo, Haoran and Liu, Wenjin and Shen, Tiesunlong and Xiao, Zikai and Cambria, Erik and Tang, Xiaoying},
  journal={arXiv preprint arXiv:2605.14089},
  year={2026}
}