ICLR 2026

StepORLM

A self-evolving framework with generative process supervision for operations research language models.

Chenyu Zhou · Tianyi Xu · Jianghao Lin · Dongdong Ge
SJTU IIC · Institute of Intelligent Computing, Shanghai Jiao Tong University
† Corresponding author: Jianghao Lin
Co-evolutionary loop of StepORLM

Co-evolutionary loop: policy + GenPRM improve each other.

Abstract

From myopic supervision to holistic, solver-grounded process evaluation.

Large Language Models have shown promise for Operations Research (OR), yet outcome reward struggles with credit assignment and discriminative step-wise supervision is often myopic. We introduce StepORLM, a self-evolving framework with generative process supervision. A policy model and a Generative Process Reward Model (GenPRM) co-evolve in a dual-feedback loop: solver-based outcome verification and holistic trajectory-level critique.

This combined signal aligns the policy via Weighted DPO and simultaneously refines GenPRM. The resulting 8B StepORLM achieves state-of-the-art performance across six OR benchmarks and the co-evolved GenPRM generalizes as a universal verifier, lifting inference scaling for both StepORLM and other OR LMs.

Why It Pops

Four signature ideas that make StepORLM stand out.

Generative Process Supervision

GenPRM evaluates entire trajectories, capturing long-range dependencies that step-wise PRMs miss.

Dual-Feedback Loop

Solver-verified outcomes plus holistic process critique drive stable, grounded improvements.

Self-Evolution

Policy and GenPRM co-evolve across iterations, steadily lifting performance on hard OR tasks.

Universal Verifier

GenPRM boosts ORLM by +10.0 avg Pass@1, proving strong cross-model generalization.

Framework at a Glance

Two stages. One co-evolutionary loop.

1

Warm-Up SFT

50K solver-verified OR problems with step-level reasoning train the initial policy. Data is synthesized and validated via external solvers.

2

Iterative Co-Evolution

Credit assignment vs myopia in OR supervision

Why outcome-only and step-wise supervision fail on long-horizon OR reasoning.

Self-evolving process overview

The self-evolving loop strengthens policy reasoning and GenPRM critiques.

Case Study: TSP Alignment Failure

A concrete ComplexLP example where a constraint error does not change the optimal value.

ComplexLP Problem 74 tour guide TSP visualization

ComplexLP Problem 74: bus tour across five cities (simplified visualization).

This example highlights the alignment gap between outcome-only supervision and process-correct modeling. Even when the solver returns the correct optimal objective, a subtle constraint mistake can survive because it does not change the final value for this instance.

StepORLM’s generative process supervision explicitly critiques the full reasoning trajectory, allowing the model to identify structural modeling errors that outcome rewards alone would miss. This is the motivation for holistic, trajectory-level verification in OR tasks.

Main Results

Pass@1 accuracy (%) across six OR benchmarks. All agentic baselines use GPT-4o as backbone.

Model Params NL4OPT MAMO
EasyLP
MAMO
ComplexLP
NLP4LP CompOR IndOR ReSocratic Avg.
Zero-shot LLMs
OpenAI o3 Closed 78.4 93.9 63.1 93.8 72.2 76.2 84.4 80.3
Gemini-2.5-Pro Closed 82.6 87.9 52.3 94.9 66.7 78.6 89.6 78.9
GPT-4o Closed 61.2 70.3 57.7 73.6 42.9 38.1 48.4 56.0
Kimi-K2 1T 77.9 93.4 55.9 84.3 61.1 59.5 81.9 73.4
DeepSeek-R1 671B 77.5 90.3 59.5 87.6 66.7 59.5 83.9 75.0
DeepSeek-V3 671B 79.8 95.2 53.2 92.1 55.6 66.7 85.1 75.4
Qwen2.5-72B-Inst 72B 78.9 95.8 44.1 88.2 50.0 57.1 81.1 70.7
Qwen3-32B 32B 77.5 92.3 46.9 93.8 50.0 61.9 85.1 72.5
Qwen3-8B 8B 63.8 73.6 45.0 58.4 27.8 52.4 61.0 54.6
Fine-tuned LLMs
ORLM 8B 73.8 90.4 59.5 76.4 50.0 42.9 61.8 65.0
LLMOPT 14B 75.1 83.5 67.6 86.0 22.2 52.4 73.2 65.7
OptMATH (origin) 32B 95.9* 89.9* 54.1* - - - - -
StepORLM 8B 97.7 97.2 79.3 97.8 55.6 59.5 82.6 81.4
Agentic Methods
OptiMUS-v0.3 Closed 76.2 78.0 46.8 88.8 46.8 45.2 87.6 67.1
CoT Closed 62.2 49.5 42.3 74.7 39.2 40.5 43.6 50.3
CoE Closed 66.7 94.4 50.6 87.4 57.1 31.2 71.2 65.5
CAFA Closed 68.1 71.2 44.5 50.0 46.4 41.1 40.1 51.6
StepORLM + GenPRM 8B+8B 97.2 97.8 87.4 98.9 61.1 61.9 94.6 85.6
Notes: Scores from original publications are marked with (*). Abbreviations: CompOR = ComplexOR, IndOR = IndustryOR.

Inference Scaling with GenPRM

Best-of-4 sampling with GenPRM verification yields the strongest average accuracy.

Model NL4OPT MAMO
EasyLP
MAMO
ComplexLP
NLP4LP CompOR IndOR ReSocratic Avg.
StepORLM as Policy Model
StepORLM 97.7 97.2 79.3 97.8 55.6 59.5 82.6 81.4
+ Major Vote 97.2 97.6 81.1 96.6 61.1 61.9 89.3 83.5
+ Solver Exec 97.7 98.4 81.1 96.1 61.1 66.7 90.3 84.5
+ Discriminative PRM 97.2 97.2 81.1 97.2 55.6 59.5 87.8 82.2
+ GenPRM (initial) 97.8 97.6 82.8 97.2 55.6 58.5 93.1 83.2
+ GenPRM (final) 97.2 97.8 87.4 98.9 61.1 61.9 94.6 85.6
ORLM as Policy Model
ORLM 73.8 90.4 59.5 76.4 50.0 42.9 61.8 65.0
+ Major Vote 78.7 88.4 50.5 78.7 44.4 47.6 73.0 65.9
+ Solver Exec 82.2 88.6 63.1 79.8 44.4 52.4 78.9 69.9
+ Discriminative PRM 75.1 91.7 63.1 82.0 50.0 54.8 74.7 70.2
+ GenPRM (initial) 87.3 90.6 55.0 90.4 44.4 47.6 65.5 68.7
+ GenPRM (final) 91.5 91.0 64.9 91.0 50.0 57.1 79.4 75.0

Self-Evolving Progress

Warm-up SFT lifts the base model, and iterative evolution keeps adding gains.

Iterative performance improvements

Performance gains across self-evolving iterations.

Data synthesis pipeline

Data synthesis pipeline that powers the warm-up stage.

Key observations: warm-up SFT yields the largest initial boost; co-evolution provides steady improvements; and the hardest benchmarks show non-monotonic trends due to small test sets and error type shifts from structural modeling to code-level fixes.

BibTeX

@misc{zhou2025steporlmselfevolvingframeworkgenerative,
      title={StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models}, 
      author={Chenyu Zhou and Tianyi Xu and Jianghao Lin and Dongdong Ge},
      year={2025},
      eprint={2509.22558},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2509.22558}, 
}