StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models

Chenyu Zhou; Tianyi Xu; Jianghao Lin; Dongdong Ge

ICLR 2026

StepORLM

A self-evolving framework with generative process supervision for operations research language models.

Chenyu Zhou · Tianyi Xu · Jianghao Lin† · Dongdong Ge

SJTU IIC · Institute of Intelligent Computing, Shanghai Jiao Tong University

† Corresponding author: Jianghao Lin

arXiv OpenReview Code

StepORLM Model

GenPRM Model

Co-evolutionary loop: policy + GenPRM improve each other.

Abstract

From myopic supervision to holistic, solver-grounded process evaluation.

Large Language Models have shown promise for Operations Research (OR), yet outcome reward struggles with credit assignment and discriminative step-wise supervision is often myopic. We introduce StepORLM, a self-evolving framework with generative process supervision. A policy model and a Generative Process Reward Model (GenPRM) co-evolve in a dual-feedback loop: solver-based outcome verification and holistic trajectory-level critique.

This combined signal aligns the policy via Weighted DPO and simultaneously refines GenPRM. The resulting 8B StepORLM achieves state-of-the-art performance across six OR benchmarks and the co-evolved GenPRM generalizes as a universal verifier, lifting inference scaling for both StepORLM and other OR LMs.

Why It Pops

Four signature ideas that make StepORLM stand out.

Generative Process Supervision

GenPRM evaluates entire trajectories, capturing long-range dependencies that step-wise PRMs miss.

Dual-Feedback Loop

Solver-verified outcomes plus holistic process critique drive stable, grounded improvements.

Self-Evolution

Policy and GenPRM co-evolve across iterations, steadily lifting performance on hard OR tasks.

Universal Verifier

GenPRM boosts ORLM by +10.0 avg Pass@1, proving strong cross-model generalization.

Framework at a Glance

Two stages. One co-evolutionary loop.

1

Warm-Up SFT

50K solver-verified OR problems with step-level reasoning train the initial policy. Data is synthesized and validated via external solvers.

2

Iterative Co-Evolution

Credit assignment vs myopia in OR supervision

Why outcome-only and step-wise supervision fail on long-horizon OR reasoning.

The self-evolving loop strengthens policy reasoning and GenPRM critiques.

Case Study: TSP Alignment Failure

A concrete ComplexLP example where a constraint error does not change the optimal value.

ComplexLP Problem 74 tour guide TSP visualization

ComplexLP Problem 74: bus tour across five cities (simplified visualization).

This example highlights the alignment gap between outcome-only supervision and process-correct modeling. Even when the solver returns the correct optimal objective, a subtle constraint mistake can survive because it does not change the final value for this instance.

StepORLM’s generative process supervision explicitly critiques the full reasoning trajectory, allowing the model to identify structural modeling errors that outcome rewards alone would miss. This is the motivation for holistic, trajectory-level verification in OR tasks.

Main Results

Pass@1 accuracy (%) across six OR benchmarks. All agentic baselines use GPT-4o as backbone.

Model	Params	NL4OPT	MAMO EasyLP	MAMO ComplexLP	NLP4LP	CompOR	IndOR	ReSocratic	Avg.
Zero-shot LLMs
OpenAI o3	Closed	78.4	93.9	63.1	93.8	72.2	76.2	84.4	80.3
Gemini-2.5-Pro	Closed	82.6	87.9	52.3	94.9	66.7	78.6	89.6	78.9
GPT-4o	Closed	61.2	70.3	57.7	73.6	42.9	38.1	48.4	56.0
Kimi-K2	1T	77.9	93.4	55.9	84.3	61.1	59.5	81.9	73.4
DeepSeek-R1	671B	77.5	90.3	59.5	87.6	66.7	59.5	83.9	75.0
DeepSeek-V3	671B	79.8	95.2	53.2	92.1	55.6	66.7	85.1	75.4
Qwen2.5-72B-Inst	72B	78.9	95.8	44.1	88.2	50.0	57.1	81.1	70.7
Qwen3-32B	32B	77.5	92.3	46.9	93.8	50.0	61.9	85.1	72.5
Qwen3-8B	8B	63.8	73.6	45.0	58.4	27.8	52.4	61.0	54.6
Fine-tuned LLMs
ORLM	8B	73.8	90.4	59.5	76.4	50.0	42.9	61.8	65.0
LLMOPT	14B	75.1	83.5	67.6	86.0	22.2	52.4	73.2	65.7
OptMATH (origin)	32B	95.9*	89.9*	54.1*	-	-	-	-	-
StepORLM	8B	97.7	97.2	79.3	97.8	55.6	59.5	82.6	81.4
Agentic Methods
OptiMUS-v0.3	Closed	76.2	78.0	46.8	88.8	46.8	45.2	87.6	67.1
CoT	Closed	62.2	49.5	42.3	74.7	39.2	40.5	43.6	50.3
CoE	Closed	66.7	94.4	50.6	87.4	57.1	31.2	71.2	65.5
CAFA	Closed	68.1	71.2	44.5	50.0	46.4	41.1	40.1	51.6
StepORLM + GenPRM	8B+8B	97.2	97.8	87.4	98.9	61.1	61.9	94.6	85.6

Notes: Scores from original publications are marked with (*). Abbreviations: CompOR = ComplexOR, IndOR = IndustryOR.

Inference Scaling with GenPRM

Best-of-4 sampling with GenPRM verification yields the strongest average accuracy.

Model	NL4OPT	MAMO EasyLP	MAMO ComplexLP	NLP4LP	CompOR	IndOR	ReSocratic	Avg.
StepORLM as Policy Model
StepORLM	97.7	97.2	79.3	97.8	55.6	59.5	82.6	81.4
+ Major Vote	97.2	97.6	81.1	96.6	61.1	61.9	89.3	83.5
+ Solver Exec	97.7	98.4	81.1	96.1	61.1	66.7	90.3	84.5
+ Discriminative PRM	97.2	97.2	81.1	97.2	55.6	59.5	87.8	82.2
+ GenPRM (initial)	97.8	97.6	82.8	97.2	55.6	58.5	93.1	83.2
+ GenPRM (final)	97.2	97.8	87.4	98.9	61.1	61.9	94.6	85.6
ORLM as Policy Model
ORLM	73.8	90.4	59.5	76.4	50.0	42.9	61.8	65.0
+ Major Vote	78.7	88.4	50.5	78.7	44.4	47.6	73.0	65.9
+ Solver Exec	82.2	88.6	63.1	79.8	44.4	52.4	78.9	69.9
+ Discriminative PRM	75.1	91.7	63.1	82.0	50.0	54.8	74.7	70.2
+ GenPRM (initial)	87.3	90.6	55.0	90.4	44.4	47.6	65.5	68.7
+ GenPRM (final)	91.5	91.0	64.9	91.0	50.0	57.1	79.4	75.0

Self-Evolving Progress

Warm-up SFT lifts the base model, and iterative evolution keeps adding gains.

Performance gains across self-evolving iterations.

Data synthesis pipeline that powers the warm-up stage.

Key observations: warm-up SFT yields the largest initial boost; co-evolution provides steady improvements; and the hardest benchmarks show non-monotonic trends due to small test sets and error type shifts from structural modeling to code-level fixes.

BibTeX

@misc{zhou2025steporlmselfevolvingframeworkgenerative,
      title={StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models}, 
      author={Chenyu Zhou and Tianyi Xu and Jianghao Lin and Dongdong Ge},
      year={2025},
      eprint={2509.22558},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2509.22558}, 
}

More Works from Our Lab

ORLM: A Customizable Framework in Training Large Models for Automated Optimization Modeling

Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling