StepORLM
A self-evolving framework with generative process supervision for operations research language models.
Co-evolutionary loop: policy + GenPRM improve each other.
Abstract
From myopic supervision to holistic, solver-grounded process evaluation.
Large Language Models have shown promise for Operations Research (OR), yet outcome reward struggles with credit assignment and discriminative step-wise supervision is often myopic. We introduce StepORLM, a self-evolving framework with generative process supervision. A policy model and a Generative Process Reward Model (GenPRM) co-evolve in a dual-feedback loop: solver-based outcome verification and holistic trajectory-level critique.
This combined signal aligns the policy via Weighted DPO and simultaneously refines GenPRM. The resulting 8B StepORLM achieves state-of-the-art performance across six OR benchmarks and the co-evolved GenPRM generalizes as a universal verifier, lifting inference scaling for both StepORLM and other OR LMs.
Why It Pops
Four signature ideas that make StepORLM stand out.
Generative Process Supervision
GenPRM evaluates entire trajectories, capturing long-range dependencies that step-wise PRMs miss.
Dual-Feedback Loop
Solver-verified outcomes plus holistic process critique drive stable, grounded improvements.
Self-Evolution
Policy and GenPRM co-evolve across iterations, steadily lifting performance on hard OR tasks.
Universal Verifier
GenPRM boosts ORLM by +10.0 avg Pass@1, proving strong cross-model generalization.
Framework at a Glance
Two stages. One co-evolutionary loop.
Warm-Up SFT
50K solver-verified OR problems with step-level reasoning train the initial policy. Data is synthesized and validated via external solvers.
Iterative Co-Evolution
Why outcome-only and step-wise supervision fail on long-horizon OR reasoning.
The self-evolving loop strengthens policy reasoning and GenPRM critiques.
Case Study: TSP Alignment Failure
A concrete ComplexLP example where a constraint error does not change the optimal value.
ComplexLP Problem 74: bus tour across five cities (simplified visualization).
This example highlights the alignment gap between outcome-only supervision and process-correct modeling. Even when the solver returns the correct optimal objective, a subtle constraint mistake can survive because it does not change the final value for this instance.
StepORLM’s generative process supervision explicitly critiques the full reasoning trajectory, allowing the model to identify structural modeling errors that outcome rewards alone would miss. This is the motivation for holistic, trajectory-level verification in OR tasks.
Main Results
Pass@1 accuracy (%) across six OR benchmarks. All agentic baselines use GPT-4o as backbone.
| Model | Params | NL4OPT | MAMO EasyLP |
MAMO ComplexLP |
NLP4LP | CompOR | IndOR | ReSocratic | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| Zero-shot LLMs | |||||||||
| OpenAI o3 | Closed | 78.4 | 93.9 | 63.1 | 93.8 | 72.2 | 76.2 | 84.4 | 80.3 |
| Gemini-2.5-Pro | Closed | 82.6 | 87.9 | 52.3 | 94.9 | 66.7 | 78.6 | 89.6 | 78.9 |
| GPT-4o | Closed | 61.2 | 70.3 | 57.7 | 73.6 | 42.9 | 38.1 | 48.4 | 56.0 |
| Kimi-K2 | 1T | 77.9 | 93.4 | 55.9 | 84.3 | 61.1 | 59.5 | 81.9 | 73.4 |
| DeepSeek-R1 | 671B | 77.5 | 90.3 | 59.5 | 87.6 | 66.7 | 59.5 | 83.9 | 75.0 |
| DeepSeek-V3 | 671B | 79.8 | 95.2 | 53.2 | 92.1 | 55.6 | 66.7 | 85.1 | 75.4 |
| Qwen2.5-72B-Inst | 72B | 78.9 | 95.8 | 44.1 | 88.2 | 50.0 | 57.1 | 81.1 | 70.7 |
| Qwen3-32B | 32B | 77.5 | 92.3 | 46.9 | 93.8 | 50.0 | 61.9 | 85.1 | 72.5 |
| Qwen3-8B | 8B | 63.8 | 73.6 | 45.0 | 58.4 | 27.8 | 52.4 | 61.0 | 54.6 |
| Fine-tuned LLMs | |||||||||
| ORLM | 8B | 73.8 | 90.4 | 59.5 | 76.4 | 50.0 | 42.9 | 61.8 | 65.0 |
| LLMOPT | 14B | 75.1 | 83.5 | 67.6 | 86.0 | 22.2 | 52.4 | 73.2 | 65.7 |
| OptMATH (origin) | 32B | 95.9* | 89.9* | 54.1* | - | - | - | - | - |
| StepORLM | 8B | 97.7 | 97.2 | 79.3 | 97.8 | 55.6 | 59.5 | 82.6 | 81.4 |
| Agentic Methods | |||||||||
| OptiMUS-v0.3 | Closed | 76.2 | 78.0 | 46.8 | 88.8 | 46.8 | 45.2 | 87.6 | 67.1 |
| CoT | Closed | 62.2 | 49.5 | 42.3 | 74.7 | 39.2 | 40.5 | 43.6 | 50.3 |
| CoE | Closed | 66.7 | 94.4 | 50.6 | 87.4 | 57.1 | 31.2 | 71.2 | 65.5 |
| CAFA | Closed | 68.1 | 71.2 | 44.5 | 50.0 | 46.4 | 41.1 | 40.1 | 51.6 |
| StepORLM + GenPRM | 8B+8B | 97.2 | 97.8 | 87.4 | 98.9 | 61.1 | 61.9 | 94.6 | 85.6 |
Inference Scaling with GenPRM
Best-of-4 sampling with GenPRM verification yields the strongest average accuracy.
| Model | NL4OPT | MAMO EasyLP |
MAMO ComplexLP |
NLP4LP | CompOR | IndOR | ReSocratic | Avg. |
|---|---|---|---|---|---|---|---|---|
| StepORLM as Policy Model | ||||||||
| StepORLM | 97.7 | 97.2 | 79.3 | 97.8 | 55.6 | 59.5 | 82.6 | 81.4 |
| + Major Vote | 97.2 | 97.6 | 81.1 | 96.6 | 61.1 | 61.9 | 89.3 | 83.5 |
| + Solver Exec | 97.7 | 98.4 | 81.1 | 96.1 | 61.1 | 66.7 | 90.3 | 84.5 |
| + Discriminative PRM | 97.2 | 97.2 | 81.1 | 97.2 | 55.6 | 59.5 | 87.8 | 82.2 |
| + GenPRM (initial) | 97.8 | 97.6 | 82.8 | 97.2 | 55.6 | 58.5 | 93.1 | 83.2 |
| + GenPRM (final) | 97.2 | 97.8 | 87.4 | 98.9 | 61.1 | 61.9 | 94.6 | 85.6 |
| ORLM as Policy Model | ||||||||
| ORLM | 73.8 | 90.4 | 59.5 | 76.4 | 50.0 | 42.9 | 61.8 | 65.0 |
| + Major Vote | 78.7 | 88.4 | 50.5 | 78.7 | 44.4 | 47.6 | 73.0 | 65.9 |
| + Solver Exec | 82.2 | 88.6 | 63.1 | 79.8 | 44.4 | 52.4 | 78.9 | 69.9 |
| + Discriminative PRM | 75.1 | 91.7 | 63.1 | 82.0 | 50.0 | 54.8 | 74.7 | 70.2 |
| + GenPRM (initial) | 87.3 | 90.6 | 55.0 | 90.4 | 44.4 | 47.6 | 65.5 | 68.7 |
| + GenPRM (final) | 91.5 | 91.0 | 64.9 | 91.0 | 50.0 | 57.1 | 79.4 | 75.0 |
Self-Evolving Progress
Warm-up SFT lifts the base model, and iterative evolution keeps adding gains.
Performance gains across self-evolving iterations.
Data synthesis pipeline that powers the warm-up stage.
Key observations: warm-up SFT yields the largest initial boost; co-evolution provides steady improvements; and the hardest benchmarks show non-monotonic trends due to small test sets and error type shifts from structural modeling to code-level fixes.
BibTeX
@misc{zhou2025steporlmselfevolvingframeworkgenerative,
title={StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models},
author={Chenyu Zhou and Tianyi Xu and Jianghao Lin and Dongdong Ge},
year={2025},
eprint={2509.22558},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2509.22558},
}