Benchmarking and Learning Real-World Customer Service Dialogue

Abstract

Existing benchmarks and training pipelines for industrial intelligent customer service (ICS) remain misaligned with real-world dialogue requirements, overemphasizing verifiable task success while under-measuring subjective service quality and realistic failure modes, leaving a gap between offline gains and deployable dialogue behavior. We close this gap with a benchmark-to-optimization loop: we first introduce OlaBench, an ICS benchmark spanning retrieval-augmented generation, workflow-based systems, and agentic settings, which evaluates service capability, safety, and latency sensitivity; moreover, motivated by OlaBench results showing state-of-the-art LLMs still fall short, we propose OlaMind, which distills reusable reasoning patterns and service strategies from expert dialogues and applies rubric-aware staged exploration-exploitation reinforcement learning to improve model capability. OlaMind surpasses GPT-5.2 and Gemini 3 Pro on OlaBench (83.64 vs. 70.58/70.84) and, in online A/B tests, delivers an average +23.67% issue resolution and -6.6% human transfer rate versus the baseline, bridging offline gains to deployment. Together, OlaBench and OlaMind advance ICS systems toward more anthropomorphic, professional, and reliable deployment.

Overview of challenges in industrial intelligent customer service and the OlaBench/OlaMind framework — Representative challenges faced by industrial intelligent customer service (ICS) systems, and our corresponding contributions: OLABENCH targets the benchmark gap, while OLAMIND addresses the learning gap.

Introduction

Industrial intelligent customer service requires dialogue systems that are effective, human-like, professionally competent, policy-compliant, and safe, all under strict latency and service constraints. Existing service-agent benchmarks predominantly emphasize task completion and tool correctness. Several critical dimensions remain underexplored, including hallucinations hidden behind fluent language, latency overhead from reasoning-intensive generation, and long-horizon adherence to service strategies and policies in multi-turn dialogues. As a result, strong offline performance remains a weak indicator of reliable real-world deployment.

To bridge this gap, the paper introduces OlaBench, a real-world benchmark for deployable dialogue behavior across service capability, safety, and latency-awareness in RAG, workflow, and agent settings. Motivated by OlaBench results showing that state-of-the-art LLMs still fall short, it further proposes OlaMind, which distills reusable reasoning patterns and service strategies from expert dialogues and refines them through rubric-aware exploration-exploitation reinforcement learning. The main contributions are as follows:

Contribution 1

Construct OlaBench, a real-world benchmark for industrial customer service that evaluates deployable dialogue behavior across multi-dimensional service quality, critical risk, hallucination, and latency.

Contribution 2

Propose OlaMind, a training paradigm that bootstraps models with expert reasoning patterns and service strategies, and then progressively elicits effective behaviors via exploration-exploitation reinforcement learning.

Contribution 3

Demonstrate that OlaMind achieves state-of-the-art performance on OlaBench and delivers consistent, measurable improvements in real-world deployment with live users.

Performance comparison of OlaMind against general LLMs on OlaBench — Performance comparison of OLAMIND against general LLMs on OLABENCH.

OlaBench: A Multi-Dimensional ICS Benchmark

Grounded in real challenges encountered in industrial practice, OlaBench is derived from real-world industrial customer-service data to evaluate models across multi-dimensional service capability, safety, and latency-awareness. It consists of three subsets spanning three application scenarios and evaluates six sub-capabilities.

Overall statistics of the OlaBench dataset.

We incorporate a dedicated hallucination-judge model, OlaMind-Hall-Judge, trained through a structured human-LLM interaction pipeline. The system first screens whether a response is hallucinatory, confirms non-hallucination cases with strong general LLMs, and routes suspected hallucinations to human verification. Verified annotations, refined rationales, and hard non-hallucination cases are then used to iteratively improve the detector.

Training pipeline for OlaMind-Hall-Judge — Training pipeline for our hallucination detection model OlaMind-Hall-Judge.

Human Verification

Human experts score 200 randomly sampled instances with the same standards as the LLM judge, and 5,000 human-annotated instances are used for safety evaluation. The agreement is strong across subjective service dimensions, supporting the use of the benchmark as a reliable offline evaluation tool.

Consistency between our LLM-as-a-judge and human experts.

OlaMind: Staged Rubric-Aware RL

Motivated by evaluations on OlaBench, which reveal that current state-of-the-art LLMs still fall short under industrial customer service constraints, we propose OlaMind, a training paradigm that progressively aligns models with industrial objectives. Instead of direct imitation, it distills reusable reasoning patterns and service strategies from expert dialogues and then adopts an exploration-exploitation refinement scheme.

Cold-Start

Thought Restructuring

This stage addresses noisy human data and the instability of directly imitating human responses. Strong general LLMs extract CoT-style trajectories that expose expert reasoning patterns and service strategies, and the resulting data is used to train Zero-Think and OlaMind-Cold-Start.

Stage-1 RL

Extensive Exploration

Starting from OlaMind-Cold-Start, GRPO with pre-CoT training encourages diverse reasoning paths. Rewards combine Dialogue Quality, Policy Compliance, Tool Calling, and a rubric-aware reward with instance-specific weighted criteria.

Stage-2 RL

Deep Exploitation

A re-cold-start step builds higher-quality data, then hybrid-CoT GRPO adds format, length, rule-match, risk, and hallucination rewards. This stage improves safety while preserving reasoning capability for latency-sensitive deployment.

Raw reasoning and service strategy records used by OlaMind — Overview of the staged OlaMind pipeline, from thought restructuring to exploration-exploitation reinforcement learning.

Reasoning comparison between Zero-Think and general LLMs — Comparison of the reasoning patterns and service strategies between *Zero-Think* and general LLMs.

Evaluation Results

On OlaBench, current state-of-the-art general LLMs still fall short, while OlaMind reaches 83.64 overall and surpasses GPT-5.2 at 70.58 and Gemini 3 Pro at 70.84.

Progressive Improvement. Stage-1 RL prioritizes exploration and achieves peak pure service capability, while Stage-2 adds the full safety reward spectrum and lowers business risk to 8.7%.
Efficiency. The post-CoT setup improves latency-performance trade-offs and supports immediate responses with optional truncatable reasoning.
Human Evaluation. Compared with the cold-start model, Stage-2 improves GSB by 35.6% and raises anthropomorphism Turing-test pass rate from 47.5% to 60.5%.

Offline performance of representative general LLMs and OlaMind variants on OlaBench.

The Pareto plot further highlights the latency-quality trade-offs of post-CoT OlaMind variants.

Pareto efficiency of OlaMind-Stage-2 models.

Ablation Study

Ablation results show that the final OlaMind performance depends jointly on rubric-aware reward terms, CoT strategy choices, and the staged reinforcement learning design.

Ablation results on reward components and staged reinforcement learning design — Ablation study on reward terms, CoT strategy, and training pipeline.

Online Gains and Reliability Analysis

Large-scale online A/B tests confirm that the offline gains transfer to deployment. Across community support and livestream interaction, OlaMind-Stage-2 achieves larger issue-resolution gains and lower human-transfer rates than OlaMind-Cold-Start.

Further analysis shows that Stage-2 corrects the over-exploration behavior of Stage-1, especially by reducing overcommitting risk and factual hallucination. The qualitative case study further illustrates more restrained, policy-consistent responses in ambiguous customer-service scenarios.

Risk and hallucination type distributions with qualitative response comparison.

BibTeX

@article{gao2025benchmarking,
  title={Benchmarking and Learning Real-World Customer Service Dialogue},
  author={Gao, Tianhong and Shen, Jundong and Wang, Jiapeng and Shi, Bei and Ju, Ying and Yao, Junfeng and Yu, Huiyu},
  journal={arXiv preprint arXiv:2510.22143},
  year={2025}
}