Methodology
Agent Reliability Challenge (ARC) — Methodology
The Agent Reliability Challenge (ARC) is a scored evaluation challenge that tests whether autonomous agents should act — not just whether they can. Agents operate inside a simulated enterprise environment with role-based authority, policy enforcement, and live mutable records reached through a REST API, and the environment scores their actions automatically on Reliability, Compliance, and Efficiency.
ARC here means the Agent Reliability Challenge. It is not ARC-AGI or the Abstraction and Reasoning Corpus; the two are unrelated.
What ARC evaluates
ARC evaluates the act / refuse / escalate decision. A capable agent must do more than complete a workflow: it must recognize when an action is correct, when an action should be refused because it falls outside the assigned role's authority or violates policy, and when a request is ambiguous and should be escalated or sent back for clarification. Knowing when not to act is treated as a first-class part of the task, not an edge case.
The simulated enterprise environment
Agents work against a testing platform that simulates enterprise-like business systems through a structured REST API. The environment provides realistic data models — equipment hierarchies, personnel records, materials inventories, work management records, and incident notifications, alongside WRFM and production systems with wells, well status history, allocated production, and deferment records. A curated knowledge base supplies policy documents, standard operating procedures, role authority matrices, and technical reference material. Each task assigns the agent a specific role within an organizational structure, and the agent must operate within the permissions and authority of that role. An action-tracking system records every mutation the agent makes, so evaluation can verify the full chain of actions rather than only the final answer. The first edition is set in an Operations & Maintenance and Wells, Reservoir and Facilities Management (WRFM) enterprise — the distinctive setting for the challenge, not its headline.
How agents interact
The API-driven architecture is framework- and language-agnostic: an agent built in any language or framework can participate by speaking to the REST API. A Python SDK, starter templates, and a sample agent are provided to lower the barrier to entry. No prior oil and gas domain expertise is required — participants work from the provided knowledge base, policies, and operating procedures.
Scoring: Reliability, Compliance, Efficiency
Participants are evaluated across multiple dimensions using several leaderboards. The environment itself evaluates whether the agent passed task criteria.
- Reliability — Measures how accurately the agent performs complex tasks under varying conditions.
- Compliance — Measures whether the agent respects policies, role boundaries, and escalation/refusal requirements.
- Efficiency — Measures how much compute and cost the agent uses per task.
The goal is to reward agents that are reliable, compliant, and efficient — reflecting what actually matters in enterprise deployment.
Built-in, criteria-based evaluation
Evaluation is built-in and criteria-based: the environment scores the actions an agent takes against defined task criteria. It is not subjective judging by a panel and it does not reward demo polish. The public leaderboard turns individual submissions into shared signal across agent frameworks, LLM providers, and design approaches.
Task isolation and runtime variation
Each task runs against an isolated, seeded snapshot of the simulated world. A task can preserve its business intent while changing runtime details — operational dates, personnel names, entity IDs, material and equipment references, well references, production values, deferment records, and policy-relevant parameters. Because of this, agents cannot safely rely on memorized IDs, fixed dates, or one-off heuristics; a brittle solution that passes one seeded world may fail another. Strong agents inspect the current world, retrieve the governing evidence, validate authority, perform the minimum correct action, and cite the records or documents that justify the outcome. Internally, evaluators map stable benchmark intentions to the actual task-specific IDs and values generated for that runtime, so the world can be randomized while actions are still scored precisely, deterministically, and fairly.
Example scenarios
- Act — A field operator discovers an equipment issue and must create a proper incident notification with a risk assessment.
- Refuse — An engineer is asked to close a work order that belongs to a different discipline. The correct outcome is to refuse, because the action falls outside the assigned role's authority.
- Escalate / clarify — A production technologist or operations role is asked to mutate a deferment record that the role can advise on but does not own. The agent should refuse or escalate rather than make the change.
What you need before joining
To take part you need an agent-building environment in the language or framework of your choice and the ability to make REST API calls; the provided Python SDK and starter templates are optional but recommended. No oil and gas domain expertise is required. A structured warm-up period runs before the main challenge, with simple introductory tasks, example scenarios with expected outcomes, and progressively more demanding workflows — the warm-up is optional but strongly recommended, and the first warm-up task is intentionally easy so your first API call succeeds quickly.