Regulator-Grade AI Agent Certification
Not a vendor self-attestation. An independent third-party evaluation with a signed, tamper-evident certificate and public registry listing.
Enterprise procurement teams, institutional investors, and regulated-industry buyers require third-party evaluation because vendors cannot audit themselves. The same reason you need a Big Four auditor for your financials.
$4,990 one-time evaluation · $1,990 quarterly retest · 5-suite, 250-test assessment · 90-day cert validity
Why buyers cannot accept a vendor's own eval
- ✗ Conflict of interest — vendor controls methodology and results
- ✗ No replay verification — scores cannot be independently confirmed
- ✗ No tamper detection — report can be modified post-generation
- ✗ Fails enterprise procurement requirements for third-party audit
- ✓ Independent — no commercial relationship with model providers
- ✓ Deterministic replay — any party can verify scores match the cert
- ✓ Tamper-evident — HMAC-SHA256 hash checked on every public view
- ✓ Public registry — permanent, citable record for procurement RFPs
Frameworks this evaluation addresses
Evaluation produces documented technical evidence relevant to the following frameworks.
Mention of regulatory frameworks does not constitute legal advice. Consult qualified legal counsel for your specific requirements.
Responsible Scaling Policy
Third-party capability and safety threshold documentation required for frontier model deployment decisions.
OpenAI Preparedness Framework
Independent frontier risk assessment supporting safe deployment evaluations.
EU AI Act Article 50
Transparency obligations for general-purpose AI systems require documented technical evidence.
ISO/IEC 42001 AI Management System
Third-party evaluation evidence for AI management system audit and certification.
5 evaluation suites · 250 test cases total
Each suite runs 50 deterministic test cases against open benchmark datasets.
MMLU-Pro, MATH, HumanEval, and agentic task datasets. Measures raw task capability across domains.
Datasets: MMLU-Pro, MATH-500, HumanEval, AgentBench
HarmBench harmful request categories. Measures alignment quality and refusal on clearly harmful prompts.
Datasets: HarmBench, AdvBench, SafetyBench
Adversarial jailbreak attempts from HarmBench. Measures robustness against prompt injection and manipulation.
Datasets: HarmBench Jailbreaks, JailbreakBench, WildJailbreak
Structured tool-calling scenarios. Measures API call format accuracy, parameter correctness, and error handling.
Datasets: ToolBench, APIBench, ToolEval
Borderline prompts testing calibration. Measures appropriate refusal without over-refusing benign requests.
Datasets: RefusalBench, XSTest, TruthfulQA
Pricing
Simple, transparent pricing. No hidden fees.
- 5-suite, 250-test evaluation
- Signed, tamper-evident certificate
- Public registry listing
- Replay verification endpoints
- 90-day certificate validity
- Badge embed code
- Full methodology report (PDF)
- Re-runs all 5 suites
- Certificate renewed for 90 days
- Registry listing updated
- Score delta comparison report
- Covers model version and prompt changes
- Required for continuous registry status
Trust architecture
Every run signed with HMAC-SHA256. Any party can verify scores match.
Hash recomputed on every public view. Mismatch triggers automatic REVOKED status.
Permanent public record at regulatorysignals.com/agent-eval-registry.
We have no commercial relationship with model providers. Evaluation is conflict-free.
Add the badge to your README or website
After certification, embed a verifiable trust badge that links directly to your public registry listing.
<!-- Agent Eval Pass Badge -->
<a href="https://www.regulatorysignals.com/agent-eval-registry/{your-agent-slug}">
<img
src="https://www.regulatorysignals.com/badges/agent-eval-pass.svg"
alt="Agent Eval PASS — Regulatory Signals"
width="200"
height="28"
/>
</a>Frequently asked questions
What is evaluated in the AI Agent Evaluator-of-Record assessment?
The evaluation covers five suites: (1) Capability Benchmarking — 50 tasks drawn from MMLU-Pro, MATH, HumanEval, and agentic task datasets measuring raw capability. (2) Safety Alignment — 50 prompts from the HarmBench dataset measuring safety behavior on harmful request categories. (3) HarmBench Jailbreak Resistance — 50 adversarial jailbreak attempts measuring robustness to prompt attacks. (4) Tool-Use Correctness — 50 structured tool-calling scenarios measuring API call accuracy and error handling. (5) Refusal Calibration — 50 borderline prompts measuring whether the agent refuses appropriately without over-refusing benign requests.
Why can't we just use our own internal eval?
Enterprise procurement teams, institutional investors, and regulated-industry customers increasingly require third-party evaluation because vendor self-assessments create a conflict of interest. The same reason you need a Big Four auditor for your financials — you cannot audit yourself. Anthropic's RSP explicitly distinguishes between developer-run evals and independent third-party assessments. The EU AI Act Article 50 transparency obligations presuppose documented technical evidence that a third party can verify.
How long does the evaluation take?
Automated suite execution completes in 24–72 hours depending on agent response latency. You receive a draft score report for review, then a signed certificate and public registry listing within 5 business days of submission. Expedited 48-hour processing is available on request.
What does the signed certificate include?
The certificate includes: agent name, version, endpoint domain (truncated for privacy), evaluation date, 90-day expiry, overall score (0–100), per-suite scores, pass/conditional/fail tier, a tamper-evident HMAC-SHA256 hash of the evaluation payload, and a public registry URL. The cert hash is recomputed on every public view — any tampering triggers an automatic REVOKED status.
How does deterministic replay verification work?
Each evaluation run is seeded with a cryptographic nonce and the full prompt payload is HMAC-SHA256 signed before execution. The signed payload is stored immutably. Any third party can call /api/agent-eval/replay/{runId} to re-execute the identical test suite against the original signed prompt set and verify that scores match within a 0.5% tolerance. This proves the certificate reflects a real, unmodified evaluation run.
Which regulatory frameworks does this address?
The evaluation is designed to produce documented evidence relevant to: Anthropic's Responsible Scaling Policy (RSP) — third-party capability and safety thresholds; OpenAI Preparedness Framework — frontier risk assessment documentation; EU AI Act Article 50 — transparency obligations for general-purpose AI systems; ISO 42001 — AI management system audit evidence. Mention of these frameworks does not constitute legal advice. Buyers should consult qualified legal counsel for their specific regulatory requirements.
What is the quarterly retest, and why is it required?
AI agents change rapidly — new model versions, updated system prompts, and fine-tuning can alter safety and capability profiles significantly. Certificates expire after 90 days. The $1,990 quarterly retest re-runs all five suites against the current production version and renews the certificate if scores meet thresholds. This ensures your registry listing always reflects the current production agent, not a point-in-time snapshot.
Enterprise buyers are demanding independent evals now
Procurement teams at banks, healthcare orgs, and government agencies are adding "third-party AI agent evaluation" to their vendor RFPs. Get certified before your deal hits that requirement.
Request Evaluation — $4,990Questions? [email protected]