A system design walkthrough for senior AI agent developers building regulated, auditable LLM systems.
Audience: Senior AI / backend engineers Stack: Spring Boot · LangChain · LiteLLM · PostgreSQL · Chart.js · PagedJS · Puppeteer Pattern: Evidence-first · LLM-as-narrator · Deterministic pipeline with acceptance gate
TL;DR
Most AI reporting systems fail not at language generation, but at trust. This article documents the architecture of a CyberArk PAM Governance Reporting Agent in which the LLM is structurally forbidden from determining any fact. Every number, name, and finding in the final report must trace back to a deterministically computed evidence record. This single constraint shapes every architectural choice — from where the rule engine sits in the pipeline, to why LangChain was chosen over LangGraph, to the design of a four-stage acceptance gate that can block PDF generation when the LLM output cannot be verified.
The central insight: in high-stakes reporting, the value of an AI system is not what it generates — it is what it refuses to generate without evidence.
Contents
- 1. The core design principle
- 2. Architecture at a glance
- 3. The rule engine and the evidence JSON contract
- 4. The LLM contract
- 5. The acceptance gate
- 6. PDF rendering
- 7. LangChain, not LangGraph
- 8. Trade-offs and open questions
- 9. Lessons for senior AI agent designers
- 1. Separate computation from narration architecturally, not just in prompts.
- 2. Design the acceptance gate before you design the prompt.
- 3. The evidence schema is your LLM’s entire world.
- 4. Auditability is the unsolved problem, not generation.
- 5. Treat REVIEW as a first-class outcome.
- 6. Determinism is a feature, not a constraint.
- Closing
1. The core design principle
Enterprise reporting in regulated environments — security, compliance, internal audit — requires a property that large language models do not naturally provide: traceability. Every number on a CISO-facing report must be derivable from a raw data source. Every risk finding must reference an event record. Every recommendation must follow from a verified finding.
This is not a limitation of the model. It is a deliberate architectural boundary:
The LLM is a narrator, not an oracle.
PostgreSQL and the rule engine generate facts. The LLM generates prose. These responsibilities never cross.
LLMs are excellent at producing clear, contextual, executive-appropriate language from structured inputs. They are unreliable at arithmetic, record counting, and threshold comparisons — exactly the operations that matter in a governance report. The architecture must reflect this division of labour explicitly.
Division of responsibility
| The LLM is responsible for | The LLM must never do |
|---|---|
| Executive summary narrative | Count sessions, accounts, or users |
| Contextual recommendations | Calculate coverage percentages |
| Risk interpretation prose | Determine risk severity |
| Trend commentary | Reference users not in the evidence |
| Tone calibration for the audience | Invent findings or extrapolate |
A prompt that simply asks the model not to calculate is a weak guarantee. An architecture that physically prevents the model from receiving the raw data is a strong guarantee. The rest of this article describes how to build the strong guarantee.
2. Architecture at a glance
The pipeline is intentionally linear and deterministic. Each stage has a single responsibility and passes a typed artifact to the next stage. The LLM only enters at stage 5 — after all facts are locked.
┌──────────────┐ ┌────────────────┐ ┌─────────────┐ ┌──────────────────┐
│ CyberArk │ → │ Control-M │ → │ Postgres │ → │ Spring Boot │
│ (source) │ │ (scheduler) │ │ (warehouse) │ │ reporting svc │
└──────────────┘ └────────────────┘ └─────────────┘ └────────┬─────────┘
▼
┌──────────────────────────────────────────────┐
│ Rule Engine (Java) │
│ Deterministic policy evaluation │
└────────────────────┬─────────────────────────┘
▼
┌──────────────────────────────────────────────┐
│ Evidence JSON (the contract) │
└────────────────────┬─────────────────────────┘
▼
┌──────────────────────────────────────────────┐
│ LangChain → LiteLLM → LLM provider │
│ Structured output: ReportSummary JSON │
└────────────────────┬─────────────────────────┘
▼
┌──────────────────────────────────────────────┐
│ Acceptance Gate (PASS / REVIEW / FAIL) │
│ Schema · Fact · Coverage · Quality │
└────────────────────┬─────────────────────────┘
▼
┌──────────────────────────────────────────────┐
│ HTML template → PagedJS → Puppeteer │
│ ↓ │
│ PDF │
└──────────────────────────────────────────────┘
The single most important boundary in this diagram is between Evidence JSON and LangChain. Everything above that boundary is deterministic and replayable. Everything below it is probabilistic — and therefore subject to the acceptance gate before it is allowed to produce output.
Stage responsibilities
| Stage | Component | Responsibility |
|---|---|---|
| Ingestion | Control-M + PostgreSQL | Scheduled extraction from CyberArk. Raw events, session metadata, and account states stored relationally. |
| Computation | Rule engine | Deterministic evaluation of named, versioned policy rules. Produces typed findings. |
| Contract | Evidence JSON | The minimal, complete, structured world the LLM is allowed to see. |
| Narration | LangChain + LiteLLM + LLM | Converts structured findings into executive prose under JSON-schema enforcement. |
| Validation | Acceptance gate | Schema, fact, coverage, and quality checks before publication. |
| Rendering | PagedJS + Puppeteer | Print-quality PDF with cover, headers, footers, controlled pagination. |
3. The rule engine and the evidence JSON contract
The rule engine is a Spring Boot component that evaluates a library of named, versioned policy rules against the PostgreSQL dataset. Each rule produces zero or more findings. A finding is not a log line — it is a structured governance assertion with a severity, a rule code, and one or more evidence references that point back to raw CyberArk records.
Anatomy of a finding
{
"findingId": "F001",
"severity": "high",
"ruleCode": "MISSING_TICKET",
"ruleVersion": "2.1",
"humanUser": "john.chan",
"privilegedAcct": "adm_prod_db",
"targetSystem": "PROD-DB-01",
"sessionStart": "2025-11-14T23:41:00Z",
"durationSeconds": 1820,
"evidenceRefs": [
"cyberark_session:S123",
"db_audit_log:AL8812"
]
}
Every field exists for a reason:
findingId— referenced by the LLM in the summary and by the appendix for traceability.severity— assigned by the rule engine, not the LLM. The LLM cannot override it.ruleCode/ruleVersion— let auditors reproduce the finding from raw data, even years later.- Entity fields (
humanUser,privilegedAcct,targetSystem) — the only entities the LLM is allowed to name in its output. evidenceRefs— the audit trail. The acceptance gate uses these to verify every entity reference made by the LLM.
The evidence JSON is the LLM’s entire world
This is the most important design decision in the system. The LLM never sees raw session logs, never sees the PostgreSQL schema, and never sees export tables. It receives the complete, minimal, pre-computed array of findings plus a small amount of period metadata. Nothing more.
If a user’s name does not appear in the evidence JSON, the LLM has no way to introduce it. If a finding is not in the array, the LLM cannot generate prose about it. The model’s universe is bounded by what the rule engine chose to put in front of it.
Common antipattern: passing raw exports or CSV dumps to the LLM and asking it to summarise findings. This delegates fact determination to the model. The rule engine must run first; only its output reaches the LLM.
4. The LLM contract
LangChain is used here not for its agent capabilities, but for two specific features:
- Structured output via JSON-schema enforcement on the model response.
- Prompt template management with explicit, versionable constraint language.
LiteLLM sits below LangChain as a provider abstraction. This decouples the reporting service from any specific model vendor — OpenAI, Azure OpenAI, Bedrock, or an on-prem model behind a private endpoint — without changing the LangChain layer above. In a regulated environment where data residency is a hard requirement, the LiteLLM swap point is architecturally significant.
Prompt contract — what the model is allowed to do
The prompt template embeds a constraint preamble that is treated as part of the contract, not as guidance. Each constraint maps to a check in the acceptance gate.
| Status | Rule |
|---|---|
| Forbidden | Introduce any metric, count, or percentage not present in the evidence JSON |
| Forbidden | Name any user, account, or system not listed in finding records |
| Forbidden | Assign a severity different from the rule-engine-provided severity |
| Required | Reference the findingId for every risk statement in the executive summary |
| Required | Return output as valid JSON conforming to the ReportSummary schema |
| Guidance | Recommendations must follow from at least one finding with severity high or critical |
Output schema (sketch)
{
"executiveSummary": "string", // 200-400 words
"coverageNarrative": "string",
"usageNarrative": "string",
"riskNarrative": [
{
"findingId": "F001", // must match an evidence record
"narrative": "string"
}
],
"adoptionNarrative": "string",
"operationalHealthNarrative": "string",
"recommendations": [
{
"recommendationId": "R01",
"narrative": "string",
"supportingFindingIds": ["F001", "F004"] // must be non-empty
}
]
}
Note what the schema does not contain: there are no numeric fields, no severity assignments, no entity lists. The numbers and structural facts already exist in the evidence JSON and the rule engine’s output — the LLM produces narrative over them, nothing else.
5. The acceptance gate
The acceptance gate is the most technically novel component of this architecture. It answers a question most AI reporting pipelines never ask:
How do you know the LLM output is correct before publishing it?
The gate runs four sequential validation stages. A report must clear all mandatory stages before Puppeteer is invoked. Failure at any mandatory stage hard-blocks PDF generation.
Stage 1 — Schema validation (mandatory)
Validates the LLM JSON output against the ReportSummary schema. Required fields, enum values, array shapes, recommendation IDs. Malformed output fails immediately. This stage is cheap and catches the largest class of LLM errors.
Stage 2 — Fact validation (mandatory)
For every named entity and every referenced findingId in the LLM output, cross-reference against the evidence JSON and, where appropriate, the underlying PostgreSQL records. Specifically:
- Every
findingIdreferenced inriskNarrativeexists in the evidence array. - Every
humanUsermentioned in prose appears in at least one referenced finding. - Every
targetSystemmentioned in prose appears in at least one referenced finding. - Every recommendation’s
supportingFindingIdsresolve to real findings.
A single unsupported claim fails the report. This stage is the architectural payoff of the evidence-first design: it is mechanically possible because the LLM’s world was constrained upstream.
Stage 3 — Coverage validation (mandatory)
Verifies that every critical and high severity finding from the evidence JSON is referenced somewhere in the LLM summary. Silence about a severity: critical event is treated as a failure — a report that omits the most important finding is worse than no report at all.
Stage 4 — Quality validation (advisory)
Evaluates clarity, readability, executive tone, and actionability of recommendations. Failure here produces a REVIEW outcome — human approval required — but does not block generation. Quality is the only stage where the gate uses an LLM to evaluate an LLM, and it is deliberately advisory because language quality is genuinely subjective.
Gate outcomes
| Outcome | Meaning | Action |
|---|---|---|
| PASS | All mandatory stages clear | PDF generated automatically |
| REVIEW | Mandatory stages clear, quality flagged | Human reviewer approves before release |
| FAIL | Wrong data, unsupported claims, or missing critical findings | PDF hard-blocked; root cause logged |
REVIEW is a deliberate first-class outcome. Most pipelines have binary success/failure semantics. In a regulated context, the state where output is technically valid but flagged for human attention is the most important state — it is where human judgment enters the loop without blocking automation entirely.
Gate logic (sketch)
public AcceptanceResult evaluate(EvidenceBundle evidence, ReportSummary llmOutput) {
var schemaResult = schemaValidator.check(llmOutput);
if (schemaResult.failed()) return AcceptanceResult.fail("SCHEMA", schemaResult);
var factResult = factValidator.check(evidence, llmOutput);
if (factResult.failed()) return AcceptanceResult.fail("FACT", factResult);
var coverageResult = coverageValidator.check(evidence, llmOutput);
if (coverageResult.failed()) return AcceptanceResult.fail("COVERAGE", coverageResult);
var qualityResult = qualityEvaluator.score(llmOutput);
if (qualityResult.belowThreshold()) {
return AcceptanceResult.review("QUALITY", qualityResult);
}
return AcceptanceResult.pass();
}
The order matters. Schema is cheapest; quality is most expensive. Failing fast on cheap checks keeps the pipeline responsive and reduces wasted LLM evaluation cost.
6. PDF rendering
PagedJS is a W3C Paged Media polyfill that enables print-quality pagination within standard HTML/CSS. It handles page numbering, running headers, controlled section breaks, table-of-contents generation, and cover page layout — concerns that are difficult to manage in a low-level programmatic PDF library.
HTML template + Chart.js images + LLM prose
│
▼
PagedJS layout engine (paginates, places headers/footers)
│
▼
Puppeteer headless Chrome (prints to PDF)
│
▼
PDF artifact + signed report metadata
A few decisions worth flagging:
- Charts are pre-rendered as PNGs using
chartjs-node-canvasinside the Spring Boot service, then embedded as base64 data URIs. This removes any runtime JavaScript dependency from the Puppeteer render context and makes PDFs byte-stable across re-renders. - The appendix is generated separately and contains only evidence JSON records. There is no LLM-generated content in the appendix. An auditor reading the report can look up any
findingIdreferenced in the body and find the corresponding evidence record verbatim. - The report is checksummed and the metadata is signed before release, so the binary artifact is itself audit evidence.
7. LangChain, not LangGraph
LangGraph is designed for workflows that are non-deterministic: the next step depends on the previous step’s output, multiple agents coordinate, or human-in-the-loop gates are dynamic rather than fixed.
This system’s workflow is entirely deterministic. The sequence — extract, evaluate, narrate, validate, render — does not branch, retry, or loop based on LLM output. A LangChain structured output chain is exactly the right primitive. Adding LangGraph would introduce coordination overhead without enabling new capability.
Rule of thumb: use LangGraph when the path through your pipeline is decided at runtime. Use a structured LangChain chain when the path is fixed and only the content varies.
When LangGraph would be appropriate (V3+)
A later version of this system would benefit from LangGraph when it needs to correlate findings across multiple enterprise sources — CyberArk + ServiceNow + SailPoint + Active Directory + SIEM — and route investigation steps based on what it finds:
Finding (severity:critical)
│
▼
Lookup ServiceNow change ticket
│
├── ticket found → check approval chain in SailPoint
│ │
│ ├── approver still active in AD? → close out
│ └── approver offboarded? → escalate, page on-call
│
└── no ticket → SIEM correlation → human approval → open incident
That is genuinely a graph, not a chain. Each node’s outgoing edge depends on the node’s result, and human approval is dynamic rather than fixed at one stage. LangGraph earns its weight here. In the current V2 governance reporting scope, it does not.
8. Trade-offs and open questions
No architecture is free. The interesting trade-offs in this design:
1. Rule engine maintenance cost. Moving fact determination out of the LLM and into a Java rule engine means rule changes require code changes, tests, and deployment. The benefit is reproducibility; the cost is velocity. Mitigation: invest in a rule DSL and a rule-test framework early.
2. Evidence JSON schema as a coupling point. The schema is the contract between the rule engine, the LLM prompt, the acceptance gate, and the PDF template. A change to the schema touches all four. This is by design — it forces deliberate evolution — but it is real coupling. Mitigation: version the schema, allow side-by-side versions during transitions.
3. Fact validation requires exact-match semantics. If the LLM writes “John Chan” but the evidence JSON has "john.chan", naive matching fails. Either canonicalise the prompt input (“the user john.chan”), constrain the schema to use only IDs, or build a tolerant matcher with a strict false-positive budget. We chose canonicalisation; it is the simplest and most auditable.
4. Quality validation is the soft spot. Stage 4 uses an LLM to evaluate the primary LLM’s output. That is the one place in the system where probabilistic reasoning judges probabilistic reasoning. Keeping it advisory rather than mandatory is the architectural concession to this fact.
5. Latency. A four-stage gate plus PagedJS rendering is not real-time. This is fine for a scheduled governance report; it would be unacceptable for an interactive query system. Different use case, different design.
9. Lessons for senior AI agent designers
1. Separate computation from narration architecturally, not just in prompts.
A prompt that says “do not calculate anything” is a weak guarantee. An architecture that physically separates the computation layer (rule engine, PostgreSQL) from the narration layer (LangChain) is a strong guarantee. Invest in the rule engine. The prompt is the easy part.
2. Design the acceptance gate before you design the prompt.
Most AI pipeline designers write the prompt first and add validation later, if at all. Invert that. Define what correct output looks like first — schema, supported claims, coverage requirements — then design the prompt to reliably produce output that clears the gate. Better prompts and better validation fall out of this ordering.
3. The evidence schema is your LLM’s entire world.
Whatever the LLM receives is, from its perspective, all that exists. If a user’s name appears in the evidence JSON, the LLM can reference it. If it does not, it cannot. Design the schema to be the complete, minimal context the model needs — nothing more, nothing less. Every extra field is a hallucination surface; every missing field is a failure mode.
4. Auditability is the unsolved problem, not generation.
Generating plausible executive prose is a solved problem. Designing a system where every sentence in that prose can be traced to a source record, validated programmatically, and explained to an auditor — that is the unsolved, high-value problem. It is also what distinguishes a production governance system from a demo.
5. Treat REVIEW as a first-class outcome.
Most pipelines have two outcomes: success and failure. The REVIEW state — where the system produced technically valid output but flagged it for human attention — is the most important outcome in a regulated context. It is the point where human judgment is injected without blocking the workflow entirely. Design your quality validation gate to produce calibrated, actionable REVIEW flags, not binary pass/fail.
6. Determinism is a feature, not a constraint.
The temptation in agentic AI work is to reach for the most powerful primitive available. Resist it when the problem does not require it. A deterministic chain with a strong validation gate will out-ship a non-deterministic graph for any workflow whose path is fixed. Save the graph for problems whose path is genuinely dynamic.
Closing
The architecture described here is built around a single, opinionated claim: in a regulated reporting system, the LLM’s job is to write, not to know. Everything that knows — the rule engine, the evidence JSON, the acceptance gate — is deterministic, versionable, and auditable. The LLM is the last mile, and the gate is the safety net.
This is not the most exciting architecture. Agentic systems with autonomous decision-making are more glamorous. But for governance reporting, audit, and compliance, glamour is a liability. What you want is a system whose worst output is no output, never wrong output. Evidence-first design, with a strong acceptance gate, gets you there.
Built for CISO, Head of Cyber Security, IAM/PAM Manager, Technology Risk, Internal Audit, and Compliance audiences. The same pattern generalises to any LLM application where every claim must be defensible.