Designing an Evidence-Grounded AI Reporting Agent for Enterprise PAM Governance

A system design walkthrough for senior AI agent developers building regulated, auditable LLM systems.

Audience: Senior AI / backend engineers Stack: Spring Boot · LangChain · LiteLLM · PostgreSQL · Chart.js · PagedJS · Puppeteer Pattern: Evidence-first · LLM-as-narrator · Deterministic pipeline with acceptance gate

TL;DR

Most AI reporting systems fail not at language generation, but at trust. This article documents the architecture of a CyberArk PAM Governance Reporting Agent in which the LLM is structurally forbidden from determining any fact. Every number, name, and finding in the final report must trace back to a deterministically computed evidence record. This single constraint shapes every architectural choice — from where the rule engine sits in the pipeline, to why LangChain was chosen over LangGraph, to the design of a four-stage acceptance gate that can block PDF generation when the LLM output cannot be verified.

The central insight: in high-stakes reporting, the value of an AI system is not what it generates — it is what it refuses to generate without evidence.

1. The core design principle
- Division of responsibility
2. Architecture at a glance
- Stage responsibilities
3. The rule engine and the evidence JSON contract
- Anatomy of a finding
- The evidence JSON is the LLM’s entire world
4. The LLM contract
- Prompt contract — what the model is allowed to do
- Output schema (sketch)
5. The acceptance gate
6. PDF rendering
7. LangChain, not LangGraph
- When LangGraph would be appropriate (V3+)
8. Trade-offs and open questions
9. Lessons for senior AI agent designers
Closing

1. The core design principle

Enterprise reporting in regulated environments — security, compliance, internal audit — requires a property that large language models do not naturally provide: traceability. Every number on a CISO-facing report must be derivable from a raw data source. Every risk finding must reference an event record. Every recommendation must follow from a verified finding.

This is not a limitation of the model. It is a deliberate architectural boundary:

The LLM is a narrator, not an oracle.

PostgreSQL and the rule engine generate facts. The LLM generates prose. These responsibilities never cross.

LLMs are excellent at producing clear, contextual, executive-appropriate language from structured inputs. They are unreliable at arithmetic, record counting, and threshold comparisons — exactly the operations that matter in a governance report. The architecture must reflect this division of labour explicitly.

Division of responsibility

The LLM is responsible for	The LLM must never do
Executive summary narrative	Count sessions, accounts, or users
Contextual recommendations	Calculate coverage percentages
Risk interpretation prose	Determine risk severity
Trend commentary	Reference users not in the evidence
Tone calibration for the audience	Invent findings or extrapolate

A prompt that simply asks the model not to calculate is a weak guarantee. An architecture that physically prevents the model from receiving the raw data is a strong guarantee. The rest of this article describes how to build the strong guarantee.

2. Architecture at a glance

The pipeline is intentionally linear and deterministic. Each stage has a single responsibility and passes a typed artifact to the next stage. The LLM only enters at stage 5 — after all facts are locked.

┌──────────────┐   ┌────────────────┐   ┌─────────────┐   ┌──────────────────┐
│   CyberArk   │ → │   Control-M    │ → │  Postgres   │ → │  Spring Boot     │
│   (source)   │   │  (scheduler)   │   │ (warehouse) │   │  reporting svc   │
└──────────────┘   └────────────────┘   └─────────────┘   └────────┬─────────┘
                                                                   ▼
                              ┌──────────────────────────────────────────────┐
                              │              Rule Engine (Java)              │
                              │  Deterministic policy evaluation             │
                              └────────────────────┬─────────────────────────┘
                                                   ▼
                              ┌──────────────────────────────────────────────┐
                              │           Evidence JSON  (the contract)      │
                              └────────────────────┬─────────────────────────┘
                                                   ▼
                              ┌──────────────────────────────────────────────┐
                              │     LangChain → LiteLLM → LLM provider       │
                              │     Structured output: ReportSummary JSON    │
                              └────────────────────┬─────────────────────────┘
                                                   ▼
                              ┌──────────────────────────────────────────────┐
                              │     Acceptance Gate  (PASS / REVIEW / FAIL)  │
                              │     Schema · Fact · Coverage · Quality       │
                              └────────────────────┬─────────────────────────┘
                                                   ▼
                              ┌──────────────────────────────────────────────┐
                              │     HTML template → PagedJS → Puppeteer      │
                              │                ↓                             │
                              │              PDF                             │
                              └──────────────────────────────────────────────┘

The single most important boundary in this diagram is between Evidence JSON and LangChain. Everything above that boundary is deterministic and replayable. Everything below it is probabilistic — and therefore subject to the acceptance gate before it is allowed to produce output.

Stage responsibilities

Stage	Component	Responsibility
Ingestion	Control-M + PostgreSQL	Scheduled extraction from CyberArk. Raw events, session metadata, and account states stored relationally.
Computation	Rule engine	Deterministic evaluation of named, versioned policy rules. Produces typed findings.
Contract	Evidence JSON	The minimal, complete, structured world the LLM is allowed to see.
Narration	LangChain + LiteLLM + LLM	Converts structured findings into executive prose under JSON-schema enforcement.
Validation	Acceptance gate	Schema, fact, coverage, and quality checks before publication.
Rendering	PagedJS + Puppeteer	Print-quality PDF with cover, headers, footers, controlled pagination.

3. The rule engine and the evidence JSON contract

The rule engine is a Spring Boot component that evaluates a library of named, versioned policy rules against the PostgreSQL dataset. Each rule produces zero or more findings. A finding is not a log line — it is a structured governance assertion with a severity, a rule code, and one or more evidence references that point back to raw CyberArk records.

Anatomy of a finding

{
  "findingId":       "F001",
  "severity":        "high",
  "ruleCode":        "MISSING_TICKET",
  "ruleVersion":     "2.1",
  "humanUser":       "john.chan",
  "privilegedAcct":  "adm_prod_db",
  "targetSystem":    "PROD-DB-01",
  "sessionStart":    "2025-11-14T23:41:00Z",
  "durationSeconds": 1820,
  "evidenceRefs": [
    "cyberark_session:S123",
    "db_audit_log:AL8812"
  ]
}

Every field exists for a reason:

findingId — referenced by the LLM in the summary and by the appendix for traceability.
severity — assigned by the rule engine, not the LLM. The LLM cannot override it.
ruleCode / ruleVersion — let auditors reproduce the finding from raw data, even years later.
Entity fields (humanUser, privilegedAcct, targetSystem) — the only entities the LLM is allowed to name in its output.
evidenceRefs — the audit trail. The acceptance gate uses these to verify every entity reference made by the LLM.

The evidence JSON is the LLM’s entire world

This is the most important design decision in the system. The LLM never sees raw session logs, never sees the PostgreSQL schema, and never sees export tables. It receives the complete, minimal, pre-computed array of findings plus a small amount of period metadata. Nothing more.

If a user’s name does not appear in the evidence JSON, the LLM has no way to introduce it. If a finding is not in the array, the LLM cannot generate prose about it. The model’s universe is bounded by what the rule engine chose to put in front of it.

Common antipattern: passing raw exports or CSV dumps to the LLM and asking it to summarise findings. This delegates fact determination to the model. The rule engine must run first; only its output reaches the LLM.

4. The LLM contract

LangChain is used here not for its agent capabilities, but for two specific features:

Structured output via JSON-schema enforcement on the model response.
Prompt template management with explicit, versionable constraint language.

LiteLLM sits below LangChain as a provider abstraction. This decouples the reporting service from any specific model vendor — OpenAI, Azure OpenAI, Bedrock, or an on-prem model behind a private endpoint — without changing the LangChain layer above. In a regulated environment where data residency is a hard requirement, the LiteLLM swap point is architecturally significant.

Prompt contract — what the model is allowed to do

The prompt template embeds a constraint preamble that is treated as part of the contract, not as guidance. Each constraint maps to a check in the acceptance gate.

Status	Rule
Forbidden	Introduce any metric, count, or percentage not present in the evidence JSON
Forbidden	Name any user, account, or system not listed in finding records
Forbidden	Assign a severity different from the rule-engine-provided severity
Required	Reference the `findingId` for every risk statement in the executive summary
Required	Return output as valid JSON conforming to the `ReportSummary` schema
Guidance	Recommendations must follow from at least one finding with severity `high` or `critical`

Output schema (sketch)

{
  "executiveSummary": "string",                 // 200-400 words
  "coverageNarrative": "string",
  "usageNarrative": "string",
  "riskNarrative": [
    {
      "findingId": "F001",                      // must match an evidence record
      "narrative": "string"
    }
  ],
  "adoptionNarrative": "string",
  "operationalHealthNarrative": "string",
  "recommendations": [
    {
      "recommendationId": "R01",
      "narrative": "string",
      "supportingFindingIds": ["F001", "F004"]  // must be non-empty
    }
  ]
}

Note what the schema does not contain: there are no numeric fields, no severity assignments, no entity lists. The numbers and structural facts already exist in the evidence JSON and the rule engine’s output — the LLM produces narrative over them, nothing else.

5. The acceptance gate

The acceptance gate is the most technically novel component of this architecture. It answers a question most AI reporting pipelines never ask:

How do you know the LLM output is correct before publishing it?

The gate runs four sequential validation stages. A report must clear all mandatory stages before Puppeteer is invoked. Failure at any mandatory stage hard-blocks PDF generation.

Stage 1 — Schema validation (mandatory)

Validates the LLM JSON output against the ReportSummary schema. Required fields, enum values, array shapes, recommendation IDs. Malformed output fails immediately. This stage is cheap and catches the largest class of LLM errors.

Stage 2 — Fact validation (mandatory)

For every named entity and every referenced findingId in the LLM output, cross-reference against the evidence JSON and, where appropriate, the underlying PostgreSQL records. Specifically:

Every findingId referenced in riskNarrative exists in the evidence array.
Every humanUser mentioned in prose appears in at least one referenced finding.
Every targetSystem mentioned in prose appears in at least one referenced finding.
Every recommendation’s supportingFindingIds resolve to real findings.

A single unsupported claim fails the report. This stage is the architectural payoff of the evidence-first design: it is mechanically possible because the LLM’s world was constrained upstream.

Stage 3 — Coverage validation (mandatory)

Verifies that every critical and high severity finding from the evidence JSON is referenced somewhere in the LLM summary. Silence about a severity: critical event is treated as a failure — a report that omits the most important finding is worse than no report at all.

Stage 4 — Quality validation (advisory)

Evaluates clarity, readability, executive tone, and actionability of recommendations. Failure here produces a REVIEW outcome — human approval required — but does not block generation. Quality is the only stage where the gate uses an LLM to evaluate an LLM, and it is deliberately advisory because language quality is genuinely subjective.

Gate outcomes

Outcome	Meaning	Action
PASS	All mandatory stages clear	PDF generated automatically
REVIEW	Mandatory stages clear, quality flagged	Human reviewer approves before release
FAIL	Wrong data, unsupported claims, or missing critical findings	PDF hard-blocked; root cause logged

REVIEW is a deliberate first-class outcome. Most pipelines have binary success/failure semantics. In a regulated context, the state where output is technically valid but flagged for human attention is the most important state — it is where human judgment enters the loop without blocking automation entirely.

Gate logic (sketch)

public AcceptanceResult evaluate(EvidenceBundle evidence, ReportSummary llmOutput) {
    var schemaResult   = schemaValidator.check(llmOutput);
    if (schemaResult.failed()) return AcceptanceResult.fail("SCHEMA", schemaResult);

    var factResult     = factValidator.check(evidence, llmOutput);
    if (factResult.failed()) return AcceptanceResult.fail("FACT", factResult);

    var coverageResult = coverageValidator.check(evidence, llmOutput);
    if (coverageResult.failed()) return AcceptanceResult.fail("COVERAGE", coverageResult);

    var qualityResult  = qualityEvaluator.score(llmOutput);
    if (qualityResult.belowThreshold()) {
        return AcceptanceResult.review("QUALITY", qualityResult);
    }

    return AcceptanceResult.pass();
}

The order matters. Schema is cheapest; quality is most expensive. Failing fast on cheap checks keeps the pipeline responsive and reduces wasted LLM evaluation cost.

6. PDF rendering

PagedJS is a W3C Paged Media polyfill that enables print-quality pagination within standard HTML/CSS. It handles page numbering, running headers, controlled section breaks, table-of-contents generation, and cover page layout — concerns that are difficult to manage in a low-level programmatic PDF library.

HTML template + Chart.js images + LLM prose
    │
    ▼
PagedJS layout engine  (paginates, places headers/footers)
    │
    ▼
Puppeteer headless Chrome  (prints to PDF)
    │
    ▼
PDF artifact + signed report metadata

A few decisions worth flagging:

Charts are pre-rendered as PNGs using chartjs-node-canvas inside the Spring Boot service, then embedded as base64 data URIs. This removes any runtime JavaScript dependency from the Puppeteer render context and makes PDFs byte-stable across re-renders.
The appendix is generated separately and contains only evidence JSON records. There is no LLM-generated content in the appendix. An auditor reading the report can look up any findingId referenced in the body and find the corresponding evidence record verbatim.
The report is checksummed and the metadata is signed before release, so the binary artifact is itself audit evidence.

7. LangChain, not LangGraph

LangGraph is designed for workflows that are non-deterministic: the next step depends on the previous step’s output, multiple agents coordinate, or human-in-the-loop gates are dynamic rather than fixed.

This system’s workflow is entirely deterministic. The sequence — extract, evaluate, narrate, validate, render — does not branch, retry, or loop based on LLM output. A LangChain structured output chain is exactly the right primitive. Adding LangGraph would introduce coordination overhead without enabling new capability.

Rule of thumb: use LangGraph when the path through your pipeline is decided at runtime. Use a structured LangChain chain when the path is fixed and only the content varies.

When LangGraph would be appropriate (V3+)

A later version of this system would benefit from LangGraph when it needs to correlate findings across multiple enterprise sources — CyberArk + ServiceNow + SailPoint + Active Directory + SIEM — and route investigation steps based on what it finds:

Finding (severity:critical)
   │
   ▼
Lookup ServiceNow change ticket
   │
   ├── ticket found  → check approval chain in SailPoint
   │                    │
   │                    ├── approver still active in AD?  → close out
   │                    └── approver offboarded?         → escalate, page on-call
   │
   └── no ticket      → SIEM correlation → human approval → open incident

That is genuinely a graph, not a chain. Each node’s outgoing edge depends on the node’s result, and human approval is dynamic rather than fixed at one stage. LangGraph earns its weight here. In the current V2 governance reporting scope, it does not.

8. Trade-offs and open questions

No architecture is free. The interesting trade-offs in this design:

1. Rule engine maintenance cost. Moving fact determination out of the LLM and into a Java rule engine means rule changes require code changes, tests, and deployment. The benefit is reproducibility; the cost is velocity. Mitigation: invest in a rule DSL and a rule-test framework early.

2. Evidence JSON schema as a coupling point. The schema is the contract between the rule engine, the LLM prompt, the acceptance gate, and the PDF template. A change to the schema touches all four. This is by design — it forces deliberate evolution — but it is real coupling. Mitigation: version the schema, allow side-by-side versions during transitions.

3. Fact validation requires exact-match semantics. If the LLM writes “John Chan” but the evidence JSON has "john.chan", naive matching fails. Either canonicalise the prompt input (“the user john.chan”), constrain the schema to use only IDs, or build a tolerant matcher with a strict false-positive budget. We chose canonicalisation; it is the simplest and most auditable.

4. Quality validation is the soft spot. Stage 4 uses an LLM to evaluate the primary LLM’s output. That is the one place in the system where probabilistic reasoning judges probabilistic reasoning. Keeping it advisory rather than mandatory is the architectural concession to this fact.

5. Latency. A four-stage gate plus PagedJS rendering is not real-time. This is fine for a scheduled governance report; it would be unacceptable for an interactive query system. Different use case, different design.

9. Lessons for senior AI agent designers

1. Separate computation from narration architecturally, not just in prompts.

A prompt that says “do not calculate anything” is a weak guarantee. An architecture that physically separates the computation layer (rule engine, PostgreSQL) from the narration layer (LangChain) is a strong guarantee. Invest in the rule engine. The prompt is the easy part.

2. Design the acceptance gate before you design the prompt.

Most AI pipeline designers write the prompt first and add validation later, if at all. Invert that. Define what correct output looks like first — schema, supported claims, coverage requirements — then design the prompt to reliably produce output that clears the gate. Better prompts and better validation fall out of this ordering.

3. The evidence schema is your LLM’s entire world.

Whatever the LLM receives is, from its perspective, all that exists. If a user’s name appears in the evidence JSON, the LLM can reference it. If it does not, it cannot. Design the schema to be the complete, minimal context the model needs — nothing more, nothing less. Every extra field is a hallucination surface; every missing field is a failure mode.

4. Auditability is the unsolved problem, not generation.

Generating plausible executive prose is a solved problem. Designing a system where every sentence in that prose can be traced to a source record, validated programmatically, and explained to an auditor — that is the unsolved, high-value problem. It is also what distinguishes a production governance system from a demo.

5. Treat REVIEW as a first-class outcome.

Most pipelines have two outcomes: success and failure. The REVIEW state — where the system produced technically valid output but flagged it for human attention — is the most important outcome in a regulated context. It is the point where human judgment is injected without blocking the workflow entirely. Design your quality validation gate to produce calibrated, actionable REVIEW flags, not binary pass/fail.

6. Determinism is a feature, not a constraint.

The temptation in agentic AI work is to reach for the most powerful primitive available. Resist it when the problem does not require it. A deterministic chain with a strong validation gate will out-ship a non-deterministic graph for any workflow whose path is fixed. Save the graph for problems whose path is genuinely dynamic.

Closing

The architecture described here is built around a single, opinionated claim: in a regulated reporting system, the LLM’s job is to write, not to know. Everything that knows — the rule engine, the evidence JSON, the acceptance gate — is deterministic, versionable, and auditable. The LLM is the last mile, and the gate is the safety net.

This is not the most exciting architecture. Agentic systems with autonomous decision-making are more glamorous. But for governance reporting, audit, and compliance, glamour is a liability. What you want is a system whose worst output is no output, never wrong output. Evidence-first design, with a strong acceptance gate, gets you there.

Built for CISO, Head of Cyber Security, IAM/PAM Manager, Technology Risk, Internal Audit, and Compliance audiences. The same pattern generalises to any LLM application where every claim must be defensible.