# TECHNICAL WHITE PAPER

## Mermaid Diagram Driven Development: Verifiable Semantic Networks for Agentic Software Engineering

**Abbreviation:** MDD  
**Date:** May 2026  
**Classification:** Open Technical Standard Proposal  
**Target Audience:** AI Systems Architects, Agent Runtime Teams, Formal Methods Engineers, Software Engineering Researchers

---

## Abstract

Current LLM-based software engineering workflows rely heavily on ambiguous Markdown documents, loosely structured prompts, and post-hoc tests. These artifacts are readable but weakly verifiable. They allow semantic drift between requirements, diagrams, agent plans, code, tests, and explanations.

**Mermaid Diagram Driven Development (MDD)** is a graph-first methodology for transforming source documents into auditable, verifiable development networks. MDD treats Mermaid diagrams as the first structured abstraction above source material. Mermaid nodes and edges preserve provenance back to original text so humans and agents can audit architectural, behavioral, and explanatory claims.

Mermaid is not the final formal system. Mermaid is compiled into a typed intermediate representation, **MDD-IR**, which becomes the canonical substrate for deterministic verification:

- **LLMs** extract semantics, propose finite domains, identify ambiguity, explain counterexamples, and suggest graph repairs.
- **TLA+** verifies state integrity, temporal behavior, safety/liveness, deadlock freedom, and legal transitions.
- **Z3** checks bounded paths, reachability, forbidden-route UNSAT results, and coverage witnesses.
- **Lean4** checks semantic extrapolations, refinement claims, and explanation rules.
- **`mdd-engine`** performs heavyweight compilation and verification.
- **`mdd`** is the lightweight deterministic traversal CLI for agents using prebuilt verified bundles.

MDD reframes autonomous software development as a pipeline of **source-backed graph abstraction, state-space verification, path solving, semantic proof, proof-carrying compilation, and deterministic runtime traversal**.

---

## 1. Core Thesis

> **Mermaid diagrams are the first auditable abstraction from source documents. LLMs provide semantic interpretation. Deterministic tools verify the consequences of that interpretation.**

The core split is:

```text
LLM        = semantics
Mermaid    = auditable graph abstraction
MDD-IR     = typed canonical graph
TLA+       = state integrity
Z3         = verifiable paths
Lean4      = semantic extrapolation/proof
mdd-engine = heavyweight compiler/verifier
mdd        = lightweight runtime traversal CLI
```

Compactly:

```text
LLMs give states meaning.
Mermaid makes meaning visible.
MDD-IR makes meaning typed.
TLA+ keeps states honest.
Z3 proves paths exist or do not.
Lean4 checks extrapolated meaning.
mdd keeps agents inside the verified design space.
```

MDD should not claim to prove a whole implementation correct. It verifies a **design-level abstraction** and emits artifacts that guide tests, assertions, explanations, and agent traversal.

---

## 2. System Architecture

```text
Source docs / code / issues
        ↓
LLM semantic extraction with provenance
        ↓
Mermaid audit graph
        ↓
Typed MDD-IR
        ↓
TLA+ state integrity checks
        ↓
Z3 path witnesses / forbidden-route checks
        ↓
Lean4 semantic proof obligations
        ↓
Verified .mddbundle
        ↓
mdd runtime traversal for AI agents
```

The pipeline has two CLIs:

```text
mdd-engine = heavy semantic/formal compiler
mdd        = lightweight verified graph traversal VM
```

The split is essential. Full verification may require LLMs, Java/TLA+, Apalache, Z3, Lean4, Lake, theorem-proving libraries, and expensive model-checking runs. Many AI agent environments cannot install or run all of that. They can often run a small CLI and sometimes a Python Z3 package. Therefore `mdd-engine` performs verification once and emits a portable `.mddbundle`; `mdd` lets many downstream agents safely reuse it.

---

## 3. Mermaid as Auditable Abstraction

Mermaid is valuable because it is compact, readable, diffable, and LLM-friendly. It gives humans a reviewable surface between prose and formal models.

However:

```text
Valid Mermaid syntax is not semantic correctness.
```

Every graph element should preserve provenance:

```yaml
node:
  id: Authenticated
  label: Authenticated Session
  provenance:
    file: docs/auth.md
    lines: [42, 47]
    status: source_explicit

edge:
  id: refresh_invalid_logout
  from: TokenExpired
  to: Unauthenticated
  event: Refresh token invalid, revoked, replayed, or malformed
  guard: refresh_case != valid
  provenance:
    file: docs/auth.md
    lines: [52, 55]
    status: source_inferred
```

Required provenance labels:

```text
source_explicit  = directly supported by source
source_inferred  = inferred from source, not directly stated
llm_hypothesis   = proposed by LLM; needs review/checking
human_confirmed  = confirmed by reviewer/user
```

---

## 4. MDD-IR

MDD-IR is the canonical typed representation. It separates Mermaid syntax from formal semantics.

MDD-IR should contain:

- graph kind: state machine, workflow, dependency graph, protocol, explanation graph
- stable node IDs and edge IDs
- labels and human descriptions
- source spans and provenance status
- actors/events
- guards, preconditions, postconditions
- finite semantic domains
- environment actions
- safety invariants
- liveness candidates and fairness assumptions
- forbidden transitions
- coverage obligations
- backend projection hints for TLA+, Z3, Lean4, tests, and runtime

Example finite domain:

```yaml
refresh_case:
  values: [valid, expired, revoked, replayed, malformed]
properties:
  - revoked token never reaches Authenticated
  - replayed token reaches Unauthenticated
  - valid token may re-enter Authenticating
```

This finite-domain step is crucial. Vague prose such as “handle auth edge cases” is not checkable. MDD turns it into bounded state/path obligations.

---

## 5. LLM Role

LLMs are semantic assistants, not proof engines.

Good LLM responsibilities:

- extract candidate states, events, actors, and edges
- map source text to graph elements
- identify ambiguity and missing cases
- propose finite edge-case domains
- propose safety/liveness properties
- explain model-checker counterexamples
- suggest graph repairs
- generate human-facing documentation from checked artifacts

Forbidden LLM responsibilities:

- inventing unsupported requirements without provenance labels
- treating Mermaid syntax as proof
- declaring a path valid without deterministic checking
- treating a semantic extrapolation as verified without Lean4 or another checker
- bypassing `mdd` when a transition is rejected

Prompt rule:

```text
Your output is a semantic proposal, not proof.
Preserve provenance and produce typed obligations for deterministic checking.
```

---

## 6. TLA+ for State Integrity

TLA+ is the state-space integrity layer. MDD projects MDD-IR into TLA+ when the graph describes behavior over time.

Mapping:

```text
Mermaid nodes       → finite States
Mermaid edges       → actions in Next
current node        → state variable
previous node       → prev variable when needed
guards/domains      → constants and variables
failure events      → environment actions
safety claims       → invariants
liveness claims     → temporal properties/fairness
```

TLA+ checks questions like:

- Can the system deadlock?
- Can a forbidden state be reached?
- Can a terminal state have outgoing transitions?
- Can a retry loop continue forever under stated assumptions?
- Can an agent deploy without test success?
- Does every allowed path preserve the declared invariant?

TLA+ should not model everything. Following the AWS lesson, the model should capture the smallest dangerous state space that can expose design bugs.

---

## 7. Z3 for Verifiable Paths

Z3 handles bounded path satisfiability and route constraints.

Z3 answers:

- Is there a path from A to B under these guards?
- Is this forbidden route UNSAT?
- What witness path covers this semantic case?
- What minimal path set covers all domain values?
- Can an agent reach the requested target without violating constraints?

Example result:

```json
{
  "query": "cover refresh_case = replayed",
  "result": "sat",
  "witness_path": ["Authenticated", "TokenExpired", "Unauthenticated"],
  "edges": ["ttl_exceeded", "refresh_invalid_logout"]
}
```

Z3 is also useful in `mdd` when lightweight runtime path checks are needed and the certified bundle allows bounded queries.

---

## 8. Lean4 for Semantic Extrapolation

LLMs often generalize from examples:

```text
All non-valid refresh token cases log out.
```

MDD treats such statements as **proof obligations**, not conclusions. Lean4 checks whether the generalized claim follows from formalized definitions.

Example obligation:

```lean
inductive RefreshCase where
  | valid | expired | revoked | replayed | malformed
  deriving DecidableEq, Repr

inductive AuthState where
  | unauthenticated | authenticating | authenticated | tokenExpired
  deriving DecidableEq, Repr

def refreshTransition (r : RefreshCase) : AuthState :=
  match r with
  | RefreshCase.valid => AuthState.authenticating
  | _ => AuthState.unauthenticated

theorem non_valid_refresh_logs_out
  (r : RefreshCase)
  (h : r ≠ RefreshCase.valid) :
  refreshTransition r = AuthState.unauthenticated := by
  cases r <;> simp [refreshTransition] at *
```

Lean4 is not a semantic oracle. It checks only formalized definitions and theorems. The LLM may propose those definitions, but deterministic tooling must check them.

---

## 9. AWS-Derived Principles

The AWS paper **“Use of Formal Methods at Amazon Web Services”** (Newcombe et al., CACM 2015, https://cacm.acm.org/research/how-amazon-web-services-uses-formal-methods/) teaches that formal methods work in industry when framed as **debugging designs**.

AWS found that prose, diagrams, reviews, tests, and fault injection were necessary but not sufficient. Subtle bugs often lived in the design itself and appeared only under rare interleavings of failures, retries, crashes, recoveries, message reorderings, and operator actions.

MDD adopts this lesson:

```text
MDD is not diagram drawing.
MDD is source-backed, exhaustively testable design debugging for agentic systems.
```

### 9.1 Start with “what must go right?”

Before drawing the happy path, MDD prompts should ask for:

1. safety invariants
2. liveness candidates
3. forbidden transitions
4. recovery obligations
5. environment assumptions
6. source provenance for each claim

Prompt block:

```text
Before producing Mermaid, identify what must go right.
List safety properties, liveness candidates, forbidden transitions, and assumptions.
Mark every item as source_explicit, source_inferred, llm_hypothesis, or human_confirmed.
```

### 9.2 Model the environment with the system

A model containing only the application happy path is misleading. Environment actions must be first-class.

For agentic software, the environment includes:

```text
network_timeout
message_loss
message_duplicate
message_reorder
process_crash
process_restart
external_service_error
operator_action
user_correction
permission_denial
tool_failure
malformed_tool_output
stale_memory
conflicting_instruction
llm_invalid_intent
retry/backoff
repair/recovery
```

Prompt block:

```text
Expand the graph with environment actions.
Do not model only the happy path.
For each environment action, specify actor, trigger, affected state, guard, postcondition, and provenance.
```

### 9.3 Prefer small bounded dangerous models

AWS found serious bugs in relatively small models. MDD should not reproduce every implementation detail. It should capture the essential state space where correctness can fail.

Prompt block:

```text
Build the smallest finite model that can expose the dangerous behavior.
Omit implementation details unless they affect a stated property.
State every abstraction decision and omission.
```

### 9.4 Counterexamples are product artifacts

MDD should translate verifier results into useful artifacts:

```text
TLA+/Z3 counterexample
  → Mermaid sequence/state trace
  → violated-property explanation
  → graph repair proposal
  → regression test
  → runtime assertion
  → updated bundle after re-verification
```

Counterexamples should be classified as:

```text
product_design_bug
graph_bug
formalization_bug
abstraction_gap
```

### 9.5 Generate tests and assertions

Formal models do not prove the implementation matches the model. They do help generate better assertions and edge-case tests.

MDD should emit:

- tests from Z3 witnesses
- tests from TLA+ counterexamples
- tests from coverage gaps
- assertions from verified invariants
- runtime checks for transition legality

### 9.6 Know the scope

MDD is strongest for:

```text
state validity
path validity
forbidden transitions
workflow correctness
recovery logic
agent traversal constraints
concurrency/interleaving risks
semantic edge-case coverage
source-backed design documentation
```

MDD is weaker unless separately modeled:

```text
emergent performance collapse
queueing/load dynamics
hard real-time latency
probabilistic availability
economic/adversarial incentives
implementation correctness without tests/assertions/instrumentation
```

---

## 10. Reusable LLM Prompt Preamble

```text
You are assisting with Mermaid Diagram Driven Development (MDD).
Your job is to convert source material into auditable graph semantics that can be checked by deterministic tools.
Do not treat your output as proof. Your output is a semantic proposal that must preserve provenance and be compiled into typed MDD-IR, then checked by TLA+, Z3, Lean4, or deterministic CLI logic.

Prefer small bounded models that capture dangerous state/path behavior over large implementation-level models.
Always model the system together with its environment: failures, retries, crashes, restarts, external services, operator actions, user corrections, tool errors, and LLM decision points.
Start from “what must go right?” before enumerating “what might go wrong?”
```

Extraction prompt:

```text
Given the source material, extract an MDD graph proposal.
Return:
1. Mermaid diagram with stable node/edge IDs.
2. Node table with type, label, terminal flag, and source provenance.
3. Edge table with event, actor, guard, postcondition, and source provenance.
4. Environment actions.
5. Safety invariants.
6. Liveness candidates and fairness assumptions.
7. Forbidden transitions.
8. Finite semantic domains.
9. Ambiguities and assumptions.
10. Suggested TLA+/Z3/Lean obligations.

Rules:
- Do not invent unsupported requirements.
- Mark unsupported proposals as llm_hypothesis.
- Prefer finite domains and explicit guards over vague prose.
- Model the environment together with the system.
- Mermaid is audit syntax, not proof.
```

Runtime prompt for agents using `mdd`:

```text
You are an LLM agent operating under MDD runtime constraints.
Use the verified .mddbundle as the authority for legal traversal.
You may propose an intent, but mdd determines whether the transition is allowed.
If mdd rejects a transition, do not bypass it. Explain the rejection, inspect allowed alternatives, and choose a legal next step or ask for human review.

Before each action:
1. State current graph node/state.
2. State intended transition and why.
3. Ask mdd whether the transition is allowed.
4. If allowed, proceed and record trace.
5. If rejected, use the rejection reason and choose a legal path.
```

---

## 11. `mdd-engine`

`mdd-engine` is the heavyweight compiler/verifier.

Responsibilities:

- ingest source docs, code, tickets, and diagrams
- call LLMs for semantic extraction
- generate Mermaid audit graphs
- enforce provenance requirements
- parse Mermaid deterministically
- compile typed MDD-IR
- generate TLA+, Z3, and Lean4 artifacts
- run model checking and solver queries
- collect counterexamples and witnesses
- render counterexamples back into Mermaid
- emit tests and assertions
- sign and package `.mddbundle`

Example:

```bash
mdd-engine ingest docs/auth.md --out graphs/auth.mmd
mdd-engine compile graphs/auth.mmd --out build/auth.ir.json
mdd-engine verify build/auth.ir.json --tla --z3 --lean
mdd-engine bundle build/auth.ir.json --out dist/auth.mddbundle
mdd-engine sign dist/auth.mddbundle --cert dist/auth.mddcert
```

The output is a proof-carrying bundle containing compiled graph data, verification metadata, transition tables, path constraints, provenance, tests, assertions, and optional audit artifacts.

---

## 12. `mdd`

`mdd` is the lightweight runtime traversal CLI. It is not a full verifier. It consumes already-verified bundles and enforces deterministic traversal.

Responsibilities:

- load `.mddbundle`
- validate bundle hash/signature
- expose current states, nodes, edges, and allowed actions
- check proposed transitions
- reject invalid paths
- explain rejection reasons
- record runtime traces
- optionally run bounded Z3 queries if available and certified
- return machine-readable JSON to agents

Example commands:

```bash
mdd state dist/auth.mddbundle

mdd next dist/auth.mddbundle \
  --state TokenExpired

mdd check-transition dist/auth.mddbundle \
  --from TokenExpired \
  --to Authenticated \
  --event refresh_token_valid

mdd explain dist/auth.mddbundle \
  --edge refresh_invalid_logout

mdd trace dist/auth.mddbundle \
  --input runtime-trace.jsonl
```

Runtime rule:

```text
The LLM proposes intent.
mdd computes legal transitions.
The LLM chooses among legal transitions.
mdd records the trace.
```

If `mdd` rejects a transition, an agent must not bypass it. It should inspect allowed alternatives, explain the rejection, repair the graph through `mdd-engine`, or ask for human review.

---

## 13. `.mddbundle`

A `.mddbundle` is the portable verified artifact emitted by `mdd-engine` and consumed by `mdd`.

Suggested contents:

```text
bundle.json              metadata, hashes, versions
source-map.json          source spans and provenance
mdd-ir.json              canonical typed graph
transition-table.json    deterministic allowed transitions
guards.json              guard expressions or compiled predicates
path-constraints.json    Z3-ready constraints or precomputed witnesses
invariants.json          checked safety properties
liveness.json            declared liveness/fairness assumptions
coverage.json            semantic coverage obligations/results
counterexamples/         traces and explanations
tests/                   generated regression tests
assertions/              generated runtime assertions
certs/                   verification certificates / signatures
audit/                   optional Mermaid, TLA+, Lean4, solver logs
```

Runtime-critical files should be small and deterministic. Heavy formal artifacts may be included for audit but are not required for `mdd` traversal.

---

## 14. Proof-Carrying Semantic Coverage

MDD introduces **proof-carrying semantic coverage**: each meaningful edge case is tracked from source text through graph representation, formal checks, generated tests, and runtime traces.

Example:

```json
{
  "semantic_case": "refresh_case = replayed",
  "source": "docs/auth.md:52-55",
  "mermaid_edge": "refresh_invalid_logout",
  "tla_status": "safe",
  "z3_witness_path": ["Authenticated", "TokenExpired", "Unauthenticated"],
  "lean_theorem": "non_valid_refresh_logs_out",
  "generated_test": "refresh_replayed_logs_out.test.ts",
  "runtime_assertion": "no_authenticated_after_replayed_refresh",
  "status": "covered"
}
```

This is stronger than ordinary code coverage. It connects **meaning** to source, verification, tests, and runtime behavior.

---

## 15. Related Work and Differentiation

MDD sits at the intersection of several existing areas.

### Spec-driven AI development

Projects such as GitHub Spec Kit, OpenSpec, Kiro Specs, and spec-workflow-mcp validate the demand for structured AI-native development. They organize requirements, designs, and tasks, but generally remain Markdown/workflow systems rather than graph-compiled formal verification systems.

MDD difference:

```text
Spec tools organize intent.
MDD verifies and executes intent as state/path constraints.
```

### Agent graph runtimes

LangGraph, XState, and Stately Agent show the value of graph/state-machine control for agents and applications.

MDD difference:

```text
Graph runtimes execute workflows.
MDD verifies source-provenanced semantic graphs before agents traverse them.
```

### Formal methods

TLA+, Apalache, P, FizzBee, Alloy, Lean4, and related tools provide the verification backbone. AWS’s industrial use of TLA+ demonstrates that formal design models find important bugs missed by conventional methods.

MDD difference:

```text
Formal tools require formal authoring.
MDD provides a source-doc → Mermaid → MDD-IR → formal backend → runtime bundle pipeline.
```

### LLMs for formalization

Research on LLM-assisted formal specification, TLA+ proof generation, Lean theorem proving, and proof-carrying code completions shows that LLMs can help produce formal artifacts.

MDD difference:

```text
LLM formalization work often targets specs/proofs directly.
MDD inserts an auditable graph layer and produces runtime traversal constraints for agents.
```

### GraphRAG and knowledge graphs

GraphRAG extracts document graphs for retrieval and synthesis.

MDD difference:

```text
GraphRAG answers questions from document graphs.
MDD verifies and executes state/path semantics from document graphs.
```

---

## 16. Contribution Claims

MDD contributes:

1. **Source-provenanced Mermaid abstraction**: a human-auditable layer between natural language and formal models.
2. **Typed MDD-IR**: a canonical graph representation with guards, domains, invariants, coverage obligations, and provenance.
3. **Multi-backend formal verification**: TLA+ for state integrity, Z3 for path witnesses, Lean4 for semantic extrapolation.
4. **Proof-carrying semantic coverage**: edge cases linked from source text to verification, tests, assertions, and runtime traces.
5. **Engine/runtime split**: `mdd-engine` performs expensive verification; `mdd` enforces lightweight deterministic traversal.
6. **Agent traversal protocol**: LLMs may propose intent, but deterministic runtime logic approves or rejects transitions.
7. **Environment-aware modeling discipline**: failures, retries, tool errors, user corrections, operator actions, and LLM invalid intents are modeled with the system.

Novelty statement:

> Unlike spec-driven development tools, MDD does not stop at structured Markdown artifacts. Unlike graph runtimes, MDD does not merely execute workflows. Unlike formal methods tools, MDD does not require engineers to directly author raw formal specifications. MDD introduces a source-provenanced Mermaid layer and typed graph IR that compile into formal backends and lightweight runtime bundles for AI agents.

---

## 17. Limitations

MDD has important limits:

- A verified model is not the real system.
- Source documents may be incomplete or wrong.
- LLM extraction may mislabel semantics.
- Formal checks only cover encoded assumptions and bounded domains.
- Lean4 checks formalized definitions, not informal intent.
- `mdd` verifies transitions under a bundle, not global reality.
- Implementation correctness still needs code review, tests, assertions, monitoring, and integration discipline.
- Performance collapse, queueing dynamics, probabilistic availability, and economic/adversarial behavior must be explicitly modeled or marked out of scope.

MDD’s mature claim is:

```text
MDD verifies design-level state/path semantics and constrains agent traversal under declared assumptions.
```

Not:

```text
MDD proves the entire software system correct.
```

---

## 18. Implementation Roadmap

Phase 1: Mermaid ingestion and deterministic parsing  
Phase 2: typed MDD-IR schema  
Phase 3: provenance and audit UI  
Phase 4: TLA+ backend for state integrity  
Phase 5: Z3 backend for path witnesses  
Phase 6: Lean4 proof obligations for semantic extrapolation  
Phase 7: `.mddbundle` packaging and signatures  
Phase 8: lightweight `mdd` runtime traversal  
Phase 9: counterexample-to-Mermaid rendering  
Phase 10: generated tests, assertions, and CI integration

CI gate example:

```bash
mdd-engine ci
```

It should fail if:

- Mermaid cannot parse
- graph elements lack provenance
- MDD-IR schema validation fails
- TLA+ invariants fail
- forbidden paths are SAT
- required coverage obligations lack witnesses
- Lean4 proof obligations fail
- generated bundle is stale relative to source hashes

---

## 19. Conclusion

MDD transforms ambiguous prose into auditable graphs, graphs into typed formal models, models into verified tests and obligations, verified artifacts into runtime bundles, and runtime traversal into logic-guided agent behavior.

The essential idea is:

```text
mdd-engine compiles proof-carrying Mermaid diagrams.
mdd lets LLM agents traverse those diagrams without violating verified state/path constraints.
```

MDD is best understood as:

> **Mermaid-based, source-provenanced, exhaustively testable design debugging for AI-assisted software engineering.**

The goal is not prettier diagrams. The goal is to find design bugs that prose, reviews, tests, and ordinary diagrams miss — and then give agents a verified design space they can traverse without inventing invalid paths.