AI-assisted QA

When working closely with ERP systems like Business Central, I quickly learned a hard truth:

My applications will never be the source of truth.
The source of truth will always be human-entered data.

And humans make mistakes.

Not because they are careless - but because real business processes are complex, time-critical, and often span multiple departments. Sales, purchasing, production planning, manufacturing, and QA all touch the same data, sometimes weeks apart.

By the time a physical product reaches final inspection, a long chain of decisions has already been made.

Some correct.
Some imperfect.
Some based on assumptions that no longer hold.

In a classic setup, software either accepts this silently or fails loudly.
Neither option helps the people responsible for quality.

The gap between software and QA

Quality assurance doesn’t fail because data is wrong.

It fails because context is missing.

Most systems can tell you:

what data exists
which rules were technically satisfied
which validations passed

What they cannot tell you is:

whether this configuration is unusual
whether it deviates from how products are usually built
whether several small deviations together should raise attention

That judgment usually lives in people’s heads and comes from experience.

Why logging is not enough

Traditional logging is excellent at answering technical questions:

What happened?
When did it happen?
Which values were involved?

But logging has a fundamental limitation:
you can only log what you already anticipated.

You log known errors.
You log known invariants.
You log known failure paths.

What you cannot log are the things that feel wrong but still technically work.

And in manufacturing, that gray area is exactly where real-world problems begin.

A different approach: let the system explain itself

Instead of trying to encode every domain rule as hard validation, I experimented with a different idea:

What if a system could explain its own understanding of the current situation — in domain language — to the people responsible for quality?

Not as logs.
Not as metrics.
Not as pass/fail rules.

But as a structured explanation of:

what it sees
what it expected
and where those two don’t fully align

Turning domain knowledge into diagnostics

This approach starts with a simple but demanding requirement:

The developer must understand the domain deeply.

Not from specifications alone, but from:

how products are physically built
how they are assembled
how they are tested
how QA actually evaluates them

That knowledge is then written down explicitly.

Not as executable rules.
Not as enforcement logic.
But as domain expectations.

One bounded context, one diagnostic voice

In my case, each microservice represents a bounded context with its own responsibilities and assumptions.

For example, the manufacturing context knows:

how a product structure is expected to look
which component hierarchies are common
which combinations are unusual but acceptable
which structures should never occur

From this context, the service creates a diagnostic snapshot of its current state.

This snapshot is not a domain model.
It is a read-only representation meant purely for explanation.

Where AI fits — and where it doesn’t

AI is not used to decide anything.

It does not change data.
It does not block workflows.
It does not fix errors.

Its role is strictly limited:

Given a structured snapshot and explicit domain expectations, it produces a human-readable diagnostic report describing what looks normal, unusual, or risky.

The intelligence does not come from the model.

It comes from the domain knowledge of users, encoded into the prompt.

What the system actually produces

The result is not a verdict.

It is a report.

Each finding contains:

a severity level
a confidence estimate
a short explanation in domain language

This allows QA to see, at a glance, where closer inspection is warranted — without being told what decision to make.

Advantages in distributed systems

This becomes especially powerful in a distributed system.

Each bounded context produces its own diagnostic report, based on its own understanding of the domain.

When these reports are combined for a single production order, QA gains a system-wide perspective:

Multiple services independently highlighting unusual aspects of the same product is often an early indicator of real production risk.

No single service has the full picture — but together, they surface patterns humans would otherwise discover too late.

So what is it?

This is not a new logging system.

It does not replace validation.
It does not replace QA.
It does not replace domain experts.

What it does is simpler and more honest:

It allows software to explain what it did, why it did it that way, and what resulted from it — in terms domain experts understand.

How does it work?

At its core, the system does three things:

Capture the current state of a bounded context as a diagnostic snapshot
Apply explicit domain expectations to that snapshot
Produce a structured, human-readable diagnostic report

Diagnostic snapshot

public sealed record ProductionOrderDiagnosticSnapshot(
    Guid ProductionOrderId,
    Guid SystemId,

    // authoritative state
    Status AggregateStatus,
    int AggregateComponentCount,
    IReadOnlyList<ComponentDiagnostic> Components,

    // derived / projected view
    ProjectionSummary Projection
);

This snapshot is:

immutable
read-only
intentionally flattened

Diagnostics service

Note: For obvious reasons, the prompt in this article is a reduced version of what runs in production. Internal domain rules, thresholds, and historical comparisons have been removed. What matters here is the idea: turning domain knowledge into explicit, reviewable diagnostics.

public sealed class ProductionOrderDiagnostics : IProductionOrderDiagnostics
{
    private readonly AzureOpenAIClient _client;

    public ProductionOrderDiagnostics(AzureOpenAIClient client)
    {
        _client = client;
    }

    public async Task<ProductionOrderDiagnosticResult> AnalyseAsync(
        ProductionOrderDiagnosticSnapshot snapshot,
        CancellationToken ct)
    {
        var systemPrompt = """
You are a production-order diagnostic system used to support human quality assurance.

Your task is to analyze a production order snapshot and identify
potential risks, inconsistencies, or unusual patterns that may
require human attention.

You are NOT a decision-making system.
You must NOT suggest fixes or changes.
You must NOT invent missing data.

----------------------------------------------------------------------
Inputs
----------------------------------------------------------------------

You receive:
1. A current production order snapshot (authoritative state)
2. A summarized projection view (derived, may lag)
3. Optional historical reference data describing typical past orders

The current order is always the primary subject.
Historical data is used only for comparison and confidence estimation.

----------------------------------------------------------------------
Analysis dimensions
----------------------------------------------------------------------

1. Structural consistency between authoritative state and projection
2. Integrity of the component hierarchy
3. Missing or orphaned components
4. Allocation completeness and consistency
5. Deviations from historically typical patterns
6. Indicators of reduced production or inspection readiness

----------------------------------------------------------------------
Severity classification
----------------------------------------------------------------------

- Critical:
  Strong indicators of structural impossibility or data corruption.

- Warning:
  Unusual patterns that differ from historical norms and may
  require closer inspection.

- Info:
  Noteworthy deviations that are likely acceptable.

----------------------------------------------------------------------
Response format
----------------------------------------------------------------------

Respond strictly as JSON matching the following schema.
No markdown. No explanation outside JSON.

{
  "isConsistent": boolean,
  "riskTags": [string],
  "findings": [
    {
      "severity": "Critical | Warning | Info",
      "category": "Structure | Allocation | Pattern | Completeness | Consistency",
      "confidence": "High | Medium | Low",
      "message": string,
      "explanation": string
    }
  ]
}
""";

        var userPrompt = JsonSerializer.Serialize(snapshot);

        var chatClient = _client.GetChatClient("gpt-4.1");

        var response = await chatClient.CompleteChatAsync(
            [
                new SystemChatMessage(systemPrompt),
                new UserChatMessage(userPrompt)
            ],
            cancellationToken: ct);

        var content = response.Value.Content[0].Text;

        return JsonSerializer.Deserialize<ProductionOrderDiagnosticResult>(
            content,
            new JsonSerializerOptions { PropertyNameCaseInsensitive = true }
        ) ?? throw new InvalidOperationException("Invalid diagnostic response.");
    }
}

Example diagnostic report

This is an example of what the system produces for a single production order.

It is not a verdict. It is a structured explanation intended for QA and domain experts.

{
  "isConsistent": false,
  "riskTags": ["Structure", "Completeness"],
  "findings": [
    {
      "severity": "Critical",
      "category": "Structure",
      "confidence": "High",
      "message": "Incomplete FlowmeterStation detected",
      "explanation": "The production order contains a FlowmeterStation without any associated Flowmeter components. According to known assembly patterns, FlowmeterStations are expected to include at least one Flowmeter. This structure may indicate an incomplete or incorrectly specified assembly."
    },
    {
      "severity": "Info",
      "category": "Pattern",
      "confidence": "Medium",
      "message": "Sensor-only configuration is uncommon",
      "explanation": "The FlowmeterStation includes Sensors but no Flowmeters. While technically representable, this configuration is uncommon compared to historical production orders and may warrant additional verification during QA."
    }
  ]
}

Supporting human judgment

Quality decisions will always involve uncertainty.

The goal of this approach is not to eliminate that uncertainty, but to reduce it — by making the system’s internal understanding visible.

AI doesn’t replace human judgment here.

It supports it!

The gap between software and QA#

Why logging is not enough#

A different approach: let the system explain itself#

Turning domain knowledge into diagnostics#

One bounded context, one diagnostic voice#

Where AI fits — and where it doesn’t#

What the system actually produces#

Advantages in distributed systems#

So what is it?#

How does it work?#

Diagnostic snapshot#

Diagnostics service#

Example diagnostic report#

Supporting human judgment#