Why SDD Breaks Down in Microservices—Part 3: Distributed Systems Need Distributed Context

:::tip
The third and final article in a three-part series.

:::

1. A recap: where spec-driven broke, and why a contract

In Part 1, I showed that spec-driven development with an LLM starts to slip once a feature runs through several microservices. Taken one at a time, each service looks clean; put together, the system does not behave the way it should. The reason is that the model loses cross-service context. The rules that live on the boundaries between services are not written down in one place, so the LLM skips over them. In Part 2, I built archspec: for every service it generates a machine-readable contract, SERVICE_MAP.yaml, that makes those rules explicit. In this part, I go back to the same feature and run it through /archspec:investigate on top of the contracts, to see whether the plan catches the cross-service bugs before any code is written.

A quick reminder of the setup. It is a Go project of 12 microservices for finding freelancers: gRPC for synchronous calls, a NATS broker for asynchronous events, and the same Clean Architecture inside every service. task-service and matching-service each have a Transactional Outbox, so a state change and its event are written together. The map is the same one I have used throughout the series:

The feature for the experiment is Smart Task Reassignment. If the selected freelancer declines an offer, the platform finds the next candidate on its own, sends them a fresh offer, and notifies the customer, instead of pushing the task into a manual queue. The reassignment rules are:

==a freelancer’s decline triggers a new round of matching;==
==candidates are ranked by rating, and ties are broken by distance to the task’s city;==
==the customer gets a notification about the reassignment;==
==after three failed reassignments, the task moves to== failed==.==

The first time around (task_1), I ran this feature without archspec: Claude read the local per-service CLAUDE.md files, planned on Sonnet 4.6, produced a plan of roughly 180 lines, and implemented it. Then I ran two independent reviews: Claude in a separate session, and Codex checking against a reference solution and a checklist. Both found the same group of cross-service bugs, and the full decline-and-reassign scenario did not hold together. The result was 6/10, about 64% on the checklist.

The task_1 bug classes, in short: direct calls to closed services that bypass worker-facade; invented methods that do not exist in the proto; the city passed as a name instead of city_id; a call loop review → worker-facade → review; N+1 instead of batch methods; an event published around the Outbox. There was also a critical bug: a single match_id reused across all reassignments, which made notification-service drop the new offers as duplicates, so the feature never completed the end-to-end scenario. All of these rules have one thing in common: they live between services and were never written down as a single constraint, so the LLM could not see them.

In Part 2, those rules became explicit in the contract: each one landed in the SERVICE_MAP.yaml of the service it belongs to. Now let’s see whether that changes the outcome when the LLM is given contracts and /archspec:investigate instead of local Markdown files.

2. Planning again: investigate on top of the contracts

In the second run (task_3) I change exactly two things relative to task_1. The feature prompt is the same: the same Smart Task Reassignment description as in the first part. The model is the same, Claude Sonnet 4.6 on medium reasoning. The difference is the environment. Now each of the 12 services has a SERVICE_MAP.yaml contract from Part 2, and instead of a free-form brainstorm I run /archspec:investigate. This is a read-only stage: it touches neither the code nor the contracts. The only file it writes is the plan.

2.1. The clarify gate: investigate asks first

investigate starts by reading a slice of the contracts of the affected services. Rather than jumping straight to planning, it first walks through the dimensions of ambiguity and asks clarifying questions: where the trigger comes from and through which public entry point, who owns the state, where the worker identifier comes from, what exactly the limit counts, which key entities are joined on, and what happens in the terminal branches. This clarification step stays read-only too and never touches the code.

Clarifying questions investigate walks through the dimensions of ambiguity

What it asked on this pass:

whether a reference / golden spec was needed to check naming against; I answered skip;
where the decline trigger comes from; we decided it is a new HTTP endpoint in api-gateway;
what “at most 3 reassignments” means; 3 after the first offer;
how to resolve the geo tie-break; by the worker’s city_id and the task’s city_id, failing loudly if a city_id is missing.

These are exactly the spots where, in task_1, the model decided without asking and got it wrong: it took worker_id from the request body, passed the city as a name instead of an id, and landed an off-by-one on the limit. Here each of those spots is pulled out into an explicit question before planning begins.

2.2. The result: a plan artifact

investigate saves the plan to a separate file rather than leaving it in the chat. It is a working artifact that implement reads later.

I break down how the plan is built in Section 4; here I will just list what is inside:

which rules the feature relies on – the plan quotes specific lines from the contracts of the affected services, so every claim it makes can be re-checked against the contract;
open questions – places where the requirements allow more than one reading. For example, where to get the worker_id of the freelancer who declined. The plan collects these into a separate list and resolves them before any code;
which API changes are needed before the code – for example, adding a DeclineOffer method and a city_id field to the proto. Service interfaces change before the implementation does;
a diagram of the new scenario – how a reassignment goes through the services (shown below);
a diff of the contract edits – exactly what changes in the SERVICE_MAP.yaml of each affected service;
a check across all 12 services – for each new event, the plan finds who publishes it and who listens, so that no consumer is missed;
who owns what – for each piece of data, it records which service is responsible for it and who is allowed to change it;
a list of edge cases – each risk is written as its own item and tied directly to the test that will check it;
notes on the plan’s self-review and independent review.

Here is the rendered sequence diagram from the plan:

The reassignment sequence diagram from the investigate plan

The reassignment sequence diagram from the investigate plan (1 part)

The reassignment sequence diagram from the investigate plan (2 part)

The reassignment sequence diagram from the investigate plan (3 part)

This is the whole reassignment path that the plan drew on its own, from the client’s action to the end state. It is easier to read as three branches.

The normal decline. The freelancer taps “decline” – the request reaches api-gateway, which pulls worker_id out of the token and calls task-service. task-service itself, through its own outbox, publishes the offer.declined event. matching-service listens for it: it takes the next candidate from the already-computed list and sends them a fresh offer with a match.found event. In parallel, a notification goes to the customer that the task has been reassigned.

The limit is reached. If this is already the third decline, task-service does not start a new round of matching; it publishes task.failed, and the customer is notified that no one could be found.

Candidates run out before the limit. If no candidates are left in the list, matching-service sends match.exhausted, task-service moves the task to failed, and the customer is notified as well.

What to look for in this diagram. First, synchronous calls (gRPC) and asynchronous events (over NATS) are drawn differently, so you can see at a glance where one service calls another directly and where it goes through the broker. Second, every branch has an ending: no path stops halfway, and in every outcome the task changes status and the customer gets a notification. Those are precisely the spots where task_1 stumbled.

2.3. What the plan caught ahead of time

The plan stage closes the gaps that kept task_1 from completing the end-to-end scenario. The main failure in task_1 was the match_id collision: every reassignment got the same match_id, and notification-service discarded them as duplicates. Here:

a new match_id for each attempt, plus the TaskID dedup fallback removed from notification-service;
city_id is made required and threaded into the task.created payload;
offer.declined and task.failed are published from task-service‘s outbox – the topology is fully event-driven, with no synchronous matching-service → task-service call;
a customer notification is added on reassignment and on task.failed;
the off-by-one on the limit is removed: reassignment_count < 3 is checked before the increment, which yields 4 offers and 3 reassignments.

Where the plan was unsure, it did not paper over it but surfaced the point as an open question. One of them was the trusted source of worker_id: the plan did not guess at it in code but raised it as a separate question, OQ-1.

2.4. improve plan

On the first pass, investigate did not produce a finished plan. Some decisions it would not make for me, so it collected them into a list of open questions; you can see them right in the plan file.

Open questions in the plan I ask it to refine the plan with improve plan

I worked through these questions and asked it to refine the plan with the improve plan command. investigate went through each one and wrote the decision straight into the plan. Here are the questions and what became of them:

| Question | Where the ambiguity is | How the plan resolved it |
|—-|—-|—-|
| OQ-1: where to get the worker_id of the freelancer who declined | if it is read from the request body, any client can decline on behalf of someone else’s freelancer | take it from the token; full authorization is out of scope for this feature, and the decision is recorded as a deliberate risk in ADR-001 (how this played out in the code is in Section 3) |
| OQ-3: is city_id required or not | without city_id there is nothing to compute distance from, and the proximity tie-break does not work | make it required: CreateTask without city_id is rejected with INVALID_ARGUMENT |
| OQ-4: what to do if there are no candidates on the very first attempt | the task could hang forever in a non-final status | the same path as when the list is exhausted: match.exhausted → task.failed → notify the customer |

In total the plan went through 4 self-review passes and was accepted by an independent plan-review in 2 rounds.

2.5. Compared with task_1

A separate review scored the plan 9/10 against 6/10 for task_1. On the checklist, that is a jump from about 64% to about 98%. The critical match_id bug is closed at the plan stage. And the main point: in task_1 the plan never changed at all, while here it was genuinely refined; it checked itself and passed an independent review.

But the plan did miss one gap. It threaded the city_id field through all the services correctly. What it did not notice was that the values of that id are recorded differently across services (task-service stores city-msk, while geo and worker-profile store moscow). Later that is exactly what broke the geo calculation in the code (more on this in Section 3).

This does not mean the plan is bad. It is more a reason to improve the tool: it lacks a check that the code actually matches the plan. A green build does not catch a gap like this; it passes even when the ids across services do not line up.

3. Implementation: `implement` from the plan

The plan is approved; now it has to become code. I run the second command on the same model, Sonnet 4.6, medium reasoning:

Running /archspec:implement against the saved plan

The archplan is mandatory: without it, implement does not start and sends you back to investigate. Plan first, code second.

How `implement` turns the plan into code

I break down the mechanics in Section 4; here is just the outline, so the result makes sense. implement first applies a YAML patch to the SERVICE_MAP.yaml of the affected services and syncs the documentation – contracts change before the code. Then it builds an implementation plan in which every requirement is tied to a specific task and test. After that comes TDD implementation driven by the edge_cases entries, then several checks that the code matches the plan, and a run of /archspec:validate and /archspec:check-architecture. At the end, a separate agent that did not write the changes reviews them, and only then is there a commit.

The key thing at this step: implement does not just write code and stop. It has its own review that runs the implementation around in a loop until there are no more findings. On this task, in two rounds it found and fixed several real bugs on its own: the reassignment limit fired on the wrong attempt, the customer got no notification, and one of the edge cases did not work. One problem implement chose not to fix, but did not hide either: worker_id still comes from the client rather than the token. That was recorded as a separate decision (ADR-001), a deliberate risk for a prototype. In the end, all 15 edge cases are covered by tests, and the build and tests are green.

A green build is necessary but not sufficient. So I ran two independent reviews over the diff: one with Claude, one with Codex. Both looked at the same code against the reference.

What came out of it

Both reviews agree that, architecturally, the task_3 solution is noticeably stronger than task_1. Here is what stands out against the first run:

Atomic outbox for offer.declined and task.failed. In task_1 the decline event was published directly by api-gateway; that was the main architectural violation. Here api-gateway only proxies gRPC, while task-service writes the state and the event atomically through DeclineAndPublish / UpdateWithEvent. This is the main fix relative to task_1.
A unique match_id for each attempt. In task_1 every reassignment got one match_id, and notification-service dropped them as duplicates, so the feature broke on the end-to-end scenario. In task_3, CreateAttemptPendingIfAbsent issues a new match-N for each (task_id, attempt) pair, and dedup in notification-service goes by match_id with no fallback to TaskID. The critical task_1 bug is gone, and the scenario runs all the way through.
Correct idempotency keys: (task_id, attempt) for offer.declined, match_id for match.found, task_id for task.failed.
city_id added to the domain and the proto and passed through into matching and geo – structurally something task_1 did not have at all.
A fully event-driven topology: task-service subscribes to match.found and match.exhausted, with no synchronous gRPC calls back.
A candidate snapshot: on reassignment the matching pipeline is not restarted; the next one is taken from the already-computed list.
The customer notification gets through, and the “candidates exhausted” path is handled.

On the reference checklist, that is a jump from about 64% (task_1) to about 93% (19.5/21 from Claude). Scores: Claude 8/10, Codex 6/10 as an eval solution (7/10 as a prototype).

Where it drifted from the plan

Both reviewers found the main discrepancy in the same place, at the core of the feature: the ranking logic.

An inverted tie-breaker. When geo is available, the code sorts candidates by distance only and loses the rating:

// matching-service/usecase/matching.go — SortCandidatesByGeo
sort.SliceStable(entries, func(i, j int) bool {
    if entries[i].distance != entries[j].distance {
        return entries[i].distance < entries[j].distance // primary key — distance
    }
    return entries[i].w.ID < entries[j].w.ID             // tie-break — worker_id, not rating
})
out := make([]domain.MatchCandidate, len(entries))
for i, e := range entries {
    out[i] = domain.MatchCandidate{WorkerID: e.w.ID, Name: e.w.Name} // rating (Score) never makes it here
}

Because of this, “the next best freelancer” turns into “the closest one”: a candidate with a 3.0 rating but nearer to the task beats a candidate with 5.0. And that list is then reused on every decline, which means the reassignments themselves go by distance rather than by rating.

The plan itself was right, though: in it, distance is only a way to choose between candidates with the same rating. The bug is not in the plan but in the implementation, which made distance the primary criterion instead of rating.

A city_id mismatch. The services in the project keep data in memory and are seeded with demo records on startup (tasks, workers, cities). In that demo data, task-service records the task’s city as city-msk, while geo-service and worker-profile record it as moscow. When matching-service asks geo-service to compute distances, geo-service does not find city-msk among its cities and returns an error for the whole request. matching-service swallows that error and simply carries on without distances.

The plan has its own gap here: it threaded the city_id field through but did not pin down a single set of id values across services (see the end of Section 2). As a result, the data error hides the tie-breaker bug: the distance calculation never even runs. So the tie-break does not work either way – with the current data, geo-service answers with an error immediately, and if you align the ids, the already-broken sort from the previous point kicks in. You can see it right in the code: task-service emits one id while geo-service knows a different one:

// task-service: the task publishes city_id = "city-msk"
{"task-1", ..., "Moscow", "city-msk"}

// geo-service: knows only "moscow", "spb", … — it has no "city-msk" entry
"moscow": {ID: "moscow", Name: "Moscow", RegionID: "moscow_region"}

There is another miss tied to city_id, and a more serious one. The field was made required in task-service, but api-gateway neither accepts nor forwards it, so creating a task through the public API now fails consistently, and there is no test for that path. A classic case: a new field added to one service but never threaded through to the public entry point.

The rest:

in the no-geo mode (when the service could not be reached) there is a hidden trap: the geo != nil check in Go passes even when the client is actually empty, so the next geo call still crashes matching-service;
a late or repeated match.found can bring an already-failed task back to “assigned” status: the HandleMatchFound handler does not check whether the task is already in a final status.

Green tests, uncovered logic

15 edge-case tests are green and the build is green, but the central ranking logic, “rating first, distance as the tie-break,” is covered by none of them. One test checks the degradation of a worker with no city_id; another, TestSortCandidatesByGeo_TieBreakByWorkerID, locks in the already-wrong sort behavior in isolation, with rating playing no part at all. There is no test of the form “worker A: rating 5.0, far away; worker B: rating 4.0, nearby – A should still come first.”

This is the same pattern as in task_1: the code passes CI while the defect sits in exactly the logic the tests step around. The difference is scale. In task_1 the whole feature broke; here it is one of the feature’s five requirements.

Takeaway

The feature works on the end-to-end scenario, unlike task_1. Decline, reassignment, new offer, customer notification, the limit of 3, task.failed – the flow is closed and tested. But one of the five requirements, the distance tie-breaker when ratings are equal, is functionally not met. The code diverged from the plan.

A good plan clearly raised quality and closed the critical task_1 bug before any code. But it did not guarantee that the code conformed: a green build and green unit tests both passed with the geo path broken. So the plugin needs more iterations and a stricter check that the code really matches the plan, not just a green build and tests. The idea works, all the same.

The solution code is in the task_3 branch: https://github.com/krus210/freelance-marketplace/tree/task_3

4. How it works: the principles behind investigate

The previous sections showed the result: the task_3 plan scored 9/10 against 6/10 for task_1, but one drift bug still made it into the code. To understand where that difference came from and why it fell short of ten, you have to look at how investigate works inside. I will break it down stage by stage.

It rests on a single idea: investigate tries to catch a bug as early as possible, at the plan stage rather than in finished code. In task_1 the feature was written out in full first, and only then did end-to-end tests and review catch it – by which point the match_id collision was already coded into several services. investigate moves the check to the moment when the LLM is still gathering requirements and drawing the plan: it is cheaper to catch a bug in a single line of the plan than in code spread across several services. The mechanism is not “one more free-form Markdown” but a machine-readable SERVICE_MAP.yaml contract on the input, plus the discipline of several stages, each closing its own specific class of bug from Part 1. And investigate stays read-only: the only file it writes is the plan artifact; it does not touch the code or the contracts.

What follows are the stages in order, and for each I note which bug it catches.

The contract as input, and removing ambiguity

investigate starts not from the code and not from scattered Markdown, but from the relevant parts of the SERVICE_MAP.yaml contract of the affected services: which methods the service has (api.endpoints), which events it publishes and which it consumes (events.published / events.consumed), and how it writes its state (consistency.write_path – for example, through an outbox). This closes the first class of bugs from Part 1: an incomplete picture of the cross-service rules. An LLM is bad at reconstructing architecture from code: some rules are not visible in the implementation, and some live only in the team’s heads. The contract gives them as an explicit list.

Next is the clarify gate. It is a ban on planning until nine dimensions of ambiguity are closed. Each dimension is about a specific class of bug:

An important rule: each question within a dimension has to be answered separately – answering one does not close the next. This is exactly where task_1 acted blindly: it asked no questions and took guesses for decisions.

The change diagram and the contract edits

investigate draws any flow that crosses a service boundary as a sequence diagram. A diagram like this shows everything a reviewer cares about at once: who calls whom, in what order, directly or through an event, and how each branch ends. For an offer decline, it looks like this:

The reassignment sequence diagram from the investigate plan

The value is that a diagram like this makes two typical cross-service bugs from Part 1 stand out at once. The first: a direct call between services where, by meaning, there should be an event through the broker. The second: a branch that breaks off halfway – the task does not move to a final status and the customer gets no notification. On an ordinary flowchart, that is harder to spot.

investigate only proposes the contract edits; it does not apply them – the stage stays read-only. And if an edit touches something shared – it changes which service owns the data, adds a publish that bypasses the outbox, or weakens an existing rule – investigate does not write it in itself but calls it out separately and asks for confirmation. Otherwise a generated contract line would quietly legitimize a design that no one approved.

Every participant in an event, and the data owners

The input contract slice is deliberately narrow. But for each new or changed event, investigate makes an exception and looks through all the contracts in the repository, every producer and consumer. Along the way it checks that the new dedup key is applied at every consumer, not just one, and that a single event does not carry two different semantic roles. This catches the class of “dedup was fixed in one consumer but forgotten in the next,” invisible if you look at a single service. This is the critical match_id bug from task_1: notification-service dropped the reused identifier as a duplicate, and the breakage could not be seen by looking at matching-service alone.

In parallel, investigate builds an ownership map: for each piece of state, which service owns it and has the right to change it. This catches the case where one service changes another’s data directly, bypassing its owner. For example, incrementing a task’s reassignment counter itself, even though the owner of the counter is task-service and it can only be changed through that service’s outbox. Any such bypass is flagged separately and requires confirmation.

Every risk becomes a test

This is the most important stage. Every risk, gap, ambiguity, and dead end that surfaced in the earlier steps becomes an edge_cases[] entry with a path to a test. Here is what an entry looks like in the archspec documentation (an illustrative example from the skill’s docs, not from the plan being discussed):

edge_cases:
  - id: EC-014
    description: "worker city joins to geo by city_id, not free-text city_name; an unresolved city_id must fail loudly, never silently collapse the distance tie-breaker to a default"
    test: "services/matching-service/usecase/matching_geo_test.go::TestEC014"

Why this matters. The next agent, the one that writes the code, does not re-read the chat – it reads the contract. A note left only in the conversation gets lost. An edge_cases[] entry stays in the contract, and two checks hold it in place: the commit will not pass until the test file exists (DET-003), and the entry cannot be deleted without a separate decision (DET-007). That is how a risk that was found reaches the code as a test. task_1 was missing exactly this: the risks were talked through but written down nowhere, and they were lost.

The self-review loop

Where the clarify gate checked the requirements, here investigate checks the plan it has drawn: it runs its draft against a list of 18 common mistakes and repeats the pass until it stops finding anything new (the first pass usually does find something). The list includes, among others:

on every reassignment, the whole matching pipeline (skill-analyzer, worker-facade, ratings, geo) is run again, even though it is enough to compute candidates once on the first attempt and save a snapshot for the rest;
a consumer takes the attempt number (attempt) from its in-memory state rather than from the event payload: after a service restart or a redelivery (replay) that state is empty, and an old event re-triggers the whole flow;
a handler sends two events that should land in one outbox commit (either both or neither) but adds them separately – if a failure happens between the writes, only the first goes out;
a service queries another one request per item (the classic N+1), even though the other has a batch method (GetWorkersBatch, GetDistancesBatch) that returns everything in a single call;
a service changes data owned by another service directly via a synchronous RPC, instead of letting the owner change it itself, on an event through its own outbox;
a command is marked idempotent in the contract (a repeat is safe), but the code does not back that up: on a redelivery of the same event, the second call is no different from the first – there is no dedup key and no CAS check, so the effect fires twice (for example, the reassignment counter grows by two);
the terminal branch (a dead end) was handled only for the last attempt, even though there may be no candidates as early as the first, and then the task hangs without moving to a final status.

The result is recorded as a separate line of the form Self-review: <N> pass(es), so you can see how many passes it took. In task_3 there were 4.

The plan is saved to a file and passes an independent review

The plan is written to a separate, dated file with the .archplan.md extension. The main lesson from task_1: a plan that lived only in the chat lost the topology and the invariants when implementation began. A file outlives the chat – implement reads the file, not the conversation history.

Then the plan goes through an independent review, and this is a real check, not a formality. Self-review is weak because it re-reads its own assumptions, so the plan is handed to a separate subagent with fresh context and no chat history. It is given only the plan file, all the SERVICE_MAP.yaml files, and the proto, and asked to look hard for reasons to reject the plan, by the same rubric investigate used on itself: are there any invented methods; is the expensive matching snapshotted rather than recomputed; is a batch method used where one exists; do events go through the owner’s outbox, without a synchronous write into someone else’s aggregate; is a new field threaded from the public entry through to every consumer and the demo data; are dedup keys in place at every consumer; is every dead end (including the first attempt) closed with a transition and a notification; does the diagram match the contract edits.

If the reviewer returns REVISE, the plan is fixed and handed to a new reviewer with fresh context, no more than 3 rounds. In task_3 this produced Plan-review: APPROVED after 2 rounds. And if separate subagents cannot be launched, the review is honestly marked Plan-review: SELF-ONLY (the plan re-read itself) rather than passed off as APPROVED, so that it is visible there was no independent check.

investigate finishes with a Definition of Done checklist – an explicit list where a green go build / go test closes none of the items. An item is closed only when each edge_cases[] entry has a test that actually works, and when /archspec:validate (and, for cross-service changes, /archspec:check-architecture) is green. This is a direct answer to the task_1 trap, where a green build was passed off as done.

implement, briefly

implement is built on the same principle: check before the fact, not after. The archplan is mandatory for it – without an .archplan.md it does not start. Contracts change first: the YAML patch is applied and run through /archspec:sync before any code is written. Then it builds an implementation plan where every requirement is tied to a task and a test, and it separately checks (with grep) that every method being called really exists in the proto or the contract – there should be no invented methods. After that comes TDD implementation, task by task. And the key part: across the whole diff it runs 5 checks that the code matches the plan, each against its own class of bug, the very ones we saw in Section 3:

Wiring – at each service’s assembly point (main.go), no nil is passed in place of a dependency, and every client talks to the right port. Catches a panic on the very first event due to a dependency that was never wired in.
Emission – every event from the contract is published on every path where it is needed, including reassignment, not only on the first match. Catches “match.found is sent only on the first match.”
Threading – a new field is threaded from the public API through to every consumer and the demo data, and the declared route matches the router character for character. This is exactly the class that broke task creation in task_3: city_id was added to the internal proto but never threaded through api-gateway.
Dedup – the dedup mark is set atomically with the side effect (or after it, but never before), and a redelivery of the second attempt’s event is traced separately through each consumer’s dedup. Catches the match_id collision from task_1.
Evidence – a table of “requirement → a specific spot in the code (file:line) → test.” Catches a requirement that simply was not written, like “rating first, distance as the tie-break” in task_3.

Each pass exists because the corresponding bug really did ship to production with green unit tests. After the passes come /archspec:validate, /archspec:check-architecture, and an independent review of the diff by a fresh subagent. If a separate reviewer cannot be launched, the result – as at the plan stage – is honestly marked SELF-ONLY: the code reviewed itself rather than an independent agent.

Three layers of checks

The checks are arranged from fast to heavy: an instant automated layer cuts off trivial errors before it comes to the slow agent reviews with their LLM calls:

The whole pipeline looks like this:

In short: investigate is not a generator of pretty Markdown. It is an attempt to pin down, as a contract, the cross-service rules that used to surface only in review or in production, and to carry every finding through to the code as a test rather than leaving it as prose in the chat. Each stage closes a specific class of bug from task_1: the clarify gate handles guesses without questions; the check across all contracts handles a forgotten dedup in a neighboring service; the edge_cases bridge handles a finding lost in the chat; plan-review handles the weakness of self-checking. But a set of stages is only as good as its checks are strict. And as Section 3 showed, the weak link right now is the check that the code matches the plan: the plan threaded the city_id field through, but no stage forced it to pin down a single set of city_id values across services, and in the code that came out as city-msk versus moscow with green tests. More on that in the conclusions.

5. Conclusion

The series was testing one hypothesis: does an LLM’s result on the same feature change if, before implementation, it is given machine-readable architecture contracts instead of local CLAUDE.md files. Smart Task Reassignment went through two modes of work: task_1 without archspec, and task_3 on top of the contracts from Part 2. The feature prompt and the planning model were the same; only the input differed.

The main result is the contrast between the two runs. In task_1 the plan was never refined at all, and the critical bug went into the implementation unnoticed – about 64% on the reference checklist. In task_3 the prompt is the same, but on top of the contracts the plan went through genuine refinement (self-review and an independent review) and closed that bug before any code. On the checklist the plan rose to about 98%, the code to about 93%. A good plan clearly raises quality.

But task_3 also showed the limit of the method: even a strong plan does not guarantee that the code matches it. In two places the implementation diverged from the plan – in one the code did the wrong thing, in another the plan left something unwritten. And, more importantly, the build was green and all 15 tests passed, even though one of the requirements worked incorrectly. A green build and green tests do not yet mean the feature works.

From there, the conclusion about the tool. The value of investigate is that it moves the check earlier, before any code: the input is a contract rather than free text; then come clarifying questions, the search for every consumer of each event, turning every risk into a test, and an independent review of the plan. The problem is looked for before implementation, not after. But task_3 also points to the next step: add a check that the code really matches the plan. The approach already works and already pays off, and closing this gap is the obvious next step.

What you can take away even without the plugin:

==a machine-readable service contract as the input to a task, instead of free-form Markdown that drifts from the code and goes stale;==
==resolving ambiguities before the plan: the entry point, the state owner (system of record), the source of identity, numeric limits, dedup keys;==
==checking architectural invariants already at the plan stage: events go through the owner’s== outbox==, there is no synchronous write into someone else’s aggregate, every terminal branch is closed with a transition and a notification;==
==tracing every new event across all the contracts: every producer, every consumer, and a dedup key at each consumer;==
==every risk (edge case) pinned down by a test right in the contract, rather than left in the chat;==
==an independent review of the plan: a separate reviewer with fresh context deliberately looks for violations of these rules before implementation begins.==

The value here is not in generating code but in the discipline of the stages. These principles work on their own, and archspec turns them into automated checks, so that they rest on a tool rather than on the team’s memory.

archspec is open source; you can try it as a plugin for Claude Code. If you find a bug, an awkward workflow, or a missing rule, open an issue.

All the links:

archspec: https://github.com/krus210/archspec
freelance-marketplace (demo project): https://github.com/krus210/freelance-marketplace
the task_3 solution branch: https://github.com/krus210/freelance-marketplace/tree/task_3

That brings the series to a close – thanks for reading.

You can also discuss this with me on LinkedIn or X.