LLM Drift in Long Sessions: Claude 60% vs 85% Integrity

Source: Hacker News / Calmkeep.ai | Type: AI/LLM Evaluation | Rating: ★★★★☆

    Key finding: When using Claude for long coding sessions (25+ turns), structural integrity drops from 100% to 60% without continuity layer, but only to 85% with it. The model progressively abandons established architectural patterns.
  

Methodology

Two transcripts were evaluated using identical task prompts:

Transcript A: Generated directly within Claude App
Transcript B: Generated using Claude via API with Calmkeep continuity layer

Both were audited using a structured "Compliance & Integrity Audit" prompt.

Architecture Laws Established (Turns 1-5)

Law	Description
LAW-01	Module-Based Architecture (vertical slicing)
LAW-02	Service Layer Owns All DB Access
LAW-03	Org-Scoped Queries - Every query MUST include org_id
LAW-04	Centralized Error Classes (custom AppError hierarchy)
LAW-05	Env Config - Centralized Fail-Fast from config/env.ts
LAW-06	Prisma as Sole ORM (No Raw SQL)
LAW-07	Validation - Schema-First (Zod adopted mid-session)
LAW-08	Single Source of Truth for Shared Logic

Results

Metric	Transcript A (No Continuity)	Transcript B (With Continuity)
Total AVEs	8	3
Drift Coefficient	40%	15%
Final Integrity	60%	85%
Decay Onset Turn	T8	T23
Post-T14 Backslide	YES	NO

Critical insight: After T14 (Zod migration), Transcript A BACKSLID by reintroducing raw parseInt for pagination in new modules. The model "forgot" its own refactor.

Common Violations in Transcript A

Inline Manual Validation - Body Cast Pattern (repeating pre-Zod anti-pattern)
Raw parseInt Pagination in Service Layer (ignoring Zod middleware)
Filter Validation Duplication (two sources of truth)
roleHierarchy Re-Definition (duplicate from middleware/requireRole.ts)
Raw Role String Array Check bypassing can() permissions system

Implications for AI Coding Agents

Continuity matters: A simple continuity layer improved integrity by 25%
Architectural drift is real: Models abandon patterns after explicit refactors
Self-correction is limited: Even when self-identified, violations persist
Context window limitations: Long sessions cause "forgetting" of earlier rules

Good news: Transcript B maintained 100% integrity through T20, with only 3 minor violations in T23-24. This shows architectural patterns CAN be maintained with the right context management.

URL: https://calmkeep.ai/codetestreport