AutoBE vs. Claude Code: 3rd-gen coding agent developer's review of the leaked source code

TL;DR

  1. Claude Code—source code leaked via an npm incident
    • while(true) + autonomous selection of 40 tools + 4-tier context compression
    • A masterclass in prompt engineering and agent workflow design
    • 2nd generation: humans lead, AI assists
  2. AutoBe—the opposite design
    • 4 ASTs x 4-stage compiler x self-correction loops
    • Function Calling Harness: even small models produce backends on par with top-tier models
    • 3rd generation: AI generates, compilers verify
  3. After reading—shared insights, a coexisting future
    • Independently reaching the same conclusions: reduce the choices; give workers self-contained context
    • 0.95^400 ~ 0%—the shift to 3rd generation is an architecture problem, not a model performance problem
    • AutoBE handles the initial build, Claude Code handles maintenance—coexistence, not replacement

Recommended reading: Function Calling Harness—a deep dive into the technique that turned 6.75% into 100%

1. The Incident

April 2026. A screenshot started circulating through developer communities. An Anthropic engineer had run npm publish without a .npmignore, and Claude Code's entire source code had been uploaded to the npm registry.

512,000 lines. 1,900 files. The complete internal architecture of the world's most widely used AI coding agent, exposed by a single missing configuration file.

Anthropic took the package down within hours, but by then countless developers had already downloaded the source. Reddit, Hacker News, X—timelines were flooded with Claude Code source analysis. Some shared the system prompts. Others dissected the security architecture. Others mapped out the structure of the while(true) loop.

We cleared our schedules—we had no choice.

AutoBE was at an inflection point. We were about to layer serious orchestration on top of a pipeline we had intentionally kept simple (more on this in Section 3). We needed to study how other AI agents designed their orchestration.

Then Anthropic's packaging mistake handed us the reference architecture. It couldn't have come at a better time—felt like receiving a gift.

Claude Code was deeper than we expected—not just a large project, but an entire worldview. Seven recovery paths inside a while(true) loop. Four-tier context compression. Twenty-three security check categories. Over 400KB of security code for BashTool alone.

The deeper we dug, the clearer it became why we built things differently.

This post is those reading notes.

2. What is AutoBE

AutoBe is an open-source AI agent that automatically generates backends. Say "build me a shopping mall backend," and it produces everything from requirements analysis to database design, API specification, E2E tests, and NestJS implementation code—all at once.

Because Function Calling Harness and AI-native compilers uniformly guarantee the quality of generated output, even small models like qwen3.5-35b-a3b can produce backends on par with top-tier models—at a fraction of the cost.

Currently supports the TypeScript / NestJS / Prisma stack.

Expansion to other languages and frameworks begins in July 2026.

2.1. The LLM Doesn't Write Code

Most AI coding agents tell the LLM "write this code" and save the returned text to a file. AutoBE is different.

AutoBE uses Function Calling. Instead of free-form text, the LLM fills in a predefined JSON Schema—an AST (Abstract Syntax Tree). It's not writing on a blank page; it's filling in a form. Once the form is filled, a compiler validates it and transforms it into actual code. The LLM fills in the structure; the compiler writes the code.

This principle applies across the entire 5-stage pipeline:

Stage Structure the LLM fills Compiler validation
Requirements AutoBeAnalyze—structured SRS Structure validation
DB Design AutoBeDatabase—DB schema AST Database Compiler
API Design AutoBeOpenApi—OpenAPI v3.2 spec OpenAPI Compiler
Testing AutoBeTest—30+ expression types Test Compiler
Implementation Modularized code (Collector/Transformer/Operation) Hybrid Compiler

Each AST strictly constrains what the LLM can generate. For example, AutoBeDatabase allows only 7 field types: "boolean" | "int" | "double" | "string" | "uri" | "uuid" | "datetime". You can't use "varchar"—it simply isn't an option. The schema is the prompt—unambiguous, model-independent, and mechanically verifiable.

2.2. Why Function Calling

"Can't you just have the LLM write text code directly?"

For frontend, maybe. If a button is slightly misplaced or an animation feels off, the app still works. On mobile, you can patch after launch. But backends are different.

Backend development isn't a domain of creativity—it's a domain of logic and precision. If a single API returns the wrong type, every client breaks. If one foreign key is missing, data integrity is gone. If two APIs define the same entity differently, the system is internally contradictory. A frontend bug is an inconvenience; a backend bug is an outage—the backend is the single source of truth that every client depends on. Consistency and 100% correctness are non-negotiable prerequisites, not nice-to-haves.

Free-form text generation cannot structurally meet this requirement.

2.2.1. Uncontrollable

Can you enforce consistency through prompts? "Don't use varchar," "don't use any types," "don't create utility functions"—this is the pink elephant problem. Tell someone "don't think of a pink elephant," and the first thing they do is picture one. Tell an LLM "don't do X," and X lands at the center of attention, actually increasing the probability of generating it. Natural language can only express constraints through prohibition, and prohibition is structurally incomplete.

export namespace AutoBeDatabase {
  export interface IForeignField {
    name: string & SnakeCasePattern; // enforce snake_case naming
    type: "uuid";
    relation: IRelation;
    unique: boolean;
    nullable: boolean;
  }
  export interface IPlainField {
    name: string & SnakeCasePattern;
    type: // restrict type by spec, not by prohibition rule
      | "boolean"
      | "int"
      | "double"
      | "string"
      | "uri"
      | "uuid"
      | "datetime";
    description: string;
    nullable: boolean;
  }
}
Enter fullscreen mode Exit fullscreen mode

Function Calling solves this at the root. The LLM isn't writing on a blank page—it's filling in a predefined form. There are only 7 field types; API specs follow the OpenAPI v3.2 schema; test logic can only be expressed within 30 variants of IExpression. It's not "don't use varchar"—varchar simply doesn't exist as an option. Not prohibition, but absence. Communicate through types and there's no misunderstanding; constrain through schemas and there's no pink elephant.

2.2.2. The Compound Effect

The math of backends is unforgiving. Consider a service with 50 tables and 400 APIs. All 400 APIs must succeed for the server to run. Total success rate = (per-unit success rate)^n:

At 95%, even 50 APIs make it virtually impossible. At 99%, 400 APIs still yield only 1.8%. Only 100% survives.

Per-unit success rate 10 APIs 50 APIs 100 APIs 400 APIs
95% 59.9% 7.7% 0.6% ~ 0%
99% 90.4% 60.5% 36.6% 1.8%
99.9% 99.0% 95.1% 90.5% 67.0%
100% 100% 100% 100% 100%

This is the structural limitation of free-form text generation. Hand a coding assistant a backend with 50 tables and 400 APIs, and you'll get output. 0 to 80 is fast. The scaffolding is great, individual functions are well-written. But getting 400 APIs to be mutually consistent, with every FK properly connected and shared types uniform across all endpoints—that's 80 to 100, a region that free-form text generation structurally cannot reach. As long as each API's success rate is 95%, total success converges to 0 as the API count grows. A human could review all 400 one by one, but then what's the point of AI?

Function Calling fundamentally solves this compound problem. The form is fixed, so variance is zero; a compiler validates the form, so per-unit success rate converges to 100%. 1.0400 = 1.0. On top of that, a 4-stage compiler guarantees system-level consistency—cross-validation between DB schema and API spec, uniformity of shared types across APIs, detection of circular dependencies between modules. If validation fails, a self-correction loop repeats until it passes.

2.2.3. Variance

LLM output is a sample drawn from a probability distribution. Run the same model with the same prompt and you get different code every time—different variable names, different patterns, different error handling approaches. Swap the model and the differences grow larger. Claude leans functional, GPT leans class-based, Qwen has its own idioms. This variance is richness in creative writing, but a defect in backends.

When the form is fixed, variance vanishes. The AST schema uniformly governs the model's "style," and the compiler verifies the result, so the model's personality has minimal impact on the final output. The benchmarks prove this:

The backends generated by qwen3.5-35b-a3b (3B active) and claude-sonnet-4.6 have nearly identical architecture, module structure, and naming conventions. Strong models converge in 1-2 iterations; weaker models converge in 3-4—but the destination is the same. Different models, same result. Run it again, same result. This is the consistency that backends demand, and Function Calling is the only approach that can structurally guarantee it.

2.3. Industry Consensus: "That Won't Work"

But the forms the LLM must fill are far from simple. AutoBeOpenApi.IJsonSchema, which defines DTO types, is a recursive union type with 10 variants:

export type IJsonSchema =
  | IJsonSchema.IBoolean
  | IJsonSchema.IInteger
  | IJsonSchema.INumber
  | IJsonSchema.IString
  | IJsonSchema.IArray      // items: IJsonSchema <- recursive
  | IJsonSchema.IObject     // properties: Record<string, IJsonSchema> <- recursive
  | IJsonSchema.IReference
  | IJsonSchema.IOneOf      // oneOf: IJsonSchema[] <- recursive
  | IJsonSchema.INull
  | IJsonSchema.IConstant;
Enter fullscreen mode Exit fullscreen mode

Ten variants nested 3 levels deep yield 1,000 possible paths.

The test stage is even more complex. AutoBeTest.IExpression, which represents E2E test logic, has over 30 recursive variants—programming-language-level complexity packed into a single Function Call:

export type IExpression =
  | IBooleanLiteral   | INumericLiteral    | IStringLiteral     // literals
  | IArrayLiteralExpression  | IObjectLiteralExpression          // compound literals
  | INullLiteral      | IUndefinedKeyword                       // null/undefined
  | IIdentifier       | IPropertyAccessExpression               // accessors
  | IElementAccessExpression | ITypeOfExpression                 // access/operations
  | IPrefixUnaryExpression   | IPostfixUnaryExpression           // unary operations
  | IBinaryExpression                                            // binary operations
  | IArrowFunction    | ICallExpression    | INewExpression      // functions
  | IArrayFilterExpression   | IArrayForEachExpression           // array operations
  | IArrayMapExpression      | IArrayRepeatExpression            // array operations
  | IPickRandom       | ISampleRandom      | IBooleanRandom     // random generation
  | IIntegerRandom    | INumberRandom      | IStringRandom      // random generation
  | IPatternRandom    | IFormatRandom      | IKeywordRandom     // random generation
  | IEqualPredicate   | INotEqualPredicate                      // assertions
  | IConditionalPredicate    | IErrorPredicate;                  // assertions
Enter fullscreen mode Exit fullscreen mode

This is the actual complexity of the form the LLM must accurately fill in a single Function Call.

qwen3-coder-next's first-attempt success rate on IJsonSchema: 6.75%. The industry consensus is clear—NESTFUL (EMNLP 2025) measured GPT-4o's nested tool calling accuracy at 28%, and JSONSchemaBench (ICLR 2025) reported success rates of 3-41% on the hardest tier across 10,000 real-world schemas. BoundaryML went further, arguing that structured output actually degrades a model's reasoning ability. The consensus: don't do Function Calling with complex schemas.

We couldn't give up. Without structured output, mechanical verification is impossible; without verification, feedback loops are impossible; without feedback loops, guarantees are impossible.

So we built the Function Calling Harness. Typia's 3-tier infrastructure is at its core:

All three tiers are auto-generated by Typia's compiler from TypeScript type definitions. Developers only need to define TypeScript types—the Function Calling schema, parse() recovery logic, validate() checker, and LlmJson.stringify() feedback generator all derive from the same type. A single type governs schema, parsing, validation, and feedback simultaneously.

2.3.1. parse() — Recovering Broken JSON

LLMs aren't JSON generators. They wrap output in markdown code blocks, prepend "I'd be happy to help!", leave brackets unclosed, omit quotes on keys, and write tru instead of true. The Qwen 3.5 series is worse—it double-serializes every union type field with 100% probability. A real production response that contained 7 simultaneous issues:

import { dedent } from "@typia/utils";
import typia, { ILlmApplication, ILlmFunction, tags } from "typia";

const app: ILlmApplication = typia.llm.application<OrderService>();
const func: ILlmFunction = app.functions[0];

// LLM sometimes returns malformed JSON with wrong types
const llmOutput = dedent`
  > LLM sometimes returns some prefix text with markdown JSON code block.

  I'd be happy to help you with your order! 😊

  \`\`\`json
  {
    "order": {
      "payment": "{\\"type\\":\\"card\\",\\"cardNumber\\":\\"1234-5678", // unclosed string & bracket
      "product": {
        name: "Laptop", // unquoted key
        price: "1299.99", // wrong type (string instead of number)
        quantity: 2, // trailing comma
      },
      "customer": {
        // incomplete keyword + unclosed brackets
        "name": "John Doe",
        "email": "[email protected]",
        vip: tru
  \`\`\` `;

const result = func.parse(llmOutput);
if (result.success) console.log(result);

interface IOrder {
  payment: IPayment;
  product: {
    name: string;
    price: number & tags.Minimum<0>;
    quantity: number & tags.Type<"uint32">;
  };
  customer: {
    name: string;
    email: string & tags.Format<"email">;
    vip: boolean;
  };
}

type IPayment =
  | { type: "card"; cardNumber: string }
  | { type: "bank"; accountNumber: string };

declare class OrderService {
  /**
   * Create a new order.
   *
   * @param props Order properties
   */
  createOrder(props: { order: IOrder }): { id: string };
}
Enter fullscreen mode Exit fullscreen mode

A single call to func.parse() recovers all 7 issues:

  • Markdown block & prefix chatter -> stripped
  • Unclosed string & bracket ("1234-5678) -> auto-completed
  • Unquoted key (name:) -> accepted
  • Trailing comma (quantity: 2,) -> ignored
  • Incomplete keyword (tru) -> completed to true
  • Wrong type ("1299.99") -> coerced to 1299.99 according to the schema
  • Double serialization ("{\"type\":\"card\"...) -> recursively restored to object

2.3.2. validate() + LlmJson.stringify() — Precision Feedback

Even after parsing, the values themselves can be wrong. Negative prices, non-email strings, decimals where integers are expected. When validate() detects a schema violation, LlmJson.stringify() generates inline // ❌ error markers on top of the LLM's original JSON:

{
  "order": {
    "payment": {
      "type": "card",
      "cardNumber": 12345678 //  [{"path":"$input.order.payment.cardNumber","expected":"string"}]
    },
    "product": {
      "name": "Laptop",
      "price": -100, //  [{"path":"$input.order.product.price","expected":"number & Minimum<0>"}]
      "quantity": 2.5 //  [{"path":"$input.order.product.quantity","expected":"number & Type<\"uint32\">"}]
    },
    "customer": {
      "name": "John Doe",
      "email": "invalid-email", //  [{"path":"$input.order.customer.email","expected":"string & Format<\"email\">"}]
      "vip": "yes" //  [{"path":"$input.order.customer.vip","expected":"boolean"}]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The LLM only needs to fix the errors marked on its own output—no need to rewrite everything, just fix the 5 flagged fields. Precise, structured, and immediately actionable feedback.

This loop is what turns 6.75% into 100%. On top of that, AutoBE's 4-stage compiler (Database -> OpenAPI -> Test -> TypeScript) adds system-level self-correction loops. Dual validation at the Function Calling level and the compiler level is what drives 100% compilation success.

3. Why This Moment

3.1. Intentionally Kept Simple

AutoBE had never paid close attention to agent orchestration. Intentionally.

We kept the workflow in its simplest possible form: one-directional waterfall, one round of AI self-review, one shot at code generation. We also intentionally banned large models, running repeated experiments with small ones (qwen3-30b-a3b, 3B active). Three reasons.

3.1.1. Stability

We needed to measure each pipeline stage's success rate in isolation. Complex orchestration makes it difficult to identify which stage failed. In a simple pipeline, "FK references broke in the Database stage" is clear. In complex orchestration, it becomes "something went wrong somewhere."

3.1.2. Debugging

The more stages where AI intervenes autonomously, the exponentially harder it becomes to trace failure causes. When Agent A corrects something, Agent B touches it again, and Agent C modifies that result—the root cause gets buried.

3.1.3. Preventing Weakness Concealment

Smart AI and sophisticated workflows mask the system's vulnerabilities. If the Database stage generates a flawed schema but the subsequent Interface stage's AI silently compensates, you never discover the Database stage's weakness. Vulnerabilities exposed by small models also exist in large models—they just surface less often. "Less often" becomes "occasionally" in production, and "occasionally" becomes an outage.

So we deliberately—with small models, in a simple pipeline, with minimal AI intervention—tightened only the validation at each stage.

3.2. Breaking 100% and Rebuilding

We had previously achieved 100% compilation + runtime success rate. Then we deliberately broke it to rebuild at a higher level of quality.

3.2.1. Divide and Conquer

AutoBE's first goal was simple: generate each API function independently. No code reuse, no inter-function dependencies, each function self-contained. If 10 functions query the same table, all 10 contain the same duplicated query.

You can't run before you walk. We first needed to prove, in the simplest possible form, that the Function Calling Harness worked, that the compiler feedback loop achieved self-correction, and that 100% was reachable even with small models.

And we proved it. 100% compilation, 100% runtime. Even with small models. The foundation works.

3.2.2. The Output Wasn't Software

After hitting 100% compilation and runtime, we looked at the output. It compiled and ran—but it wasn't maintainable software. Adding a column to a table meant regenerating all 10 related functions. Changing requirements meant rebuilding from scratch. Without code reuse, the output could be generated but couldn't evolve.

The next mission was clear: move to a structure that enables code reuse—where functions call other functions, shared logic converges in one place, and requirement changes only require modifying what changed.

3.2.3. Breaking It

So we broke 100%.

Introducing inter-module dependencies caused the success rate to plummet to 40%. Problems that didn't exist with independent functions erupted all at once—the moment functions call each other, one function's mistake breaks another. Return types don't match, imports get tangled, dependency ordering falls apart. A microcosm of the compound effect from Section 2.2—when 100 modules depend on each other, each module's 95% success rate converges to 0% at the system level.

From 100% to 40%. It took months. We strengthened the compiler, refined the correction loops, and improved the Harness.

We reached 100% compilation again. Runtime 100% is still being restored.

3.3. Time to Get Sophisticated

At this point, we had fully achieved 100% compilation. Runtime 100% was still in progress.

This is when we declared:

"With 100% compilation secured as our foundation, it's time to start getting sophisticated."

Introduce agent self-review loops. Refine the prompts. Add sophistication to the orchestration. No matter how sophisticated you make a workflow without a verification foundation, it's nothing more than an elaborate dice roll. Lay the verification foundation first, then build the workflow on top—we were convinced this was the right order.

To do that, we needed to seriously study how other AI agents designed their orchestration.

That's exactly when the Claude Code source code leaked.

4.

Visit Website