BlogEngineering

How to Detect AI-Generated vs. Human-Written Code at the Commit Level

How to Detect AI-Generated vs. Human-Written Code at the Commit Level

Legacy SEI tools guess whether code is AI-generated from IDE telemetry. First-party attribution — native tool telemetry and commit metadata — gives you deterministic, per-file ground truth without monitoring developers.

How to Detect AI-Generated vs. Human-Written Code at the Commit Level

Your SEI dashboard says 40% of your team’s code is “AI-assisted.” What does that actually mean? In most cases, it means someone had Copilot open in their editor while they typed. That’s not measurement. That’s inference from proximity.

Tools like Waydev and Jellyfish track IDE plugin activation, session duration, and acceptance rates. They know a developer had an AI tool running. They don’t know which lines it wrote. A developer who rejected every Copilot suggestion in a two-hour session still registers as “AI-assisted” in these systems. A developer who pasted Claude’s output from a browser tab registers as fully manual.

This is the proxy metrics problem. You’re making staffing, tooling, and process decisions based on data that confuses “tool was present” with “tool wrote the code.” The gap between those two things is where bad decisions live.

There’s a more precise approach: instrument the AI’s actual tool calls. When Claude Code edits a file, creates a file, or writes to a notebook, that action is a discrete, recorded event — which tool, which file, how many lines. Not a guess based on whether the plugin was active. Not a commit trailer you could grep for. Native telemetry from the AI itself, captured per-file and per-tool-call, then piped through an OpenTelemetry pipeline into a dashboard that aggregates AI ratio alongside cycle time, friction signals, and cost per task. That’s the difference between proximity metrics and ground-truth attribution.

The Taxonomy of AI Attribution Methods

Not all “AI tracking” works the same way. There are three fundamentally different approaches, and most teams don’t realize their tool uses the weakest one.

Inference-based attribution relies on IDE telemetry — plugin activation logs, session duration, suggestion acceptance rates. Waydev, Jellyfish, and most SEI platforms operate here. The data tells you a developer had an AI tool open. It doesn’t tell you what the tool actually produced. A developer who rejects every suggestion for an hour gets counted. A developer who uses Claude in a terminal window or browser tab gets missed entirely. Inference also requires installing plugins or browser extensions on developer machines, which is where the “this feels like spyware” conversations start — and once that trust erodes, good luck getting accurate self-reported data either.

Heuristic-based attribution tries to identify AI-written code after the fact using style analysis or a secondary AI model that grades whether code “looks” machine-generated. This is clever in theory and unreliable in practice. Developers edit AI output before committing. AI models trained on human code produce human-looking code. The false positive rate is high enough that the data isn’t actionable — you end up arguing about whether the detector is right instead of making decisions about your team’s AI adoption.

First-party attribution records what actually happened. When Claude Code’s Edit tool modifies a file, that event — tool name, file path, line count — is captured as native telemetry at the source. No inference from plugin activity. No after-the-fact style guessing. The AI’s own tool calls are the record. For environments where native telemetry isn’t available, Co-Authored-By commit trailers serve as a fallback signal at the commit level. First-party attribution is the only deterministic method because it doesn’t interpret behavior or analyze output. It logs the action as it happens.

How Commit-Level Attribution Actually Works

Deterministic attribution isn’t one mechanism. It’s two, layered by granularity, with the more precise tier taking priority when available.

Tier 1: Native Tool Telemetry (Primary)

When Claude Code modifies your codebase, it does so through explicit tool calls — Edit, Write, NotebookEdit. Each call is a discrete, structured event: which tool fired, which file it touched, how many lines it produced. This is per-file, per-action attribution. The AI isn’t inferring what it wrote after the fact. It logged the action as it happened.

These events flow through a standard OpenTelemetry pipeline into an analytics store — the same OTEL infrastructure your team likely already uses for application observability. From there, the data is queryable per session, per developer, per task, or per time window. You get file-level AI ratio breakdowns without installing separate monitoring plugins or browser extensions on developer machines and without analyzing code style.

Tier 2: Commit Metadata (Fallback)

When native telemetry isn’t available — different tooling, offline work, or environments where OTEL collection isn’t configured — the system falls back to reading commit trailers in git history:

feat: add retry logic to payment webhook handler Implement exponential backoff for failed Stripe webhook deliveries. Max 3 retries with jitter. Co-Authored-By: Claude <noreply@anthropic.com>

This isn’t a custom convention. Claude Code already writes this trailer on every commit it produces. The data is already in your git log — most teams just aren’t reading it programmatically.

Attribution at this tier is per-commit, not per-file. If N out of M commits in a session carry the trailer, then N/M of total added lines are attributed to AI — proportional by commit count, not redistributed across individual commits. Less granular than Tier 1, but still deterministic. The AI ratio itself is calculated server-side from the collected data — it’s not embedded in the commit.

What the two tiers share: both are first-party. The AI tool itself is the source of truth in both cases — not a secondary observer watching IDE activity. Both are deterministic — no probabilistic guessing, no style analysis. And neither requires monitoring developer behavior. No browser extensions, no keystroke logging, no screen capture.

The difference is resolution. Native telemetry knows which tool touched which file and how many lines per call. Commit metadata knows which commits involved AI, and approximates the rest. When both are available, Tier 1 wins.

The First-Party Data Playbook

Once you have deterministic attribution data, the question shifts from “how much AI code do we have” to “is the AI writing code in the right places.” Here’s the workflow.

  • Evaluate AI ratio by domain. A 75% AI ratio on API scaffolding, CRUD endpoints, and test boilerplate is a team operating well — that’s exactly where AI saves the most time with the least risk. A 75% AI ratio on your payment processing logic, distributed consensus layer, or auth middleware is a different conversation. That code needs closer review, not because AI can’t write it, but because the cost of a subtle bug there is orders of magnitude higher. Segment your AI ratio by directory or module, not just by developer. The aggregate number is almost useless without domain context.

  • Correlate with DORA metrics. Segment deployment frequency and lead time by AI-heavy vs. manual tasks. You want to answer a specific question: does AI-assisted work ship faster without creating more rework? Track completed tasks per day and average task duration, split by AI ratio quartile. If your highest-AI-ratio tasks also have the shortest lead times and aren’t generating more bug tickets, that’s real signal. If they ship fast but come back as defects, the AI is generating velocity without quality — and you’ve caught it with data instead of gut feeling.

  • Identify context gaps. If AI ratio drops to zero on a specific microservice or module, the AI lacks context for that area and the developer is doing everything manually. Tandemu detects this by cross-referencing frequently changed files against memory coverage per folder and surfaces the gaps in the CLI during task selection and on the memory dashboard. Low memory coverage means the AI is flying blind. The fix isn’t “use AI more” — it’s adding architectural context so the AI can actually contribute. The modules where AI ratio is lowest are often the ones that would benefit from it most.