Alcazar · Technical Blog

Technical notes, architecture writeups, and release stories.

RSS feed

Published Mar 16, 2026

Wide logging: Stripe's canonical log line pattern

Most logging is too narrow.

One line has the route. Another has the user. Another has the timeout. Another has the feature flag. Another has the deploy SHA.

Then an incident happens and you end up doing joins by hand.

Stripe’s answer is canonical log lines. The modern name is usually wide events. The pattern is simple: emit one structured record per unit of work with all the important fields already attached.

For a web service, that usually means one log event at the end of every request.

The Pattern

A canonical log line is the summary row for a request.

It should include the fields you always wish you had in one place:

  • route
  • method
  • status
  • duration
  • user or account ID
  • request ID and trace ID
  • build or deploy ID
  • feature flags
  • downstream timings
  • error code

In raw form it might look like this:

ts=2026-03-16T12:03:41Z service=api env=prod route=/v1/charges method=POST status=500 request_id=req_123
account_id=acct_456 build_id=9f2c1d7 feature_flag.payments_v2=true duration_ms=843 db_ms=792
db_queries=18 cache_hit=false error_slug=charge_db_timeout

This is useful because the log line is already pre-joined. You are not reconstructing a request from fragments. You are querying complete rows.

That sounds minor, but it changes what production debugging feels like.

Why It Works

Stripe did two important things.

First, they treated the canonical line as critical infrastructure. It is emitted after the request finishes, and their implementation is hardened so the line still appears when exceptions happen.

Second, they did not stop at debugging. Stripe pushed these records into warehousing systems and used them for longer-term analysis and product surfaces like the Developer Dashboard.

That is the part many teams miss.

A canonical log line is more than a nicer log. It is a request-shaped data model.

If the schema is stable, the same event can support:

  • incident response
  • release analysis
  • customer support investigations
  • product analytics

Amazon describes a similar idea in the Builders Library: emit one structured request log entry per unit of work, then derive metrics later. Log first. Aggregate later.

What To Log

Most teams stop too early.

They log route, status, and latency. That is enough for a dashboard, but not enough for diagnosis.

The highest-value fields tend to be:

  • Route template: /teams/{team_id}/members/{user_id} is better than raw paths with IDs embedded in them.
  • Identity: user_id, account_id, API key ID, auth method.
  • Release metadata: Git SHA, build ID, deploy ring, region.
  • Execution cost: duration, DB time, query count, cache hit or miss, retry count.
  • Decision inputs: feature flags, experiment variant, plan tier, client version.
  • Outcome: status code, throttled yes or no, fallback path used, error slug.

Two fields are especially underrated.

The first is build_id. Metrics tell you that latency went up. build_id tells you which deploy owns the regression.

The second is an error_slug. Not just an exception class. A stable identifier for the exact failure site or failure reason.

That is the difference between “timeouts increased” and “the timeout came from the new write path behind feature_flag.double_write.”

The Real Benefit

The real power of wide logging is not observability. It is correlation.

Once every request carries business context and execution context in the same row, you can ask much better questions:

  • Did the new build hurt only enterprise accounts?
  • Did the regression appear only on iOS 7.4.1?
  • Did variant B increase errors only in eu-west-1?
  • Did the slow requests all miss cache and hit the same downstream service?

Metrics are bad at this because they throw context away early.

Traditional logs are bad at this because the context is scattered.

Canonical log lines keep the context intact long enough to query it.

That is why the pattern keeps coming back under different names.

High Cardinality

This is where people get nervous.

user_id, request_id, build_id, and feature flags are high-cardinality fields. In many systems that is a warning sign.

The important distinction is where the cardinality lives.

High-cardinality values are often fine inside a wide event. They become expensive when you force them into the wrong indexing model.

That is why this pattern works best with systems designed for filtering and grouping over many dimensions. Stripe used Splunk and Redshift. Modern teams might use ClickHouse-backed tools, Honeycomb, BigQuery, or their own warehouse.

The storage choice is less important than the query shape. You want to slice rich rows, not pre-aggregate away the useful parts.

Common Mistakes

Only logging the happy path

The canonical event should be emitted in finally, ensure, or equivalent teardown logic. If it disappears on exceptions, it fails when you need it most.

Logging raw paths instead of route templates

/users/123/orders/456 is terrible for grouping. /users/{user_id}/orders/{order_id} is what you want.

Logging exception classes but not error reasons

TimeoutError is often too broad. An error slug gives you a stable grouping key tied to a real code path.

Dumping raw input into the event

Amazon recommends sanitizing and truncating request details before logging. That is important here. A rich event becomes dangerous fast if you start packing it with tokens, secrets, or arbitrary payloads.

Letting the schema drift

Field names become muscle memory. If one service logs user_id, another logs uid, and a third logs account_user, cross-service queries get messy fast.

Implementation

The usual implementation is middleware.

Create a request-scoped object at the start of the request. Let middleware and business logic add fields as work happens. Emit one structured line at the end.

If you use OpenTelemetry, the root span can play this role. If not, JSON or logfmt is fine.

A good starting schema is:

  • service.name
  • env
  • request_id
  • trace_id
  • route
  • method
  • status
  • duration_ms
  • user_id or account_id
  • build_id
  • error_slug
  • sample_rate

Then add fields whenever a real production question is hard to answer.

Summary

Canonical logging is a simple idea that pays for itself quickly.

Emit one rich, trustworthy event per request. Make it stable. Make it complete. Make sure it still appears on failures.

Once you do that, logs stop being breadcrumbs and start being records.

← Back to Tech Log