Sources and checks

Writing a check

A check is the thing that turns "pg_restore exited zero" into "the restore worked." It runs against the restored scratch database and answers one question: does this look like your data?

Why you write your own

Without a check, a restore test proves only that the dump tool could reverse its own output. That is necessary but thin. Dumps can restore into a functionally broken state: missing extensions, silent collation drift, empty tables because of a permission mismatch at backup time, a row count that is off by an order of magnitude because the dump caught a transaction at a bad moment.

A check is one query that, if it passes, rules out the category of failure you actually care about. One check per source is enough to start. Add more as you learn what can go wrong.

Anatomy

A check has three fields. The id names it, the summary query returns one row, the expect expression says what the row should look like.

checks:
  - id: recent-orders
    summary_sql: |
      SELECT COUNT(*) AS n
      FROM orders
      WHERE created_at > NOW() - INTERVAL '24 hours'
    expect: "n > 100"

Each source in config.yaml references the checks it uses by id:

sources:
  - slug: prod-db
    kind: postgres
    connection: { uri: env:RESTORABLE_DB_URI }
    checks:
      - recent-orders

The check runs after the restore completes, against the scratch Postgres container the agent spun up.

The expect DSL

The summary query returns one row. The column names in that row become the identifiers you can reference in expect. The DSL accepts exactly one comparison:

<column> <op> <literal>
  • Operators: ==, !=, <, <=, >, >=
  • Literals: integers, floats, double-quoted or single-quoted strings, true, false, null
  • Ordering operators (<, <=, >, >=) require a numeric right-hand side. Equality (==, !=) accepts numbers, strings, or bool/null.

Examples:

expect: "n > 100"
expect: "users >= 1000"
expect: "status == \"ok\""
expect: "failures == 0"

The DSL is deliberately small. It is not a programming language; it is a sentence that says "this row should look like this." For anything more (compound conditions, ranges, derived predicates), push the logic into the summary query and have expect check one boolean column the query produces. The Postgres pattern is (condition)::int AS ok with expect: "ok == 1"; Mongo uses $project: { ok: { $cond: [<predicate>, 1, 0] } }.

Postgres examples

Row freshness

Assert that your orders table received rows recently. Useful when you want to detect a restore that succeeded against a backup that was silently empty for your most-active table.

- id: orders-fresh
  summary_sql: |
    SELECT (
      COUNT(*) FILTER (WHERE created_at > NOW() - INTERVAL '24 hours') > 0
      AND COUNT(*) > 1000
    )::int AS ok
    FROM orders
  expect: "ok == 1"

NOW() inside the scratch container is the scratch's wall clock, which is the clock at restore-test time, not at backup time. That is usually what you want for freshness checks. For time-invariant checks, avoid NOW() and use recorded timestamps instead.

Reference data intact

Assert that a small table of reference values is present. Low noise, catches restore failures where a schema restored but data did not.

- id: currencies-present
  summary_sql: |
    SELECT COUNT(*) AS n FROM currencies
  expect: "n >= 150"

Schema sanity

Assert that a table count matches what you expect. Catches restores that produce a valid but drastically-reduced database.

- id: tables-sane
  summary_sql: |
    SELECT (COUNT(*) BETWEEN 40 AND 60)::int AS ok
    FROM information_schema.tables
    WHERE table_schema = 'public'
  expect: "ok == 1"

Invariant across rows

Push the test into SQL and have expect check one boolean. Useful when you want to assert something that holds for every row.

- id: no-orphan-orders
  summary_sql: |
    SELECT COUNT(*) AS orphans
    FROM orders o
    LEFT JOIN users u ON o.user_id = u.id
    WHERE u.id IS NULL
  expect: "orphans == 0"

MongoDB examples

Mongo checks use summary_find instead of summary_sql. Three shapes: a filter with an implicit count, a full aggregate pipeline, or a list-collections probe over the database.

Filter-and-count

- id: open-orders
  summary_find:
    database: app
    collection: orders
    filter: { "status": "open" }
  expect: "n > 0"

A filter block runs db.collection.countDocuments(filter) in the scratch database and produces a summary row { n: <count> }. The database key is required; the scratch's default database is admin, which is not where your data lives.

Aggregate

- id: orders-total
  summary_find:
    database: app
    collection: orders
    aggregate:
      - { $match: { status: "open" } }
      - { $group: { _id: null, n: { $sum: 1 }, total: { $sum: "$amount" } } }
      - { $project: { ok: { $cond: [{ $and: [{ $gt: ["$total", 10000] }, { $gt: ["$n", 50] }] }, 1, 0] } } }
  expect: "ok == 1"

The first document of the pipeline result becomes the summary row. Keys in that document are what expect can reference.

List-collections

- id: app-collections
  summary_find:
    database: app
    list_collections: true
  expect: "n >= 4"

With list_collections: true the agent runs db.getCollectionNames() against database, drops names that start with system., and returns the count as { n: <count> }. This is the schema- count probe the wizard auto-attaches as <slug>-collections: it catches the failure mode where a restore exits clean but most of the database didn't materialize. collection, filter, and aggregate must be unset in this mode.

Execution semantics

Three rules to know:

Checks run sequentially against a shared scratch

All checks configured on one source run in one restore session against the same ephemeral database. They run in the order declared in the source's checks: list.

Treat checks as read-only by convention. If an earlier check mutates the scratch DB, later checks see the mutation. The agent does not enforce read-only; your SQL does.

Per-check timeouts

Each check has a 60-second timeout. A check that runs over budget is killed and recorded as result: "error" with reason timeout. The session continues with the next check.

The roll-up

The receipt's top-level result rolls up the per-check results. Any fail wins, otherwise any error wins, otherwise pass. You see per-check detail in the dashboard and in the receipt itself.

What not to put in a check

  • Writes. Neither the agent nor the scratch DB blocks them. A DELETE that happens to work breaks later checks in the same session and makes receipts confusing.
  • Long-running queries. A query that takes five minutes makes every restore test take five minutes plus. If you need to validate large aggregates, precompute them at backup time in a materialised view and have the check read that.
  • Queries that depend on backup time. The scratch DB has no knowledge of when the backup was taken beyond the rows you backed up. If freshness matters, the backup's created_at metadata in the receipt is the right source of truth, not the SQL.
  • Credentials or data in the check text. The check text is hashed, and the hash appears in the receipt. The text itself is not serialised, so embedded secrets would not leak out of your infrastructure, but the check is a lousy place for them regardless.

How the check binds to the receipt

The agent computes a SHA-256 of the check's canonical bytes (SQL text for Postgres, canonical JSON for Mongo) and embeds that hash in the receipt. Your auditor can rebuild the hash from the check's definition and verify that the receipt references the exact check you say it does.

The check text itself stays in your infrastructure. It never uploads to the orchestrator, and it never appears in the receipt. The hash is the contract.

Changing a check

Edit the check in config.yaml, reload the agent (systemctl reload restorable). The next restore test uses the new check and produces a receipt with the new hash. Historical receipts continue to reference the old hash; that is correct, because they attest to what the old check said.

Renaming a check's id while keeping its content the same is a separate event: the hash stays, the id changes, so receipts before and after are legibly distinct to an auditor doing a longitudinal review.