Writing a check
A check is the thing that turns "pg_restore exited zero" into "the restore worked." It runs against the restored scratch database and answers one question: does this look like your data?
Why you write your own
Without a check, a restore test proves only that the dump tool could reverse its own output. That is necessary but thin. Dumps can restore into a functionally broken state: missing extensions, silent collation drift, empty tables because of a permission mismatch at backup time, a row count that is off by an order of magnitude because the dump caught a transaction at a bad moment.
A check is one query that, if it passes, rules out the category of failure you actually care about. One check per source is enough to start. Add more as you learn what can go wrong.
Anatomy
A check has three fields. The id names it, the
summary query returns one row, the expect expression
says what the row should look like.
checks:
- id: recent-orders
summary_sql: |
SELECT COUNT(*) AS n
FROM orders
WHERE created_at > NOW() - INTERVAL '24 hours'
expect: "n > 100"
Each source in config.yaml references the checks it
uses by id:
sources:
- slug: prod-db
kind: postgres
connection: { uri: env:RESTORABLE_DB_URI }
checks:
- recent-orders The check runs after the restore completes, against the scratch Postgres container the agent spun up.
The expect DSL
The summary query returns one row. The column names in that row
become the identifiers you can reference in expect.
The DSL accepts exactly one comparison:
<column> <op> <literal> - Operators:
==,!=,<,<=,>,>= - Literals: integers, floats, double-quoted or single-quoted strings,
true,false,null - Ordering operators (
<,<=,>,>=) require a numeric right-hand side. Equality (==,!=) accepts numbers, strings, or bool/null.
Examples:
expect: "n > 100"
expect: "users >= 1000"
expect: "status == \"ok\""
expect: "failures == 0"
The DSL is deliberately small. It is not a programming language;
it is a sentence that says "this row should look like this." For
anything more (compound conditions, ranges, derived predicates),
push the logic into the summary query and have expect
check one boolean column the query produces. The Postgres pattern
is (condition)::int AS ok with
expect: "ok == 1"; Mongo uses
$project: { ok: { $cond: [<predicate>, 1, 0] } }.
Postgres examples
Row freshness
Assert that your orders table received rows recently. Useful when you want to detect a restore that succeeded against a backup that was silently empty for your most-active table.
- id: orders-fresh
summary_sql: |
SELECT (
COUNT(*) FILTER (WHERE created_at > NOW() - INTERVAL '24 hours') > 0
AND COUNT(*) > 1000
)::int AS ok
FROM orders
expect: "ok == 1" NOW() inside the scratch container is the scratch's
wall clock, which is the clock at restore-test time, not at
backup time. That is usually what you want for freshness checks.
For time-invariant checks, avoid NOW() and use
recorded timestamps instead.
Reference data intact
Assert that a small table of reference values is present. Low noise, catches restore failures where a schema restored but data did not.
- id: currencies-present
summary_sql: |
SELECT COUNT(*) AS n FROM currencies
expect: "n >= 150" Schema sanity
Assert that a table count matches what you expect. Catches restores that produce a valid but drastically-reduced database.
- id: tables-sane
summary_sql: |
SELECT (COUNT(*) BETWEEN 40 AND 60)::int AS ok
FROM information_schema.tables
WHERE table_schema = 'public'
expect: "ok == 1" Invariant across rows
Push the test into SQL and have expect check one
boolean. Useful when you want to assert something that holds for
every row.
- id: no-orphan-orders
summary_sql: |
SELECT COUNT(*) AS orphans
FROM orders o
LEFT JOIN users u ON o.user_id = u.id
WHERE u.id IS NULL
expect: "orphans == 0" MongoDB examples
Mongo checks use summary_find instead of
summary_sql. Three shapes: a filter with an implicit
count, a full aggregate pipeline, or a list-collections probe
over the database.
Filter-and-count
- id: open-orders
summary_find:
database: app
collection: orders
filter: { "status": "open" }
expect: "n > 0"
A filter block runs
db.collection.countDocuments(filter) in the scratch
database and produces a summary row
{ n: <count> }. The
database key is required; the scratch's default
database is admin, which is not where your data
lives.
Aggregate
- id: orders-total
summary_find:
database: app
collection: orders
aggregate:
- { $match: { status: "open" } }
- { $group: { _id: null, n: { $sum: 1 }, total: { $sum: "$amount" } } }
- { $project: { ok: { $cond: [{ $and: [{ $gt: ["$total", 10000] }, { $gt: ["$n", 50] }] }, 1, 0] } } }
expect: "ok == 1"
The first document of the pipeline result becomes the summary
row. Keys in that document are what expect can
reference.
List-collections
- id: app-collections
summary_find:
database: app
list_collections: true
expect: "n >= 4"
With list_collections: true the agent runs
db.getCollectionNames() against
database, drops names that start with
system., and returns the count as
{ n: <count> }. This is the schema-
count probe the wizard auto-attaches as
<slug>-collections: it catches the failure
mode where a restore exits clean but most of the database didn't
materialize. collection, filter, and
aggregate must be unset in this mode.
Execution semantics
Three rules to know:
Checks run sequentially against a shared scratch
All checks configured on one source run in one restore session
against the same ephemeral database. They run in the order
declared in the source's checks: list.
Treat checks as read-only by convention. If an earlier check mutates the scratch DB, later checks see the mutation. The agent does not enforce read-only; your SQL does.
Per-check timeouts
Each check has a 60-second timeout. A check that runs over
budget is killed and recorded as result: "error"
with reason timeout. The session continues with the
next check.
The roll-up
The receipt's top-level result rolls up the per-check
results. Any fail wins, otherwise any
error wins, otherwise pass. You see
per-check detail in the dashboard and in the receipt itself.
What not to put in a check
- Writes. Neither the agent nor the scratch DB blocks them. A DELETE that happens to work breaks later checks in the same session and makes receipts confusing.
- Long-running queries. A query that takes five minutes makes every restore test take five minutes plus. If you need to validate large aggregates, precompute them at backup time in a materialised view and have the check read that.
- Queries that depend on backup time. The
scratch DB has no knowledge of when the backup was taken
beyond the rows you backed up. If freshness matters, the
backup's
created_atmetadata in the receipt is the right source of truth, not the SQL. - Credentials or data in the check text. The check text is hashed, and the hash appears in the receipt. The text itself is not serialised, so embedded secrets would not leak out of your infrastructure, but the check is a lousy place for them regardless.
How the check binds to the receipt
The agent computes a SHA-256 of the check's canonical bytes (SQL text for Postgres, canonical JSON for Mongo) and embeds that hash in the receipt. Your auditor can rebuild the hash from the check's definition and verify that the receipt references the exact check you say it does.
The check text itself stays in your infrastructure. It never uploads to the orchestrator, and it never appears in the receipt. The hash is the contract.
Changing a check
Edit the check in config.yaml, reload the agent
(systemctl reload restorable). The next
restore test uses the new check and produces a receipt with the
new hash. Historical receipts continue to reference the old
hash; that is correct, because they attest to what the old
check said.
Renaming a check's id while keeping its content
the same is a separate event: the hash stays, the id changes,
so receipts before and after are legibly distinct to an auditor
doing a longitudinal review.