Architecture

stratum is a Cargo workspace of six crates. The flow is config → desired resources → diff against prior state → plan → apply via providers → new state → post-apply drift check.

Workspace layout

crates/
  core/                   stratum-core
  config/                 stratum-config
  cli/                    stratum-cli (the `stratum` binary)
  providers/
    ssh/                  stratum-provider-ssh
    docker/               stratum-provider-docker
    system/               stratum-provider-system
    git/                  stratum-provider-git

core has no provider dependencies. config depends on core only for its DesiredResource / ResourceAddr types. Providers depend on core for the trait. The CLI wires everything together.

Core types

All types live in crates/core/src/lib.rs.

ResourceAddr

#![allow(unused)]
fn main() {
struct ResourceAddr { kind: String, name: String }
}

Renders as <kind>.<name>. Used as the key in the state map and as the user-facing identifier in CLI output.

ResourceState

#![allow(unused)]
fn main() {
struct ResourceState {
    addr: ResourceAddr,
    provider: String,
    attrs: serde_json::Value,
}
}

One per tracked resource. attrs is whatever the provider returned from its last create / update.

State

#![allow(unused)]
fn main() {
struct State {
    version: u32,                              // file format version, default 1
    resources: BTreeMap<String, ResourceState>, // keyed by addr.key()
}
}

On-disk JSON, loaded with State::load(path) and saved with State::save(path). Default path is .stratum/state.json. A missing file is treated as an empty state. The parent directory is created on save.

Action and FieldChange

#![allow(unused)]
fn main() {
enum Action {
    NoOp,
    Create,
    Update { changes: Vec<FieldChange> },
    Delete,
}

struct FieldChange { field: String, from: Value, to: Value }
}

Action::symbol() returns the two-character prefix used in plan output ( , +, ~, -).

Observed and Drift

#![allow(unused)]
fn main() {
enum Observed {
    Present(Value),       // resource exists; attrs normalized to state shape
    Absent,               // confirmed gone on the host
    Unknown(String),      // provider can't tell (carries a reason)
}

struct Drift {
    changes: Vec<FieldChange>,
    missing: bool,               // state says exists but observed == Absent
    unreadable: Option<String>,  // observed == Unknown OR read returned Err
}
}

Drift is per-resource, populated by refresh_plan. Drift::is_clean() is true when all three fields are empty/false/none.

PlannedResource and Plan

#![allow(unused)]
fn main() {
struct PlannedResource {
    addr: ResourceAddr,
    provider: String,
    desired: Value,
    prior: Value,
    action: Action,
    drift: Option<Drift>,     // None unless refresh_plan was called
}

struct Plan { steps: Vec<PlannedResource> }
}

Plan::summary() returns a PlanSummary { create, update, delete, noop, drifted, missing, unreadable }. Plan::was_refreshed() is true iff any step has a Some(drift).

DesiredResource

#![allow(unused)]
fn main() {
struct DesiredResource {
    addr: ResourceAddr,
    provider: String,
    attrs: Value,
}
}

The output of the config evaluator and the input to build_plan.

Plan / apply flow

  1. Resolve sources. If -n NAME is set, the CLI loads the manifest, extracts the named namespace, and resolves the config list to [manifest, ...namespace.configs] and the state path to .stratum/<name>.json (or the namespace's explicit state =). Otherwise the configs and state path come straight from -c / -s. See Manifest discovery.
  2. Parse config. stratum_config::load_files(paths) (or load_file for a single path) runs lex → parse on each file, tags each block with its source path, concatenates into one Document, and runs a multi-pass extract: hosts → secrets → namespaces → providers + resources. The result is an Extracted { hosts, providers, resources, secrets, namespaces, redaction_map }. Any system_file with content_file is inlined during this step — see content_file. Duplicate hosts/providers/secrets/resources/namespaces across files are hard errors that name both paths.
  3. Cross-namespace check. In namespace mode only, re-load every sibling namespace's configs and check the current namespace's docker_container resources for port and container-name collisions. See Cross-namespace validator.
  4. Load state. In bundle mode, State::load(state_path). In namespace mode, State::load_merged(state_path, _shared.json) — see Split state. Missing file → default empty state.
  5. Build plan. build_plan(extracted.resources, &state) -> Result<Plan>:
  6. Optional refresh. With plan --refresh, run refresh_plan(&mut plan, &registry) to annotate every non-create step with observed drift.
  7. Print plan. Symbol per resource, fields-changed lines for updates, drift annotations if refreshed, summary at the end.
  8. Confirmation gate. Without -y, exit here. apply without -y is identical to plan plus the "Apply? Re-run with -y to execute" line.
  9. Build registry. Instantiate all providers. (No shipped provider reads its provider { ... } block today.)
  10. Execute. For each plan step (in topo order), look up the provider by kind prefix and call create / update / delete. After every docker_container create or update, run the post-apply readiness wait before moving to the next step. Update state with the returned attrs (or remove the entry on delete).
  11. Save state. In bundle mode, state.save(state_path). In namespace mode, state.save_split(state_path, _shared.json)_stratum_* addresses route to the shared file, everything else to the namespace's file.
  12. Post-apply self-check. Reload the config, rebuild the plan against the new state, run refresh_plan again, and print one summary line: post-apply drift: clean or post-apply drift: N differ, M missing, K unreadable — run 'stratum plan --refresh' to see details.

The Provider trait

#![allow(unused)]
fn main() {
#[async_trait]
trait Provider: Send + Sync {
    fn name(&self) -> &str;
    fn kinds(&self) -> &[&'static str];
    fn configure(&mut self, _attrs: &Value) -> Result<()> { Ok(()) }
    async fn create(&self, kind: &str, name: &str, attrs: &Value) -> Result<Value>;
    async fn update(&self, kind: &str, name: &str, prior: &Value, attrs: &Value) -> Result<Value>;
    async fn delete(&self, kind: &str, name: &str, prior: &Value) -> Result<()>;
    async fn read(&self, _kind: &str, _name: &str, _prior: &Value) -> Result<Observed> {
        Ok(Observed::Unknown("provider does not implement read".into()))
    }
}
}
  • name() is the lookup key in the registry.
  • kinds() lists every kind the provider owns. The registry's for_kind scans providers and returns the first match.
  • configure is called once at apply time, with the provider "<name>" { ... } body. Default impl ignores it. No shipped provider implements it today.
  • create, update, delete return the new attrs to record in state (or () for delete). The returned value is what the next plan will diff against.
  • read must be non-destructive — it's a query, not a side effect. Default impl returns Unknown. Implementations should normalize the returned Value to the same shape as state attrs.

The diff algorithm

There are two diff functions in core. They serve different purposes.

diff (symmetric, used by Action::Update legacy path)

#![allow(unused)]
fn main() {
fn diff(prior: &Value, desired: &Value) -> Vec<FieldChange>
}

A recursive walk over JSON values:

  • If prior == desired exactly, return no changes.
  • If both are JSON objects, walk their union of keys (sorted, deduplicated). For each key, recurse with the dotted path <prefix>.<key>.
  • Otherwise, emit a single FieldChange { field: <path>, from: prior, to: desired }. The field is "<root>" when the diff lives at the document root.

diff_observed (one-sided, used by build_plan and refresh_plan)

#![allow(unused)]
fn main() {
fn diff_observed(prior: &Value, observed: &Value) -> Vec<FieldChange>
}

Used both by build_plan (comparing state-stored prior against desired config) and by refresh_plan (comparing state against live observation). Rules differ from diff:

  • State-only fields are ignored. Only keys present in observed are walked. A field that's in prior but not in observed does not generate drift. This is what lets providers store extra fields (container_id, sha256, etc.) without polluting plans.
  • Missing key vs empty container = no drift. If prior has no key k but observed has k: {} or k: [], that's not drift. Same for prior: null vs observed: {} / [].
  • String arrays are compared as sets. ["a", "b"] and ["b", "a"] are equal. Non-string arrays are compared by order.
  • Added keys in observed → flagged. A key in observed but not in prior shows up as from: null, to: <value>.

The provider's read implementation is responsible for trimming the observed value to a shape that mirrors state, so noise doesn't leak through. For example, docker_container strips com.docker.* labels and intersects with the state's label key set.

Drift detection

refresh_plan(&mut plan, &registry) annotates each plan step with observed drift from live reality.

#![allow(unused)]
fn main() {
async fn refresh_plan(plan: &mut Plan, registry: &Registry);
}

Rules:

  • Action::Create is skipped — there's no prior state to read.
  • Sequential per resource. SSH round-trips are I/O-bound but ~10 resources doesn't justify parallelism yet.
  • Per-resource errors are caught, not propagated. They become Drift::unreadable = Some("read failed: ..."). refresh_plan itself never returns Err.
  • The provider's read is called with (kind, name, &step.prior). The returned Observed is mapped:
    • Present(observed)drift.changes = diff_observed(&step.prior, &observed)
    • Absentdrift.missing = true
    • Unknown(reason)drift.unreadable = Some(reason)

PlanSummary counts:

  • drifted — count of steps where drift.changes is non-empty.
  • missing — count of steps where drift.missing == true and the action is not Delete. (A Delete step whose resource is already gone is annotated (already gone on host; delete will noop) instead — that's not drift.)
  • unreadable — count of steps where drift.unreadable.is_some().

Planner-side validators

Port-conflict validator

Before classifying steps, build_plan walks every docker_container.ports value across the desired set and checks for (host, ip, host_port) collisions. Two resources binding the same port on the same host is a hard error at plan time, naming both. A 0.0.0.0:N bind symmetrically collides with 127.0.0.1:N — the wildcard bind subsumes the loopback one.

Random ports ("5432" — docker picks the host port) are skipped silently. Port ranges ("8000-8010:8000-8010") get a warning but are not validated. Unrecognized port shapes are skipped to keep the validator forward-compatible.

depends_on topo sort

The planner runs a stable Kahn's-algorithm topo sort over the docker_container.depends_on edges (see depends_on). Properties:

  • Stable. Resources without edges keep their input (file) order. Where ties exist, a BTreeSet ready-set picks them in lexicographic addr order.
  • Implicit _stratum_* resources stay at the front. They carry no edges and have in_degree = 0, so they land first.
  • Cycles are a hard error citing the cycle path.
  • Unknown references are hard errors citing both the source and the missing target.

The topo order applies to Create and Update steps; Delete order is computed separately.

Secret-content normalization on plan

For kinds where a content field carries a secret value, state stores only sha256 (the plaintext is unrecoverable from state) but desired carries the full plaintext at plan time. A naive diff_observed(prior, desired) would emit content: null -> "<plaintext>" on every plan, leaking the value into CLI output.

build_plan normalizes desired before diffing. The kinds that opt into this live in a const SECRET_CONTENT_TO_SHA: &[(&str, &str)]:

kindcontent field
system_secret_filecontent

For each entry, normalize_for_plan(kind, attrs) clones attrs, removes the named field, and inserts sha256: <hex> derived from its UTF-8 bytes. Diff then compares sha against sha — exactly the same shape state holds. Plaintext never reaches the diff.

This is the inverse half of the kind's own apply-time unchanged check (which compares the same sha against prior state to decide whether to re-upload). The two together guarantee that a plaintext secret never appears in plan output, in state, or in apply logs.

Plan-level secret redaction

After build_plan returns and after refresh_plan runs, the CLI calls Extracted::redact_plan(&mut plan) once before printing. This walk does two things:

  1. Apply substring redaction to every step's desired, prior, and per-FieldChange from / to. A leaf string containing a known secret plaintext (introduced via ${...} interpolation) gets each occurrence replaced with the inline <secret:NAME:sha256:HEX> marker. Exact-match leaves are replaced with the object marker, same as everywhere else.
  2. Drop redaction-cancelled changes. When state holds the inline substring marker and observed returns plaintext, both sides collapse to the same marker after the walk. Any FieldChange where from == to post-redaction is dropped. If an Action::Update's changes list becomes empty, the step is downgraded to Action::NoOp — drift that was only a substring-marker-vs-plaintext difference disappears entirely.

This is what stops plan --refresh from emitting spurious updates on every secret-bearing interpolated field. See Secrets: substring redaction.

Post-apply readiness wait

After every successful docker_container create or update, the planner pauses before moving on to the next step (which may be a dependent declared via depends_on). The wait lives in the CLI in post_apply_wait:

  • If desired.healthcheck is present, poll docker inspect --format '{{.State.Health.Status}}' <name> once a second, up to 60 polls. Terminal statuses are healthy (proceed), unhealthy (fail the apply), or empty / none on the first poll (no health check at the docker level — proceed). starting and other interim values keep polling.
  • Otherwise, sleep 500ms. This is cosmetic — docker often needs a beat to wire networks and volumes before something else pokes the container.

Non-docker_container steps return immediately. The provider's own create / update is synchronous: git_repo clones return when done, system_secret_file returns when the SSH upload completes.

The poll loop itself is in core (poll_container_health), separated from SSH plumbing so it's unit-testable with a mocked inspector.

Delete ordering

build_plan emits delete steps in forward topo order over state-resident depends_on edges. For two resources X and Y where X depends_on Y at runtime, X is torn down before Y — the dependent goes first so the dependency is still serving while it shuts down.

Resources without recorded depends_on edges fall through with in_degree = 0 and end up before any edged resources, in reverse-iteration order of the state BTreeMap (which preserves the prior file-order-independent behavior). This keeps the heuristic close to "leaves before roots" for hand-written configs even when no depends_on is declared.

depends_on is recorded in state at create / update time and survives across apply runs, so a delete computed against state still knows the edges the resource was declared with — even when the resource is no longer in config.

Implicit per-host resources

For every host block in the merged document, extract injects three implicit resources before any user-declared ones, addressed under the _stratum_ prefix:

addrkindpurpose
ssh_exec._stratum_swap_<host>ssh_execCreates a 4 GB /swapfile, enables it, persists in fstab.
system_file._stratum_sshd_oom_<host>system_fileDrops /etc/systemd/system/ssh.service.d/oom.conf with OOMScoreAdjust=-1000.
ssh_exec._stratum_sshd_reload_<host>ssh_execsystemctl daemon-reload && systemctl restart ssh.

The first two exist so that under memory pressure the kernel does not kill sshd — which would lock the operator out of recovery. The third applies the drop-in. They are stable across versions and live at the front of the desired list (in_degree 0), so they apply before any user resource on the host.

In namespace mode they are routed to _shared.json so multiple namespaces sharing a host don't each try to recreate them. In bundle mode they share the single state file with everything else.

The _stratum_ prefix is reserved. User-declared resources should not use it.

Manifest discovery (namespace mode)

When -n NAME is set, the CLI resolves the config + state paths as follows. See Namespaces for the syntax.

  1. Locate the manifest. If --manifest PATH was passed, that path is used. Otherwise the CLI requires ./stratum.strat to exist; if it doesn't, the command errors.
  2. Load the manifest. Runs stratum_config::load_file(manifest), producing an Extracted with one or more namespace declarations.
  3. Look up the namespace. If the named namespace isn't in the manifest, error with the list of known namespaces.
  4. Resolve configs. The merged list is [manifest, ...ns.configs]. The manifest is always first so its top-level host / secret / provider blocks are visible to every per-namespace file. Each configs entry is absolutized at parse time against the manifest's directory.
  5. Resolve state. Priority order: explicit -s on the command line, then the namespace's body-level state =, then .stratum/<name>.json.

Passing -c together with -n is a hard error — the namespace's configs = [...] is the config list, and a -c override would silently shadow it. Bundle mode (no -n) is unchanged by namespace support.

Cross-namespace validator

Namespace mode's plan and apply run a sibling-collision check before classification. The check exists because build_plan operates inside one namespace's view of the world — it has no visibility into what other namespaces declare — so two namespaces could each plan a docker_container binding the same (host, host_port) and only discover the conflict at apply time, when one fails over a port already taken by the other.

The check, in validate_cross_namespace:

  1. Re-loads the manifest (cheap; it has no resources).
  2. For each sibling namespace (every one except the current), loads its configs with LoadOptions::allow_unresolved_secrets = true so a missing env var in some unrelated namespace doesn't block planning the current one.
  3. Walks every docker_container in every sibling, collecting:
    • Port claims. Each ports entry is parsed for the host-port half of H:C or IP:H:C. Ranges (8000-8010:...) and bare-port shapes (where docker picks the host port) are skipped.
    • Name claims. The container's name attribute, falling back to the resource's label.
  4. Checks every docker_container in the current namespace against the collected claims, erroring on the first (host, port) or (host, name) collision and naming both the offending current-namespace address and the sibling that owns the claim.

The validator is skipped entirely in bundle mode. Within a single namespace, the existing planner-side port-conflict validator catches collisions within the same desired set. The cross-namespace validator is strictly the inter-namespace layer above it.

The sibling loader uses allow_unresolved_secrets = true defensively — it's only collecting addresses, ports, and names, none of which depend on secret plaintext. If a sibling load fails for any other reason, the error is logged and that sibling is skipped (the plan still proceeds), so a broken sibling doesn't gate apply of an unrelated namespace.

Split state (namespace mode)

In namespace mode the state on disk is two files instead of one:

.stratum/
  <name>.json        # user-declared resources for namespace `<name>`
  _shared.json       # implicit per-host _stratum_* resources

State::save_split(ns_path, shared_path) walks self.resources and routes each entry by addr name: anything starting with _stratum_ goes to _shared.json, everything else to <name>.json. Both files are written every save (with parent dirs created), even when one side is empty — that keeps the next load predictable.

State::load_merged(ns_path, shared_path) is the inverse. It loads both files and unions their resources maps, with the namespace's entry winning any addr.key() collision (the more recently touched of the two, since the active scope just ran). Missing files become empty state (matches load).

Bundle mode keeps using the single-file State::load(path) and State::save(path). The CLI picks the right pair via the -n flag — load_state / save_state in crates/cli/src/main.rs switch on whether a shared path is set.

The split is what lets two namespaces targeting the same host co-exist without each trying to own the per-host tuning resources. First namespace applies: _stratum_swap_*, _stratum_sshd_oom_*, _stratum_sshd_reload_* land in _shared.json. Second namespace plans: load_merged pulls them back from the shared file into its working state, so the new plan sees them as no-op. Without the split, the second apply would see them missing from its state file and recreate them, churning the swap file and restarting sshd on every cross-namespace apply.

State file shape

{
  "version": 1,
  "resources": {
    "docker_container.traefik": {
      "addr": { "kind": "docker_container", "name": "traefik" },
      "provider": "docker",
      "attrs": {
        "host": "root@192.0.2.10",
        "image": "traefik:v2.11",
        "container_id": "abc123...",
        "...": "..."
      }
    },
    "system_package.docker": { "...": "..." }
  }
}

Resources are keyed by <kind>.<name> in a BTreeMap, so the on-disk order is deterministic (lexicographic). The file is overwritten in full on every successful apply.

Secret markers in state

When a resource attr resolves from a secret ref, the provider receives plaintext but state stores a redaction marker:

{
  "env": {
    "POSTGRES_PASSWORD": {
      "__secret": "pg_password",
      "__secret_sha256": "sha256:f7c3bc1d808e04..."
    }
  }
}

The marker is written by Extracted::redact_into, called between every provider return and state.upsert. diff and diff_observed are marker-aware (see core::secret_compare): a marker compares equal to plaintext when the plaintext's hash matches the marker's __secret_sha256, and a marker-vs-marker compare uses only the hashes. This is what keeps --refresh from showing perpetual drift on secret-bearing fields. The CLI's render function prints markers as <secret:NAME sha:abc123> — six hex chars, enough to spot a rotation, not enough to attack offline.