Batch / Recipe Controller for Sequencing, Metering & Audit Logs
← Back to: Industrial Sensing & Process Control
Core idea: A Batch/Recipe Controller turns a recipe into a deterministic, auditable execution—sequencing I/O and metering with a reliable timebase, then storing evidence-grade batch records that survive power/network faults across Ethernet/fieldbus integration.
Outcome: It makes every run repeatable and explainable: the same inputs yield the same steps, and any deviation can be traced to specific timestamps, mappings, measurements, and approved recipe versions.
H2-1. What This Controller Does and Where It Fits
A Batch / Recipe Controller is the execution core that turns a versioned recipe into a repeatable run: it orchestrates step timing, I/O sequencing, metering windows, time-stamped events, and evidence-grade batch records, then integrates the results to SCADA/HMI/historian/MES via Ethernet or fieldbus—without replacing PLC safety logic.
To avoid scope confusion, treat industrial automation as four layers with distinct responsibilities:
- Field layer (sensors, valves, actuators, meters): produces raw signals and executes physical actions.
- Control layer (PLC, remote I/O, safety interlocks): guarantees basic deterministic I/O control and failsafe defaults.
- Orchestration layer (Batch/Recipe Controller): defines and proves the execution semantics of a batch run—steps, phases, timeouts, retries, checkpoints, resumes, and operator interventions.
- Supervisory/enterprise layer (HMI/SCADA/historian/MES): visualizes, aggregates, and connects business context; it should not be the source of truth for step-by-step execution evidence.
The practical boundary is evidence and repeatability. If you need to answer “what exactly happened” at each step (with timestamps, recipe version, operator actions, and measured totals), you need a controller that owns: recipe → batch instance → step actions → event log.
Use an independent Batch/Recipe Controller when:
- Recipes change and must be traceable (versioning, approval, rollback, effective range per batch).
- Steps are complex (branching, parallel phases, resource locks, retries, conditional waits).
- Metering is contractual (totalizers, tolerances, reconciliation across devices/data sinks).
- Batch records are audited (operator actions, parameter changes, exception handling, power-loss continuity).
- Equipment spans buses (Ethernet + fieldbus devices with unified sequencing and health monitoring).
A PLC alone is typically sufficient when:
- Logic is stable, steps are few, and there is no need for recipe version governance.
- Metering is for display only (no reconciliation loop, no audit requirement).
- Power-loss recovery can restart the process without step-level resume requirements.
Design anchor: separate “hard safety” from “batch orchestration.” Keep safety interlocks and fail-safe outputs in the PLC/safety chain; let the Batch/Recipe Controller focus on deterministic sequencing and evidence-grade records.
- batch_id
- recipe_version
- event_ts (monotonic + UTC)
- meter_window_id
- step_state (start/end/abort/resume)
H2-2. Scope Guard and Acceptance Criteria
This section converts “batch control” from a concept into sign-off-able engineering criteria. Each line below is written as: metric (what to measure) + evidence (what to capture) + first fix (what to change first).
Acceptance checklist (10 lines):
- Step timing accuracy — specify max jitter and timestamp resolution; capture step_start/step_end deltas; first fix: prioritize sequencer + async logging.
- Event continuity under power loss — define max loss window (e.g., ≤N seconds); capture power-fail test log + recovery proof; first fix: journal + checkpoint cadence.
- Recipe version governance — define versioning, approval, rollback; capture who/when/why + effective batch range; first fix: bind recipe_version immutably per batch_id.
- Metering reconciliation — define allowable batch total error; capture totalizer vs meter vs historian comparison; first fix: lock sampling window + calibration version per batch.
- Comms resilience — define reconnect time + stale-data marking; capture drop/reconnect replay; first fix: watchdog + offline queue + explicit stale state.
- Interlocks and safe stop — define permissives, timeouts, safe outputs; capture interlock test cases and safe-state table; first fix: keep hard safety in PLC/safety chain.
- Audit trail completeness — require who/what/when/why/impact; capture export sample linked to batch_id; first fix: route all changes through a single audited API.
- Storage lifetime — estimate years at events/sec and write amplification; capture wear model + leveling strategy; first fix: event tiering + batch commits + verbosity caps.
- Time-base boundary — define RTC/NTP/PTP roles + sync health; capture offset/holdover/sync_state in logs; first fix: dual timestamps (monotonic + UTC) per event.
- Commissioning readiness — require simulator/replay/self-test; capture dry-run report + deterministic replay; first fix: build test-first with recorded trace playback.
Scope guard: this controller must not become “everything.” It owns sequencing + evidence, while PLC owns failsafe I/O, and SCADA/MES own visualization and business workflow. That separation is itself an acceptance criterion because it prevents untestable coupling.
H2-3. Data Model: Recipe → Batch → Step → Action
A robust batch system is built on an executable, auditable data model. The core rule is to separate the static contract (Recipe) from the runtime instance (Batch), then break execution into provable units (Step/Phase) and replayable atoms (Action).
Recipe (static template) defines the approved structure and parameters. It must be versioned and immutable per approval cycle. A recipe is not “a spreadsheet of fields”; it is a graph of steps with branching/parallelism rules, entry/exit conditions, timeout policies, and parameter constraints.
Batch (runtime instance) binds to one recipe version and records what happened in the real world. This binding is what makes audit and troubleshooting possible: a batch_id must always map to a single recipe_version.
Step / Phase (execution unit) is the smallest unit that should have a clear start/end boundary, a measurable outcome, and a deterministic transition rule. Steps can be sequential, parallel, looped, or conditional—yet still must be explainable via logged state transitions.
Action (replayable atom) is the minimum work item the sequencer executes: setting I/O states, applying setpoints, opening a metering window, verifying stability, waiting for a condition, or emitting a log marker. Actions must carry enough parameters to be replayed in simulation or trace playback.
Design binding rules (evidence-grade)
(1) Batch → Recipe version binding: batch_id always points to exactly one recipe_version. (2) Event → Action binding: every event references (batch_id, step_id, action_id). (3) Metering → Window binding: measured totals bind to meter_window_id with calibration version.
- recipe_id
- recipe_version
- batch_id
- step_id
- action_id
- meter_window_id
- event_seq
- event_ts_mono
- event_ts_utc
- calib_version
H2-4. Sequencing Engine: Determinism, Concurrency, and Interlocks
Determinism is not a claim—it is a design property that can be measured. A sequencing engine is deterministic when the same inputs and time base produce the same step transitions, and when every transition can be explained by logged evidence (events, interlocks, timeouts, and resource arbitration).
Scan loop vs event loop: treat them as a boundary decision, not a debate. Periodic sampling is useful for filtering and stable condition checks, while event-driven dispatch is essential for bus updates, metering window edges, timeouts, alarms, and operator interventions. A practical architecture uses a single prioritized event queue where both periodic ticks and asynchronous signals land.
Concurrency is controlled by resources, not hopes. Parallel steps must declare what they own: exclusive actuators (valves/pumps), shared sensors, and rate-limited services (storage/network). The sequencer must provide resource locking and arbitration so conflicts produce predictable outcomes.
Interlocks/permissives must be evidence-grade. A raw input should not flip a batch state instantly. Use a condition tree with explicit debounce windows, optional voting windows, and a defined trip latch policy (latched until operator reset vs auto-clear).
Timeouts define safety and recoverability. Separate: soft abort (stop progression and preserve diagnostics), hard abort (force safe outputs, typically via PLC/safety chain), and the safe state definition (what every output becomes). Recovery is then a choice: resume from a step checkpoint or restart the batch, based on checkpoint granularity and metering state.
- event_queue_depth
- dispatch_latency_ms
- resource_id
- owner_step_id
- wait_reason
- interlock_id
- debounce_ms
- vote_window_ms
- timeout_ms
- retry_count
- checkpoint_id
H2-5. I/O Sequencing: Commissioning-Friendly Mapping & Safety Defaults
Commissioning becomes predictable when I/O is not hard-coded. A maintainable architecture separates logical intent (process meaning) from physical realization (wiring / addresses), with a versioned mapping layer that is auditable, testable, and safe by default.
I/O mapping layers: define process-facing tags such as Valve_01_Open or Pump_A_Start, then bind them to physical channels (local I/O, remote I/O, or fieldbus registers) via a single mapping contract. This contract should include polarity, scaling, and protocol binding, plus lifecycle controls (mapping_version, approvals, and effective dates).
Safety defaults (fail-safe outputs) must be explicit per output and per fault category. “Safe” is not one value—it is a policy: drop vs hold, latched vs auto-clear, and the recovery rule. Hard safety trips should be enforced by PLC/safety chains; the batch controller should orchestrate orderly stops and log the evidence of every transition.
Input semantics determine deterministic behavior: edge vs level, debounce windows, and filter principles must be part of the design. Raw inputs should not drive step transitions directly; they should pass through debounce/filter stages and emit events that can be traced in logs.
Manual override is a controlled mode, not ad-hoc output toggling. Enter/exit conditions, role-based permission, and time-bounded overrides are required. Every manual action must create an audit event that links back to batch_id/step_id where applicable.
Dry-run and forced I/O enable safe commissioning and reproducible troubleshooting. Dry-run advances the sequencer without driving physical outputs. Forced I/O must be time-limited, clearly flagged, and recorded with who/when/why, so batch evidence remains trustworthy.
Evidence fields to log
mapping_version • effective_from • approved_by • raw_state/debounced_state • debounce_ms • filter_type • forced_flag • override_role • safe_state_reason
- logical_tag
- protocol_binding
- physical_channel
- polarity
- scaling
- mapping_version
- safe_value
- recover_policy
- forced_flag
- override_role
H2-6. Metering & Timing: Sampling Windows, Totalizers, and Reconciliation
Metering is the proof layer of a batch. The goal is not “a number on a screen” but a result that can be recomputed, reconciled, and audited. That requires explicit sampling windows, stable time bases, versioned calibration, and reconciliation outputs.
Instantaneous vs totalized: instantaneous measurements are noisy and time-dependent; totalized values depend on the integration method and the accuracy of timestamps. Drift typically comes from sampling cadence mismatch, filter delay, time-base jumps, or calibration mismatch—so all of those must be visible in evidence logs.
Sampling windows define where metering starts and stops. Windows should be triggered by step events and captured with monotonic timestamps for ordering, plus UTC timestamps for correlation across systems. Filtering should be minimal and traceable (type + parameter), so totals can be recomputed in offline replay.
Metering events should be explicit: start/stop, threshold reached, and stability conditions. For example, a stable condition can be defined as “variance below a tolerance for a continuous duration,” which prevents transient spikes from closing a window early.
Reconciliation compares three totals: (1) batch total computed from the window, (2) meter-device totalizer, and (3) historian total. The output should include PASS/WARN/FAIL plus a delta value and a hint category (window mismatch, calibration mismatch, missing samples, time jump).
Calibration must be versioned. Every metering window must record the calibration version (and validity), otherwise the batch record cannot prove what coefficients were used when the result was produced.
Evidence fields to log
sample_period_ms • filter_type/filter_param • window_start_ts_mono • window_end_ts_mono • totalized_value • unit • calib_version • reconcile_status • delta
- meter_window_id
- sample_period_ms
- filter_type
- window_start_event
- window_stop_event
- totalized_value
- calib_version
- meter_total
- historian_total
- reconcile_status
H2-7. Time Base: RTC, NTP, PTP/1588 and Timestamp Strategy
A reliable batch record needs two different “time truths”: ordering time and correlation time. Event ordering must be based on a monotonic clock to survive NTP/PTP steps, while cross-system correlation uses UTC wall time plus a recorded sync state.
RTC provides a retained time baseline across power loss. It is not a precision sync tool, but it prevents wall time from resetting to an invalid value after reboot. NTP is appropriate for general synchronization where millisecond-level alignment is acceptable, but it can introduce time steps (sudden jumps). PTP/1588 targets sub-millisecond alignment and is preferred when multiple devices must agree on window edges, sequence timing, or fine-grained measurements.
Timestamp rule: every evidence-grade event should record a dual timestamp: ts_mono_ns (monotonic for strict ordering) and ts_utc (UTC for correlation), plus sync_state and offset so time quality is visible during audits. When the system enters holdover or becomes unsynchronized, the record must still be continuous and clearly marked.
Clock health should be observable and actionable: capture sync_state, offset, last_sync_ts, holdover_age, and time_step_count. If sync quality degrades, the controller can apply a deterministic downgrade policy (e.g., more conservative metering decisions, reconciliation marked WARN, or operator confirmation for critical changes).
Multi-source redundancy avoids single points of failure. A practical priority policy is: PTP_LOCKED > NTP_SYNC > HOLDOVER > RTC_ONLY. Any source switch should generate a dedicated event so later investigations can explain time-quality changes.
Evidence fields to log
ts_mono_ns • ts_utc • sync_state • offset_ms • holdover_age_s • last_sync_ts • time_step_count • time_source
- ts_mono_ns
- ts_utc
- sync_state
- offset_ms
- holdover_age_s
- last_sync_ts
- time_step_count
- time_source
- primary/backup
H2-8. Event Log & Audit Trail: Evidence-Grade Records (EBR-ready)
Evidence-grade logging is not “debug text.” It is a structured chain that can answer: who did what, when, with which version, and which batch was affected. Logs should be linkable (join keys), exportable, and tamper-evident.
Event classes should be audit-oriented: process events (step transitions, metering window edges, interlock trips), configuration changes (recipe releases, mapping_version changes, thresholds), and security-relevant actions (manual override, forced I/O, role changes, auth failures). Each class requires different mandatory fields, but all must include batch_id and trustworthy timestamps.
Tamper-evidence can be achieved with a lightweight hash chain: each event stores hash_prev and a computed hash_curr over key fields. Optional signatures can be applied to high-value milestones (e.g., batch completion or recipe publish) without signing every log line.
Capacity strategy should be tiered and deterministic: a RAM ring buffer for high-frequency debug traces, an append-only flash log for durable evidence events, and a server/historian tier for long-term retention and search. If a loss window occurs (power failure before flush), the system should record the loss interval explicitly.
Join keys are the backbone of traceability: batch_id, recipe_version, operator_id, device_id, plus event_seq and step_id/action_id to localize cause. Mapping_version and calib_version should also be captured, so I/O behavior and metering results remain provable in audits.
Export should be simple and consistent: CSV/JSON for evidence packages, and structured feeds to SCADA/historians (tags or topics) for monitoring. Export should never strip the fields that prove integrity and linkage.
Minimum evidence fields
event_seq • event_type • ts_mono_ns • ts_utc • sync_state • batch_id • recipe_version • operator_id • device_id • step_id/action_id • result • hash_prev/hash_curr
- event_seq
- event_type
- hash_prev
- hash_curr
- batch_id
- recipe_version
- operator_id
- device_id
- mapping_version
- calib_version
- export_ref
H2-9. Recipe & Event Storage: Media Choice, Wear, Power-Fail, and Integrity
Storage in a batch controller is not “saving files.” It is an engineering mechanism that must support high-frequency writes, long service life, power-fail consistency, and verifiable integrity. The design starts by separating workload types: recipes are versioned and read-mostly; events are append-heavy and must remain replayable after a sudden reboot.
Workload split: a recipe is a compact, versioned artifact (approve, publish, rollback), so it should be stored in a slot-based or image-based format that supports atomic activation (A/B slot or dual-image). An event log is a continuous stream; it should be written as an append-only journal with explicit commit markers so incomplete tails can be detected and truncated after power loss.
Power-fail consistency requires a deterministic commit path. For event journal entries, a safe pattern is: write payload → write length/CRC (or hash) → write commit marker → advance pointer atomically. For recipes, write the new version into an inactive slot, validate it, then flip an active pointer as the final step. Recovery should replay from the last checkpoint plus verified journal entries.
Wear and lifetime are controlled by minimizing write amplification and avoiding hot spots. Depending on the write rate and retention needs, media choices may include FRAM/MRAM for high-endurance logging, or NOR/SD for larger capacity when paired with journal-structured writes and explicit health monitoring. When media is block-based, bad-block handling and a clear allocation strategy are required.
Integrity and rollback must be evidence-ready. Every stored recipe should include version metadata, and every event should carry a checksum and sequence number. If a rollback occurs, it should create a dedicated audit event with from/to versions and a reason, so downstream reports can explain changes in batch outcomes.
Tiering improves resilience: keep hot, queryable state locally (recent events + current batch snapshot), store durable evidence in append-only flash media, and export cold archives to a server/historian. Offline buffering should be treated as a normal mode, with explicit “catch-up” export after reconnection.
Evidence fields to log
recipe_version • approval_id • active_slot • event_seq • crc/hash • commit_marker • checkpoint_last_seq • recovery_path • bytes_written_total • storage_health_state
- recipe_version
- active_slot
- event_seq
- commit_marker
- crc/hash
- checkpoint
- recovery_event
- write_amplification
- wear_index
- bad_block_count
- storage_health_state
H2-10. Ethernet & Fieldbus Integration: Mapping, Latency, and Failover
Connectivity is not “link up.” Integration must define a tag contract, a connection state machine, a latency budget, and a failover behavior. Control decisions must remain deterministic even when data becomes stale or links flap.
Protocol stacks may include Modbus TCP, EtherNet/IP, PROFINET, or OPC UA, but the engineering value is above the protocol: a versioned tag map with stable naming, units, scaling, and compatibility rules. The controller should expose tag_map_version and device profiles so commissioning changes can be audited and rolled out safely.
Connection management must be explicit: reconnect loops, watchdog timers, and stale-data marking. “Stale” is not just missing updates; it means the value must not be used for step gating or safety decisions. Each tag should have an age_ms (or last_update_ts_mono) and a quality flag (GOOD/STALE/BAD) so behavior is explainable in logs.
Latency requires plane separation: keep control-plane updates isolated from telemetry-plane reporting. Control-plane traffic is bounded and deterministic (step gating, key setpoints), while telemetry can be buffered and bursty (trends, historian feeds). Telemetry congestion must not affect control decisions.
Failover should preserve safety and evidence: link switches, dual networks, or redundancy principles (e.g., parallel paths) must trigger a link_failover_event, and quality flags should show degraded states during switching. After reconnection, the system should emit re-sync completion events and reconcile any buffered telemetry exports.
Evidence fields to log
tag_map_version • device_profile_id • quality(GOOD/STALE/BAD) • age_ms • watchdog_state • reconnect_count • link_failover_event • control_plane_latency_ms • telemetry_backlog
- tag_map_version
- device_profile_id
- quality
- age_ms
- watchdog_state
- reconnect_count
- stale_flag
- control-plane
- telemetry-plane
- link_failover_event
- re-sync
H2-11. Validation & Commissioning Playbook (Field-Ready Tests)
This playbook turns system claims into repeatable evidence. Each test is written as a 3-line SOP: Goal (what must be proven), Method (how to reproduce), and Evidence (which fields/waveforms/exports to capture). Pass criteria are explicit so commissioning results are auditable.
Evidence rule
Use ts_mono_ns for interval/jitter calculations, keep ts_utc + sync_state for correlation, and ensure every test produces an exportable artifact (CSV/JSON) linked by batch_id and event_seq.
T1 — Step Timing (prove jitter upper bound)
Goal: For the same recipe and inputs, key step edges show bounded jitter (e.g., P99 ≤ X ms).
Method: Run the same batch N times (e.g., 30). Compare step enter/exit, I/O set, and metering window edges.
Evidence: Export a timing report keyed by batch_id/step_id with ts_mono_ns, delta_ms, P95/P99, and time_step_count.
Pass: P99 jitter stays within the declared limit; no missing event_seq segments.
- ts_mono_ns
- event_seq
- delta_ms
- P99
- time_step_count
T2 — Interlocks (debounce + voting window)
Goal: Bouncy inputs do not cause false trips; sustained violations trip within T.
Method: Inject controlled chatter (10–50 ms) and then sustained faults (> window). Repeat across multiple sensors/inputs.
Evidence: Capture interlock decision events with debounce_ms, vote_window_ms, gate state, and the resulting safe_state transition.
Pass: No false stops during chatter; sustained faults always produce a deterministic trip record.
- debounce_ms
- vote_window_ms
- interlock_gate
- safe_state
- result_code
T3 — Metering (integration error + reconciliation)
Goal: Known pulse/flow profiles produce totals within the declared error bound (e.g., ≤ X%).
Method: Feed step, ramp, and pulsed profiles. Validate both instantaneous and totalized values under noise/low-flow cases.
Evidence: window_start/stop events, totalizer snapshots, calib_version, and a reconciliation export (batch_total vs meter_total vs historian_total).
Pass: Error stays within bounds across profiles; calibration version is traceable per batch.
- window_start
- window_stop
- totalizer
- calib_version
- reconcile
T4 — Power-Fail (50 random cuts, evidence continuity)
Goal: After random power loss, the controller recovers to a consistent state and preserves evidence (or records loss explicitly).
Method: Cut power at varied phases (writing events, switching steps, exporting). Repeat at least 50 times with different cut points.
Evidence: recovery_event with recovery_path, checkpoint_last_seq, journal tail verification (truncate point), and loss_interval_s if any.
Pass: System resumes without silent gaps; any gap is declared and bounded.
- recovery_event
- checkpoint_last_seq
- truncate_tail
- loss_interval_s
- hash_chain
T5 — Storage Wear (write budget → lifetime boundary)
Goal: Under the expected write rate, projected life meets the target (years) with observable health telemetry.
Method: Estimate events/s and bytes/event; compute daily writes; optionally run accelerated logging for a fixed duration to validate counters.
Evidence: bytes_written_total, wear_index (or erase_count_max), bad_block_count, storage_health_state, plus the assumptions used in the write budget.
Pass: Budgeted lifetime meets target; health indicators remain stable and monotonic.
- bytes_written_total
- wear_index
- erase_count_max
- bad_block_count
- health_state
T6 — Network Drop (stale marking + deterministic gating)
Goal: When links flap, stale data is never used for step gating; reconnection is evidence-grade and explainable.
Method: Unplug/replug, inject loss/jitter, and force repeated reconnects. Validate both control-plane and telemetry-plane behavior.
Evidence: quality (GOOD/STALE/BAD), age_ms, watchdog_state, reconnect_count, link_failover_event, and resync_complete_event.
Pass: Control-plane actions are blocked or degraded under STALE; transitions are fully logged.
- quality
- age_ms
- watchdog_state
- reconnect_count
- resync
T7 — Recipe Change Control (approval + rollback + impact)
Goal: Recipe changes are approved, versioned, and reversible; the impact on batches is provable.
Method: Execute a full cycle: draft → approve → publish → run → rollback → run again. Include a mapping change case.
Evidence: recipe_version, approval_id, effective_from, rollback_reason, and batch_id ↔ recipe_version linkage in the audit export.
Pass: No “mystery versions”; every batch can be tied to an approved recipe version.
- recipe_version
- approval_id
- effective_from
- rollback_reason
- impact_scope
T8 — Audit Export (hash-chain verify + reconciliation)
Goal: Export packages are consistent, verifiable, and reconcile totals across controller/meter/historian.
Method: Export CSV/JSON packages and randomly sample N events to verify hash links; run reconciliation for selected batches.
Evidence: export_ref, hash_prev/hash_curr, hash_chain_verify_result, plus reconciliation artifacts keyed by batch_id.
Pass: Chain verification passes and reconciliation differences are within declared tolerances.
- export_ref
- hash_prev
- hash_curr
- verify_result
- reconcile
Example MPNs for commissioning setups (reference BOM snippets): The parts below are commonly used building blocks to implement or validate the mechanisms above (timebase, storage, power-fail, network I/O, tamper-evident logs). Select equivalents per platform and availability.
Time base & timestamping
- NXP PCF2129
- Microchip MCP79410
- Analog Devices MAX31341
Use RTCs for power-loss retention; pair with NTP/PTP on the Ethernet side as needed.
Power-fail robustness (supervisor / watchdog)
- Texas Instruments TPS3839
- Texas Instruments TPS3430
- Analog Devices ADM809
Useful for T4: repeatable resets and brownout behavior while validating recovery events.
Nonvolatile storage (recipe + journal)
- Infineon/Cypress FM25V10
- Fujitsu MB85RS64V
- Everspin MR25H40
- Winbond W25Q64JV
- Macronix MX25L128
FRAM/MRAM fit write-heavy journals; SPI NOR is common for recipe images + journal/checkpoint structures.
Ethernet / field I-O building blocks
- WIZnet W5500
- Texas Instruments DP83826E
- Microchip KSZ8081RNA
- Texas Instruments SN65HVD72
- Analog Devices ADM2587E
- Analog Devices ADuM1250
Ethernet controller/PHY + RS-485/isolated RS-485 are common for Modbus/fieldbus bridges and T6 tests.
Evidence integrity helpers
- Microchip ATECC608B
- NXP SE050
Secure elements can support key storage for signing milestone records (optional, aligns with T8).
H2-12. FAQs (Troubleshooting, Evidence-First)
Each answer follows a fixed field-ready structure: one-sentence conclusion + two evidence checks + one first fix. Every question points back to the chapters that define the proof fields.
Batch stops unexpectedly — interlock glitch or timeout policy? (→ H2-4 / H2-5)
Conclusion: Most “unexpected stops” are either a noisy interlock gate or an overly aggressive timeout policy; the faster the stop and the less context in the log, the more likely an interlock decision path is missing.
Evidence: Check interlock events for debounce_ms/vote_window_ms and interlock_gate_state; then verify timeout_reason, step_id, and retry_count for the same window.
First fix: Enable decision-summary logging for interlocks and downgrade non-critical timeouts to soft-abort before hard-abort.
- debounce_ms
- vote_window_ms
- timeout_reason
- retry_count
Same recipe, different yield — metering reconciliation or timing drift? (→ H2-6 / H2-7)
Conclusion: Yield variation under the same recipe is usually a reconciliation gap (window/calibration) or a timebase issue that shifts windows and gates; if totals disagree across sources, metering is first suspect.
Evidence: Compare batch_total vs meter_total vs historian_total and confirm calib_version; then validate window edges with ts_mono_ns and check sync_state/time_step_count.
First fix: Lock window start/stop conditions and bind calib_version into every batch record before tuning time sync.
- calib_version
- window_start
- ts_mono_ns
- sync_state
Events look out of order — wall clock step or missing monotonic stamps? (→ H2-7 / H2-8)
Conclusion: “Out-of-order” events are typically caused by wall-clock steps (NTP/PTP corrections) or by logging without a stable monotonic timeline; relying on UTC alone makes ordering fragile during sync changes.
Evidence: Confirm every event includes both ts_mono_ns and ts_utc plus sync_state; then check event_seq continuity and look for time_step_count spikes around the suspected interval.
First fix: Sort/compute by monotonic time, display by UTC, and enforce dual-timestamp + sequence as mandatory log fields.
- ts_mono_ns
- ts_utc
- event_seq
- time_step_count
Recipe update broke only some lines — tag mapping versioning or compatibility? (→ H2-10 / H2-3)
Conclusion: Partial breakage after a recipe update usually means tag-map versions are inconsistent across lines, or a compatibility rule (units/scaling/types) was violated by a new parameter schema.
Evidence: Compare tag_map_version and device_profile_id across the affected lines; then verify recipe parameter constraints (units, ranges, types) and whether the controller reports schema/version mismatch events.
First fix: Freeze execution to a single known-good tag map version and roll forward using an explicit compatibility layer for renamed or retyped tags.
- tag_map_version
- device_profile_id
- schema_version
- unit/scaling
After power loss, batch resumes wrong step — checkpoint design or commit ordering? (→ H2-9 / H2-4)
Conclusion: Wrong-step resume is almost always a checkpoint definition problem (missing state variables) or an unsafe commit order that lets pointers advance before payload is fully durable.
Evidence: Inspect checkpoint_last_seq, saved step_id/phase_state, and the recovery_path event; then validate journal tail integrity (CRC/commit marker) and whether truncation occurred before replay.
First fix: Expand checkpoint to include minimal state-machine invariants and make pointer advancement the final atomic action in the commit path.
- checkpoint_last_seq
- phase_state
- commit_marker
- recovery_path
Totalizer doesn’t match flowmeter — sampling window or calibration version mismatch? (→ H2-6)
Conclusion: A totalizer mismatch is usually caused by window boundary rules (start/stop/stability) or a calibration-version mismatch where the batch used a different scaling than the meter/historian expects.
Evidence: Verify window_start/window_stop triggers and stability criteria, then confirm calib_version and effective_from were attached to the batch record and applied consistently end-to-end.
First fix: Make calib_version mandatory per batch and add a reconciliation report that flags boundary-condition anomalies.
- window_start
- window_stop
- calib_version
- effective_from
Network drop causes unsafe output — stale-data handling or fail-safe defaults? (→ H2-10 / H2-5)
Conclusion: Unsafe output during network loss points to stale-data being used for gating, or to fail-safe defaults that do not drive outputs to a safe state on link/IO faults.
Evidence: Check whether quality=STALE still allows control writes and whether age_ms exceeded the declared budget; then confirm the output safe_state_table for “link down / input invalid / device absent” cases.
First fix: Hard-block control writes under STALE and enforce fail-safe outputs for every fault class before re-enabling automatic sequencing.
- quality
- age_ms
- safe_state_table
- watchdog_state
Audit trail fails an inspection — missing change reason or weak linkage to batch_id? (→ H2-8)
Conclusion: Audit failures usually come from missing change justification/approval fields, or from broken linkage where configuration events cannot be tied to the specific batch_id and recipe_version they influenced.
Evidence: Confirm change_reason, approval_id, and actor identity are present for config changes; then sample exports and ensure batch_id and recipe_version are attached to every relevant event in the chain.
First fix: Make change-reason and batch linkage mandatory fields and reject writes that cannot be attributed and scoped.
- change_reason
- approval_id
- batch_id
- recipe_version
Storage wears out early — write amplification or event verbosity too high? (→ H2-9 / H2-8)
Conclusion: Early wear is typically caused by write amplification (inefficient journal/checkpoint layout) or by logging too many high-frequency events that could be aggregated without losing evidence value.
Evidence: Review bytes_written_total, wear_index/erase_count_max, and the event rate by category; then inspect checkpoint cadence and whether small updates trigger full-block rewrites.
First fix: Reduce verbosity/merge repetitive events and increase checkpoint efficiency before changing media.
- bytes_written_total
- wear_index
- events_per_second
- checkpoint_period
Time sync keeps flapping — NTP source quality or PTP grandmaster holdover? (→ H2-7)
Conclusion: Sync flapping is usually upstream time-source instability (NTP quality) or poor holdover/switchover behavior when PTP grandmaster changes; frequent time steps corrupt correlation even if monotonic ordering is intact.
Evidence: Check sync_state, offset_ms, holdover_state, and time_step_count; then correlate step events with upstream source changes and verify monotonic timestamps remain continuous.
First fix: Tighten source priority/thresholds and avoid wall-clock steps that exceed the declared event-ordering tolerance.
- sync_state
- offset_ms
- holdover_state
- time_step_count
Manual override creates confusion — permission model or inadequate logging? (→ H2-5 / H2-8)
Conclusion: Override confusion typically comes from weak role/permission boundaries or from incomplete before/after logging that makes it impossible to reconstruct who changed what, when, and for how long.
Evidence: Verify operator_role, override_grant, and expiry are enforced; then confirm override events capture before/after values, duration, and affected outputs, linked to batch_id and operator identity.
First fix: Require time-bounded override grants and log the full override envelope (grant → active → revoke) as evidence-grade events.
- operator_role
- override_grant
- override_duration
- batch_id
Latency spikes during heavy logging — priority inversion or storage blocking? (→ H2-4 / H2-9)
Conclusion: Logging-induced latency spikes usually indicate priority inversion (sequencer blocked behind lower-priority work) or synchronous storage commits that stall the main loop under bursty event rates.
Evidence: Compare loop_latency_ms/scan_time_ms with storage_commit_time_ms and queue depth; then confirm whether log writes are synchronous or buffered and whether backpressure throttles event generation safely.
First fix: Move logging to an async queue with bounded flush policy, and protect the sequencer with priority and budget enforcement.
- loop_latency_ms
- storage_commit_time_ms
- queue_depth
- backpressure
MPN quick list (common building blocks)
- RTC: NXP PCF2129 · Microchip MCP79410 · Analog Devices MAX31341
- Supervisor / watchdog: TI TPS3839 · TI TPS3430 · Analog Devices ADM809
- FRAM / MRAM / NOR: Infineon/Cypress FM25V10 · Fujitsu MB85RS64V · Everspin MR25H40 · Winbond W25Q64JV
- Ethernet / PHY: WIZnet W5500 · TI DP83826E · Microchip KSZ8081RNA
- RS-485 / isolation: TI SN65HVD72 · Analog Devices ADM2587E · Analog Devices ADuM1250
- Secure element (optional): Microchip ATECC608B · NXP SE050
These are reference examples to support commissioning evidence (timebase, power-fail behavior, logging media, fieldbus bridging, optional signing).