Batch / Recipe Controller for Sequencing, Metering & Audit Logs

Q: Batch stops unexpectedly—interlock glitch or timeout policy?

Most unexpected stops come from either a noisy interlock gate or an overly aggressive timeout policy; if the stop is fast and the log lacks context, interlock decision evidence is often missing. Check interlock events for debounce_ms/vote_window_ms and interlock_gate_state, then verify timeout_reason, step_id, and retry_count in the same window. First fix: enable decision-summary logging and downgrade non-critical timeouts to soft-abort before hard-abort.

Q: Events look out of order—wall clock step or missing monotonic stamps?

Out-of-order events usually come from wall-clock steps (NTP/PTP corrections) or logs that omit a stable monotonic timeline; UTC-only ordering is fragile during sync changes. Confirm events include ts_mono_ns and ts_utc plus sync_state, then check event_seq continuity and time_step_count spikes near the interval. First fix: sort/compute by monotonic time, display by UTC, and enforce dual timestamps + sequence as mandatory fields.

Q: Recipe update broke only some lines—tag mapping versioning or compatibility?

Partial breakage after a recipe update typically means tag_map_version is inconsistent across lines, or a compatibility rule (units/scaling/types) was violated by the new schema. Compare tag_map_version and device_profile_id on affected lines, then verify parameter constraints and schema mismatch events. First fix: freeze execution to one known-good tag map and roll forward via an explicit compatibility layer for renamed/retyped tags.

Q: After power loss, batch resumes wrong step—checkpoint design or commit ordering?

Wrong-step resume is usually a checkpoint definition problem (missing state variables) or unsafe commit ordering where pointers advance before payload is durable. Inspect checkpoint_last_seq, saved step_id/phase_state, and recovery_path, then validate journal tail integrity via CRC/commit_marker and any truncation before replay. First fix: store minimal state-machine invariants in checkpoints and make pointer advancement the final atomic commit action.

Q: Network drop causes unsafe output—stale-data handling or fail-safe defaults?

Unsafe output during network loss points to stale data being used for gating, or fail-safe defaults that do not drive outputs to a safe state on link/IO faults. Check whether quality=STALE still allows control writes and whether age_ms exceeded budget; then validate the safe_state_table for link-down and invalid-input cases. First fix: hard-block control writes under STALE and enforce fail-safe outputs for every fault class.

Q: Audit trail fails an inspection—missing change reason or weak linkage to batch_id?

Audit failures typically come from missing change_reason/approval fields, or broken linkage where configuration events cannot be tied to the specific batch_id and recipe_version they influenced. Confirm change_reason, approval_id, and actor identity for config changes; then sample exports to ensure batch_id and recipe_version are attached across the chain. First fix: make change-reason and batch linkage mandatory and reject writes that cannot be attributed and scoped.

Q: Storage wears out early—write amplification or event verbosity too high?

Early wear is typically caused by write amplification from inefficient journal/checkpoint layout, or by overly verbose high-frequency events that could be aggregated without losing evidence value. Review bytes_written_total and wear_index/erase_count_max, plus event rate by category; then inspect checkpoint cadence and whether small updates trigger full-block rewrites. First fix: reduce verbosity/merge repetitive events and improve checkpoint efficiency before changing media.

Q: Time sync keeps flapping—NTP source quality or PTP grandmaster holdover?

Sync flapping is usually upstream NTP instability or poor PTP holdover/switchover behavior; frequent time steps harm correlation even if monotonic ordering stays intact. Check sync_state, offset_ms, holdover_state, and time_step_count, then correlate step events with upstream source changes. First fix: tighten source priority/thresholds and avoid large wall-clock steps that exceed the event-correlation tolerance.

← Back to: Industrial Sensing & Process Control

Core idea: A Batch/Recipe Controller turns a recipe into a deterministic, auditable execution—sequencing I/O and metering with a reliable timebase, then storing evidence-grade batch records that survive power/network faults across Ethernet/fieldbus integration.

Outcome: It makes every run repeatable and explainable: the same inputs yield the same steps, and any deviation can be traced to specific timestamps, mappings, measurements, and approved recipe versions.

H2-1. What This Controller Does and Where It Fits

A Batch / Recipe Controller is the execution core that turns a versioned recipe into a repeatable run: it orchestrates step timing, I/O sequencing, metering windows, time-stamped events, and evidence-grade batch records, then integrates the results to SCADA/HMI/historian/MES via Ethernet or fieldbus—without replacing PLC safety logic.

To avoid scope confusion, treat industrial automation as four layers with distinct responsibilities:

Field layer (sensors, valves, actuators, meters): produces raw signals and executes physical actions.
Control layer (PLC, remote I/O, safety interlocks): guarantees basic deterministic I/O control and failsafe defaults.
Orchestration layer (Batch/Recipe Controller): defines and proves the execution semantics of a batch run—steps, phases, timeouts, retries, checkpoints, resumes, and operator interventions.
Supervisory/enterprise layer (HMI/SCADA/historian/MES): visualizes, aggregates, and connects business context; it should not be the source of truth for step-by-step execution evidence.

The practical boundary is evidence and repeatability. If you need to answer “what exactly happened” at each step (with timestamps, recipe version, operator actions, and measured totals), you need a controller that owns: recipe → batch instance → step actions → event log.

Use an independent Batch/Recipe Controller when:

Recipes change and must be traceable (versioning, approval, rollback, effective range per batch).
Steps are complex (branching, parallel phases, resource locks, retries, conditional waits).
Metering is contractual (totalizers, tolerances, reconciliation across devices/data sinks).
Batch records are audited (operator actions, parameter changes, exception handling, power-loss continuity).
Equipment spans buses (Ethernet + fieldbus devices with unified sequencing and health monitoring).

A PLC alone is typically sufficient when:

Logic is stable, steps are few, and there is no need for recipe version governance.
Metering is for display only (no reconciliation loop, no audit requirement).
Power-loss recovery can restart the process without step-level resume requirements.

Design anchor: separate “hard safety” from “batch orchestration.” Keep safety interlocks and fail-safe outputs in the PLC/safety chain; let the Batch/Recipe Controller focus on deterministic sequencing and evidence-grade records.

batch_id
recipe_version
event_ts (monotonic + UTC)
meter_window_id
step_state (start/end/abort/resume)

Figure 1. System placement: PLC owns basic I/O control and safety defaults; the Batch/Recipe Controller owns recipe execution semantics and evidence-grade batch records; SCADA/MES consume results via interfaces.

Cite this figure Batch / Recipe Controller — System Placement (Figure 1)

H2-2. Scope Guard and Acceptance Criteria

This section converts “batch control” from a concept into sign-off-able engineering criteria. Each line below is written as: metric (what to measure) + evidence (what to capture) + first fix (what to change first).

Acceptance checklist (10 lines):

Step timing accuracy — specify max jitter and timestamp resolution; capture step_start/step_end deltas; first fix: prioritize sequencer + async logging.
Event continuity under power loss — define max loss window (e.g., ≤N seconds); capture power-fail test log + recovery proof; first fix: journal + checkpoint cadence.
Recipe version governance — define versioning, approval, rollback; capture who/when/why + effective batch range; first fix: bind recipe_version immutably per batch_id.
Metering reconciliation — define allowable batch total error; capture totalizer vs meter vs historian comparison; first fix: lock sampling window + calibration version per batch.
Comms resilience — define reconnect time + stale-data marking; capture drop/reconnect replay; first fix: watchdog + offline queue + explicit stale state.
Interlocks and safe stop — define permissives, timeouts, safe outputs; capture interlock test cases and safe-state table; first fix: keep hard safety in PLC/safety chain.
Audit trail completeness — require who/what/when/why/impact; capture export sample linked to batch_id; first fix: route all changes through a single audited API.
Storage lifetime — estimate years at events/sec and write amplification; capture wear model + leveling strategy; first fix: event tiering + batch commits + verbosity caps.
Time-base boundary — define RTC/NTP/PTP roles + sync health; capture offset/holdover/sync_state in logs; first fix: dual timestamps (monotonic + UTC) per event.
Commissioning readiness — require simulator/replay/self-test; capture dry-run report + deterministic replay; first fix: build test-first with recorded trace playback.

Scope guard: this controller must not become “everything.” It owns sequencing + evidence, while PLC owns failsafe I/O, and SCADA/MES own visualization and business workflow. That separation is itself an acceptance criterion because it prevents untestable coupling.

Figure 2. Acceptance checklist grouped by responsibility. Use it to scope requirements, compare vendors, and sign off deterministic behavior, metering correctness, audit evidence, and recovery guarantees.

Cite this figure Batch / Recipe Controller — Acceptance Checklist (Figure 2)

H2-3. Data Model: Recipe → Batch → Step → Action

A robust batch system is built on an executable, auditable data model. The core rule is to separate the static contract (Recipe) from the runtime instance (Batch), then break execution into provable units (Step/Phase) and replayable atoms (Action).

Recipe (static template) defines the approved structure and parameters. It must be versioned and immutable per approval cycle. A recipe is not “a spreadsheet of fields”; it is a graph of steps with branching/parallelism rules, entry/exit conditions, timeout policies, and parameter constraints.

Batch (runtime instance) binds to one recipe version and records what happened in the real world. This binding is what makes audit and troubleshooting possible: a batch_id must always map to a single recipe_version.

Step / Phase (execution unit) is the smallest unit that should have a clear start/end boundary, a measurable outcome, and a deterministic transition rule. Steps can be sequential, parallel, looped, or conditional—yet still must be explainable via logged state transitions.

Action (replayable atom) is the minimum work item the sequencer executes: setting I/O states, applying setpoints, opening a metering window, verifying stability, waiting for a condition, or emitting a log marker. Actions must carry enough parameters to be replayed in simulation or trace playback.

Design binding rules (evidence-grade)

(1) Batch → Recipe version binding: batch_id always points to exactly one recipe_version. (2) Event → Action binding: every event references (batch_id, step_id, action_id). (3) Metering → Window binding: measured totals bind to meter_window_id with calibration version.

recipe_id
recipe_version
batch_id
step_id
action_id
meter_window_id
event_seq
event_ts_mono
event_ts_utc
calib_version

Figure 3. ISA-88-style structure (left) with a batch runtime instance (right). The model enforces version binding and event/action linkability to support replay, audit, and recovery.

Cite this figure Batch / Recipe Controller — Recipe Data Model (Figure 3)

H2-4. Sequencing Engine: Determinism, Concurrency, and Interlocks

Determinism is not a claim—it is a design property that can be measured. A sequencing engine is deterministic when the same inputs and time base produce the same step transitions, and when every transition can be explained by logged evidence (events, interlocks, timeouts, and resource arbitration).

Scan loop vs event loop: treat them as a boundary decision, not a debate. Periodic sampling is useful for filtering and stable condition checks, while event-driven dispatch is essential for bus updates, metering window edges, timeouts, alarms, and operator interventions. A practical architecture uses a single prioritized event queue where both periodic ticks and asynchronous signals land.

Concurrency is controlled by resources, not hopes. Parallel steps must declare what they own: exclusive actuators (valves/pumps), shared sensors, and rate-limited services (storage/network). The sequencer must provide resource locking and arbitration so conflicts produce predictable outcomes.

Interlocks/permissives must be evidence-grade. A raw input should not flip a batch state instantly. Use a condition tree with explicit debounce windows, optional voting windows, and a defined trip latch policy (latched until operator reset vs auto-clear).

Timeouts define safety and recoverability. Separate: soft abort (stop progression and preserve diagnostics), hard abort (force safe outputs, typically via PLC/safety chain), and the safe state definition (what every output becomes). Recovery is then a choice: resume from a step checkpoint or restart the batch, based on checkpoint granularity and metering state.

event_queue_depth
dispatch_latency_ms
resource_id
owner_step_id
wait_reason
interlock_id
debounce_ms
vote_window_ms
timeout_ms
retry_count
checkpoint_id

Figure 4. Deterministic sequencing is achieved by explicit evidence gates: metering windows, interlock debounce/voting, timeout and retry policies, and logged transitions.

Cite this figure Batch / Recipe Controller — Sequencer Timeline (Figure 4)

H2-5. I/O Sequencing: Commissioning-Friendly Mapping & Safety Defaults

Commissioning becomes predictable when I/O is not hard-coded. A maintainable architecture separates logical intent (process meaning) from physical realization (wiring / addresses), with a versioned mapping layer that is auditable, testable, and safe by default.

I/O mapping layers: define process-facing tags such as Valve_01_Open or Pump_A_Start, then bind them to physical channels (local I/O, remote I/O, or fieldbus registers) via a single mapping contract. This contract should include polarity, scaling, and protocol binding, plus lifecycle controls (mapping_version, approvals, and effective dates).

Safety defaults (fail-safe outputs) must be explicit per output and per fault category. “Safe” is not one value—it is a policy: drop vs hold, latched vs auto-clear, and the recovery rule. Hard safety trips should be enforced by PLC/safety chains; the batch controller should orchestrate orderly stops and log the evidence of every transition.

Input semantics determine deterministic behavior: edge vs level, debounce windows, and filter principles must be part of the design. Raw inputs should not drive step transitions directly; they should pass through debounce/filter stages and emit events that can be traced in logs.

Manual override is a controlled mode, not ad-hoc output toggling. Enter/exit conditions, role-based permission, and time-bounded overrides are required. Every manual action must create an audit event that links back to batch_id/step_id where applicable.

Dry-run and forced I/O enable safe commissioning and reproducible troubleshooting. Dry-run advances the sequencer without driving physical outputs. Forced I/O must be time-limited, clearly flagged, and recorded with who/when/why, so batch evidence remains trustworthy.

Evidence fields to log

mapping_version • effective_from • approved_by • raw_state/debounced_state • debounce_ms • filter_type • forced_flag • override_role • safe_state_reason

logical_tag
protocol_binding
physical_channel
polarity
scaling
mapping_version
safe_value
recover_policy
forced_flag
override_role

Figure 5. Commissioning-friendly I/O mapping separates logical tags from physical channels through a versioned binding layer. Safe outputs are explicit by fault category to ensure deterministic recovery behavior.

Cite this figure Batch / Recipe Controller — I/O Map & Safety Defaults (Figure 5)

H2-6. Metering & Timing: Sampling Windows, Totalizers, and Reconciliation

Metering is the proof layer of a batch. The goal is not “a number on a screen” but a result that can be recomputed, reconciled, and audited. That requires explicit sampling windows, stable time bases, versioned calibration, and reconciliation outputs.

Instantaneous vs totalized: instantaneous measurements are noisy and time-dependent; totalized values depend on the integration method and the accuracy of timestamps. Drift typically comes from sampling cadence mismatch, filter delay, time-base jumps, or calibration mismatch—so all of those must be visible in evidence logs.

Sampling windows define where metering starts and stops. Windows should be triggered by step events and captured with monotonic timestamps for ordering, plus UTC timestamps for correlation across systems. Filtering should be minimal and traceable (type + parameter), so totals can be recomputed in offline replay.

Metering events should be explicit: start/stop, threshold reached, and stability conditions. For example, a stable condition can be defined as “variance below a tolerance for a continuous duration,” which prevents transient spikes from closing a window early.

Reconciliation compares three totals: (1) batch total computed from the window, (2) meter-device totalizer, and (3) historian total. The output should include PASS/WARN/FAIL plus a delta value and a hint category (window mismatch, calibration mismatch, missing samples, time jump).

Calibration must be versioned. Every metering window must record the calibration version (and validity), otherwise the batch record cannot prove what coefficients were used when the result was produced.

Evidence fields to log

sample_period_ms • filter_type/filter_param • window_start_ts_mono • window_end_ts_mono • totalized_value • unit • calib_version • reconcile_status • delta

meter_window_id
sample_period_ms
filter_type
window_start_event
window_stop_event
totalized_value
calib_version
meter_total
historian_total
reconcile_status

Figure 6. Metering is defined by explicit window edges and traceable time bases. Totalized results are reconciled across batch calculations, device totalizers, and historian totals with a clear PASS/WARN/FAIL output.

Cite this figure Batch / Recipe Controller — Metering Window & Reconciliation (Figure 6)

H2-7. Time Base: RTC, NTP, PTP/1588 and Timestamp Strategy

A reliable batch record needs two different “time truths”: ordering time and correlation time. Event ordering must be based on a monotonic clock to survive NTP/PTP steps, while cross-system correlation uses UTC wall time plus a recorded sync state.

RTC provides a retained time baseline across power loss. It is not a precision sync tool, but it prevents wall time from resetting to an invalid value after reboot. NTP is appropriate for general synchronization where millisecond-level alignment is acceptable, but it can introduce time steps (sudden jumps). PTP/1588 targets sub-millisecond alignment and is preferred when multiple devices must agree on window edges, sequence timing, or fine-grained measurements.

Timestamp rule: every evidence-grade event should record a dual timestamp: ts_mono_ns (monotonic for strict ordering) and ts_utc (UTC for correlation), plus sync_state and offset so time quality is visible during audits. When the system enters holdover or becomes unsynchronized, the record must still be continuous and clearly marked.

Clock health should be observable and actionable: capture sync_state, offset, last_sync_ts, holdover_age, and time_step_count. If sync quality degrades, the controller can apply a deterministic downgrade policy (e.g., more conservative metering decisions, reconciliation marked WARN, or operator confirmation for critical changes).

Multi-source redundancy avoids single points of failure. A practical priority policy is: PTP_LOCKED > NTP_SYNC > HOLDOVER > RTC_ONLY. Any source switch should generate a dedicated event so later investigations can explain time-quality changes.

Evidence fields to log

ts_mono_ns • ts_utc • sync_state • offset_ms • holdover_age_s • last_sync_ts • time_step_count • time_source

ts_mono_ns
ts_utc
sync_state
offset_ms
holdover_age_s
last_sync_ts
time_step_count
time_source
primary/backup

Figure 7. Dual timestamps keep event ordering stable (monotonic) while supporting cross-system correlation (UTC). Sync quality is recorded per event to preserve auditability during holdover or unsync states.

Cite this figure Batch / Recipe Controller — Dual-Clock Timestamp Model (Figure 7)

H2-8. Event Log & Audit Trail: Evidence-Grade Records (EBR-ready)

Evidence-grade logging is not “debug text.” It is a structured chain that can answer: who did what, when, with which version, and which batch was affected. Logs should be linkable (join keys), exportable, and tamper-evident.

Event classes should be audit-oriented: process events (step transitions, metering window edges, interlock trips), configuration changes (recipe releases, mapping_version changes, thresholds), and security-relevant actions (manual override, forced I/O, role changes, auth failures). Each class requires different mandatory fields, but all must include batch_id and trustworthy timestamps.

Tamper-evidence can be achieved with a lightweight hash chain: each event stores hash_prev and a computed hash_curr over key fields. Optional signatures can be applied to high-value milestones (e.g., batch completion or recipe publish) without signing every log line.

Capacity strategy should be tiered and deterministic: a RAM ring buffer for high-frequency debug traces, an append-only flash log for durable evidence events, and a server/historian tier for long-term retention and search. If a loss window occurs (power failure before flush), the system should record the loss interval explicitly.

Join keys are the backbone of traceability: batch_id, recipe_version, operator_id, device_id, plus event_seq and step_id/action_id to localize cause. Mapping_version and calib_version should also be captured, so I/O behavior and metering results remain provable in audits.

Export should be simple and consistent: CSV/JSON for evidence packages, and structured feeds to SCADA/historians (tags or topics) for monitoring. Export should never strip the fields that prove integrity and linkage.

Minimum evidence fields

event_seq • event_type • ts_mono_ns • ts_utc • sync_state • batch_id • recipe_version • operator_id • device_id • step_id/action_id • result • hash_prev/hash_curr

event_seq
event_type
hash_prev
hash_curr
batch_id
recipe_version
operator_id
device_id
mapping_version
calib_version
export_ref

Figure 8. Evidence-grade records combine structured join keys with tamper-evident chaining. Tiered storage keeps durability and exportability without losing audit-critical fields.

Cite this figure Batch / Recipe Controller — Audit Trail & Hash Chain (Figure 8)

H2-9. Recipe & Event Storage: Media Choice, Wear, Power-Fail, and Integrity

Storage in a batch controller is not “saving files.” It is an engineering mechanism that must support high-frequency writes, long service life, power-fail consistency, and verifiable integrity. The design starts by separating workload types: recipes are versioned and read-mostly; events are append-heavy and must remain replayable after a sudden reboot.

Workload split: a recipe is a compact, versioned artifact (approve, publish, rollback), so it should be stored in a slot-based or image-based format that supports atomic activation (A/B slot or dual-image). An event log is a continuous stream; it should be written as an append-only journal with explicit commit markers so incomplete tails can be detected and truncated after power loss.

Power-fail consistency requires a deterministic commit path. For event journal entries, a safe pattern is: write payload → write length/CRC (or hash) → write commit marker → advance pointer atomically. For recipes, write the new version into an inactive slot, validate it, then flip an active pointer as the final step. Recovery should replay from the last checkpoint plus verified journal entries.

Wear and lifetime are controlled by minimizing write amplification and avoiding hot spots. Depending on the write rate and retention needs, media choices may include FRAM/MRAM for high-endurance logging, or NOR/SD for larger capacity when paired with journal-structured writes and explicit health monitoring. When media is block-based, bad-block handling and a clear allocation strategy are required.

Integrity and rollback must be evidence-ready. Every stored recipe should include version metadata, and every event should carry a checksum and sequence number. If a rollback occurs, it should create a dedicated audit event with from/to versions and a reason, so downstream reports can explain changes in batch outcomes.

Tiering improves resilience: keep hot, queryable state locally (recent events + current batch snapshot), store durable evidence in append-only flash media, and export cold archives to a server/historian. Offline buffering should be treated as a normal mode, with explicit “catch-up” export after reconnection.

Evidence fields to log

recipe_version • approval_id • active_slot • event_seq • crc/hash • commit_marker • checkpoint_last_seq • recovery_path • bytes_written_total • storage_health_state

recipe_version
active_slot
event_seq
commit_marker
crc/hash
checkpoint
recovery_event
write_amplification
wear_index
bad_block_count
storage_health_state

Figure 9. Storage is tiered by workload: versioned recipes (A/B activation), append-only event journal, and checkpointed runtime state. A deterministic commit sequence enables safe recovery after sudden power loss.

Cite this figure Batch / Recipe Controller — Storage Tiering & Power-Fail Commit (Figure 9)

H2-10. Ethernet & Fieldbus Integration: Mapping, Latency, and Failover

Connectivity is not “link up.” Integration must define a tag contract, a connection state machine, a latency budget, and a failover behavior. Control decisions must remain deterministic even when data becomes stale or links flap.

Protocol stacks may include Modbus TCP, EtherNet/IP, PROFINET, or OPC UA, but the engineering value is above the protocol: a versioned tag map with stable naming, units, scaling, and compatibility rules. The controller should expose tag_map_version and device profiles so commissioning changes can be audited and rolled out safely.

Connection management must be explicit: reconnect loops, watchdog timers, and stale-data marking. “Stale” is not just missing updates; it means the value must not be used for step gating or safety decisions. Each tag should have an age_ms (or last_update_ts_mono) and a quality flag (GOOD/STALE/BAD) so behavior is explainable in logs.

Latency requires plane separation: keep control-plane updates isolated from telemetry-plane reporting. Control-plane traffic is bounded and deterministic (step gating, key setpoints), while telemetry can be buffered and bursty (trends, historian feeds). Telemetry congestion must not affect control decisions.

Failover should preserve safety and evidence: link switches, dual networks, or redundancy principles (e.g., parallel paths) must trigger a link_failover_event, and quality flags should show degraded states during switching. After reconnection, the system should emit re-sync completion events and reconcile any buffered telemetry exports.

Evidence fields to log

tag_map_version • device_profile_id • quality(GOOD/STALE/BAD) • age_ms • watchdog_state • reconnect_count • link_failover_event • control_plane_latency_ms • telemetry_backlog

tag_map_version
device_profile_id
quality
age_ms
watchdog_state
reconnect_count
stale_flag
control-plane
telemetry-plane
link_failover_event
re-sync

Figure 10. Protocol adapters feed a versioned tag map and a connection state machine. Control-plane and telemetry-plane are separated, stale marking is explicit, and failover is evidence-grade through dedicated events.

Cite this figure Batch / Recipe Controller — Ethernet & Fieldbus Gateway (Figure 10)

H2-11. Validation & Commissioning Playbook (Field-Ready Tests)

This playbook turns system claims into repeatable evidence. Each test is written as a 3-line SOP: Goal (what must be proven), Method (how to reproduce), and Evidence (which fields/waveforms/exports to capture). Pass criteria are explicit so commissioning results are auditable.

Evidence rule

Use ts_mono_ns for interval/jitter calculations, keep ts_utc + sync_state for correlation, and ensure every test produces an exportable artifact (CSV/JSON) linked by batch_id and event_seq.

T1 — Step Timing (prove jitter upper bound)

Goal: For the same recipe and inputs, key step edges show bounded jitter (e.g., P99 ≤ X ms).

Method: Run the same batch N times (e.g., 30). Compare step enter/exit, I/O set, and metering window edges.

Evidence: Export a timing report keyed by batch_id/step_id with ts_mono_ns, delta_ms, P95/P99, and time_step_count.

Pass: P99 jitter stays within the declared limit; no missing event_seq segments.

ts_mono_ns
event_seq
delta_ms
P99
time_step_count

T2 — Interlocks (debounce + voting window)

Goal: Bouncy inputs do not cause false trips; sustained violations trip within T.

Method: Inject controlled chatter (10–50 ms) and then sustained faults (> window). Repeat across multiple sensors/inputs.

Evidence: Capture interlock decision events with debounce_ms, vote_window_ms, gate state, and the resulting safe_state transition.

Pass: No false stops during chatter; sustained faults always produce a deterministic trip record.

debounce_ms
vote_window_ms
interlock_gate
safe_state
result_code

T3 — Metering (integration error + reconciliation)

Goal: Known pulse/flow profiles produce totals within the declared error bound (e.g., ≤ X%).

Method: Feed step, ramp, and pulsed profiles. Validate both instantaneous and totalized values under noise/low-flow cases.

Evidence: window_start/stop events, totalizer snapshots, calib_version, and a reconciliation export (batch_total vs meter_total vs historian_total).

Pass: Error stays within bounds across profiles; calibration version is traceable per batch.

window_start
window_stop
totalizer
calib_version
reconcile

T4 — Power-Fail (50 random cuts, evidence continuity)

Goal: After random power loss, the controller recovers to a consistent state and preserves evidence (or records loss explicitly).

Method: Cut power at varied phases (writing events, switching steps, exporting). Repeat at least 50 times with different cut points.

Evidence: recovery_event with recovery_path, checkpoint_last_seq, journal tail verification (truncate point), and loss_interval_s if any.

Pass: System resumes without silent gaps; any gap is declared and bounded.

recovery_event
checkpoint_last_seq
truncate_tail
loss_interval_s
hash_chain

T5 — Storage Wear (write budget → lifetime boundary)

Goal: Under the expected write rate, projected life meets the target (years) with observable health telemetry.

Method: Estimate events/s and bytes/event; compute daily writes; optionally run accelerated logging for a fixed duration to validate counters.

Evidence: bytes_written_total, wear_index (or erase_count_max), bad_block_count, storage_health_state, plus the assumptions used in the write budget.

Pass: Budgeted lifetime meets target; health indicators remain stable and monotonic.

bytes_written_total
wear_index
erase_count_max
bad_block_count
health_state

T6 — Network Drop (stale marking + deterministic gating)

Goal: When links flap, stale data is never used for step gating; reconnection is evidence-grade and explainable.

Method: Unplug/replug, inject loss/jitter, and force repeated reconnects. Validate both control-plane and telemetry-plane behavior.

Evidence: quality (GOOD/STALE/BAD), age_ms, watchdog_state, reconnect_count, link_failover_event, and resync_complete_event.

Pass: Control-plane actions are blocked or degraded under STALE; transitions are fully logged.

quality
age_ms
watchdog_state
reconnect_count
resync

T7 — Recipe Change Control (approval + rollback + impact)

Goal: Recipe changes are approved, versioned, and reversible; the impact on batches is provable.

Method: Execute a full cycle: draft → approve → publish → run → rollback → run again. Include a mapping change case.

Evidence: recipe_version, approval_id, effective_from, rollback_reason, and batch_id ↔ recipe_version linkage in the audit export.

Pass: No “mystery versions”; every batch can be tied to an approved recipe version.

recipe_version
approval_id
effective_from
rollback_reason
impact_scope

T8 — Audit Export (hash-chain verify + reconciliation)

Goal: Export packages are consistent, verifiable, and reconcile totals across controller/meter/historian.

Method: Export CSV/JSON packages and randomly sample N events to verify hash links; run reconciliation for selected batches.

Evidence: export_ref, hash_prev/hash_curr, hash_chain_verify_result, plus reconciliation artifacts keyed by batch_id.

Pass: Chain verification passes and reconciliation differences are within declared tolerances.

export_ref
hash_prev
hash_curr
verify_result
reconcile

Example MPNs for commissioning setups (reference BOM snippets): The parts below are commonly used building blocks to implement or validate the mechanisms above (timebase, storage, power-fail, network I/O, tamper-evident logs). Select equivalents per platform and availability.

Time base & timestamping

NXP PCF2129
Microchip MCP79410
Analog Devices MAX31341

Use RTCs for power-loss retention; pair with NTP/PTP on the Ethernet side as needed.

Power-fail robustness (supervisor / watchdog)

Texas Instruments TPS3839
Texas Instruments TPS3430
Analog Devices ADM809

Useful for T4: repeatable resets and brownout behavior while validating recovery events.

Nonvolatile storage (recipe + journal)

Infineon/Cypress FM25V10
Fujitsu MB85RS64V
Everspin MR25H40
Winbond W25Q64JV
Macronix MX25L128

FRAM/MRAM fit write-heavy journals; SPI NOR is common for recipe images + journal/checkpoint structures.

Ethernet / field I-O building blocks

WIZnet W5500
Texas Instruments DP83826E
Microchip KSZ8081RNA
Texas Instruments SN65HVD72
Analog Devices ADM2587E
Analog Devices ADuM1250

Ethernet controller/PHY + RS-485/isolated RS-485 are common for Modbus/fieldbus bridges and T6 tests.

Evidence integrity helpers

Microchip ATECC608B
NXP SE050

Secure elements can support key storage for signing milestone records (optional, aligns with T8).

Figure 11. A commissioning matrix that maps each test (T1–T8) to the domains it must validate, and the minimum evidence tokens to capture for audit-ready results.

Cite this figure Batch / Recipe Controller — Validation Matrix (Figure 11)

Request a Quote

Name

Company

Part Number(s) / BOM

Quantity & Target Lead Time

Alternates Allowed

Temperature Grade

Package / Footprint

Compliance

Budget Window

Lot Size / Qty

Message

Attachment

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12. FAQs (Troubleshooting, Evidence-First)

Each answer follows a fixed field-ready structure: one-sentence conclusion + two evidence checks + one first fix. Every question points back to the chapters that define the proof fields.

Batch stops unexpectedly — interlock glitch or timeout policy? (→ H2-4 / H2-5)

Conclusion: Most “unexpected stops” are either a noisy interlock gate or an overly aggressive timeout policy; the faster the stop and the less context in the log, the more likely an interlock decision path is missing.

Evidence: Check interlock events for debounce_ms/vote_window_ms and interlock_gate_state; then verify timeout_reason, step_id, and retry_count for the same window.

First fix: Enable decision-summary logging for interlocks and downgrade non-critical timeouts to soft-abort before hard-abort.

debounce_ms
vote_window_ms
timeout_reason
retry_count

Back to: H2-4 · H2-5

Same recipe, different yield — metering reconciliation or timing drift? (→ H2-6 / H2-7)

Conclusion: Yield variation under the same recipe is usually a reconciliation gap (window/calibration) or a timebase issue that shifts windows and gates; if totals disagree across sources, metering is first suspect.

Evidence: Compare batch_total vs meter_total vs historian_total and confirm calib_version; then validate window edges with ts_mono_ns and check sync_state/time_step_count.

First fix: Lock window start/stop conditions and bind calib_version into every batch record before tuning time sync.

calib_version
window_start
ts_mono_ns
sync_state

Back to: H2-6 · H2-7

Events look out of order — wall clock step or missing monotonic stamps? (→ H2-7 / H2-8)

Conclusion: “Out-of-order” events are typically caused by wall-clock steps (NTP/PTP corrections) or by logging without a stable monotonic timeline; relying on UTC alone makes ordering fragile during sync changes.

Evidence: Confirm every event includes both ts_mono_ns and ts_utc plus sync_state; then check event_seq continuity and look for time_step_count spikes around the suspected interval.

First fix: Sort/compute by monotonic time, display by UTC, and enforce dual-timestamp + sequence as mandatory log fields.

ts_mono_ns
ts_utc
event_seq
time_step_count

Back to: H2-7 · H2-8

Recipe update broke only some lines — tag mapping versioning or compatibility? (→ H2-10 / H2-3)

Conclusion: Partial breakage after a recipe update usually means tag-map versions are inconsistent across lines, or a compatibility rule (units/scaling/types) was violated by a new parameter schema.

Evidence: Compare tag_map_version and device_profile_id across the affected lines; then verify recipe parameter constraints (units, ranges, types) and whether the controller reports schema/version mismatch events.

First fix: Freeze execution to a single known-good tag map version and roll forward using an explicit compatibility layer for renamed or retyped tags.

tag_map_version
device_profile_id
schema_version
unit/scaling

Back to: H2-10 · H2-3

After power loss, batch resumes wrong step — checkpoint design or commit ordering? (→ H2-9 / H2-4)

Conclusion: Wrong-step resume is almost always a checkpoint definition problem (missing state variables) or an unsafe commit order that lets pointers advance before payload is fully durable.

Evidence: Inspect checkpoint_last_seq, saved step_id/phase_state, and the recovery_path event; then validate journal tail integrity (CRC/commit marker) and whether truncation occurred before replay.

First fix: Expand checkpoint to include minimal state-machine invariants and make pointer advancement the final atomic action in the commit path.

checkpoint_last_seq
phase_state
commit_marker
recovery_path

Back to: H2-9 · H2-4

Totalizer doesn’t match flowmeter — sampling window or calibration version mismatch? (→ H2-6)

Conclusion: A totalizer mismatch is usually caused by window boundary rules (start/stop/stability) or a calibration-version mismatch where the batch used a different scaling than the meter/historian expects.

Evidence: Verify window_start/window_stop triggers and stability criteria, then confirm calib_version and effective_from were attached to the batch record and applied consistently end-to-end.

First fix: Make calib_version mandatory per batch and add a reconciliation report that flags boundary-condition anomalies.

window_start
window_stop
calib_version
effective_from

Back to: H2-6

Network drop causes unsafe output — stale-data handling or fail-safe defaults? (→ H2-10 / H2-5)

Conclusion: Unsafe output during network loss points to stale-data being used for gating, or to fail-safe defaults that do not drive outputs to a safe state on link/IO faults.

Evidence: Check whether quality=STALE still allows control writes and whether age_ms exceeded the declared budget; then confirm the output safe_state_table for “link down / input invalid / device absent” cases.

First fix: Hard-block control writes under STALE and enforce fail-safe outputs for every fault class before re-enabling automatic sequencing.

quality
age_ms
safe_state_table
watchdog_state

Back to: H2-10 · H2-5

Audit trail fails an inspection — missing change reason or weak linkage to batch_id? (→ H2-8)

Conclusion: Audit failures usually come from missing change justification/approval fields, or from broken linkage where configuration events cannot be tied to the specific batch_id and recipe_version they influenced.

Evidence: Confirm change_reason, approval_id, and actor identity are present for config changes; then sample exports and ensure batch_id and recipe_version are attached to every relevant event in the chain.

First fix: Make change-reason and batch linkage mandatory fields and reject writes that cannot be attributed and scoped.

change_reason
approval_id
batch_id
recipe_version

Back to: H2-8

Storage wears out early — write amplification or event verbosity too high? (→ H2-9 / H2-8)

Conclusion: Early wear is typically caused by write amplification (inefficient journal/checkpoint layout) or by logging too many high-frequency events that could be aggregated without losing evidence value.

Evidence: Review bytes_written_total, wear_index/erase_count_max, and the event rate by category; then inspect checkpoint cadence and whether small updates trigger full-block rewrites.

First fix: Reduce verbosity/merge repetitive events and increase checkpoint efficiency before changing media.

bytes_written_total
wear_index
events_per_second
checkpoint_period

Back to: H2-9 · H2-8

Time sync keeps flapping — NTP source quality or PTP grandmaster holdover? (→ H2-7)

Conclusion: Sync flapping is usually upstream time-source instability (NTP quality) or poor holdover/switchover behavior when PTP grandmaster changes; frequent time steps corrupt correlation even if monotonic ordering is intact.

Evidence: Check sync_state, offset_ms, holdover_state, and time_step_count; then correlate step events with upstream source changes and verify monotonic timestamps remain continuous.

First fix: Tighten source priority/thresholds and avoid wall-clock steps that exceed the declared event-ordering tolerance.

sync_state
offset_ms
holdover_state
time_step_count

Back to: H2-7

Manual override creates confusion — permission model or inadequate logging? (→ H2-5 / H2-8)

Conclusion: Override confusion typically comes from weak role/permission boundaries or from incomplete before/after logging that makes it impossible to reconstruct who changed what, when, and for how long.

Evidence: Verify operator_role, override_grant, and expiry are enforced; then confirm override events capture before/after values, duration, and affected outputs, linked to batch_id and operator identity.

First fix: Require time-bounded override grants and log the full override envelope (grant → active → revoke) as evidence-grade events.

operator_role
override_grant
override_duration
batch_id

Back to: H2-5 · H2-8

Latency spikes during heavy logging — priority inversion or storage blocking? (→ H2-4 / H2-9)

Conclusion: Logging-induced latency spikes usually indicate priority inversion (sequencer blocked behind lower-priority work) or synchronous storage commits that stall the main loop under bursty event rates.

Evidence: Compare loop_latency_ms/scan_time_ms with storage_commit_time_ms and queue depth; then confirm whether log writes are synchronous or buffered and whether backpressure throttles event generation safely.

First fix: Move logging to an async queue with bounded flush policy, and protect the sequencer with priority and budget enforcement.

loop_latency_ms
storage_commit_time_ms
queue_depth
backpressure

Back to: H2-4 · H2-9

MPN quick list (common building blocks)

RTC: NXP PCF2129 · Microchip MCP79410 · Analog Devices MAX31341
Supervisor / watchdog: TI TPS3839 · TI TPS3430 · Analog Devices ADM809
FRAM / MRAM / NOR: Infineon/Cypress FM25V10 · Fujitsu MB85RS64V · Everspin MR25H40 · Winbond W25Q64JV
Ethernet / PHY: WIZnet W5500 · TI DP83826E · Microchip KSZ8081RNA
RS-485 / isolation: TI SN65HVD72 · Analog Devices ADM2587E · Analog Devices ADuM1250
Secure element (optional): Microchip ATECC608B · NXP SE050

These are reference examples to support commissioning evidence (timebase, power-fail behavior, logging media, fieldbus bridging, optional signing).

Batch / Recipe Controller for Sequencing, Metering & Audit Logs

Batch / Recipe Controller for Sequencing, Metering & Audit Logs

H2-1. What This Controller Does and Where It Fits

H2-2. Scope Guard and Acceptance Criteria

H2-3. Data Model: Recipe → Batch → Step → Action

Design binding rules (evidence-grade)

H2-4. Sequencing Engine: Determinism, Concurrency, and Interlocks

H2-5. I/O Sequencing: Commissioning-Friendly Mapping & Safety Defaults

Evidence fields to log

H2-6. Metering & Timing: Sampling Windows, Totalizers, and Reconciliation

Evidence fields to log

H2-7. Time Base: RTC, NTP, PTP/1588 and Timestamp Strategy

Evidence fields to log

H2-8. Event Log & Audit Trail: Evidence-Grade Records (EBR-ready)

Minimum evidence fields

H2-9. Recipe & Event Storage: Media Choice, Wear, Power-Fail, and Integrity

Evidence fields to log

H2-10. Ethernet & Fieldbus Integration: Mapping, Latency, and Failover

Evidence fields to log

H2-11. Validation & Commissioning Playbook (Field-Ready Tests)

Evidence rule

T1 — Step Timing (prove jitter upper bound)

T2 — Interlocks (debounce + voting window)

T3 — Metering (integration error + reconciliation)

T4 — Power-Fail (50 random cuts, evidence continuity)

T5 — Storage Wear (write budget → lifetime boundary)

T6 — Network Drop (stale marking + deterministic gating)

T7 — Recipe Change Control (approval + rollback + impact)

T8 — Audit Export (hash-chain verify + reconciliation)

Time base & timestamping

Power-fail robustness (supervisor / watchdog)

Nonvolatile storage (recipe + journal)

Ethernet / field I-O building blocks

Evidence integrity helpers

Request a Quote

Accepted Formats

Attachment

H2-12. FAQs (Troubleshooting, Evidence-First)

MPN quick list (common building blocks)

Explore

Categories

Get in Touch