123 Main Street, New York, NY 10001

Batch / Recipe Controller for Sequencing, Metering & Audit Logs

← Back to: Industrial Sensing & Process Control

Core idea: A Batch/Recipe Controller turns a recipe into a deterministic, auditable execution—sequencing I/O and metering with a reliable timebase, then storing evidence-grade batch records that survive power/network faults across Ethernet/fieldbus integration.

Outcome: It makes every run repeatable and explainable: the same inputs yield the same steps, and any deviation can be traced to specific timestamps, mappings, measurements, and approved recipe versions.

H2-1. What This Controller Does and Where It Fits

A Batch / Recipe Controller is the execution core that turns a versioned recipe into a repeatable run: it orchestrates step timing, I/O sequencing, metering windows, time-stamped events, and evidence-grade batch records, then integrates the results to SCADA/HMI/historian/MES via Ethernet or fieldbus—without replacing PLC safety logic.

To avoid scope confusion, treat industrial automation as four layers with distinct responsibilities:

  • Field layer (sensors, valves, actuators, meters): produces raw signals and executes physical actions.
  • Control layer (PLC, remote I/O, safety interlocks): guarantees basic deterministic I/O control and failsafe defaults.
  • Orchestration layer (Batch/Recipe Controller): defines and proves the execution semantics of a batch run—steps, phases, timeouts, retries, checkpoints, resumes, and operator interventions.
  • Supervisory/enterprise layer (HMI/SCADA/historian/MES): visualizes, aggregates, and connects business context; it should not be the source of truth for step-by-step execution evidence.

The practical boundary is evidence and repeatability. If you need to answer “what exactly happened” at each step (with timestamps, recipe version, operator actions, and measured totals), you need a controller that owns: recipe → batch instance → step actions → event log.

Use an independent Batch/Recipe Controller when:

  • Recipes change and must be traceable (versioning, approval, rollback, effective range per batch).
  • Steps are complex (branching, parallel phases, resource locks, retries, conditional waits).
  • Metering is contractual (totalizers, tolerances, reconciliation across devices/data sinks).
  • Batch records are audited (operator actions, parameter changes, exception handling, power-loss continuity).
  • Equipment spans buses (Ethernet + fieldbus devices with unified sequencing and health monitoring).

A PLC alone is typically sufficient when:

  • Logic is stable, steps are few, and there is no need for recipe version governance.
  • Metering is for display only (no reconciliation loop, no audit requirement).
  • Power-loss recovery can restart the process without step-level resume requirements.

Design anchor: separate “hard safety” from “batch orchestration.” Keep safety interlocks and fail-safe outputs in the PLC/safety chain; let the Batch/Recipe Controller focus on deterministic sequencing and evidence-grade records.

  • batch_id
  • recipe_version
  • event_ts (monotonic + UTC)
  • meter_window_id
  • step_state (start/end/abort/resume)
Batch / Recipe Controller — System Placement Execution semantics + evidence chain sit between PLC I/O control and SCADA/MES supervision. Field Layer Control Layer Orchestration Layer Supervisory / Enterprise Sensors Valves Meters Actuators PLC I/O control Remote I/O modules Safety Interlocks failsafe defaults Batch / Recipe Controller Recipe engine • Sequencer • Time base • Event logger • Storage • Bus gateway batch_id recipe_version HMI SCADA Historian MES events / batch record control / sequencing path telemetry / evidence export
Figure 1. System placement: PLC owns basic I/O control and safety defaults; the Batch/Recipe Controller owns recipe execution semantics and evidence-grade batch records; SCADA/MES consume results via interfaces.

H2-2. Scope Guard and Acceptance Criteria

This section converts “batch control” from a concept into sign-off-able engineering criteria. Each line below is written as: metric (what to measure) + evidence (what to capture) + first fix (what to change first).

Acceptance checklist (10 lines):

  • Step timing accuracy — specify max jitter and timestamp resolution; capture step_start/step_end deltas; first fix: prioritize sequencer + async logging.
  • Event continuity under power loss — define max loss window (e.g., ≤N seconds); capture power-fail test log + recovery proof; first fix: journal + checkpoint cadence.
  • Recipe version governance — define versioning, approval, rollback; capture who/when/why + effective batch range; first fix: bind recipe_version immutably per batch_id.
  • Metering reconciliation — define allowable batch total error; capture totalizer vs meter vs historian comparison; first fix: lock sampling window + calibration version per batch.
  • Comms resilience — define reconnect time + stale-data marking; capture drop/reconnect replay; first fix: watchdog + offline queue + explicit stale state.
  • Interlocks and safe stop — define permissives, timeouts, safe outputs; capture interlock test cases and safe-state table; first fix: keep hard safety in PLC/safety chain.
  • Audit trail completeness — require who/what/when/why/impact; capture export sample linked to batch_id; first fix: route all changes through a single audited API.
  • Storage lifetime — estimate years at events/sec and write amplification; capture wear model + leveling strategy; first fix: event tiering + batch commits + verbosity caps.
  • Time-base boundary — define RTC/NTP/PTP roles + sync health; capture offset/holdover/sync_state in logs; first fix: dual timestamps (monotonic + UTC) per event.
  • Commissioning readiness — require simulator/replay/self-test; capture dry-run report + deterministic replay; first fix: build test-first with recorded trace playback.

Scope guard: this controller must not become “everything.” It owns sequencing + evidence, while PLC owns failsafe I/O, and SCADA/MES own visualization and business workflow. That separation is itself an acceptance criterion because it prevents untestable coupling.

Acceptance Checklist (Sign-off Ready) Metrics + evidence artifacts + first fixes, grouped by engineering responsibility. Timing & Sequencing Metering & Time Base Evidence & Storage Network & Commissioning jitter cap timeouts & retries checkpoint / resume safe stop policy meter window reconciliation sync state mono + UTC stamps audit who/what/when power-fail continuity wear model CRC / version / rollback reconnect SLA stale-data marking offline queue simulator / replay Evidence to capture: logs tests reports
Figure 2. Acceptance checklist grouped by responsibility. Use it to scope requirements, compare vendors, and sign off deterministic behavior, metering correctness, audit evidence, and recovery guarantees.

H2-3. Data Model: Recipe → Batch → Step → Action

A robust batch system is built on an executable, auditable data model. The core rule is to separate the static contract (Recipe) from the runtime instance (Batch), then break execution into provable units (Step/Phase) and replayable atoms (Action).

Recipe (static template) defines the approved structure and parameters. It must be versioned and immutable per approval cycle. A recipe is not “a spreadsheet of fields”; it is a graph of steps with branching/parallelism rules, entry/exit conditions, timeout policies, and parameter constraints.

Batch (runtime instance) binds to one recipe version and records what happened in the real world. This binding is what makes audit and troubleshooting possible: a batch_id must always map to a single recipe_version.

Step / Phase (execution unit) is the smallest unit that should have a clear start/end boundary, a measurable outcome, and a deterministic transition rule. Steps can be sequential, parallel, looped, or conditional—yet still must be explainable via logged state transitions.

Action (replayable atom) is the minimum work item the sequencer executes: setting I/O states, applying setpoints, opening a metering window, verifying stability, waiting for a condition, or emitting a log marker. Actions must carry enough parameters to be replayed in simulation or trace playback.

Design binding rules (evidence-grade)

(1) Batch → Recipe version binding: batch_id always points to exactly one recipe_version. (2) Event → Action binding: every event references (batch_id, step_id, action_id). (3) Metering → Window binding: measured totals bind to meter_window_id with calibration version.

  • recipe_id
  • recipe_version
  • batch_id
  • step_id
  • action_id
  • meter_window_id
  • event_seq
  • event_ts_mono
  • event_ts_utc
  • calib_version
Recipe Model → Executable Batch Instance Versioned recipe defines structure; batch binds version and records step/action evidence. ISA-88 (simplified) Recipe vX.Y versioned Unit Procedure scoped Operation timed Phase (Step) metered Actions logged Batch Instance (runtime) batch_id binds recipe_version recipe_version operator_id • equipment_id Step state init → running → done timeout / retry / abort Timestamps event_ts_mono event_ts_utc Metering window meter_window_id • totalized_value • calib_version Event log bindings (batch_id, step_id, action_id) + event_seq version binding action → event
Figure 3. ISA-88-style structure (left) with a batch runtime instance (right). The model enforces version binding and event/action linkability to support replay, audit, and recovery.

H2-4. Sequencing Engine: Determinism, Concurrency, and Interlocks

Determinism is not a claim—it is a design property that can be measured. A sequencing engine is deterministic when the same inputs and time base produce the same step transitions, and when every transition can be explained by logged evidence (events, interlocks, timeouts, and resource arbitration).

Scan loop vs event loop: treat them as a boundary decision, not a debate. Periodic sampling is useful for filtering and stable condition checks, while event-driven dispatch is essential for bus updates, metering window edges, timeouts, alarms, and operator interventions. A practical architecture uses a single prioritized event queue where both periodic ticks and asynchronous signals land.

Concurrency is controlled by resources, not hopes. Parallel steps must declare what they own: exclusive actuators (valves/pumps), shared sensors, and rate-limited services (storage/network). The sequencer must provide resource locking and arbitration so conflicts produce predictable outcomes.

Interlocks/permissives must be evidence-grade. A raw input should not flip a batch state instantly. Use a condition tree with explicit debounce windows, optional voting windows, and a defined trip latch policy (latched until operator reset vs auto-clear).

Timeouts define safety and recoverability. Separate: soft abort (stop progression and preserve diagnostics), hard abort (force safe outputs, typically via PLC/safety chain), and the safe state definition (what every output becomes). Recovery is then a choice: resume from a step checkpoint or restart the batch, based on checkpoint granularity and metering state.

  • event_queue_depth
  • dispatch_latency_ms
  • resource_id
  • owner_step_id
  • wait_reason
  • interlock_id
  • debounce_ms
  • vote_window_ms
  • timeout_ms
  • retry_count
  • checkpoint_id
Sequencer Execution Timeline (Deterministic) Actions + evidence gates (interlocks, metering windows, timeouts) produce explainable transitions. time (monotonic) t0 t1 t2 t3 t4 Step A Action: I/O set + setpoint Metering window meter_window_id • totalizer Interlock gate debounce_ms • vote_window_ms Verify stable condition Step B next transition Timeout path timeout_ms → retry_count Evidence fields Log: (batch_id, step_id, action_id), interlock_id, debounce_ms, vote_window_ms, timeout_ms, retry_count, dispatch_latency_ms
Figure 4. Deterministic sequencing is achieved by explicit evidence gates: metering windows, interlock debounce/voting, timeout and retry policies, and logged transitions.

H2-5. I/O Sequencing: Commissioning-Friendly Mapping & Safety Defaults

Commissioning becomes predictable when I/O is not hard-coded. A maintainable architecture separates logical intent (process meaning) from physical realization (wiring / addresses), with a versioned mapping layer that is auditable, testable, and safe by default.

I/O mapping layers: define process-facing tags such as Valve_01_Open or Pump_A_Start, then bind them to physical channels (local I/O, remote I/O, or fieldbus registers) via a single mapping contract. This contract should include polarity, scaling, and protocol binding, plus lifecycle controls (mapping_version, approvals, and effective dates).

Safety defaults (fail-safe outputs) must be explicit per output and per fault category. “Safe” is not one value—it is a policy: drop vs hold, latched vs auto-clear, and the recovery rule. Hard safety trips should be enforced by PLC/safety chains; the batch controller should orchestrate orderly stops and log the evidence of every transition.

Input semantics determine deterministic behavior: edge vs level, debounce windows, and filter principles must be part of the design. Raw inputs should not drive step transitions directly; they should pass through debounce/filter stages and emit events that can be traced in logs.

Manual override is a controlled mode, not ad-hoc output toggling. Enter/exit conditions, role-based permission, and time-bounded overrides are required. Every manual action must create an audit event that links back to batch_id/step_id where applicable.

Dry-run and forced I/O enable safe commissioning and reproducible troubleshooting. Dry-run advances the sequencer without driving physical outputs. Forced I/O must be time-limited, clearly flagged, and recorded with who/when/why, so batch evidence remains trustworthy.

Evidence fields to log

mapping_version • effective_from • approved_by • raw_state/debounced_state • debounce_ms • filter_type • forced_flag • override_role • safe_state_reason

  • logical_tag
  • protocol_binding
  • physical_channel
  • polarity
  • scaling
  • mapping_version
  • safe_value
  • recover_policy
  • forced_flag
  • override_role
I/O Map & Safety Defaults Logical intent binds to physical channels via a versioned mapping layer; safe states are explicit by fault type. Logical tags Mapping layer Physical I/O Tag Dir Valve_01_Open DO Pump_A_Start DO Flow_Pulse DI Door_SW DI EStop DI Heater_Enable DO Binding & lifecycle mapping_version • approved_by Polarity & scaling invert • gain • offset Input semantics edge/level • debounce_ms Audit flags forced_flag • override_role Remote I/O node slot/channel mapping Fieldbus device node/register binding Output driver safe_value • recover_policy drop / hold / latch Safe-state table (fault → output behavior) power loss → drop outputs net stale → hold + alarm watchdog → safe stop safety trip → PLC chain
Figure 5. Commissioning-friendly I/O mapping separates logical tags from physical channels through a versioned binding layer. Safe outputs are explicit by fault category to ensure deterministic recovery behavior.

H2-6. Metering & Timing: Sampling Windows, Totalizers, and Reconciliation

Metering is the proof layer of a batch. The goal is not “a number on a screen” but a result that can be recomputed, reconciled, and audited. That requires explicit sampling windows, stable time bases, versioned calibration, and reconciliation outputs.

Instantaneous vs totalized: instantaneous measurements are noisy and time-dependent; totalized values depend on the integration method and the accuracy of timestamps. Drift typically comes from sampling cadence mismatch, filter delay, time-base jumps, or calibration mismatch—so all of those must be visible in evidence logs.

Sampling windows define where metering starts and stops. Windows should be triggered by step events and captured with monotonic timestamps for ordering, plus UTC timestamps for correlation across systems. Filtering should be minimal and traceable (type + parameter), so totals can be recomputed in offline replay.

Metering events should be explicit: start/stop, threshold reached, and stability conditions. For example, a stable condition can be defined as “variance below a tolerance for a continuous duration,” which prevents transient spikes from closing a window early.

Reconciliation compares three totals: (1) batch total computed from the window, (2) meter-device totalizer, and (3) historian total. The output should include PASS/WARN/FAIL plus a delta value and a hint category (window mismatch, calibration mismatch, missing samples, time jump).

Calibration must be versioned. Every metering window must record the calibration version (and validity), otherwise the batch record cannot prove what coefficients were used when the result was produced.

Evidence fields to log

sample_period_ms • filter_type/filter_param • window_start_ts_mono • window_end_ts_mono • totalized_value • unit • calib_version • reconcile_status • delta

  • meter_window_id
  • sample_period_ms
  • filter_type
  • window_start_event
  • window_stop_event
  • totalized_value
  • calib_version
  • meter_total
  • historian_total
  • reconcile_status
Metering Window & Totalizer (Reconciled) Explicit window edges + traceable time base enable recompute and reconciliation across systems. Flow (instantaneous) sample_period_ms • filter_type time meter_window_id calib_version start event stop event Totalizer totalized_value = Σ(flow × dt) over the metering window Use monotonic timestamps for ordering; store UTC for cross-system correlation. Reconciliation Batch total from window Meter total device totalizer Historian sampled total Result reconcile_status delta + hint evidence
Figure 6. Metering is defined by explicit window edges and traceable time bases. Totalized results are reconciled across batch calculations, device totalizers, and historian totals with a clear PASS/WARN/FAIL output.

H2-7. Time Base: RTC, NTP, PTP/1588 and Timestamp Strategy

A reliable batch record needs two different “time truths”: ordering time and correlation time. Event ordering must be based on a monotonic clock to survive NTP/PTP steps, while cross-system correlation uses UTC wall time plus a recorded sync state.

RTC provides a retained time baseline across power loss. It is not a precision sync tool, but it prevents wall time from resetting to an invalid value after reboot. NTP is appropriate for general synchronization where millisecond-level alignment is acceptable, but it can introduce time steps (sudden jumps). PTP/1588 targets sub-millisecond alignment and is preferred when multiple devices must agree on window edges, sequence timing, or fine-grained measurements.

Timestamp rule: every evidence-grade event should record a dual timestamp: ts_mono_ns (monotonic for strict ordering) and ts_utc (UTC for correlation), plus sync_state and offset so time quality is visible during audits. When the system enters holdover or becomes unsynchronized, the record must still be continuous and clearly marked.

Clock health should be observable and actionable: capture sync_state, offset, last_sync_ts, holdover_age, and time_step_count. If sync quality degrades, the controller can apply a deterministic downgrade policy (e.g., more conservative metering decisions, reconciliation marked WARN, or operator confirmation for critical changes).

Multi-source redundancy avoids single points of failure. A practical priority policy is: PTP_LOCKED > NTP_SYNC > HOLDOVER > RTC_ONLY. Any source switch should generate a dedicated event so later investigations can explain time-quality changes.

Evidence fields to log

ts_mono_ns • ts_utc • sync_state • offset_ms • holdover_age_s • last_sync_ts • time_step_count • time_source

  • ts_mono_ns
  • ts_utc
  • sync_state
  • offset_ms
  • holdover_age_s
  • last_sync_ts
  • time_step_count
  • time_source
  • primary/backup
Dual-Clock Timestamp Model Order by monotonic time; correlate by UTC wall time with recorded sync quality. Time sources PTP / 1588 sub-ms alignment NTP ms sync • may step RTC retain across power loss Clock manager Priority rule PTP_LOCKED > NTP_SYNC HOLDOVER > RTC_ONLY Health sync_state • offset_ms Holdover holdover_age_s • last_sync Time-step tracking time_step_count • source_change Event record ts_mono_ns ordering ts_utc correlation sync_state LOCKED/HOLDOVER offset_ms time quality batch_id • step_id evidence join keys emit
Figure 7. Dual timestamps keep event ordering stable (monotonic) while supporting cross-system correlation (UTC). Sync quality is recorded per event to preserve auditability during holdover or unsync states.

H2-8. Event Log & Audit Trail: Evidence-Grade Records (EBR-ready)

Evidence-grade logging is not “debug text.” It is a structured chain that can answer: who did what, when, with which version, and which batch was affected. Logs should be linkable (join keys), exportable, and tamper-evident.

Event classes should be audit-oriented: process events (step transitions, metering window edges, interlock trips), configuration changes (recipe releases, mapping_version changes, thresholds), and security-relevant actions (manual override, forced I/O, role changes, auth failures). Each class requires different mandatory fields, but all must include batch_id and trustworthy timestamps.

Tamper-evidence can be achieved with a lightweight hash chain: each event stores hash_prev and a computed hash_curr over key fields. Optional signatures can be applied to high-value milestones (e.g., batch completion or recipe publish) without signing every log line.

Capacity strategy should be tiered and deterministic: a RAM ring buffer for high-frequency debug traces, an append-only flash log for durable evidence events, and a server/historian tier for long-term retention and search. If a loss window occurs (power failure before flush), the system should record the loss interval explicitly.

Join keys are the backbone of traceability: batch_id, recipe_version, operator_id, device_id, plus event_seq and step_id/action_id to localize cause. Mapping_version and calib_version should also be captured, so I/O behavior and metering results remain provable in audits.

Export should be simple and consistent: CSV/JSON for evidence packages, and structured feeds to SCADA/historians (tags or topics) for monitoring. Export should never strip the fields that prove integrity and linkage.

Minimum evidence fields

event_seq • event_type • ts_mono_ns • ts_utc • sync_state • batch_id • recipe_version • operator_id • device_id • step_id/action_id • result • hash_prev/hash_curr

  • event_seq
  • event_type
  • hash_prev
  • hash_curr
  • batch_id
  • recipe_version
  • operator_id
  • device_id
  • mapping_version
  • calib_version
  • export_ref
Audit Trail: Evidence Chain (EBR-ready) Structured events with join keys + tamper-evident chaining + tiered storage and export. Event chain (tamper-evident) Event#120 hash_prev hash_curr event_type • result batch_id • recipe_ver Event#121 hash_prev hash_curr event_type • result operator_id • device_id Event#122 hash_prev hash_curr action_id • result mapping_ver • calib_ver hash link hash link Key fields (joinable evidence) batch_id recipe_version event_seq operator_id device_id step_id / action_id mapping_version calib_version export_ref Storage tiers RAM ring high-rate Flash append durable Server export CSV/JSON
Figure 8. Evidence-grade records combine structured join keys with tamper-evident chaining. Tiered storage keeps durability and exportability without losing audit-critical fields.

H2-9. Recipe & Event Storage: Media Choice, Wear, Power-Fail, and Integrity

Storage in a batch controller is not “saving files.” It is an engineering mechanism that must support high-frequency writes, long service life, power-fail consistency, and verifiable integrity. The design starts by separating workload types: recipes are versioned and read-mostly; events are append-heavy and must remain replayable after a sudden reboot.

Workload split: a recipe is a compact, versioned artifact (approve, publish, rollback), so it should be stored in a slot-based or image-based format that supports atomic activation (A/B slot or dual-image). An event log is a continuous stream; it should be written as an append-only journal with explicit commit markers so incomplete tails can be detected and truncated after power loss.

Power-fail consistency requires a deterministic commit path. For event journal entries, a safe pattern is: write payload → write length/CRC (or hash) → write commit marker → advance pointer atomically. For recipes, write the new version into an inactive slot, validate it, then flip an active pointer as the final step. Recovery should replay from the last checkpoint plus verified journal entries.

Wear and lifetime are controlled by minimizing write amplification and avoiding hot spots. Depending on the write rate and retention needs, media choices may include FRAM/MRAM for high-endurance logging, or NOR/SD for larger capacity when paired with journal-structured writes and explicit health monitoring. When media is block-based, bad-block handling and a clear allocation strategy are required.

Integrity and rollback must be evidence-ready. Every stored recipe should include version metadata, and every event should carry a checksum and sequence number. If a rollback occurs, it should create a dedicated audit event with from/to versions and a reason, so downstream reports can explain changes in batch outcomes.

Tiering improves resilience: keep hot, queryable state locally (recent events + current batch snapshot), store durable evidence in append-only flash media, and export cold archives to a server/historian. Offline buffering should be treated as a normal mode, with explicit “catch-up” export after reconnection.

Evidence fields to log

recipe_version • approval_id • active_slot • event_seq • crc/hash • commit_marker • checkpoint_last_seq • recovery_path • bytes_written_total • storage_health_state

  • recipe_version
  • active_slot
  • event_seq
  • commit_marker
  • crc/hash
  • checkpoint
  • recovery_event
  • write_amplification
  • wear_index
  • bad_block_count
  • storage_health_state
Storage Tiering & Power-Fail Commit Read-mostly recipes, append-heavy events, checkpointed runtime state—committed and recoverable. Local tiers Recipe store (A/B) recipe_version • approval_id • active_slot write inactive → validate → flip pointer Event journal (append) event_seq • crc/hash • commit_marker truncate invalid tail on boot Checkpoint (runtime snapshot) batch_id • step_id • checkpoint_last_seq replay from checkpoint + journal Power-fail-safe commit 1) write payload 2) write len + CRC/hash 3) write commit_marker 4) advance pointer (atomic) On boot: recovery path validate active_slot • scan tail • truncate replay from checkpoint_last_seq writes
Figure 9. Storage is tiered by workload: versioned recipes (A/B activation), append-only event journal, and checkpointed runtime state. A deterministic commit sequence enables safe recovery after sudden power loss.

H2-10. Ethernet & Fieldbus Integration: Mapping, Latency, and Failover

Connectivity is not “link up.” Integration must define a tag contract, a connection state machine, a latency budget, and a failover behavior. Control decisions must remain deterministic even when data becomes stale or links flap.

Protocol stacks may include Modbus TCP, EtherNet/IP, PROFINET, or OPC UA, but the engineering value is above the protocol: a versioned tag map with stable naming, units, scaling, and compatibility rules. The controller should expose tag_map_version and device profiles so commissioning changes can be audited and rolled out safely.

Connection management must be explicit: reconnect loops, watchdog timers, and stale-data marking. “Stale” is not just missing updates; it means the value must not be used for step gating or safety decisions. Each tag should have an age_ms (or last_update_ts_mono) and a quality flag (GOOD/STALE/BAD) so behavior is explainable in logs.

Latency requires plane separation: keep control-plane updates isolated from telemetry-plane reporting. Control-plane traffic is bounded and deterministic (step gating, key setpoints), while telemetry can be buffered and bursty (trends, historian feeds). Telemetry congestion must not affect control decisions.

Failover should preserve safety and evidence: link switches, dual networks, or redundancy principles (e.g., parallel paths) must trigger a link_failover_event, and quality flags should show degraded states during switching. After reconnection, the system should emit re-sync completion events and reconcile any buffered telemetry exports.

Evidence fields to log

tag_map_version • device_profile_id • quality(GOOD/STALE/BAD) • age_ms • watchdog_state • reconnect_count • link_failover_event • control_plane_latency_ms • telemetry_backlog

  • tag_map_version
  • device_profile_id
  • quality
  • age_ms
  • watchdog_state
  • reconnect_count
  • stale_flag
  • control-plane
  • telemetry-plane
  • link_failover_event
  • re-sync
Ethernet & Fieldbus Gateway (Mapped + Managed) Versioned tag mapping, bounded control-plane latency, telemetry buffering, and evidence-grade failover. Field devices Meter totalizer • pulses Valve/Actuator state • interlock Remote I/O DI/DO • status Drive/Motor speed • alarms Batch controller Protocol adapters Modbus • EIP • PROFINET Tag map (ver) tag_map_version • units scaling • compatibility Connection manager watchdog • reconnect quality • age_ms • stale Plane separation control vs telemetry Upstream SCADA / HMI ops monitoring Historian telemetry ingest MES / EBR batch records Export CSV/JSON control-plane latency_budget telemetry-plane quality • age_ms Failover should emit link_failover_event and mark degraded quality during switching.
Figure 10. Protocol adapters feed a versioned tag map and a connection state machine. Control-plane and telemetry-plane are separated, stale marking is explicit, and failover is evidence-grade through dedicated events.

H2-11. Validation & Commissioning Playbook (Field-Ready Tests)

This playbook turns system claims into repeatable evidence. Each test is written as a 3-line SOP: Goal (what must be proven), Method (how to reproduce), and Evidence (which fields/waveforms/exports to capture). Pass criteria are explicit so commissioning results are auditable.

Evidence rule

Use ts_mono_ns for interval/jitter calculations, keep ts_utc + sync_state for correlation, and ensure every test produces an exportable artifact (CSV/JSON) linked by batch_id and event_seq.

T1 — Step Timing (prove jitter upper bound)

Goal: For the same recipe and inputs, key step edges show bounded jitter (e.g., P99 ≤ X ms).

Method: Run the same batch N times (e.g., 30). Compare step enter/exit, I/O set, and metering window edges.

Evidence: Export a timing report keyed by batch_id/step_id with ts_mono_ns, delta_ms, P95/P99, and time_step_count.

Pass: P99 jitter stays within the declared limit; no missing event_seq segments.

  • ts_mono_ns
  • event_seq
  • delta_ms
  • P99
  • time_step_count

T2 — Interlocks (debounce + voting window)

Goal: Bouncy inputs do not cause false trips; sustained violations trip within T.

Method: Inject controlled chatter (10–50 ms) and then sustained faults (> window). Repeat across multiple sensors/inputs.

Evidence: Capture interlock decision events with debounce_ms, vote_window_ms, gate state, and the resulting safe_state transition.

Pass: No false stops during chatter; sustained faults always produce a deterministic trip record.

  • debounce_ms
  • vote_window_ms
  • interlock_gate
  • safe_state
  • result_code

T3 — Metering (integration error + reconciliation)

Goal: Known pulse/flow profiles produce totals within the declared error bound (e.g., ≤ X%).

Method: Feed step, ramp, and pulsed profiles. Validate both instantaneous and totalized values under noise/low-flow cases.

Evidence: window_start/stop events, totalizer snapshots, calib_version, and a reconciliation export (batch_total vs meter_total vs historian_total).

Pass: Error stays within bounds across profiles; calibration version is traceable per batch.

  • window_start
  • window_stop
  • totalizer
  • calib_version
  • reconcile

T4 — Power-Fail (50 random cuts, evidence continuity)

Goal: After random power loss, the controller recovers to a consistent state and preserves evidence (or records loss explicitly).

Method: Cut power at varied phases (writing events, switching steps, exporting). Repeat at least 50 times with different cut points.

Evidence: recovery_event with recovery_path, checkpoint_last_seq, journal tail verification (truncate point), and loss_interval_s if any.

Pass: System resumes without silent gaps; any gap is declared and bounded.

  • recovery_event
  • checkpoint_last_seq
  • truncate_tail
  • loss_interval_s
  • hash_chain

T5 — Storage Wear (write budget → lifetime boundary)

Goal: Under the expected write rate, projected life meets the target (years) with observable health telemetry.

Method: Estimate events/s and bytes/event; compute daily writes; optionally run accelerated logging for a fixed duration to validate counters.

Evidence: bytes_written_total, wear_index (or erase_count_max), bad_block_count, storage_health_state, plus the assumptions used in the write budget.

Pass: Budgeted lifetime meets target; health indicators remain stable and monotonic.

  • bytes_written_total
  • wear_index
  • erase_count_max
  • bad_block_count
  • health_state

T6 — Network Drop (stale marking + deterministic gating)

Goal: When links flap, stale data is never used for step gating; reconnection is evidence-grade and explainable.

Method: Unplug/replug, inject loss/jitter, and force repeated reconnects. Validate both control-plane and telemetry-plane behavior.

Evidence: quality (GOOD/STALE/BAD), age_ms, watchdog_state, reconnect_count, link_failover_event, and resync_complete_event.

Pass: Control-plane actions are blocked or degraded under STALE; transitions are fully logged.

  • quality
  • age_ms
  • watchdog_state
  • reconnect_count
  • resync

T7 — Recipe Change Control (approval + rollback + impact)

Goal: Recipe changes are approved, versioned, and reversible; the impact on batches is provable.

Method: Execute a full cycle: draft → approve → publish → run → rollback → run again. Include a mapping change case.

Evidence: recipe_version, approval_id, effective_from, rollback_reason, and batch_id ↔ recipe_version linkage in the audit export.

Pass: No “mystery versions”; every batch can be tied to an approved recipe version.

  • recipe_version
  • approval_id
  • effective_from
  • rollback_reason
  • impact_scope

T8 — Audit Export (hash-chain verify + reconciliation)

Goal: Export packages are consistent, verifiable, and reconcile totals across controller/meter/historian.

Method: Export CSV/JSON packages and randomly sample N events to verify hash links; run reconciliation for selected batches.

Evidence: export_ref, hash_prev/hash_curr, hash_chain_verify_result, plus reconciliation artifacts keyed by batch_id.

Pass: Chain verification passes and reconciliation differences are within declared tolerances.

  • export_ref
  • hash_prev
  • hash_curr
  • verify_result
  • reconcile

Example MPNs for commissioning setups (reference BOM snippets): The parts below are commonly used building blocks to implement or validate the mechanisms above (timebase, storage, power-fail, network I/O, tamper-evident logs). Select equivalents per platform and availability.

Time base & timestamping

  • NXP PCF2129
  • Microchip MCP79410
  • Analog Devices MAX31341

Use RTCs for power-loss retention; pair with NTP/PTP on the Ethernet side as needed.

Power-fail robustness (supervisor / watchdog)

  • Texas Instruments TPS3839
  • Texas Instruments TPS3430
  • Analog Devices ADM809

Useful for T4: repeatable resets and brownout behavior while validating recovery events.

Nonvolatile storage (recipe + journal)

  • Infineon/Cypress FM25V10
  • Fujitsu MB85RS64V
  • Everspin MR25H40
  • Winbond W25Q64JV
  • Macronix MX25L128

FRAM/MRAM fit write-heavy journals; SPI NOR is common for recipe images + journal/checkpoint structures.

Ethernet / field I-O building blocks

  • WIZnet W5500
  • Texas Instruments DP83826E
  • Microchip KSZ8081RNA
  • Texas Instruments SN65HVD72
  • Analog Devices ADM2587E
  • Analog Devices ADuM1250

Ethernet controller/PHY + RS-485/isolated RS-485 are common for Modbus/fieldbus bridges and T6 tests.

Evidence integrity helpers

  • Microchip ATECC608B
  • NXP SE050

Secure elements can support key storage for signing milestone records (optional, aligns with T8).

Validation Matrix (Commissioning Evidence) Rows: test cases • Columns: domains • Cells: evidence to capture (short tokens). Timing Interlock Metering Storage Network Audit T1 T2 T3 T4 T5 T6 T7 T8 ts_mono P99 age_ms event_seq debounce vote_win result window totalizer calib_ver reconcile checkpoint truncate recovery bytes_wr wear_idx bad_blk quality age_ms failover active_slot approval rollback hash_chain export_ref Legend: ts_mono=monotonic timestamp • age_ms=data age • quality=GOOD/STALE/BAD • hash_chain=prev/curr hash link
Figure 11. A commissioning matrix that maps each test (T1–T8) to the domains it must validate, and the minimum evidence tokens to capture for audit-ready results.

Request a Quote

Accepted Formats

pdf, csv, xls, xlsx, zip

Attachment

Drag & drop files here or use the button below.

H2-12. FAQs (Troubleshooting, Evidence-First)

Each answer follows a fixed field-ready structure: one-sentence conclusion + two evidence checks + one first fix. Every question points back to the chapters that define the proof fields.

Batch stops unexpectedly — interlock glitch or timeout policy? (→ H2-4 / H2-5)

Conclusion: Most “unexpected stops” are either a noisy interlock gate or an overly aggressive timeout policy; the faster the stop and the less context in the log, the more likely an interlock decision path is missing.

Evidence: Check interlock events for debounce_ms/vote_window_ms and interlock_gate_state; then verify timeout_reason, step_id, and retry_count for the same window.

First fix: Enable decision-summary logging for interlocks and downgrade non-critical timeouts to soft-abort before hard-abort.

  • debounce_ms
  • vote_window_ms
  • timeout_reason
  • retry_count
Back to: H2-4 · H2-5
Same recipe, different yield — metering reconciliation or timing drift? (→ H2-6 / H2-7)

Conclusion: Yield variation under the same recipe is usually a reconciliation gap (window/calibration) or a timebase issue that shifts windows and gates; if totals disagree across sources, metering is first suspect.

Evidence: Compare batch_total vs meter_total vs historian_total and confirm calib_version; then validate window edges with ts_mono_ns and check sync_state/time_step_count.

First fix: Lock window start/stop conditions and bind calib_version into every batch record before tuning time sync.

  • calib_version
  • window_start
  • ts_mono_ns
  • sync_state
Back to: H2-6 · H2-7
Events look out of order — wall clock step or missing monotonic stamps? (→ H2-7 / H2-8)

Conclusion: “Out-of-order” events are typically caused by wall-clock steps (NTP/PTP corrections) or by logging without a stable monotonic timeline; relying on UTC alone makes ordering fragile during sync changes.

Evidence: Confirm every event includes both ts_mono_ns and ts_utc plus sync_state; then check event_seq continuity and look for time_step_count spikes around the suspected interval.

First fix: Sort/compute by monotonic time, display by UTC, and enforce dual-timestamp + sequence as mandatory log fields.

  • ts_mono_ns
  • ts_utc
  • event_seq
  • time_step_count
Back to: H2-7 · H2-8
Recipe update broke only some lines — tag mapping versioning or compatibility? (→ H2-10 / H2-3)

Conclusion: Partial breakage after a recipe update usually means tag-map versions are inconsistent across lines, or a compatibility rule (units/scaling/types) was violated by a new parameter schema.

Evidence: Compare tag_map_version and device_profile_id across the affected lines; then verify recipe parameter constraints (units, ranges, types) and whether the controller reports schema/version mismatch events.

First fix: Freeze execution to a single known-good tag map version and roll forward using an explicit compatibility layer for renamed or retyped tags.

  • tag_map_version
  • device_profile_id
  • schema_version
  • unit/scaling
Back to: H2-10 · H2-3
After power loss, batch resumes wrong step — checkpoint design or commit ordering? (→ H2-9 / H2-4)

Conclusion: Wrong-step resume is almost always a checkpoint definition problem (missing state variables) or an unsafe commit order that lets pointers advance before payload is fully durable.

Evidence: Inspect checkpoint_last_seq, saved step_id/phase_state, and the recovery_path event; then validate journal tail integrity (CRC/commit marker) and whether truncation occurred before replay.

First fix: Expand checkpoint to include minimal state-machine invariants and make pointer advancement the final atomic action in the commit path.

  • checkpoint_last_seq
  • phase_state
  • commit_marker
  • recovery_path
Back to: H2-9 · H2-4
Totalizer doesn’t match flowmeter — sampling window or calibration version mismatch? (→ H2-6)

Conclusion: A totalizer mismatch is usually caused by window boundary rules (start/stop/stability) or a calibration-version mismatch where the batch used a different scaling than the meter/historian expects.

Evidence: Verify window_start/window_stop triggers and stability criteria, then confirm calib_version and effective_from were attached to the batch record and applied consistently end-to-end.

First fix: Make calib_version mandatory per batch and add a reconciliation report that flags boundary-condition anomalies.

  • window_start
  • window_stop
  • calib_version
  • effective_from
Back to: H2-6
Network drop causes unsafe output — stale-data handling or fail-safe defaults? (→ H2-10 / H2-5)

Conclusion: Unsafe output during network loss points to stale-data being used for gating, or to fail-safe defaults that do not drive outputs to a safe state on link/IO faults.

Evidence: Check whether quality=STALE still allows control writes and whether age_ms exceeded the declared budget; then confirm the output safe_state_table for “link down / input invalid / device absent” cases.

First fix: Hard-block control writes under STALE and enforce fail-safe outputs for every fault class before re-enabling automatic sequencing.

  • quality
  • age_ms
  • safe_state_table
  • watchdog_state
Back to: H2-10 · H2-5
Audit trail fails an inspection — missing change reason or weak linkage to batch_id? (→ H2-8)

Conclusion: Audit failures usually come from missing change justification/approval fields, or from broken linkage where configuration events cannot be tied to the specific batch_id and recipe_version they influenced.

Evidence: Confirm change_reason, approval_id, and actor identity are present for config changes; then sample exports and ensure batch_id and recipe_version are attached to every relevant event in the chain.

First fix: Make change-reason and batch linkage mandatory fields and reject writes that cannot be attributed and scoped.

  • change_reason
  • approval_id
  • batch_id
  • recipe_version
Back to: H2-8
Storage wears out early — write amplification or event verbosity too high? (→ H2-9 / H2-8)

Conclusion: Early wear is typically caused by write amplification (inefficient journal/checkpoint layout) or by logging too many high-frequency events that could be aggregated without losing evidence value.

Evidence: Review bytes_written_total, wear_index/erase_count_max, and the event rate by category; then inspect checkpoint cadence and whether small updates trigger full-block rewrites.

First fix: Reduce verbosity/merge repetitive events and increase checkpoint efficiency before changing media.

  • bytes_written_total
  • wear_index
  • events_per_second
  • checkpoint_period
Back to: H2-9 · H2-8
Time sync keeps flapping — NTP source quality or PTP grandmaster holdover? (→ H2-7)

Conclusion: Sync flapping is usually upstream time-source instability (NTP quality) or poor holdover/switchover behavior when PTP grandmaster changes; frequent time steps corrupt correlation even if monotonic ordering is intact.

Evidence: Check sync_state, offset_ms, holdover_state, and time_step_count; then correlate step events with upstream source changes and verify monotonic timestamps remain continuous.

First fix: Tighten source priority/thresholds and avoid wall-clock steps that exceed the declared event-ordering tolerance.

  • sync_state
  • offset_ms
  • holdover_state
  • time_step_count
Back to: H2-7
Manual override creates confusion — permission model or inadequate logging? (→ H2-5 / H2-8)

Conclusion: Override confusion typically comes from weak role/permission boundaries or from incomplete before/after logging that makes it impossible to reconstruct who changed what, when, and for how long.

Evidence: Verify operator_role, override_grant, and expiry are enforced; then confirm override events capture before/after values, duration, and affected outputs, linked to batch_id and operator identity.

First fix: Require time-bounded override grants and log the full override envelope (grant → active → revoke) as evidence-grade events.

  • operator_role
  • override_grant
  • override_duration
  • batch_id
Back to: H2-5 · H2-8
Latency spikes during heavy logging — priority inversion or storage blocking? (→ H2-4 / H2-9)

Conclusion: Logging-induced latency spikes usually indicate priority inversion (sequencer blocked behind lower-priority work) or synchronous storage commits that stall the main loop under bursty event rates.

Evidence: Compare loop_latency_ms/scan_time_ms with storage_commit_time_ms and queue depth; then confirm whether log writes are synchronous or buffered and whether backpressure throttles event generation safely.

First fix: Move logging to an async queue with bounded flush policy, and protect the sequencer with priority and budget enforcement.

  • loop_latency_ms
  • storage_commit_time_ms
  • queue_depth
  • backpressure
Back to: H2-4 · H2-9

MPN quick list (common building blocks)

  • RTC: NXP PCF2129 · Microchip MCP79410 · Analog Devices MAX31341
  • Supervisor / watchdog: TI TPS3839 · TI TPS3430 · Analog Devices ADM809
  • FRAM / MRAM / NOR: Infineon/Cypress FM25V10 · Fujitsu MB85RS64V · Everspin MR25H40 · Winbond W25Q64JV
  • Ethernet / PHY: WIZnet W5500 · TI DP83826E · Microchip KSZ8081RNA
  • RS-485 / isolation: TI SN65HVD72 · Analog Devices ADM2587E · Analog Devices ADuM1250
  • Secure element (optional): Microchip ATECC608B · NXP SE050

These are reference examples to support commissioning evidence (timebase, power-fail behavior, logging media, fieldbus bridging, optional signing).