Suspension & Air Spring Control for Rolling Stock
← Back to: Rail Transit & Locomotive
Rail air-spring suspension is a closed-loop system that maintains carbody height and ride comfort by combining pressure/height sensing, valve actuation, and evidence-driven fault handling under harsh EMC and transient conditions. This guide shows what to measure, what to log, and what to fix first—from sensing and valve drivers to isolation, validation, and field aging-model updates.
H2-1. System Role & Operating Principle
Rolling-stock air suspension maintains carbody height (and, when applicable, left/right level) under passenger load changes and track-induced excitation. The loop uses height sensing as the primary observable, pressure/temperature as supporting evidence, and a valve manifold (fill/exhaust) as the actuator. The engineering target is not “pressure accuracy,” but stable height regulation, bounded valve activity, and diagnosable behavior under rail EMC and pneumatic delays.
Air suspension interacts with both primary and secondary suspension dynamics, but the control problem here is specific: the system must keep a repeatable height reference while the pneumatic plant introduces compressibility, flow limits, and temperature sensitivity. Height is therefore the “truth signal,” while pressure becomes a secondary channel for (1) cross-checking plausibility, (2) estimating load/health, and (3) explaining slow drifts that height alone cannot attribute.
- Why closed-loop height control is required: load changes shift the static equilibrium; temperature and leakage can make “pressure correct” while height is wrong.
- How load change is detected: a height step (or persistent offset) is directly observed; pressure+temperature correlated with height supports load/health estimation.
- How ride comfort is affected: poor delay handling leads to overshoot and valve chatter, creating low-frequency body motion and visible level oscillation.
H2-2. Mechanical & Pneumatic Model
The pneumatic plant is fundamentally slow and nonlinear: compressible gas, finite flow through valves/orifices, and hose volume create delay and path dependence. A practical model must explain three field observations: (1) temperature-driven pressure shifts that do not mean height shifts, (2) leakage-driven slow height drift, and (3) overshoot or chatter when the controller ignores pneumatic time constants.
Steady-state intuition (what sets height): the air spring supports vertical load via pressure acting over an effective area. Height changes alter internal volume, shifting equilibrium pressure. In engineering terms, height is the primary observable; pressure is a supporting channel that becomes ambiguous when temperature or gas mass changes. This is why “pressure looks OK” can coexist with “height is wrong,” and why the diagnostic logic must compare pressure, height, and temperature together.
Effective stiffness (why comfort can change): the air spring behaves like a variable-rate spring. Higher pressure or lower effective volume generally increases stiffness (“harder” feel). Adding reservoir volume can soften the effective stiffness but also increases the amount of gas that must be moved, which slows response and can worsen transient overshoot unless control timing is adjusted.
Dynamics (why delay matters): the valve manifold and hoses limit mass flow. After a fill/exhaust pulse, pressure and height continue to evolve as the pneumatic network equalizes. The measurable symptom is a non-zero delay between valve_cmd and height_mm response (Δt), plus continued motion after the command stops. A robust strategy therefore enforces minimum on/off times, limits integral windup, and uses deadbands to avoid reacting to sensor noise amplified by delay.
H2-3. Height & Pressure Sensing Chain
The sensing chain must turn height and pressure into verifiable evidence under rail conditions: long harnesses, common-mode swings, and EMI injection. The practical goal is not “a sensor choice,” but a diagnosable pipeline where each hop (sensor → AFE → ADC/ΣΔ → MCU → log) exposes health flags, raw counts, and calibration state.
3.1 Height sensing options (selection logic that survives field reality)
Height is the primary truth signal for closed-loop control, so the selection criteria must prioritize: long-cable robustness, stable reference behavior, and predictable failure detectability. Common implementations include LVDT-based displacement, potentiometric position sensing, and magnetostrictive position sensing. The deciding factor is typically how the sensor output and harness interact with the vehicle’s ground potential shifts and EMI environment.
- LVDT: strong for non-contact displacement measurement; requires stable excitation/conditioning. Key field risk is reference drift or saturation during common-mode events.
- Potentiometer: simple interface; risk is wear-related drift and intermittent contact under vibration. Diagnostics must watch for step noise and open-circuit signatures.
- Magnetostrictive: non-wear sensing with good repeatability; the interface is more complex. Field strength and EMI immunity must be verified with the actual harness and routing.
3.2 Pressure sensing chain (why isolation + ΣΔ is common in rail)
A MEMS pressure element is usually not the hardest part; the challenge is carrying small pressure-dependent signals across noise, ground shifts, and transients. A robust rail-grade chain often combines isolation (to break ground loops and block common-mode injection) with a ΣΔ conversion approach (to move the signal into a digitally-filtered domain). The engineering focus is the end-to-end behavior: gain/offset stability, anti-alias and digital filtering, and time alignment with control and logging.
3.3 Rail engineering failure signatures (symptom → evidence → first fix)
- Common-mode coupling: raw counts show abrupt offset steps or correlated noise across channels. Evidence: height_raw_counts and pressure_raw_counts jump together; sensor_status indicates saturation. First fix: improve CM rejection path (shield termination, differential input, isolator CMTI margin).
- EMI injection: periodic ripple appears at a fixed phase relative to switching or valve activity. Evidence: narrowband energy rise, repeatable timing vs valve_cmd. First fix: sampling phase management + front-end RC/CM filtering + routing return paths away from sensor reference.
- Long harness issues: lab short cable works; field cable causes intermittent errors. Evidence: increased CRC/status faults (if digital), rising offset drift vs vibration/temperature. First fix: proper termination/shielding, input protection/limiting, connector/harness validation under transients.
- Sensor drift: slow offset change forces higher valve activity to “hold height.” Evidence: monotonic offset_counts change and larger calibration deltas; cross-check pressure/temperature consistency. First fix: temperature compensation, scheduled recalibration policy, cross-channel plausibility checks.
H2-4. Valve & Driver Architecture
The actuator side must be self-diagnosing: if the manifold does not respond to a command, height control becomes non-observable and troubleshooting becomes guesswork. A rail-ready valve drive stage therefore combines controlled energization (to limit inrush and ground bounce), robust flyback handling (to contain kickback energy), and fast protection (overcurrent/short) with explicit feedback flags and event logging.
4.1 Fill vs exhaust valves (control-relevant asymmetry)
Fill and exhaust paths rarely behave symmetrically in the field. Fill authority depends on available supply pressure and restrictions; exhaust depends on vent path and silencers. To avoid oscillation and chatter, implementations typically enforce: minimum on-time, minimum off-time, deadband around target height, and different pulse limits for fill vs exhaust. These limits should be visible in logs as command edges, duration, and resulting pressure/height response.
4.2 Coil drive realities (inrush, kickback, OC/SC)
- Inrush and supply dip: coil energization can cause a fast current rise and rail dip, coupling into sensing and MCU stability. First mitigation is controlled drive (slew/limit) and local decoupling with a tight return loop.
- Kickback (flyback): fast turn-off generates a voltage spike; the clamp path must keep high di/dt currents out of logic/sensing references. Typical elements include TVS or diode clamps (implementation-dependent).
- Overcurrent and short protection: fast OC detection isolates a shorted coil/harness before repeated dips cause resets. Logs should include oc_flag, trip_count, and the commanded duration.
- Open-load detection: when a coil is disconnected or harness is broken, commands produce no current and no pneumatic response. The driver should report open_load and the control should enter a conservative mode.
4.3 High-side vs low-side drive (decision impacts diagnostics and EMI path)
High-side and low-side drive choices change both diagnostics and noise coupling. Low-side switching can be more sensitive to ground bounce (especially with shared returns), while high-side can simplify certain short-to-ground checks. The decision should be guided by harness return routing, protection requirements, and where kickback energy is allowed to flow.
H2-5. Closed-Loop Control Strategy
Height is the primary controlled variable; pressure and temperature are supporting evidence. The actuator is not continuous—valves are discrete, delayed by pneumatic dynamics, and constrained by minimum on/off times. A field-ready strategy therefore layers practical protections around PID: deadband to block noise, anti-windup to prevent overshoot, and pulse/rate limiting to avoid valve chatter and supply dips.
5.1 Control layers (from measurement to safe actuation)
- Measurement selection: use height_mm_filt as the control truth; keep pressure_kPa + temp_C as plausibility and health context.
- Deadband + hysteresis: if |error_mm| is small, do not actuate; this prevents noise-triggered pulses and extends valve life.
- PID with anti-windup: clamp or freeze the integrator when the actuator is saturated or gated off; this avoids large overshoot after a delay.
- Pulse mapping: convert controller output into fill/exhaust pulses with minimum on-time and minimum off-time.
- Rate limiting: bound toggles per minute and cap maximum pulse width per command window.
5.2 Pressure-assisted logic (supporting evidence, not a replacement)
Pressure is most useful as a supporting channel: it explains slow drifts and helps classify failures. When height deviates but pressure does not change as expected, the sensing chain is suspect. When a valve command occurs but pressure and height show no consistent response, the manifold/flow path may be impaired (open-load, stuck valve, or blocked pneumatic path). Pressure + temperature can also support load estimation and adaptive thresholds without turning the loop into a pressure controller.
5.3 Dynamic gating (station/accel/brake conditions)
During transient motion or high vibration, aggressive control can amplify body motion and produce chatter. A practical approach gates actuation and integral action: when motion or vibration crosses a threshold, freeze the integrator and restrict pulses until the signal quality returns. This avoids chasing short-lived disturbances and preserves stability under pneumatic delay.
H2-6. Vibration Monitoring & Ride Quality
Vibration monitoring serves two roles: it quantifies ride quality and it protects the height loop from reacting to short-lived disturbances. A practical implementation captures acceleration with a consistent timebase, derives simple metrics (RMS/peak and band energy), and triggers event logs that can be aligned with valve commands and height error history.
6.1 Sensor placement and signal integrity
Placement affects what the sensor “sees.” Carbody mounting emphasizes comfort-relevant motion, while locations nearer to structural interfaces can emphasize higher-frequency content. The engineering priority is stable mounting, known axis orientation, and a harness/reference strategy that avoids injecting noise into the measurement. Time alignment is critical: vibration metrics are only actionable if their timestamps can be correlated with control loop actions.
6.2 Metrics that explain ride quality and control risk
- RMS (windowed): describes sustained vibration level over a defined time window; useful for gating control aggressiveness.
- Peak: captures shocks/impacts; useful for event classification and fault triage.
- Band energy: summarizes frequency distribution (low/mid/high); helps separate slow body motion from impacts or resonant behavior.
- Event triggers: threshold + minimum gap + pre/post capture create black-box records suitable for field debugging.
6.3 Practical implementation notes (bandwidth, filtering, logging)
Filtering should reduce noise without destroying time correlation. Use windowed RMS and coarse band-energy summaries rather than heavy filtering that adds large phase delay. When vibration is high, apply control gating: freeze integrator state and restrict valve toggles to avoid noise-triggered actuation and perceived ride degradation.
H2-7. Protection & Fault Handling
Protection must be a closed loop: detect a trigger, capture evidence, execute a deterministic action, and apply a clear recovery rule. For air-spring control, the critical objective is to prevent unsafe valve behavior under transients (over/under-voltage), stop uncontrolled height hunting under leaks, and maintain diagnosability when sensors or drivers degrade.
7.1 Fixed response template (Trigger → Evidence → Action → Clear)
Each fault class should use the same structure so operators and logs remain comparable across vehicles and software versions. Triggers are window-based (time or counts), evidence fields capture the minimal snapshot needed to localize root cause, actions define a safe actuator posture (limit/lock/degrade), and clear rules prevent oscillation between states.
Overvoltage (OV)
Trigger: supply_v > V_OV for t_ov, or ov_count in a window
Evidence: supply_v, ov_flag, ov_count, sample_ts, valve_cmd, height/pressure snapshot
Action: restrict pulses; raise alarm; capture event snapshot
Clear: supply_v stable within limits for t_clear; counters decay
Undervoltage / Brownout (UV)
Trigger: supply_v < V_UV, brownout_event, reset_cause
Evidence: supply_v, uv_flag, brownout_count, reset_cause, watchdog_reset
Action: enter fail-safe valve posture; freeze integrator; conservative mode
Clear: stable supply + self-check passed + staged recovery
Leak detection
Trigger: pressure drops in hold state; rising fill_cmd frequency; drift score
Evidence: pressure_kPa, temp_C, height_mm, hold_time_s, fill_cmd_count, leak_score
Action: alarm; degrade (limit refills); log long-window snapshot
Clear: maintenance clear or multi-cycle stability proof
Sensor failure / plausibility
Trigger: open/short/overrange/saturation; plausibility_fail_count
Evidence: sensor_status, raw_counts, crc_error, cal_version, plausibility counters
Action: switch to redundant channel if available; else limit control authority
Clear: N consecutive valid samples + stable status
Valve driver abnormal
Trigger: oc_flag/short_flag/open_load; repeated trip_count
Evidence: oc_flag, short_flag, open_load, trip_count, pulse_ms, supply_uv_event
Action: channel lockout; limited retries; protect supply and sensing
Clear: cooldown + one self-test pulse; if fail persists, remain locked
7.2 Fail-safe valve posture (deterministic output under fault)
A fail-safe posture defines what the actuator outputs become when the controller is unstable, the supply is out of bounds, or a watchdog reset occurs. The posture is enforced by both software state and driver hardware defaults: valve commands are inhibited or limited, integrator state is frozen, and re-entry to normal control is staged (self-check → conservative control → normal).
7.3 Redundancy and watchdog recovery (avoid “reset → overshoot”)
- Dual-channel sensing: implement window-based agreement checks and log the channel selection decision with timestamps and calibration versions.
- Watchdog: after a watchdog reset, start in a recovery stage (freeze integrator, limit pulses, verify sensors/driver flags) before restoring normal gains.
- Clear rules: use stable time windows and counters to prevent rapid oscillation between normal/degraded states.
H2-8. Isolation, EMC & Rail Transients
Rail environments combine long harnesses, large common-mode shifts, and high-energy transients (EFT/surge/lightning-like events). Robust behavior requires designing the injection paths out of the system: isolate communication boundaries, suppress common-mode currents, keep protection loops short, and ensure high di/dt return currents do not flow through sensitive AFE/MCU reference regions.
8.1 What changes in rail (transients and common-mode reality)
- EFT / fast bursts: couples into long cables and I/O edges, creating false transitions and ADC disturbances.
- Surge / high energy: stresses protection devices and raises ground potential, pushing sensors and PHYs into saturation.
- Lightning-like impulses: drives extreme dv/dt and di/dt; the outcome depends on where the current returns.
8.2 Isolation boundary (communications and field wiring)
Isolation is not only a component choice; it is a boundary definition. The field side must have a controlled return path for high-energy currents, while the control side reference must remain quiet for sensor and MCU stability. Isolated transceivers and isolated power supplies should be paired with common-mode suppression elements and short protection loops near connectors.
8.3 Practical must-haves (CM suppression, TVS layout, ground loops)
- Common-mode suppression: differential inputs/links plus CMC/RC networks to prevent CM currents from entering the logic reference.
- TVS placement: protect at the interface; keep the clamp loop short and local; avoid routing the clamp return through AFE/MCU grounds.
- Return path control: ensure high di/dt currents close locally; do not let them traverse sensor reference regions.
H2-9. Communications & Logging
Communications is only valuable if it preserves diagnosability under interference. Logging is the evidence backbone: it ties height control, valve actions, vibration events, protection trips, and parameter versions onto one consistent timeline. A rail-ready design therefore pairs isolated links (Ethernet/RS-485/CAN) with time synchronization and strict versioned configuration records.
9.1 Link layer expectations (Ethernet / RS-485 / CAN)
- Isolation first: isolate the transceiver/PHY and its power so common-mode shifts do not collapse the logic reference.
- Error evidence: log CRC/errors, drop counters, reconnect attempts, and link state transitions.
- Recovery: define deterministic reconnect/backoff and persist the reason codes.
9.2 Time synchronization (the single time axis for all evidence)
Without time sync, valve pulses cannot be correlated with sensor deviations or vibration events. Time sync should expose health fields: source selection, offset/skew estimate, and loss-of-sync counters. When sync is lost, logging must note the transition and maintain monotonic local timestamps.
9.3 Parameter version management (make field incidents reproducible)
Every incident must reference the exact parameter set used at that time: controller gains, deadband/limits, calibration versions, and safety thresholds. Configuration changes should be logged as first-class events with a version ID, timestamp, and a short change summary.
9.4 Log schema (fast loop, events, config)
H2-10. Validation & Test Plan
A rail-ready validation plan must be reproducible and evidence-driven. Every test item is defined as: Test item → Measurement → Pass/Fail criteria → Log fields. The plan below covers lab characterization (pressure & valve latency), rig-level closed-loop dynamics (load steps & leak injection), and line/environment reliability (temperature cycling & long-term drift).
10.1 Reference measurement chain (example MPNs)
The test plan assumes a measurement chain capable of time-aligned logging of pressure/height/vibration and protection states. The following representative parts are commonly used building blocks:
Pressure / Height acquisition
ΣΔ ADC: TI ADS131M04 (simultaneous sampling) / TI ADS124S08 (precision, low-speed)
Isolated ΣΔ modulator: TI AMC1306M25 or TI AMC1304M25
Isolated amplifier: TI AMC1311
Digital isolator: TI ISO7741 / ADI ADuM141E
Valve driver timing & protection capture
High-side switch (eFuse family): TI TPS25982 (hot-swap/eFuse class example)
High-side driver: Infineon BTS500xx family (smart high-side switch example)
Low-side driver: TI DRV103 (solenoid driver class example)
TVS example: Littelfuse SMCJ58A (selection depends on rail & interface)
Vibration sensing
MEMS accelerometer: ADI ADXL355 (low-noise) / ST LIS3DH (general)
Timebase tag: log with sample_ts plus sync health fields
Time sync, secure logging, nonvolatile storage
RTC (temp-comp): Microchip MCP79410 (RTC class example)
Secure element: Microchip ATECC608B (signing/identity)
FRAM (event log): Fujitsu MB85RS64V (SPI FRAM class example)
10.2 Test matrix (engineering format)
| Test item | Measurement | Pass/Fail criteria (examples) | Log fields (evidence) |
|---|---|---|---|
| Pressure scan (lab)P step up/down, fixed temp window | pressure_kPa vs height_mm_filt; hysteresis index; temp_C compensation Chain: pressure sensor → AFE/isolator → ADS131M04 / AMC1306M25 | Height error within spec across scan; bounded hysteresis; no saturation flags Use window-based acceptance, not single-point | pressure_kPa, height_mm_raw/filt, temp_C, scan_step_id, sensor_status, sample_ts, param_set_id, cal_version |
| Valve actuation latency (lab)pulse & step response timing | cmd_ts → response_ts; delay mean/std; coil current/protection flags if available Driver class examples: DRV103 (LS) / BTS500xx (HS) | Delay ≤ limit; jitter ≤ limit; no repeated driver trip_count; no UV resets Latency must be stable across temperature bins | fill_cmd/exhaust_cmd, pulse_ms, valve_state, oc/short/open flags, supply_v, reset_cause, response_detect_flag, sample_ts |
| Vibration simulation (lab)shaker input profiles | accel_rms/peak; band_energy; valve_toggle_count; control_gate_flag Accel examples: ADXL355 / LIS3DH | No self-excited hunting; toggle rate bounded; protection not spuriously triggered | accel_rms/peak, band_energy, height_mm_filt, error_mm, i_state, pulse_ms, toggle_count, event_ts |
| Dynamic load steps (rig)closed-loop on bench rig | overshoot_mm; settling_time_s; steady_state_error_mm; valve duty | Overshoot ≤ target; settling ≤ target; steady error ≤ target; no driver lockouts Acceptance can be parameter-set dependent; always version-tag | height_target_mm, height_mm_filt, error_mm, i_state, pulse_ms, motion_gate_flag, supply_v, sample_ts, param_set_id |
| Leak injection (rig)controlled leak paths | pressure_decay_rate; leak_score; refill_frequency; drift_rate | Correct transition to degraded; bounded refill behavior; alarm tiers triggered as designed | leak_score, hold_time_s, fill_cmd_count, pressure_kPa, height_mm_filt, fault_state, event_ts, param_set_id |
| Temperature cycling (line/env)cold/heat cycles | sensor offset drift; channel_delta (if redundant); reset statistics; clamp events | Offset drift bounded; plausibility checks stable; no repeated brownouts; comms error rate bounded Isolators: ISO7741 / ADuM141E | temp_C, raw_counts, offset_est, channel_delta, sensor_status, crc_error_count, reset_cause, sample_ts, cal_version |
| Long-term drift monitoringweeks/months trend | trend slopes: height_offset/day, refill/week, leak_score trend; event counters Event log storage: MB85RS64V (FRAM) example | Drift slopes bounded; event rate not increasing; stable time sync status Summaries must still carry param_set_id + model_version | daily_summary_id, trend_slope, fill_cmd_count, leak_score, event_counts, time_sync_status, param_set_id, model_version |
H2-11. Field Feedback & Aging Model
Suspension air-spring control is not a static device: it is a dynamic model system that must learn from field evidence. Rubber aging changes effective stiffness and hysteresis, leak rates typically increase over time, and sensors drift in offset and temperature coefficients. A robust design defines an update pipeline (data intake → feature extraction → quality gate → coefficient update → version control → rollback).
11.1 Aging mechanisms and what evidence proves them
Rubber aging (stiffness & hysteresis drift)
The same pressure change produces a different height response over time (slower, smaller, more hysteretic). Identify it by comparing pressure–height response shape under matched operating windows. Features: K_eff, hysteresis_index, time_constant_tau
Evidence fields: pressure_kPa, height_mm_filt, temp_C, event_window_id, sample_ts
Leak rate growth (maintenance predictor)
Hold-state pressure decay accelerates and refill frequency increases. Use long-window evidence and avoid single-event conclusions. Features: leak_rate_est, refill_frequency, hold_stability_index
Evidence fields: hold_time_s, pressure_decay_rate, fill_cmd_count, leak_score, event_counts
Sensor zero drift (offset & temp coefficient)
Offsets shift slowly; dual-channel deltas widen; temperature dependence changes. Update compensation coefficients while preserving raw counts for traceability. Features: offset_est, temp_coeff_est, channel_delta_trend
Evidence fields: raw_counts, offset_est, temp_C, channel_delta, cal_version, sample_ts
11.2 Update pipeline (coefficient update + threshold governance)
- Data intake: event snapshots (pre/post) and periodic summaries (daily/weekly) with time sync health.
- Feature extraction: leak_rate_est, K_eff, offset_est, time_constant_tau, band_energy, refill frequency.
- Quality gate: require sample count, stable time sync, anomaly filtering, and valid calibration tags.
- Coefficient update: update model parameters with confidence score and effective operating range.
- Threshold adaptation: only adjust alarms/gates in controlled ways; avoid relaxing safety boundaries without evidence and governance.
- Version control: every deployment is a new model_version and param_set_id, with rollback ID and apply timestamp.
11.3 Example implementation blocks (MPNs)
Field feedback requires trusted identity, tamper-evident logs, and stable storage. The following example parts are typical building blocks:
H2-12. FAQs (Troubleshooting, Evidence-Driven)
Each answer is structured as: 1-sentence conclusion + 2 evidence checks + 1 first fix. Evidence checks reference the same log fields used in H2-3 to H2-11, so each symptom can be classified quickly without scope creep.
Body height is occasionally low — leak or sensor drift?
Conclusion
Classify it as a leak if pressure decays during a hold window; otherwise prioritize sensor offset drift and compensation errors.
Evidence checks (2)
1) Check pressure_decay_rate over hold_time_s and trend fill_cmd_count/leak_score (leak signature). MPN examples: isolated measurement AMC1311 / AMC1306M25; ADC ADS131M04.
2) Check raw_counts and offset_est drift vs temp_C, plus cal_version/param_set_id changes (sensor/model signature).
First fix
Run a controlled hold test and inspect seals, fittings, and pneumatic lines before re-calibrating or changing thresholds.
At train start, height oscillates violently — overshoot or slow valve response?
Conclusion
If valve response latency is stable and short, treat it as control overshoot/hunting; if latency varies or is long, treat it as actuator/driver timing first.
Evidence checks (2)
1) Measure cmd_ts → response_ts and its jitter under the same supply/temperature; compare to toggle_count and driver diagnostics (oc/open/short flags).
2) Check overshoot_mm, settling_time_s, and whether motion_gate_flag is applied during launch transients (control gating/strategy evidence).
First fix
Apply a startup gating window and pulse limiting (reduce duty/toggle) before retuning PID gains.
Acceleration logs look abnormal — sensor mounting or EMI?
Conclusion
If abnormal energy clusters in a narrow mechanical band, suspect mounting; if spikes correlate with switching/communications errors, suspect EMI injection.
Evidence checks (2)
1) Compare band_energy distribution (low/mid/high) across repeated runs and mounting points; check repeatability across the same track segment.
2) Correlate accel spikes with valve_cmd, crc_error_count, or time_sync_status changes (EMI/timebase evidence). MPN examples: ADXL355 (accel); ISO7741 / ADuM141E (isolation).
First fix
Re-seat and torque the sensor mount, then improve EMI grounding/shield termination and cable routing before changing trigger thresholds.
Valves actuate too frequently — PID tuning or noisy height signal?
Conclusion
If raw height is noisy but filtered height is stable, fix filtering/anti-chatter first; otherwise tune PID and gating for dynamic conditions.
Evidence checks (2)
1) Compare height_mm_raw vs height_mm_filt and compute short-window noise metrics; verify sampling timestamps and cable/common-mode susceptibility.
2) Check toggle_count alongside i_state accumulation and whether a deadband / minimum pulse width policy exists (control policy evidence).
First fix
Introduce/verify a deadband and minimum on/off time, then validate sensor filtering before adjusting PID gains.
More alarms on rainy days — connector issue or pressure sensor wet drift?
Conclusion
If faults coincide with comm/isolation health degradation, prioritize connector sealing and leakage paths; otherwise evaluate sensor offset drift vs humidity/temperature.
Evidence checks (2)
1) Check crc_error_count, link_down_count, and isolation-related diagnostics around rain exposure; look for simultaneous common-mode disturbance patterns.
2) Check pressure offset trend (offset_est) vs temp_C and compare pre/post exposure windows; verify no sudden cal_version changes.
First fix
Improve connector sealing, drainage, and insulation cleaning first, then re-check sensor offset stability before retuning thresholds.
Communications drop but local control is fine — isolator problem or ground loop?
Conclusion
If link drops occur without local resets, suspect isolation/PHY supply and ground reference issues; if drops coincide with transients, suspect ground-loop EMI paths.
Evidence checks (2)
1) Compare link_down_count and crc_error_count with supply_v and reset_cause (is the controller stable while the link fails?). MPN examples: ISO7741 / ADuM141E.
2) Check correlation between comm dropouts and high dV/dt events (valve switching, surge/EFT exposure) to confirm a ground-loop or common-mode injection path.
First fix
Verify isolator-side power and reference routing, then correct shield termination and ground-loop paths before changing protocols or retry timers.
Height remains correct, but ride feels “stiffer” — rubber aging or model not updated?
Conclusion
If height tracking is nominal but dynamic response changes, suspect stiffness/hysteresis aging and update model coefficients using field evidence rather than altering target height.
Evidence checks (2)
1) Compare pressure–height response curves under matched operating windows to estimate K_eff and hysteresis changes over time.
2) Check vibration metrics (accel_rms, band distribution) and confirm parameter set integrity (model_version, param_set_id) to avoid mixing versions.
First fix
Update coefficient sets through the controlled update pipeline (quality-gated, versioned), and validate via a repeatable rig test before fleet deployment.
Leak alarms appear and disappear — thresholds too sensitive or detection window too short?
Conclusion
If leak indicators only trigger during short transients, the detection window/gating is likely wrong; if trends persist across holds, the leak is real.
Evidence checks (2)
1) Check whether leak_score is computed inside a stable hold_time_s window and whether motion is gated (motion_gate_flag).
2) Compare multi-day trend of pressure_decay_rate and fill_cmd_count to distinguish transient artifacts from real leakage growth.
First fix
Lengthen and stabilize the detection window (apply gating) and then re-run a controlled hold test before adjusting the leak threshold.
Valve driver sometimes reports “open” — harness contact or back-EMF/clamp issue?
Conclusion
If open faults coincide with vibration/connector movement, suspect harness contact; if they coincide with switching spikes, suspect clamp path/layout and back-EMF handling.
Evidence checks (2)
1) Check open_flag vs vibration level and connector state; confirm coil current proxy (if available) and intermittent resistance signatures.
2) Check whether faults correlate with surge/EFT exposure and common-mode disturbances; inspect TVS/clamp return path and ground reference. MPN examples: DRV103 (solenoid driver class); SMCJ58A (TVS class).
First fix
Fix connector retention and strain relief first, then improve clamp/TVS placement and return path before widening diagnostics thresholds.
Height shifts with temperature — wrong compensation or sensor self-heating/installation?
Conclusion
If offset tracks temperature predictably, compensation coefficients are likely wrong; if offset changes stepwise with power states, suspect self-heating or installation stress.
Evidence checks (2)
1) Trend offset_est and temp_coeff_est vs temp_C across controlled temperature ramps; confirm stability of cal_version.
2) Correlate drift with duty cycles, supply changes, and mounting constraints (step-like behavior indicates local heating or mechanical stress).
First fix
Correct the temperature compensation using a repeatable calibration sweep, then validate with a temperature cycle test before field rollout.
Timestamps in logs look unreliable — RTC drift or time sync chain instability?
Conclusion
If time sync health flags drop during comm disturbances, the sync chain is unstable; if health is stable but time drifts, the local clock/RTC is drifting.
Evidence checks (2)
1) Check time_sync_status and sync event counters near anomalies; correlate with crc_error_count and link drops.
2) Compare local time drift against known references across temperature bins; ensure logs always include model_version and param_set_id for traceability.
First fix
Stabilize the time sync path (isolation/grounding) first; if drift persists with good sync health, upgrade/compensate the clock source and re-validate.
Lab validation passed, but line issues are frequent — EMC injection gap or missing gating strategy?
Conclusion
If failures correlate with rail transient exposure or cabling layout, the EMC injection path was not covered; otherwise control gating for real operations is missing.
Evidence checks (2)
1) Compare failure times with transient exposure markers (surge/EFT/lightning environment proxies) and observe comm/diagnostic counters for common-mode disturbance patterns.
2) Check whether motion_gate_flag and startup/stop compensation are active on line; compare overshoot/settling against rig results to detect missing gating.
First fix
Expand validation to include EMC injection paths and cable harness reality, then add gating windows and pulse limiting before retuning PID.