Space Memory & Interconnect for Spacecraft Electronics
← Back to: Avionics & Mission Systems
Space Memory & Interconnect is about keeping spacecraft data correct and links deterministic under radiation and long mission life. It combines EDAC/ECC with scrubbing, proven link margins (BER/CRC), and logged, testable degrade/switch policies so “reliability” can be verified and replayed.
What this block owns: spacecraft data memory + deterministic links
This chapter sets the scope boundary and the engineering “proof loop” for reliable spacecraft data storage and transport under radiation-driven soft errors and long mission lifetimes.
Ownership What this page covers
Rad-hard memory control (SDRAM/DDR) with EDAC/ECC, background scrubbing, and deterministic interconnects (SpaceWire / SpaceFibre / LVDS) including physical-layer margining.
Core deliverable
A design that stays correct, observable, and provable
Reality check “Average bandwidth” is not a guarantee
Real-time loss events are typically triggered by worst-case service gaps: arbitration queues, refresh windows, and scrub reads/writes can create long tail latency even when average throughput looks sufficient.
- Design to a maximum inter-service gap, not just MB/s.
- Budget and verify worst-case service time (WCST).
Proof loop Reliability triad + evidence
Flight readiness is built from three cooperating mechanisms: ECC (local correction), scrubbing (lifetime maintenance), and link integrity (BER/CRC/FEC), all closed by telemetry counters and event logs that prove behavior in test and in flight.
- Correctable vs uncorrectable errors must be counted and acted on.
- Links must meet BER targets with margin across temperature and aging.
Typical data path (where reliability lives)
A common spacecraft payload path is: Sensor/FPGA → Memory controller (EDAC/ECC) → Interconnect PHY (SpaceWire/SpaceFibre/LVDS) → Payload computer / recorder. The critical engineering question is not “does it pass data once,” but whether the system remains correct and diagnosable after years of accumulated soft-error exposure and environmental drift.
| This page owns | Why it matters | How it is proven |
|---|---|---|
| EDAC/ECC datapath syndrome, correction, flags |
Prevents silent corruption during reads; classifies error severity | Error injection + counter/log validation; UE handling checks |
| Scrubbing policy quota, priority, escalation |
Prevents multi-bit accumulation beyond ECC capability | CE-rate sweep; WCST under load; long-run stability tests |
| PHY integrity BER/CRC/FEC, margin |
Ensures correct delivery over temperature, harness, aging | BER tests (PRBS), voltage/temp sweep, stress patterns |
| Observability counters, events, timestamps |
Makes reliability auditable and actionable in flight | Log completeness review; reproducible fault playbooks |
Out-of-scope details (not expanded here): TT&C waveform/protocol stacks, spacecraft bus power conversion, and system-level crypto/key lifecycle.
Requirements decomposition: bandwidth, latency, determinism, lifetime, radiation
This chapter turns “it should be reliable” into measurable targets: throughput, worst-case service gaps, error budgets, margins, and acceptance evidence.
Step 1 — Separate sustained rate from burst behavior
Two numbers are required for each producer: sustained average rate (long-run MB/s) and burst demand (bytes that must be serviced within a bounded window). Burst behavior drives FIFO depth and arbitration pressure; sustained rate drives steady-state utilization and thermal/aging stress.
| Producer | Average rate | Burst size | Burst period | Max allowed service gap |
|---|---|---|---|---|
| Payload stream A | _____ MB/s | _____ MB | _____ ms | _____ ms |
| Payload stream B | _____ MB/s | _____ MB | _____ ms | _____ ms |
| Housekeeping / logs | _____ KB/s | _____ KB | _____ s | _____ ms |
The “max allowed service gap” is the key determinism target: it bounds how long a stream can wait between successful memory service windows without loss.
Step 2 — Budget the worst-case service time (WCST), not just average bandwidth
A memory system can meet average MB/s while still violating real-time constraints when it periodically stops servicing a flow. The worst-case gap typically stacks from: (a) arbitration queueing, (b) refresh windows, and (c) scrub insertion. A flight-oriented requirement should explicitly cap the combined effect.
Determinism requirement (example form)
Max inter-service gap ≤ Stream requirement
- Arbitration: competing sources can create long tail waiting even if mean service is fast.
- Refresh: periodic command slots reduce available service windows and create structured gaps.
- Scrub: lifetime maintenance consumes bandwidth; policy must guarantee bounded interference.
Step 3 — Define reliability targets as observable metrics (CE/UE + link errors)
Reliability requirements must be written in terms of counters, thresholds, and actions—so they can be verified in test and audited in flight. This keeps ECC and link-integrity features from becoming “checkbox” items without operational meaning.
| Metric | What it proves | Suggested evidence |
|---|---|---|
| Correctable errors (CE) rate | Soft-error environment + scrub effectiveness | CE histogram vs temperature; scrub-rate sweep; stable long-run trend |
| Uncorrectable errors (UE) | Fault containment works and is observable | UE injection tests; region retire / degrade path; event logs with context |
| CRC / FEC / link event counts | Interconnect integrity under drift | BER tests (PRBS); temp/voltage sweep; margin reports |
| Worst-case latency / gap | Determinism holds under interference | Stress run with refresh+scrub enabled at max load; WCST measurement |
Step 4 — Constrain temperature/aging effects via explicit margins
Environmental drift should be treated as a margin requirement, not a post-hoc surprise. For memory, this is a timing margin problem; for interconnect, this is an eye/BER margin problem. Requirements should define how margin is demonstrated (sweeps, long runs, and stress patterns).
- Memory margin: timing closure must remain valid across temperature corners and supply variation.
- Link margin: BER targets must hold with measured margin over harness/connector variability and drift.
Radiation is treated here as a soft-error driver affecting observable CE/UE behavior and required scrub policy, without expanding into detector physics.
Memory choices in space: SDRAM/DDR in a rad-hard context (what changes)
“SDRAM/DDR” in spacecraft design is not just a part number choice. The flight difference is how the memory behaves across temperature and aging, how errors are contained over mission life, and how the design proves correctness with counters, logs, and repeatable tests.
Engineering outcomes What “space-grade” changes in practice
Selection criteria are dominated by predictability and traceable evidence, not peak throughput. Device and lot variation can shift timing margin and soft-error behavior, so the system must remain stable with guardbands and observable error metrics.
- Timing margin must hold at temperature corners and supply variation.
- Error behavior must be bounded using ECC + scrubbing + containment.
- Traceability (lot, screening, configuration) supports repeatable qualification.
Do not confuse Component “hardness” vs system “proof”
Flight reliability is rarely achieved by a single “more robust” memory device. It comes from a controller architecture that (1) corrects data on read, (2) prevents multi-bit accumulation via scrubbing, and (3) exposes telemetry and events so behavior is provable.
A practical rule
If it cannot be measured in counters and logs, it cannot be qualified.
Selection checklist (device + integration) — focus on consequences
The items below are written as engineering consequences rather than marketing categories. Each item should map to an acceptance test, a margin report, or a configuration control record.
| Decision point | Why it changes in space | What to prove |
|---|---|---|
| Temperature range & drift | Timing closure and read/write windows shrink at corners; long mission life magnifies drift sensitivity. | Corner characterization; guardbanded timing; stable operation with worst-case patterns. |
| Lot/batch consistency | Variation shifts timing and soft-error behavior; qualification must remain repeatable across procurement lots. | Lot traceability; re-validation plan; configurable scrub and thresholds. |
| Packaging & interconnect parasitics | Parasitics alter edge rate and signal integrity; stability depends on real margins, not nominal models. | Board SI checks; margin tests; stable read training/guardbands (if used). |
| ECC support model | ECC must align with the datapath width and system fault model; it is part of the architecture. | Fault injection; CE/UE classification; no silent corruption under injected faults. |
| Scrubbing capability | Without scrubbing, correctable errors can accumulate into uncorrectable events over long exposures. | Scrub-rate sweep; bounded interference (WCST); stable long-run CE trend. |
This chapter stays at the memory-control level. It does not expand into spacecraft power architecture, TT&C modulation, or system-level crypto lifecycle.
Why controller strategy matters: writes, refresh, self-refresh (risk points)
Refresh is not the main problem by itself. The risk comes from stacked interference—refresh windows combined with arbitration queueing and scrubbing insertion can create long service gaps and fragile corner behavior if the policy is not explicitly bounded and tested.
- Write/read turn-around can create tail latency spikes under mixed traffic.
- Self-refresh entry/exit requires deterministic criteria and recovery validation.
- Refresh + scrub overlap must be scheduled to avoid unbounded inter-service gaps.
- Evidence requirement: a measured WCST and logged error trends under stress runs.
Controller micro-architecture: scheduler, timing closure, refresh, buffering
A stable flight memory controller is designed around bounded latency and provable behavior. The goal is to prevent unbounded queueing delays while maintaining data integrity (ECC) and lifetime maintenance (scrubbing) without breaking real-time streams.
Scheduling Throughput vs determinism (the real trade)
Micro-architecture choices such as open-page vs close-page, bank interleaving, and read/write turn-around have a direct impact on tail latency. A flight-oriented scheduler prioritizes bounded per-flow delay over peak benchmarks.
- Open-page: higher peak, but can amplify latency spikes on mixed/random traffic.
- Close-page: more predictable service time under uncertainty.
- Turn-around: treat read↔write switches as a first-class latency budget item.
Timing closure Margin as a requirement, not an assumption
DDR timing is only “closed” if it remains valid across corners. Stability comes from guardbanded constraints, measurable margins, and a verification plan that exercises worst-case patterns under temperature and supply variation.
- Guardband critical timings (tRCD/tRP/tRAS) for corners and drift.
- Validate with stress patterns and long-run runs, not only nominal vectors.
- Link errors and memory errors must be correlated to conditions in logs.
Buffering & backpressure: preventing overflow without hiding failures
FIFO depth and backpressure protocols are not cosmetic implementation details—they define which stream fails first under contention. The design should explicitly size buffers from burst demand and enforce backpressure semantics that preserve determinism for critical flows.
| Design element | What can go wrong | What to enforce |
|---|---|---|
| Request queue depth | Bursts overflow; arbitration never catches up; dropped frames appear “random”. | Derive from burst bytes and max inter-service gap; measure occupancy under stress. |
| Per-flow isolation | Critical streams starve under best-effort traffic; tail latency explodes. | QoS classes, weights, or per-flow queues; cap worst-case delay per class. |
| Backpressure protocol | Hidden congestion causes silent loss; “works in lab” but fails at corner loads. | Explicit ready/valid or credit semantics; log drops, retries, and saturation. |
The acceptance target is not “no congestion ever,” but predictable congestion with bounded impact and observable signals.
Refresh vs scrub arbitration: guaranteeing WCST while maintaining integrity
Refresh and scrubbing both consume memory service slots, but they exist for different reasons: refresh preserves electrical correctness while scrub preserves long-term correctness in the presence of soft errors. A flight controller must implement a policy that bounds interference and provides telemetry that proves the policy is working.
- Policy knobs: scrub quota, priority levels, escalation triggers based on CE rate.
- Determinism constraint: cap the maximum injected work between service opportunities for real-time flows.
- Evidence: measured WCST under max load with refresh+scrub enabled; logs that show counters and state transitions.
Acceptance evidence (what proves “stable”) — minimal, testable set
- WCST measurement: worst-case inter-service gap remains below stream requirement under stress conditions.
- Fault injection: injected single-bit and multi-bit scenarios are classified correctly (CE/UE) with no silent corruption.
- Corner stability: temperature/supply sweeps with worst-case patterns do not produce unexplained resets or drift.
- Telemetry completeness: counters, timestamps, and events allow post-test and in-flight auditing.
ECC/EDAC fundamentals for flight memory: what to implement and why
ECC selection should be written as an engineering contract: what fault patterns are corrected or detected, what latency and bandwidth cost is paid, and what evidence proves “no silent corruption” under stress and fault injection.
Code choice When SECDED is enough vs when to go stronger
SECDED (single-error correct, double-error detect) is the default baseline for many flight designs because it provides strong protection against common single-bit soft errors with manageable overhead. Stronger schemes become worth paying for when errors show correlation or when the mission cannot tolerate frequent recovery actions triggered by detections.
- SECDED: best default when single-bit errors dominate and recovery on rare detections is acceptable.
- DECTED / stronger detection: consider when double-bit events become non-negligible or when deterministic recovery is required.
- Chipkill-class: worth considering when a single device fault must be contained without turning into a system-level failure (kept concept-level here).
Cost model What ECC “costs” in a flight budget
ECC cost is not only parity bits. It includes encode/decode latency, correction mux timing, bandwidth overhead, and the observability infrastructure needed for qualification. The budget should include both steady-state and worst-case paths (including correction events).
| Cost item | Why it matters | How to bound it |
|---|---|---|
| Parity overhead | Reduces usable bandwidth and impacts bus width planning. | Document data/parity mapping; verify throughput under worst-case traffic. |
| Encode/decode latency | Changes worst-case service time and tail latency. | Measure correction path timing; include in WCST budget. |
| Correction events | Can perturb service slots under heavy CE rates. | Stress test at elevated CE rates; confirm bounded behavior. |
| Telemetry/logging | Required for qualification and in-flight auditing. | Define counters + event schema; verify completeness. |
Datapath placement: where EDAC must live
A flight EDAC implementation is defined by placement in the datapath: encode on writes and decode/correct on reads, with a syndrome computation that feeds both the correction mux and the observability plane (flags, counters, and event logs).
- Write path: data → encode → (data + parity) → memory
- Read path: (data + parity) → syndrome → classify → correct mux → data out
- Observability: per-type counters + timestamped events with address context
The goal is to ensure that every corrected or detected anomaly becomes a measurable, auditable record rather than an invisible internal detail.
Handling strategy: Correctable vs Detectable vs Uncorrectable (minimal action set)
Error classes must map to explicit actions. The “minimal action set” below keeps behavior deterministic, enables qualification, and avoids silent corruption.
| Class | Minimum action | Must log |
|---|---|---|
| CE | Correct on read; increment counters; track hotspots; optionally raise scrub priority for the region. | Address/region ID, timestamp, syndrome summary, corrected-bit count, channel/port, running CE rate. |
| DE | Detect-only: block unsafe data use; trigger recovery action (re-read / re-fetch / isolate) per system policy. | Address/region ID, timestamp, syndrome summary, action taken, retry outcome, associated stream/context. |
| UE | Escalate immediately: isolate or retire the fault domain (page/bank/region); enter a known degraded-safe path. | Fault domain ID, timestamp, UE count, trigger context, retire/remap result, post-action verification status. |
Verification checklist (flight-oriented)
- Fault injection: single-bit and multi-bit injections are classified correctly; correction never produces silent wrong output.
- “No silent corruption”: detected errors always create events; corrected reads always increment the correct counters.
- Budget impact: encode/decode + correction path is included in the WCST and throughput measurements under stress loads.
- Log completeness: address context + timestamps exist so qualification and in-flight auditing are reproducible.
Scrubbing & fault containment: keeping ECC meaningful over mission life
ECC corrects errors when data is accessed. Scrubbing prevents correctable errors from accumulating into uncorrectable events over long exposure. The flight challenge is to control scrub intensity so integrity improves without breaking deterministic latency.
Why scrub exists Prevent multi-bit accumulation beyond ECC capability
Without scrubbing, latent correctable errors can remain in memory until multiple upsets land in the same codeword, turning what would have been corrected reads into uncorrectable events. Scrubbing periodically reads, corrects, and rewrites memory to keep the latent error inventory low.
- ECC: fixes the accessed word.
- Scrub: maintains the whole population of stored words.
- Containment: isolates regions that repeatedly misbehave.
Do not break RT Scrub must be a controlled background workload
Scrubbing consumes service slots and can create latency gaps if unmanaged. A flight implementation should cap interference using explicit quota and gating, and adjust intensity based on measured CE rate trends.
- Quota: limit scrub work per time window.
- Timeslice: run only in defined windows or when queue occupancy is safe.
- Threshold: increase intensity when CE rate rises; back off when stable.
Scrub types (composable policies)
| Policy | When it helps | What to verify |
|---|---|---|
| Background / patrol | Baseline maintenance across all regions at a low, bounded rate. | CE trend stays controlled; WCST remains within budget under load. |
| On-idle | Accelerates maintenance during low traffic to reduce interference risk. | Gating logic is correct; no starvation of required patrol coverage. |
| Region-priority | Targets code/data/buffer regions differently based on criticality and error history. | Hotspots are detected; elevated regions are treated without global disruption. |
Fault containment: when to retire a page/bank/region (concept-level)
Fault containment turns repeated anomalies into bounded, diagnosable actions. The design should define a fault domain granularity and a trigger set that is measurable from counters and events.
- Hotspot trigger: CE clustering at specific addresses/regions exceeds a threshold over a window.
- Escalation trigger: repeated detectable events or any uncorrectable event in the same region.
- Action: retire or remap the region; verify post-action behavior with follow-up reads and counters.
Containment is kept at the memory-controller level here. System-level data recovery policy is intentionally not expanded.
Acceptance evidence (minimal but sufficient)
- WCST under scrub: worst-case inter-service gap stays below real-time requirements with scrub enabled.
- CE-controlled trend: long-run tests show scrub reduces accumulated CE inventory and prevents UE growth.
- Hotspot behavior: induced hotspots trigger priority scrub and/or retire events with correct logs.
- State audit: state transitions and actions are visible in event records.
Interconnect selection: SpaceWire vs SpaceFibre vs LVDS (engineering tradeoffs)
Interconnect selection should be expressed as a contract: the determinism model, the topology that must be supported, the diagnostic visibility that will be needed in flight, and the complexity cost of implementing redundancy without hidden failure modes.
SpaceWire Mature networks with predictable behavior and good diagnostics
SpaceWire is often chosen for its well-understood network behavior and operational visibility. It supports routing and multi-node topologies that are practical for distributed avionics and payload subsystems, but sustained high-rate payload streaming can expose throughput ceilings and tail-latency growth under contention.
- Strength: deterministic-by-design network patterns; straightforward fault localization.
- Constraint: bandwidth ceiling and contention-driven tail latency near saturation.
- Reliability hooks: CRC and retry concepts support traceable error accounting (kept concept-level).
SpaceFibre High-rate trunks with QoS-style isolation for payload streams
SpaceFibre targets high throughput while providing mechanisms that help isolate critical flows from background traffic. For payload recording and high-rate sensor streams, the key value is not only speed, but the ability to maintain service expectations when multiple streams compete.
- Strength: high-rate backbone; flow isolation concepts (virtual channels / QoS) for stream control.
- Constraint: higher implementation and qualification complexity; multi-lane alignment can be a risk.
- Operational focus: budget and prove margin with BER data and logged events.
LVDS Simple point-to-point links with low protocol overhead (but hidden system costs)
LVDS is a good fit when a design needs simple point-to-point paths with minimal protocol overhead. The tradeoff is that scaling and diagnosing a larger system becomes an engineering task: additional wiring, explicit test modes, and instrumentation are required to keep field support and fault isolation practical.
- Strength: low overhead, fixed topology, clean timing assumptions for point-to-point paths.
- Constraint: expansion requires more ports/wires; diagnostics must be designed in.
- Recommended hooks: loopback, counters, link-health registers, and defined maintenance scripts.
Selection axes A decision table that can be used in reviews
| Axis | What is being optimized | Typical fit (concept-level) |
|---|---|---|
| Bandwidth | High-rate payload streams vs moderate-rate distributed traffic. | SpaceFibre (high-rate trunk) / SpaceWire (moderate, shared network) / LVDS (fixed P2P) |
| Determinism | Worst-case service time and tail latency under contention. | SpaceWire patterns are predictable; SpaceFibre needs QoS discipline; LVDS relies on fixed topology |
| Topology | Multi-node networks, routing, backbone distribution, or pure fanout. | SpaceWire (network) / SpaceFibre (trunk + endpoints) / LVDS (P2P fanout) |
| Diagnostics | Fault localization, counters, event logs, and operational scripts. | SpaceWire/SpaceFibre usually provide stronger link-level visibility; LVDS needs explicit hooks |
| Redundancy | How hard it is to implement and validate dual paths and failover. | Networks require careful failover rules; P2P requires duplicated wiring and clear switchover logic |
The “best” choice is the one that makes determinism, diagnostics, and redundancy easiest to prove for the intended data path.
PHY/link integrity: encoding, CDR, skew, termination, BER and margining
“Works in the lab” often fails in flight due to temperature drift, harness variation, reflections, crosstalk, and aging. Link integrity engineering converts these risks into a measurable margin stack and a qualification script with clear pass/fail evidence.
Physical quantities What eats margin in real harnesses
Link robustness is governed by a small set of measurable quantities. These should be tracked as a margin stack rather than treated as isolated “signal integrity notes”.
- Jitter: timing uncertainty that closes sampling/decision windows.
- Eye opening: combined result of noise, loss, reflections, and crosstalk.
- Channel loss: attenuation and frequency-dependent behavior of harness + connectors.
- Reflections: termination and impedance discontinuities that distort edges.
- Crosstalk: coupling that becomes pattern-sensitive in bundled pairs.
SerDes/CDR Lock range, jitter transfer, and multi-lane skew
For SerDes-style links, robustness depends on the receiver’s ability to maintain lock and on how jitter is transferred through the clock-data recovery path. For multi-lane operation, lane-to-lane skew and alignment windows become a primary qualification risk.
- CDR lock: confirm lock acquisition and stability across temperature and voltage.
- Jitter transfer: ensure the combined Tx + channel spectrum does not overwhelm Rx tolerance.
- Lane skew: verify alignment margin and alarm behavior under worst harness variation.
Termination Reflections and coupling boundaries (SI-only)
Termination and impedance control decide whether energy is absorbed or reflected. The engineering goal is not to “follow a rule”, but to ensure reflections do not collapse eye margin under the worst pattern, temperature, and harness tolerance.
- Differential termination: treat placement and value as part of the margin stack.
- AC/DC coupling boundary: verify baseline behavior does not trigger pattern-dependent failures.
- Harness variation: connectors and layout discontinuities must be included in qualification samples.
BER proof PRBS + stress patterns + environment sweeps
BER evidence should be collected with controlled patterns and stress conditions, and recorded with enough context to reproduce results during qualification and in-flight investigations.
- Patterns: PRBS + stress patterns that maximize transitions and coupling sensitivity.
- Sweeps: temperature and voltage sweeps; long-run soak to expose rare events.
- Records: BER counters, lock-loss events, retrain counts, temperature/voltage tags, timestamps.
Engineering closure: turn integrity into a margin contract
A robust link is defined by a bounded margin stack and a qualification script that proves the remaining margin stays positive under worst-case temperature, harness tolerance, and aging assumptions.
| Step | What is measured | Evidence output |
|---|---|---|
| Build a margin stack | Tx jitter, channel loss/reflect/crosstalk, Rx tolerance/sensitivity, temp/aging derates. | “Remaining margin” summary for each harness/temperature corner. |
| Run BER campaigns | PRBS + stress patterns with sweeps and long-run soak. | BER counters + lock-loss + event logs with timestamps and conditions. |
| Define pass/fail | Target BER and allowed recovery events (lock-loss/retrain) per mission needs. | Clear pass/fail gates suitable for review sign-off. |
Board & harness implementation: layout rules that keep margins real
Robust links are built on repeatable physical margins, not optimistic lab setups. This section provides board-and-harness rules that are audit-ready (review checklists) and test-ready (BERT/loopback without breaking flight configuration).
Differential pairs Minimal rules that prevent most margin loss
- Reference plane continuity: route on a continuous return plane; avoid crossing plane splits or voids.
- Impedance discipline: keep pair geometry stable; avoid sudden neck-downs and uncontrolled stubs.
- Length + skew: match within each lane and between lanes only after the return path is correct.
- Via strategy: minimize vias; keep transitions symmetric; avoid via stubs and unused layer transitions.
- Return path: ensure a nearby, uninterrupted return path; do not force long return detours.
Priority order for reviews: return path → termination/stubs → vias → length matching.
Connectors & harness Where “works in the lab” often breaks
- Stub control: avoid long branch stubs; treat every branch as a potential reflection source.
- Connector variation: include connector + harness tolerance in qualification samples.
- Pair integrity: preserve pairing through the connector (no accidental pair reshuffling).
- Shield/ground (pointed only): changes in shield bonding can change coupling and eye margin; qualify representative builds.
Harness and connector choices should be treated as part of the link budget, not as “mechanical details”.
Test & observability Insert BERT/loopback without breaking flight mode
Qualification data must reflect the flight path. Test insertion should be designed as a controlled mode, not an ad-hoc lab hack.
- Loopback modes: define digital and PHY loopbacks (concept-level) with clear enable/disable controls.
- Non-invasive points: avoid test fixtures that alter termination or create extra stubs.
- Lockout for flight: ensure test modes are disabled and verifiable in flight configuration.
- Logged evidence: record counters, lock-loss events, temperature/voltage tags, timestamps.
Common pitfalls Failure patterns that are easy to miss in reviews
| Pitfall | Typical symptom | How to catch it |
|---|---|---|
| Termination misplaced | Pattern-sensitive errors; BER rises after harness changes or temperature sweeps. | Check termination location; verify with margin/BER runs using representative harness. |
| Plane split crossing | Intermittent lock-loss; “one board works, another fails”. | Review return path continuity per segment; forbid split crossings. |
| Long stub / branch | Short tests pass; long-run soak fails; failures cluster at certain patterns. | Enforce stub limits; audit branches at connectors and test headers. |
| Lane swap not documented | Debug mismatch; logs and scope points do not correlate to lane indices. | Require lane map documentation + silk/labels + bring-up checklist. |
Redundancy & determinism: dual links, cold/warm spare, graceful degradation
Redundancy should reduce mission risk without creating uncontrolled switching or unpredictable tail latency. The recommended approach is to define (1) redundancy form, (2) trigger windows and debouncing, (3) action priorities, and (4) determinism evidence.
Redundancy forms Pick the form that is easiest to prove
| Form | Best fit | Key proof focus |
|---|---|---|
| 1+1 warm/hot spare | Fast switchover with bounded interruption for critical streams. | Switch time bound, false-switch rate, post-switch stability. |
| A/B cold spare | Simpler isolation and lower steady-state complexity/power. | Re-initialization sequence, repeatable recovery, stable BER at corners. |
| Dual active (parallel) | Throughput scaling when aggregation is supported and verifiable. | Ordering/buffering bounds, congestion behavior, tail latency proof. |
The most reliable design is often the one with the simplest, testable failover behavior.
Triggers Use windows + debouncing to avoid flapping
Switching should not be triggered by single spikes. Triggers should be windowed and tied to logged evidence.
- UE event: immediate escalation to protective action (policy-defined).
- Link down / loss of lock: switch path with controlled re-sync and verification.
- BER over threshold: windowed decision; attempt graceful degradation first when permitted.
- Continuous CRC fail: debounce window; treat bursts differently from sustained failures.
Every trigger should carry: window length, threshold, cooldown, and log fields.
Graceful degradation Reduce load before switching, when safe
When mission objectives allow, a controlled degradation path can recover margin without immediate topology changes.
- Reduce rate: lower link rate, frame rate, or stream resolution to restore margin.
- Switch path: change to the redundant link/network after recovery checks.
- Retire path: lock out unstable paths to prevent oscillation; require explicit re-qualification.
Determinism What must stay bounded under congestion
- Worst-case service time: bounded “maximum wait” for critical streams.
- Tail latency: control of worst-percentile latency under contention.
- Switch interruption: maximum acceptable outage during failover/re-sync.
- Evidence: timestamped logs, counters, and repeatable stress tests.
Determinism is proven by bounded results at worst corners, not by average throughput numbers.
Validation plan: proving memory + link robustness (fault injection & radiation campaigns)
A credible acceptance plan turns “robust” into auditable evidence: repeatable stimuli, observable counters/logs, and pass/fail gates that hold at corners. This section is written as a Definition-of-Done checklist plus a test matrix that can be signed off by systems, QA, and program teams.
Definition of Done What “done” must prove (not just “runs once”)
- No silent corruption: injected and natural faults must be surfaced by EDAC/logging paths; data integrity is demonstrably preserved or safely degraded.
- ECC remains meaningful over mission life: scrubbing meets quota under realistic load; CE/UE behavior matches policy; retirement/lockout is deterministic.
- Link stability is corner-proof: BER/CRC/lock metrics meet targets across temperature, voltage disturbance, and harness variation.
- Worst-case latency is bounded: combined stress (load + refresh + scrub + link stress) preserves an upper bound on tail latency for critical traffic.
- Evidence is replayable: every run produces timestamped logs with condition tags, counter snapshots, actions taken, and outcomes.
Layered coverage From unit tests to worst-case mission scheduling
- L1 — Unit: EDAC encode/decode paths, CE/UE classification, scrub engine accounting, PHY loopback/PRBS/CRC counters.
- L2 — Subsystem: memory↔link interaction under real traffic; verify that observability stays intact when load varies.
- L3 — Scenario: scripted worst-case schedule with simultaneous refresh + scrub + peak traffic + link stress; verify tail latency bound and policy behavior.
Each layer must specify: stimulus → observables → pass gate → log fields.
Outputs What must be recorded and reviewed
- Memory: CE/UE counters, syndrome classes, affected region/address (policy-defined granularity), scrub progress, retire/lockout events.
- Link: BER, CRC window stats, lock-loss/re-lock events, recovery time, (optional) eye/jitter snapshots for debug correlation.
- System: actions taken (reduce rate / switch path / retire path), cooldown/debounce behavior, and post-action stability window.
- Latency: tail percentiles and worst-case service time under combined stress conditions.
Memory acceptance EDAC injection, scrub quota, UE handling, log integrity
| Test item | Method | Pass gate evidence |
|---|---|---|
| Single-bit injection | Inject controlled single-bit faults in representative regions; execute read/write patterns under load. | Correct data, CE counter increments, logged fields complete. |
| Multi-bit injection | Inject multi-bit faults; verify detect/uncorrectable behavior per policy (isolate/retire/degrade). | No silent failure; UE event logged; policy action is deterministic. |
| Scrub quota | Run background/on-idle/region-priority scrub modes under multiple load levels. | Quota met; scrub progress logged; latency impact remains bounded. |
| UE policy consistency | Force UE conditions and verify consistent action chains and recovery windows. | Action/outcome logged; no oscillation; lockout rules respected. |
Link acceptance BER sweeps, temperature cycles, supply disturbance, PRBS/loopback
| Test item | Method | Pass gate evidence |
|---|---|---|
| BER sweep | PRBS/stress patterns across temperature corners and harness samples; include long-run soak tests. | BER/CRC windows meet targets; artifacts attached per run. |
| Temperature cycling | Cycle across expected flight thermal envelope; track lock, BER, CRC windows continuously. | No uncontrolled lock-loss; recovery time within bound; logs tagged. |
| Supply disturbance | Apply controlled ripple/droop profiles; verify CDR stability and counter behavior. | Bounded lock-loss behavior; deterministic actions; stable post-window. |
| Loopback coverage | Verify PHY/digital loopbacks and flight lockout; confirm test mode does not alter termination conditions. | Repeatable results; test mode off in flight config; evidence logged. |
Combined worst case Load + refresh + scrub + link stress (tail-latency proof)
The combined scenario must intentionally align “busy moments” to expose the true worst-case service time. The objective is not a high average throughput, but a provable upper bound on tail latency when refresh and scrubbing compete with peak traffic while the link is stressed.
- Traffic profile: sustained stream + burst overlays (scripted phases with timestamps).
- Memory activity: refresh operating normally; scrub operating with quota and/or accelerated modes.
- Link stress: PRBS/stress pattern and corner condition (temperature or supply disturbance).
- Policy behavior: verify that degrade/switch/retire actions (if enabled) do not cause flapping; cooldown/debounce must be observable.
- Proof: record worst-case service time, tail percentiles, action triggers, and stable recovery windows.
Radiation campaigns (SEE) What to observe (no detector details)
Radiation runs should validate that resilience mechanisms remain testable and deterministic under real upset conditions. The focus is on counters, logs, actions, and recovery bounds.
- Memory: CE/UE rate vs run segment; syndrome distribution; scrub effectiveness; region retire/lockout events.
- Link: CRC window anomalies; BER excursions; lock-loss/re-lock frequency; recovery time distribution.
- System: policy actions taken (reduce/switch/retire), debounce/cooldown compliance, and post-action stability.
- Logging: every event must include timestamp, run-id tags, condition tags, and counter snapshots for post-analysis replay.
Example equipment (materials list) Typical part numbers used in acceptance labs
Use equivalent instruments if lab standards differ. The list below is included as concrete examples for documentation and procurement checklists.
| Category | Example part number | Use in this plan |
|---|---|---|
| BERT / PRBS generator & analyzer | M8040A (Keysight) / MP1800A (Anritsu) | PRBS/stress patterns, BER sweeps, long-run soak statistics. |
| Power disturbance + profiling | N6705C (Keysight) | Controlled droop/ripple profiles and correlated rail telemetry. |
| Thermal cycling chamber | SE-Series (Thermotron) | Corner temperatures, thermal cycling, long dwell soak tests. |
| High-speed scope / eye debug (optional) | 86100D (Keysight) / equivalent | Debug correlation (eye/jitter snapshots) when BER anomalies occur. |
FAQs (Space Memory & Interconnect)
Practical answers with engineering evidence: counters, logs, margins, and acceptance gates (no cross-topic expansion).