Battery Thermal Management Controller for Energy Storage
← Back to: Energy & Energy Storage Systems
This page helps you design and place a dedicated battery thermal management controller for ESS and UPS systems, from sensor placement and actuator driving to protection strategies and system interfaces. It explains how to choose the right architecture and IC building blocks so cooling, safety and derating behaviour stay predictable as the site scales.
What this page solves
This page focuses on the thermal controller that sits between an energy storage battery rack and its cooling hardware. In rack and container ESS, UPS battery systems and buffer ESS for fast charging, higher charge and discharge C-rates quickly turn “just add some fans” into a dedicated control problem with dozens of temperature points and multiple cooling actuators.
The goal is to show when a standalone battery thermal management controller becomes necessary instead of treating temperature as a few extra channels inside the pack BMS. The page frames how multi-point sensing, pumps, valves and fans are coordinated, and how thermal limits are translated into power limits, alarms and safe states for the rest of the ESS.
It also clarifies the boundaries between the thermal controller and other subsystems. The pack BMS owns cell safety and state-of-charge, the PCS manages bidirectional power conversion and grid constraints, and the EMS optimizes site-level energy flows. The thermal controller is responsible for collecting detailed temperature data, driving the cooling loop and publishing a clear thermal capability envelope back to these systems.
- When the thermal function can stay inside the pack BMS MCU, and when it should move to a standalone controller.
- How the controller exchanges limits, alarms and status with the BMS, PCS and EMS.
- How sensing, actuation and protections are grouped into a coherent control block.
Thermal sensing architecture & accuracy budget
A battery thermal controller depends on a dense temperature sensing network rather than a handful of generic probes. Typical ESS designs combine NTC thermistors and, in some cases, RTD sensors at cell tabs, module cooling plates, coolant inlet and outlet points and cabinet air locations. The sensing architecture must scale from a few channels in a small UPS battery system to dozens of channels in a container-sized rack while keeping wiring complexity and noise under control.
The controller groups these sensing points into front-end AFEs, multiplexers and ADCs. Choices such as NTC versus RTD, star versus daisy-chain wiring and the amount of analog multiplexing have direct impact on achievable accuracy, response time and diagnosability. The aim is not to build a sensor textbook, but to highlight which AFE, ADC and reference features matter when designing a dedicated thermal controller for ESS.
Accuracy and repeatability are usually specified at the system level, for example within ±1 °C to ±2 °C over the operating range. The thermal controller has to budget this across sensor tolerances, wiring resistance, ADC resolution, voltage reference drift and installation errors. At the same time it must detect open and shorted sensors and catch more subtle issues such as slipped probes or poor contact, then fall back to conservative safe modes when measurements are no longer trustworthy.
Actuator types: pumps, valves and fans
The battery thermal management controller does not move heat by itself; it coordinates pumps, valves and fans that sit around the rack or container. Each actuator family brings different electrical interfaces, dynamic behaviour and diagnostic needs. The controller therefore relies on a set of smart drivers to switch and modulate these loads safely across the ESS voltage range.
Coolant pumps in ESS applications are usually DC or BLDC machines that run for long periods and face varying viscosity, trapped air and partial clogging. The driver must support soft-start to limit inrush, current limiting during transients and stall detection when the rotor cannot turn. For higher voltage or automotive-derived platforms this often leads to high-side or half-bridge drivers with integrated current sense and thermal shutdown rather than simple low-side switches.
Valves can be simple on–off devices or proportional actuators that meter flow through a cooling branch. On–off valves are usually handled by low-side or high-side drivers sized for inductive loads, while proportional valves prefer PWM or current-controlled outputs to achieve repeatable positions. Depending on system accuracy targets, the controller may rely on coil current estimation or add a dedicated position sensor, which requires extra ADC or digital interfaces and more detailed diagnostics for stuck or drifting valves.
Fans are typically 2-wire, 3-wire or 4-wire units. A 2-wire fan only exposes power pins and is driven by a switched supply or PWM on the supply line, with no direct speed feedback. A 3-wire fan adds a tachometer signal that allows the controller to verify that speed follows demand. A 4-wire fan accepts a dedicated PWM input at a defined frequency and also returns tach pulses, enabling quieter, granular speed control. In all three cases the driver must handle inductive switching and, for safety-critical zones, support detection of missing or slowed fans.
Across pumps, valves and fans, the preferred implementation in ESS is to use smart high-side, half-bridge or full-bridge drivers rather than discrete MOSFETs. These devices integrate gate control, current sensing, short-circuit and overtemperature protection and often self-test logic. The thermal controller then monitors diagnostic pins and measured currents to detect blocked pumps, stuck valves or failed fans and can derate or shut down the system before temperatures drift out of control. Main power conversion topologies remain in the PCS for ESS page; this section focuses strictly on actuator drive and feedback chains.
Protection, derating and safe states
A battery thermal management controller is useful only if it turns temperature and cooling status into clear actions. Rather than a single trip threshold, energy storage systems work with multiple temperature zones and safe states. At lower temperatures the controller can prioritise efficiency and noise, while approaching the upper range it must progressively reduce charge and discharge power, enforce charge or discharge inhibits and finally request a full shutdown if cooling and temperature cannot be kept within limits.
Typical implementations define a normal operating zone, one or more derating zones and one or more trip zones. In the normal zone thermal limits do not constrain the pack BMS or PCS and cooling runs at low to moderate duty. As average module or hotspot temperatures enter a derating zone, the controller raises pump and fan speeds and sends a reduced power envelope to the BMS or PCS instead of waiting for a hard trip. Close to the maximum allowable temperature, charge may be inhibited while carefully controlled discharge remains, until a final threshold forces both directions towards zero and the system enters a defined safe state.
Cooling failures must be handled explicitly, not just as temperature excursions. Loss of a pump, stuck valves or missing fan tach pulses all indicate that cooling capacity is lower than expected. The thermal controller therefore combines actuator diagnostics with temperature measurements when deciding how fast to derate. If cooling cannot be restored, the controller requests more aggressive power limits even before temperatures cross the normal derating boundary, buying time to avoid hotspots or cell damage in large racks and containers.
Sensor faults are handled with conservative fallback rules. Open or shorted NTCs, inconsistent readings between neighbouring probes or unrealistic dynamics all point to a measurement that can no longer be trusted. For non-critical locations, the controller can substitute nearby temperatures with extra margin. For critical hotspots or coolant outlets, the controller should escalate to a more restrictive power envelope and, if faults persist, escalate towards shutdown. In parallel it reports whether limits are driven by high temperature or by loss of sensing, giving the EMS and operators a clearer picture of the underlying issue.
Thermal runaway and fire detection pages provide specialized input signals such as off-gassing, smoke and flame alarms. After these signals arrive, the role of the thermal controller is to drive ESS power towards zero, notify the pack BMS, PCS and EMS, and if required operate dedicated I/O for fire system interlocks. The gas and fire sensing algorithms and hardware interfaces are described on their own pages so that this section can stay focused on temperature-based limits, cooling faults and safe state transitions.
Controller partitioning and system interfaces
Battery thermal control logic can live in several places in an energy storage system. Small UPS and single-rack ESS designs often integrate thermal control into the pack BMS MCU, reusing its ADC, PWM and communication resources. As systems grow, the number of temperature channels, pumps, valves and fans grows as well, and the controller must run more complex safety and diagnostic logic. At that point integration into the BMS can exhaust CPU headroom and make every thermal update part of the battery safety certification scope.
A dedicated thermal management controller is common on rack or container-sized ESS. In this partition the controller terminates most temperature sensors and actuator drivers, then exposes a clean interface to the pack BMS and PCS: recommended charge and discharge power limits, thermal alarms and cooling fault codes. This separation lets the thermal function evolve independently from the core cell protection firmware and supports liquid-cooling networks with redundant pumps and branch valves without overloading the BMS MCU.
Multi-rack and multi-container sites frequently use a distributed architecture. Each rack has its own local thermal controller and BMS, taking care of detailed cooling and fast protection for that rack. A site EMS or gateway exchanges higher level setpoints and constraints with these racks rather than driving valves and fans directly. This approach simplifies scaling, isolates faults and lets the EMS schedule power and cycling duty across racks based on both electrical and thermal margins.
Regardless of the partition, the thermal controller must talk to neighbouring subsystems. Upwards it exchanges limits, alarms and status with the pack BMS, PCS and EMS over CAN, RS-485 or Ethernet and may provide hardwired trip lines for safety. Laterally it receives digital inputs from fire and off-gassing detection and from the insulation monitor. When gas, smoke or insulation faults are detected, the controller drives ESS power towards zero, cooperates with the BMS and PCS on shutdown and, if required, operates fire system interlocks. The aim of this section is to help decide where the thermal control code should run and which interfaces need to be reserved as the system scales.
IC building blocks and vendor mapping
Implementing a battery thermal controller for ESS means combining several IC building blocks rather than searching for a single monolithic device. The sensing front end typically uses multi-channel NTC or RTD AFEs and temperature monitors, sometimes combined with remote-diode interfaces for power modules. These devices provide accurate, diagnosable temperature readings across tens of points in the rack while handling excitation currents, linearisation and open or short detection.
The controller MCU or SoC anchors the design. It must offer enough ADC and timer channels to cover temperature inputs, pressure and flow sensors, as well as PWM outputs for pumps, valves and fans. It also needs CAN or CAN FD and, for rack or container controllers, at least one Ethernet interface towards the site gateway. Flash and RAM must support multiple cooling strategies, calibration data and on-board logging, with sufficient CPU performance to run control loops and communication stacks without jeopardising BMS or PCS timing.
Actuators are best driven by smart high-side, half-bridge or full-bridge drivers instead of discrete MOSFETs. These drivers integrate gate control, current sensing, short circuit and thermal shutdown protection and often provide diagnostic outputs to the MCU. Dedicated fan controllers may handle 4-wire fans with fixed PWM frequency, soft-start and tach monitoring. This approach reduces the burden on the MCU pins and simplifies fault detection for blocked pumps, stuck valves and missing or slowed fans.
Reliable operation also depends on supervision and power-protection ICs. External watchdogs and reset supervisors monitor the thermal controller supply rails and software health. eFuses and smart high-side power switches protect the controller supply and driver rails against short circuits and overloads with programmable current limits and diagnostic flags. DC/DC converters, LDOs and isolated supplies form the power tree for sensors, logic and actuators, while detailed converter topology and backup PSU design are covered in the dedicated auxiliary / backup PSU for ESS page.
Mainstream semiconductor vendors such as Texas Instruments, Analog Devices, NXP, Infineon and Microchip offer complete portfolios across these categories. Selection is driven more by feature sets than by brand: voltage and temperature ratings, channel counts, communication interfaces, diagnostics and safety documentation all matter. Once requirements for the ESS environment and control architecture are clear, these criteria provide a structured way to narrow down candidate ICs rather than relying on ad-hoc part choices.
Application mini-stories (rack / container examples)
Mini-story 1: Containerized ESS under high ambient and partial cooling loss
Consider a 2 MWh containerized ESS with two stacked rows of liquid-cooled racks and roof-mounted exhaust fans. In summer, ambient temperature can remain above 40 °C while some PCS or HVAC units are out of service for maintenance. The operator still expects the site to deliver power, but any thermal trip on a container would disrupt service. The thermal controller therefore needs to manage limited cooling capacity, shift thresholds and prioritise where cold coolant and airflow are used.
Each rack uses around 20 temperature points: NTCs on module outlets, coolant inlets and outlets, busbars and cabinet air inlets; a few RTDs on cold plates; and several remote-diode sensors inside power modules. Two multi-sensor temperature measurement ICs such as Analog Devices LTC2984 or similar parts aggregate NTC and RTD inputs per rack. Remote-diode monitors such as Texas Instruments TMP451 or TMP461-class devices track the junction temperature of key power modules, while an 8-channel sigma-delta ADC such as AD7124-8 measures flow and pressure transducers in the coolant loop. These front-end devices provide calibrated temperatures with open/short diagnostics and offload much of the linearisation from the main MCU.
Cooling hardware consists of one primary liquid pump, one redundant pump on a bypass loop, several branch valves and a bank of roof fans. The pumps are driven by smart high-side switches with integrated current limit and diagnostics, for example Infineon BTS50015-1TAD, BTS500xx-family or Texas Instruments TPS1H100-Q1 class devices. Branch valves are grouped on multi-channel low-side or configurable drivers such as TI DRV8806 or DRV8908-Q1, which provide fault status for each channel. Roof fans are powered by additional high-side switches or a dedicated fan controller stage, and tachometer signals return to the thermal controller for speed verification.
When site control marks certain PCS or HVAC units as unavailable, the thermal controller tightens temperature thresholds and enters derating earlier. For example, the normal operating band may end around 35–40 °C, with derating starting between 40–50 °C, charge inhibit at 50–55 °C and trip above 55–60 °C. With reduced cooling capacity, the controller drives pumps and critical valves to higher duty at lower temperatures and requests reduced charge and discharge power from the BMS and PCS before hard trips are reached. Flow and temperature data are used to allocate coolant preferentially to racks with higher C-rate or state of charge, while less loaded racks are allowed to run warmer as long as their own limits are respected.
- Sensors: ~20 NTC / RTD points per rack via 2 × LTC2984-class devices, 2–4 remote diodes via TMP451/TMP461-class ICs, 2–3 flow/pressure sensors via an AD7124-8-class sigma-delta ADC.
- Actuators: 1–2 liquid pumps on BTS500xx / TPS1H100-Q1 smart high-side switches, 6–8 valves on DRV8806 / DRV8908-Q1, 6–8 roof fans on smart switches or fan controllers with tach feedback.
- Controller: Rack-level ARM Cortex-M MCU with dual CAN FD, one Ethernet port, ~24–32 PWM channels and multiple SPI / I²C interfaces to AFEs and drivers.
Mini-story 2: UPS battery system with short high C-rate discharge
A 500 kVA data-center UPS relies on a valve-regulated lead-acid or lithium battery system to sustain the load for several minutes at high C-rate. Most of the time the system runs in float charge with moderate temperature. During a mains failure, the UPS draws high current for a short interval and the thermal controller must avoid nuisance trips when cell and cabinet temperatures are still well below damaging levels. At the same time, repeated high-current events should not push the battery into accelerated ageing or risk overheating if cooling degrades.
Each battery cabinet uses around 8–12 NTCs placed on module tops, busbars and air inlets and outlets, along with one or two ambient sensors. The main UPS controller MCU reads most NTCs directly via its integrated ADC, possibly supported by a small 4-channel high-resolution ADC such as Texas Instruments ADS1118 or similar devices for critical points. Board-level digital temperature sensors such as ADT7484-class parts monitor controller and power-electronics PCB temperatures. This modest sensor count is enough to capture cabinet air and metal temperatures when combined with knowledge of discharge current and duration.
Cooling is provided by three or four high-speed cabinet fans. Each fan is controlled through smart high-side switches such as Infineon BTF3050TE- or BTS500xx-family devices or TI TPS1H100-Q1, with tachometer signals returning to the controller. In some designs a simple fan controller IC such as Microchip TC654/TC655-class parts maps temperature to fan duty and monitors faults, so that the UPS controller receives health and speed information without generating high-frequency PWM directly. Driver current ratings and diagnostic features are chosen to withstand frequent start-stop cycles and occasional blocked rotors.
During a discharge event the thermal controller observes the requested power, estimated discharge duration and initial cabinet temperature. As soon as a high C-rate is detected, fans ramp to higher speed even before temperature has risen significantly, using the cabinet and cell mass as thermal capacitance. Temperature, time and power are considered together; short spikes at high power do not immediately trigger derating unless previous events have accumulated heat. After the discharge finishes, fans continue at elevated speed for a defined cooldown period and the controller logs the event with timestamps, peak temperature and power profile. This log gives the EMS and maintenance teams a clear view of thermal stress over the lifetime of the UPS.
- Sensors: ~12 NTC channels into the UPS controller ADC, 1–2 ambient sensors and optionally a 4-channel high-resolution ADC such as ADS1118 for critical points.
- Actuators: 3–4 cabinet fans driven by BTF3050TE / BTS500xx / TPS1H100-Q1 smart high-side switches or by a fan controller such as TC654/TC655-class devices.
- Controller: UPS control MCU with spare ADC channels, PWM outputs and tach inputs for fan control, reusing existing CAN or Ethernet links for reporting thermal events.
Design checklist for a battery thermal management controller
Sensing: coverage, accuracy and wiring
- Are all critical hotspots covered by sensors, including module outlets, coolant inlets and outlets, cabinet air inlets and outlets and key busbars?
- Is the temperature accuracy budget closed across sensor tolerance, AFE offset and gain, ADC resolution, reference drift and wiring resistance?
- Is the mix of NTC and RTD types chosen based on operating range, required accuracy and harness cost, with clear justification in the design notes?
- Are harness lengths, wire gauge and maximum allowed series resistance per sensor loop specified and checked against fault-detection thresholds?
- Do the temperature AFEs (for example LTC2984-class, AD7124-class, TMP451/TMP461-class devices) provide open- and short-circuit detection and diagnostic flags on each channel?
- Are any remote-diode interfaces used on power modules validated for the expected common-mode voltage range and wiring environment inside the rack or container?
Actuators: drivers, currents and safe states
- For each pump, valve and fan, are the maximum operating current, stall or inrush current and duty cycle characterised and matched to driver ratings with margin?
- Are smart high-side or bridge drivers (for example BTS500xx, BTF3050TE, TPS1H1xx, DRV88xx families) selected with adequate voltage, current and thermal derating for the ESS environment?
- Are driver diagnostics such as overcurrent, overtemperature, open load and short to battery or ground monitored by the thermal controller and logged as distinct fault codes?
- Is a defined safe state documented for each actuator in case of driver failure, MCU freeze, loss of supply or watchdog reset, including preferred valve positions and fan profiles?
- Are PWM frequencies and update rates for pumps, proportional valves and fans chosen to avoid audible noise issues while keeping switching losses and EMI manageable?
- Are mechanical limits and startup constraints (for example minimum speed, priming requirements for pumps) explicitly reflected in the driver and firmware configuration?
Interfaces: BMS / PCS / EMS and event logging
- Are clear signals defined from the thermal controller to the pack BMS and PCS, including thermal derating active, charge inhibit, thermal trip and cooling fault indications?
- Is the allowed charge and discharge power as a function of temperature expressed as a structured envelope or table that can be consumed by the BMS and PCS?
- Are communication links to BMS, PCS and EMS (CAN, RS-485, Ethernet or similar) sized and configured with sufficient bandwidth, update rate and error handling for thermal limits and events?
- Are thermal events such as entering and leaving derating zones, charge inhibits, trips and actuator or sensor faults timestamped and stored in a log with enough resolution and retention?
- Is the interaction with the insulation monitor, fire detection and off-gassing subsystems defined as discrete inputs and outputs with clear priority rules between electrical and thermal trips?
- Are remote diagnostics and firmware updates for the thermal controller planned through the same site gateway used for EMS or through a secure OTA mechanism on the ESS network?
Safety and redundancy: sensors, cooling paths and IC-level protection
- Are dual sensors provided on the most critical locations such as the hottest module, coolant outlet and cabinet air outlet, with cross-check and plausibility logic in firmware?
- Is at least one redundant cooling path or degraded mode defined, for example a backup pump, bypass valve or emergency fan curve for loss-of-function scenarios?
- Does the thermal controller MCU use an independent watchdog and supply supervisor (for example TPS386000, ADM8316, MIC81x-class devices) with well-defined recovery and limp-home actions?
- Are eFuses or smart power switches (for example TPS25982, LTC4368-class devices) used on thermal controller and driver supplies to limit fault current and provide diagnostic feedback?
- Are shutdown and fallback strategies aligned with fire detection, off-gassing and insulation monitor behaviour, including priorities and timing between different trip sources?
- Are safety-related assumptions, threshold values and dependency on specific IC features documented so that later device substitutions can be reviewed without losing protection coverage?
FAQs about battery thermal management controllers
1. When does an ESS really need a dedicated battery thermal controller instead of just feeding temperatures into the pack BMS MCU?
You usually rely on the pack BMS MCU while you have a single rack, a simple air-cooled cabinet and only a handful of temperature points and fans. You move to a dedicated thermal controller once C-rates, rack count or liquid-cooling complexity grow, so thermal changes no longer force you to recertify and requalify the BMS firmware. Learn more in the partitioning section.
2. How many temperature points are typically enough per rack, and which locations should you instrument first?
For a typical rack you start by covering the hottest module string, coolant inlet and outlet, main busbars and cabinet air inlet and outlet. That often means 12–20 points. You add more sensors when you have strong gradients, unusual airflow or high C-rates, keeping at least one “golden” point for validation and calibration. See the sensing architecture guidance.
3. How should you choose NTCs, RTDs and remote-diode sensors for containerized ESS and UPS battery systems?
You use NTCs when you need low cost and short harness runs, RTDs when you need stable accuracy over long cables or wide ambient ranges, and remote diodes when you want true junction temperature of power modules. In practice you often combine them so each sensor type matches its strength in the cooling architecture. Compare options in the sensing section.
4. What is a practical way to build an accuracy budget for ESS temperature measurement and decide whether your AFEs are good enough?
You start by defining the maximum error you can tolerate at each decision boundary, for example ±1 °C at derating and trip thresholds. You then split that budget across sensor tolerance, AFE offset and gain, ADC resolution, reference drift and wiring resistance. If the sum exceeds your budget you adjust components or thresholds. See the accuracy discussion here.
5. How should pump stall, valve faults and fans that do not spin be diagnosed and protected at driver and firmware level?
You combine smart drivers and firmware checks. Current and voltage diagnostics from high-side or bridge drivers tell you about short circuits, open loads and stalls. Tachometer feedback, flow and temperature changes confirm that pumps, valves and fans actually move coolant and air. When a fault is confirmed, you enter a safe state and log the event for later analysis. Review actuator details here.
6. How should you define temperature zones and derating thresholds so that cells are protected without causing nuisance trips?
You anchor thresholds to cell and module datasheets, then introduce a “warning and derating” band before any safety limit. In that band you gradually reduce charge and discharge power instead of tripping. You also consider ambient temperature, event duration and recent thermal history so short peaks do not trigger unnecessary shutdowns. See the protection and derating section.
7. What must the thermal controller at least do when a thermal runaway or off-gassing signal is received from a dedicated detection system?
You immediately drive requested charge and discharge power towards zero, command the BMS and PCS into a safe state and freeze any aggressive cooling strategy that might spread gas or smoke in the wrong direction. You also raise clear alarms for the EMS and fire system, log the event and avoid automatic restart until a supervised inspection is complete. Read more about fault handling here.
8. How should multi-rack and multi-container sites partition thermal control between rack-level controllers and the site EMS?
You let each rack-level controller manage its own sensors, pumps, valves and fans with fast local protection, then expose a simple view to the EMS: available power, thermal margin and health. The EMS uses those summaries and business rules to schedule which racks work harder, while avoiding direct control of individual actuators across the whole site. See the multi-rack partitioning examples.
9. What criteria should you use to select the MCU, temperature AFEs and smart drivers for an ESS battery thermal controller?
You start from channel counts, interfaces and diagnostics. The MCU needs enough ADCs, timers, CAN and Ethernet to cover growth in racks and actuators. Temperature AFEs must support your chosen sensors and wiring length. Smart drivers must survive inrush, stalls and ambient extremes. You also check EMC performance, safety documentation and long-term availability. Explore IC building blocks here.
10. How should the thermal controller exchange limits and events with the pack BMS, PCS and EMS so that power derating is predictable?
You define a small set of clear messages: current thermal state, maximum recommended charge and discharge power, derating active flags, trip causes and cooling fault codes. You update them at a fixed rate over CAN or Ethernet and avoid ad-hoc bitfields. When each consumer honours the same envelope, power changes become smooth and reproducible across firmware updates. See the interface checklist section.
11. How can you add redundancy in sensors, cooling paths and power supplies so that a single fault does not silently disable thermal protection?
You duplicate sensors at the hottest and most safety-relevant points and cross-check their readings. You design at least one degraded cooling mode, such as a backup pump or emergency fan curve. You also protect controller and driver supplies with independent watchdogs, supervisors and eFuses so no single failure can quietly remove cooling or sensing capabilities. Review safety IC building blocks.
12. What should you log in a battery thermal management system so that field issues can be analysed and cooling strategies improved over time?
You log key temperatures, cooling states, derating entries, charge inhibits, trips and actuator or sensor faults with timestamps. You also capture ambient temperature, requested and delivered power and any EMS or operator overrides. With that history you can later explain warranty cases, tune thresholds for a specific site and refine firmware without guessing. See the application mini-stories for examples.