Nine essential data centre health checks

At present, vendors are designing servers that can demand over 40kW of cooling per rack. With most data centres designed to cool an average of no more than 2 kW per rack, innovative strategies must be used for proper cooling of high-density equipment.

The Schneider Electric Data Centre Science Centre, a research centre of the energy management company, recommends regular “health checks” as a start to increasing cooling efficiency, cooling capacity, and power density in existing data centres.

According to Schneider Electric, just as an automobile benefits from regular servicing, a data centre needs to be kept operating at peak efficiency to maintain the business processes it supports and to prevent future problems. Before embarking upon expensive upgrades to the data centre to deal with cooling problems, certain checks should be carried out to identify potential flaws in the cooling infrastructure.

These checks will determine the health of the data centre in order to avoid temperature-related IT equipment failure.

“They can be used to assess the availability of sufficient cooling capacity for the future,” says Eben Owen, E&S sales manager at Schneider Electric South Africa.
The current status should be reported and a baseline established to ensure that subsequent corrective actions result in improvements.

A cooling system check-up should include these nine items:
* Maximum cooling capacity – if there isn’t enough petrol in the tank to power the engine then no amount of tweaking will improve the situation. Check the overall cooling capacity to ensure that the IT equipment in the data centre does not exceed it.
“One watt of power consumed needs one watt of cooling. Excess of demand over supply will require major re-engineering work or the use of self-contained high-density cooling solutions,” says Owen.

* CRAC (computer room air conditioning) units – measured supply and return temperatures and humidity readings must be consistent with design values. Check set points and reset if necessary.
A return air temperature considerably below room ambient temperature would indicate a short circuit in the supply air path, causing cooled air to bypass the IT equipment and return directly to the CRAC unit.
“Check that all fans are operating properly and that alarms are functioning. Ensure that all filters are clean,” adds Owen.

* Chiller water/ condenser loop – check condition of the chillers and/or external condensers, pumping systems, and primary cooling loops. Ensure that all valves are operating correctly. Make sure that DX systems, if used, are fully charged.

* Room temperatures – test temperatures at strategic positions in the aisles of the data centre. Owen explains that these measuring positions should generally be centred between equipment rows and spaced approximately every fourth rack position.

* Rack temperatures – measuring points should be at the centre of the air intakes at the bottom, middle, and top of each rack. These temperatures should be recorded and compared with the manufacturer’s recommended intake temperatures for the IT equipment.

* Tile air velocity – if a raised floor is used as a cooling plenum, air velocity should be uniform across all perforated tiles or floor grilles.

* Condition of subfloors – “Any dirt and dust present below the raised floor will be blown up through vented floor tiles and drawn into the IT equipment,” says Owen. “Under-floor obstructions such as network and power cables obstruct airflow and have an adverse effect on the cooling supply to the racks.”

* Airflow within racks – gaps within racks (unused rack space without blanking panels, empty blade slots without blanking blades, unsealed cable openings) or excess cabling will affect cooling performance.

* Aisle and floor tile arrangement – effective use of the subfloor as a cooling plenum critically depends upon the arrangement of floor vents and positioning of CRAC units.
Installation of the latest blade-server technology provides many benefits. However, these servers if deployed as compactly as their size allows draw two to five times the per-rack power of traditional servers and generate heat output that can easily cause thermal shutdown if proactive cooling strategies are not employed.

Owen stresses that to avoid outright equipment failures, unexplained slowdowns and shortened equipment life, it is becoming critically important to implement a regular health check regime to ensure that cooling equipment is operating within the design values of capacity, efficiency, and redundancy.