subscribe: Daily Newsletter

 

Long fat networks need a helping hand from high-speed TCP

0 comments

Five years ago, a company may have housed its data in one part of Johannesburg and maintained a mirrored site nearby, writes Christo Briedenhann, country manager: South Africa, Riverbed.

Management approved of this and saw the mirrored site as an almost sure route to continuous operations. But in today's uncertain world, most enterprises have changed their approach to mirroring, data replication and disaster recovery.
Savvy IT executives now want the company's backup site located far away enough to be on an independent power grid. Many others have gone farther afield and located these sites thousands of miles from company headquarters. To support this architecture, they have built long fat networks (LFNs), which are "long" in terms of distance and network delay and "fat" in terms of link bandwidth.
One of the key reasons to deploy LFNs for disaster recovery is to handle the immense loads of storage traffic that can be generated in an active data centre. When fully replicating a storage system, every write to a disk must be reflected to the mirrored site over a large distance. Yet, because high-end Storage Area Networks (SANs) are based on Fibre Channel protocols, their physical reach is limited.
To address this problem, storage switch vendors introduced SAN routers to extend the reach of Fibre Channel communications over IP. With a SAN router, isolated Fibre Channel SANs can be bridged together over an arbitrary distance by tunnelling Fibre Channel Small Computer System Interface (SCSI) payloads over IP using FCIP or iSCSI protocols.
A major challenge to this approach is the sheer volume of network traffic generated from replicating an active data centre. Such workloads often require many hundreds of megabits, if not gigabits, of network throughput. While dedicated high-speed data circuits are not cheap, storage operators routinely justify their use as a necessary business expense.
Enter the LFN, a high capacity, long-haul network designed to carry storage replication traffic over IP. You would think that with enough link bandwidth, such an approach would seem to "just work".
Unfortunately, the laws of physics intervene and the speed of light wreaks havoc on applications and protocols separated by large distances. You can have all the bandwidth in the world, but a Transmission Control Protocol (TCP)/IP connection carrying storage replication traffic can still come to a grinding halt when subjected to typical conditions in a high-delay LFN.
In response to such situations, well-intentioned IT operators blindly throw network bandwidth at the problem, only to discover that performance is still the same after expensive upgrades. UDP not the answer
To avoid the problems that arise from TCP’s flow control and congestion control algorithms, certain SAN routers support the connectionless User Datagram Protocol (UDP) for replication traffic.
UDP, though, is a mixed bag. It is typically configured to run blindly at a pre-configured rate, which works sufficiently well if a circuit is dedicated completely to the
mirroring application. However, if the network is shared at all, UDP's lack of congestion control can virtually stamp out any competing TCP flows.
Unlike UDP, TCP allows client-server applications to communicate reliably over the unreliable IP packet service and permits sharing of network bandwidth across connections in a roughly "fair" fashion. It does so by having each TCP sender dynamically adjust its transmission window, which represents the maximum amount of unacknowledged data that can be in transit in the network at any given time.
Since it takes a roundtrip time for each packet to be acknowledged, a TCP sender can send a window's worth of packets every roundtrip time, yielding a sending rate of the window size divided by the roundtrip time. As the roundtrip time increases, as in an LFN, the sending rate drops if the window is somehow constrained or if the adjustments to the window are made in a sub-optimal fashion.
Over the years, improvements made to TCP allow the protocol to perform better at ever-increasing speeds. For example, TCP "window scaling" – introduced in 1988 – avoids a flow-control bottleneck from the original protocol specification, such as limiting throughput to slightly more than 5Mbit/sec. on a 100-millisecond roundtrip network. Meanwhile, selective acknowledgments and rate halving, which were introduced in 1996, both improve the protocol’s dynamics when the window becomes large.
Despite all these improvements, TCP performance in LFNs is often abysmal. Because of this, TCP often gets a bad reputation for how its congestion control algorithm "hunts" for bandwidth haphazardly and backs off conservatively at any sign of congestion. Some vendors that specialise in TCP protocol optimisation contend that TCP is fundamentally flawed at high speeds and must be replaced by an alternative transport protocol.
Yet, despite repeated attempts over the past two decades, no viable alternative to TCP or TCP congestion control has emerged that definitively performs better across the broad range of environments in which TCP shines.
So, what is the problem with TCP? In contrast to common wisdom, a properly tuned TCP implementation running over a well-engineered network can readily achieve throughputs up to 100M bit/sec., even with very high latencies. But beyond 100M bit/sec., in a typical high-latency environment, performance starts to degrade.
The problem has to do with the window adjustment algorithm, which, when devised 20 years ago, was not targeted for the performance regime of today's LFNs. In its so-called "congestion avoidance" phase, TCP increases its sending window by one packet every roundtrip time and when it detects congestion, it cuts the window in half.
On an LFN, when the optimal window size is, say, 8 000 packets, this means it takes about 400 seconds (six minutes and 40 seconds) to recover from a congestion event that caused the window to cut in half. In a highly dynamic network, with lots of connections coming and going, normal TCP is simply too sluggish to track all the activity.
Fortunately, there is an elegant and simple-to-implement fix embodied in a protocol enhancement called high-speed TCP. Documented in RFC-3649, the gist of this approach is to alter how the window is opened on each round trip and closed on congestion events in a way that is dependent upon the absolute size of the window.
When the window is small, high-speed TCP behaves like normal TCP. Conversely, when the window is large, it increases the window by a larger amount and decreases it by a smaller amount, where these amounts are chosen based on the precise value of the window in operation. The net effect is that TCP's sluggishness evaporates in LFNs and high-speed TCP performs splendidly, capable of fully utilising multi-gigabit, high-delay links.
Not only does this approach enable full utilisation of high-speed WAN links, but it also does so without losing or compromising any of the familiar and essential benefits of TCP, including safe congestion control.
Better yet, this works even when high-speed TCP connections share WAN links with "normal" TCP connections.
In a nutshell, the secret to achieving blistering performance of TCP-based storage replication traffic is no secret at all: high-speed TCP is open and public and it can be readily deployed for SAN-based storage mirroring, replication and disaster recovery. Distance and widely separated sites no longer create networking performance problems.
By using high-speed TCP, the LFN connecting sites separated by thousands of miles can achieve performance levels similar to what is possible when connecting neighbouring sites. In essence, network problems influencing storage backup across widely dispersed sites no longer needs to be a problem. Network managers can now offer a solution that is more attractive to the CFO than expensive bandwidth upgrades, while providing complete business continuity.