The Sad Story of TCP Fast Open
If there’s a way to make something fast, you’ve got my attention.
11th Apr 2019
The Sad Story of TCP Fast Open
If there’s a way to make something fast, you’ve got my attention. Especially when there’s a way to make a lot of things fast with one simple change – and that’s exactly what TCP Fast Open (TFO) promises to do.
TFO (RFC 7413) started out in 2011 as a way to eliminate one of the round trips involved in opening a TCP connection. Since TCP (Transmission Control Protocol) is the underlying fundamental technology used for almost all connections on the Internet, an improvement to TCP connection establishment performance would improve the performance of virtually everything, especially web browsing. In early testing discussed at the 2011 Linux Plumbers Conference, Google found that TFO reduced page load times by 4-40%. Because of the elimination of a round trip, the slowest, highest latency connections would benefit the most – TFO promised to be a great improvement for many users.
Given the immediate success, support for this performance improving technology rapidly grew. In 2012, Linux 3.7 gained support for client and server TFO. In 2013, Android gained support when KitKat (4.4) was released using the Linux 3.10 kernel. In 2015, iOS gained support. In 2016, Windows 10 got support in the Anniversary Update. Even load balancers, such as F5, added support.
And yet, today, not one browser supports it. And, even more, Chrome, Firefox, and Edge all have use of TFO disabled by default.
So, what happened to this technology that once sounded so promising?
Initial Optimism Meets Hard Reality
I attribute the failure to achieve widespread adoption of TCP Fast Open to four factors:
- Imperfect initial planning
- Tracking concerns
- Other performance improvements
Factor 1: Imperfect Initial Planning
TCP Fast Open was in trouble from its initial conception. Because it involves change to an operating system, it had to be done perfectly from the very beginning. Operating systems have long lifespans – updates happen slowly, backwards compatibility is paramount, and changes are, rightfully so, difficult to make. So, when the TFO specification wasn’t perfect the first time, it was a major blow to the changes of its ever achieving widespread adoption.
TFO requires the allocation of a new, dedicated TCP Option Kind Number. Option Kind Numbers specify which features are in use during a TCP connection and, as a new feature, TFO requires a new TCP Option Kind Number. Since TFO was experimental when it started out, it used a number (254 with magic 0xF989) from the experimental allocation. This was quickly ingrained into Windows, iOS, Linux. and more. As the saying goes, “nothing is as permanent as a temporary solution.”
So, when TFO left experiment status with RFC 7413, they released a statement saying that all current versions should migrated over to the new option. Or, in more complex terms,“Existing implementations that are using experimental option 254 per [RFC6994] with magic number 0xF989 (16 bits) as allocated in the IANA “TCP Experimental Option Experiment Identifiers (TCP ExIDs)” registry by this document, should migrate to use this new option (34) by default.”
Did all implementations migrate? If they did, they would lose compatibility with those that didn’t migrate. Therefore, all systems must now support both the experimental TCP Option Kind Number and the permanent one.
This issue isn’t a deal breaker – but it certainly wasn’t a great way to start out.
Factor 2: Middleboxes
Middleboxes are the appliances that sit between the end user and the server they’re trying to reach. They’re firewall, proxies, routers, caches, security devices, and more. They’re rarely updated, very expensive, and run proprietary software. Middleboxes are, in short, why almost everything runs over HTTP today and not other protocols as the original design for the Internet envisioned.
The first sign of trouble appeared in the initial report from Google in 2011 regarding TFO. As reported by LWN, “about 5% of the systems on the net will drop SYN packets containing unknown options or data. There is little to be done in this situation; TCP fast open simply will not work. The client must thus remember cases where the fast-open SYN packet did not get through and just use ordinary opens in the future.”
Over the years, Google and Mozilla did more testing and eventually found that TFO wasn’t beneficial. Clients that initiate TFO connections encounter failures requiring them to re-try without TFO so often that, on average, TFO costs more time than it saves. In some networks, TFO never works – for example, China Mobile’s firewall consistently fails to accept TFO requiring every connection to be retried without it, leading to TFO actually increasing roundtrips.
Middleboxes are probably the fatal blow for TFO: the existing devices won’t be replaced for (many) years, and the new replacement devices may have the same problems.
Factor 3: Tracking Concerns
When a client makes an initial connection to a host, TFO negotiates a unique random number called a cookie; on subsequent connections to the same host, the client uses the cookie to eliminate one round trip. Using this unique cookie allows servers using TFO to track users. For example, if a user browses to a site, then opens an incognito window and goes to the same site, the same TFO cookie would be used in both windows. Furthermore, if a user goes to a site at work, then uses the same browser to visit that site from a coffee shop, the same TFO cookie would be used in both cases, allowing the site to know it’s the same user.
In 2011, tracking by the governments and corporations wasn’t nearly as much of a concern as it is today. It would still be two years before Edward Snowden would release documents describing the US government’s massive surveillance programs.
But, in 2019, tracking concerns are real. TFO potential to be used for user tracking makes it unacceptable for most use cases.
One way to mitigate tracking concerns would be for the TFO cookie cache to be cleared whenever the active network changes. Windows/Linux/MacOS/FreeBSD/etc should consider clearing the OS’s TFO cookie cache when changing networks. See this discussion on curl’s issue tracker for more.
Factor 4: Other Performance Improvements
When TFO started out, HTTP/2 was not yet in use – in fact, HTTP/2’s precursor, SPDY, had a draft until 2012. With HTTP/1, a client would make many connections to the same server to make parallel requests. With HTTP/2, clients can make parallel requests over the same TCP connections. Therefore, since it setups up far fewer TCP connections, HTTP/2 benefits much less than HTTP/1 from TFO.
HTTP/3 plans to use UDP (User Datagram Protocol) to reduce connection setup round trips gaining the same performance advantage of TFO but without its problems. UDP is a fundamental Internet protocol like TCP but originally designed for one way connectionless communication. HTTP/3 will build a highly optimized connection establishment system logically analogous to TCP on top of UDP. The end result will be faster than what TCP can do even with TFO.
TLS (Transport Layer Security) 1.3 offers another improvement that reduces round trips called 0RTT. TLS is the system used for encryption in HTTPS so improvements to TLS potentially improve all users of HTTPS.
In the end, performance has been improving without requiring TFO’s drawbacks/costs.
The Future of TFO
TFO may never be universally used, but it still has its place. The best use case for TFO is with relatively new clients and servers, connected by a network using either no middleboxes or only middleboxes that don’t interfere with TFO, when user tracking isn’t a concern.
Domain Name System (DNS) is such a use case. DNS is how software (such as a browser) resolves human-readable names (such as integralblue.com) to an IP (Internet Protocol) address to which the computer can connect. DNS is very latency sensitive and very frequently used – eliminating the latency from one round trip would give a perceivable improvement to users. The same TCP connections are made from the same clients to the same servers repeatedly, which is TFO’s best case scenario. And, there’s no tracking concern since many DNS clients and servers don’t move around (there’s no “incognito” mode for DNS). Stubby, Unbound, dnsmasq, BIND, and PowerDNS, for example, include or are currently working on support for TFO.
TFO is already supported by all the major operating systems so it is here to stay. The question is, will it ever see widespread real world adoption? In the short term, TFO will be adopted in specialty areas such as DNS. But, eventually, , perhaps in the 5 or 10 year time frame, TFO will see widespread adoption on the Web as troublesome middleboxes are gradually replaced allowing browsers to enable support for it. By that time, however, HTTP/3 will be widely deployed, offering a better performance improvement than TFO could ever offer for the use case of web browsing. In the end, TFO is an interesting example of an idea and its implementation being challenged (due to technical limitations and a changing political landscape), resulting in its inability to live up to expectations. However, like so many other technologies, the original implementation isn’t what matters most – the idea is the significant part. In the case of TFO, the performance improving concept behind it will benefit users for years to it influences future technologies such as HTTP/3.