Meet Gain— the New Fastest Go TCP Framework | by Paweł Gaczyński | Jul, 2022

A look at my open-source project

In the beginning, I planned one very long article about my high-performance TCP framework — the Gain, to be published at the same time as the framework will be released on Github. But when I was working on the Gain, I realized how many exciting things I would like to mention in the article. Therefore I decided to publish a series of articles. This text is the first and introductory part. It contains the genesis of the Gain creation, some backgrounds, and my project decisions. In the end, there are results of the benchmarks.

Let’s start with a genesis. Developing high-performance services was always one of my areas of interest. I made some heavy optimizations in one of my company’s backend systems a couple of years ago.

Short story long, I reduced CPU load by more than 100 times (!) in some scenarios and average memory consumption by more than 10 times. I spent a lot of time figuring out how to improve our system in that way.

Unfortunately, there were some technical limitations (more than one was very annoying) that I could not change and technical debt that could not be repaid. Those two things made it impossible to implement some well-known solutions, and the only way was to create a tailor-made system.

I prepared a proof of concept in my free time (because software development is not only a job for me but also my passion) and then showed benchmark results to people in my company. It was a quick decision. My design was approved, and after a month or two new version of our system was approved for production.

Besides getting valuable knowledge and experience, that project has taught me two things: performance optimization can be quite challenging and great fun. Since I like challenges and now have much more experience (not only with high-level network applications programming but with low-level networking also), I’ve decided to verify my present skills after years since that project. I had a couple of ideas, but finally, the winner was developing a TCP (and UDP soon. Hello HTTP/3) framework that beats others in terms of performance.

I am sure every programmer interested in high-performance TCP/HTTP applications knows TechEmpower Benchmark.

For those who don’t, it is a performance comparison of many web application frameworks executing fundamental tasks such as JSON serialization, database access, and server-side template composition. I know — benchmark results can be somehow misleading.

The real-world cases can be very different than those tested in the benchmark. In addition, benchmarks often test high peak performance, which is usually not the essential feature in commercial projects. It is not desired as much as low CPU usage, low memory consumption, or very low latency.
It is a personal challenge, and the main goal is to check if there is a chance to develop a faster framework than those tested on the TE benchmark. For the purpose of this series of articles and measurement of my framework performance, I’ve focused on the “Plaintext” test type. It is a simple case where the HTTP server responds with JSON data. A description of it is here.

When you look at one of the last runs of the TE benchmark (HERE), you can see that couple of frameworks have almost the same results. So there is no sense in running benchmarks for all the top frameworks. Therefore I’ve chosen two competitors:

  • FaF —because it is the fastest one. Written in Rust. As the author wrote, it has one goal: “to demonstrate the upper bound of single-node performance while remaining usable in a production setting.” When you look at the source code, you can see that author made heavy optimizations wherever possible.
  • gnet — the fastest TCP/UDP framework written in Go. It can run in two modes: reactor mode and reuse port mode. In the first one, one thread (acceptor) is responsible for accepting connections and forwarding them to separated threads (reactors). In the second one, there is no acceptor. Instead, all the threads have the same responsibility. All of them accept connections and read/write data.

All of those frameworks were built on top of the Linux epoll API. I will not describe how it works. Many articles do it probably much better than I would.

For this article, the most important thing to know is that epoll is a fast solution, but it has some caveats — especially in times of Specter/Meltdown when system calls are not as cheap as they are used to be. Therefore another API was created — io_uring. It is a relatively new interface for asynchronous I/O. A short introduction to how it works. There are two queues implemented using ring buffers shared between user space and kernel:

  • Submission Queue (SQ),
  • Completion Queue (CQ).

Application populates the first one with one or more SQ entries. Then kernel consumes all requests and creates completion events if any of them is completed. Completion events are always associated with specific SQEs but may arrive in any order. If you want to learn more about io_uring, there are many great resources:

From the beginning, I considered two programming languages: Go and Rust. Both of them are modern, very fast languages, and both of them also have their pros and cons. Rust is generally faster. Its performance is in pair with C++ performance. It interestingly guarantees memory safety.

There is a feature called ownership. The ownership is a set of compiler rules. If any of the following rules are violated, the program won’t compile:

  1. Each value in Rust has a variable that’s called its owner.
  2. There can only be one owner at a time.
  3. When the owner goes out of scope, the value will be dropped.

Golang uses a more popular solution for the same purpose — a garbage collector. Because GC works in runtime, it has an impact on program performance. Many techniques can help reduce GC’s overhead, but freeing memory is still slower than in Rust. There is also another significant difference between those two languages. Rust can bind to native libraries almost without pain. There is also no inherent overhead when calling C code from Rust. Why is it important? Because the main implementation of io_uring is written in C. It is called liburing and is maintained by Jens Axboe — author of io_uring architecture. On the other hand, Golang has a feature called CGO. Unfortunately, this solution is not as fast as expected, and calling C code from Go is slow.

So, the choice of language seemed simple enough. Not necessarily. It would seem that using Rust and the mentioned liburing library was the best option, but in this particular case, the main goal was to squeeze as much as possible from Linux and io_uring. For that, you need a deep knowledge of the programming language of your choice and a thorough understanding of Kernel-side IO mechanisms. I have much more experience with Go than with Rust, so in this respect, Go is a better choice. What about the io_uring API mentioned above and written in C? CGO would not pass the test, so the solution may be… writing an analogous API in Go. Go users have a low-level sys/unix package containing functions for calling syscalls, and an unsafe package that allows, among other things, to allocate memory manually. I went through the libing code carefully and was confident that these two packages were enough. Since this is not a commercial project, there was no need to save man-hours, and I could take a longer path. Writing the library in Go helped me understand how io_uring much better, and in retrospect, I think it was the right decision.

Hardware:

  • Server: AWS m6i.xlarge instance, 4 vCPU, 16 GB RAM
  • Client: AWS m6i.2xlarge instance, 8 vCPU, 32 GB RAM, located in the same availability zone in a cluster placement group as the server

Software:

  • Ubuntu 22.04 LTS
  • Kernel: 5.15.0–1004-aws
  • Go 1.18

Configuration:

  • 512
  • 8 threads, 1 per vCPU
  • 5 seconds warm-up, then three test runs, 10s each

I’ve run the benchmark three times for each of the frameworks. The results in each run were always very similar. Below you can see the best ones.

Gnet

ubuntu@ip-xx-xx-xx-xx:~$  wrk -t8 -c512 -d10s http://3.68.71.17:8080/plaintext  
Running 10s test @ http://3.68.71.17:8080/plaintext
8 threads and 512 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.45ms 544.73us 23.69ms 90.78%
Req/Sec 44.76k 3.50k 94.66k 97.64%
3579707 requests in 10.10s, 440.39MB read
Requests/sec: 354443.98
Transfer/sec: 43.61MB

FaF

ubuntu@ip-xx-xx-xx-xx:~$ wrk -t8 -c512 -d10s http://3.68.71.17:8089/plaintext  
Running 10s test @ http://3.68.71.17:8089/plaintext
8 threads and 512 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.30ms 0.91ms 72.42ms 94.50%
Req/Sec 50.14k 5.43k 167.49k 98.63%
4005379 requests in 10.10s, 481.30MB read
Requests/sec: 396598.02
Transfer/sec: 47.66MB

Gain

ubuntu@ip-xx-xx-xx-xx:~$ wrk -t8 -c512 -d10s http://3.68.71.17:8765/plaintext  
Running 10s test @ http://3.68.71.17:8765/plaintext
8 threads and 512 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.08ms 253.02us 33.56ms 96.90%
Req/Sec 58.16k 3.12k 138.32k 97.88%
4634736 requests in 10.10s, 570.18MB read
Requests/sec: 458908.65
Transfer/sec: 56.46MB

The bar chart below compares the number of requests per second for each framework.

As you can see, the Gain beats even the fastest server in TE Benchmark. Its result is almost 30% better than gnet and nearly 16% better than FaF. Perhaps it is not outstanding, but we are talking about a framework based on the io_uring, which should be even faster in the near future, and the Gain should benefit from it. I give big credit to Jenx Axboe and all io_uring contributors. They’ve already done a great job and are constantly working on new improvements to io_uring. Because my works on the Gain are still in progress and there are a couple of things I would improve, it is not available yet but will be released soon.

Moreover, there were releases of new Linux kernels with some exciting features (especially in versions 5.18 and 5.19) and a new libering version (2.2). So before the release, I want also to check the performance on those new kernels and implement (maybe) new io_uring features. In the next article, I will describe the Gain architecture, and explain performance optimizations, so stay tuned!

Leave a Comment