The Internals of Asynchronous Programming | by John Pradeep Vincent | Jul, 2022

With a Kotlin implementation example

Photo by Howard Bouchevereau on Unsplash

Most programming languages ​​like Go, Rust, Kotlin, etc provide very “light-weight” thread abstractions.

It allows you to spawn thousands of such lightweight threads without creating any operating system context switch overhead!

Although these languages ​​use different semantics to provide such lightweight threads, the underlying building blocks and the implementation patterns of such a framework are very similar.

I will try to provide a high-level conceptual overview of the underlying implementation so you could develop an intuition for such programming models and understand that most of these languages ​​could in theory provide the same level of performance guarantees.

For the sake of this blog, I am going to use the term “xroutineto identify an individual unit of computation (aka lightweight thread) — Golang calls it “Go routine”, Kotlin calls it “coroutine” and Rust just refers to this individual unit of computation as “Future”, but the goal and the implementation pattern of these primitives are very similar.

High Level Design

At a high level, the major components are — Executor, Reactor, and XRoutine. Let’s look at each of these components in detail and how they work together to provide the highly performant lightweight threads

reactor

The main responsibility of this component is to monitor the file descriptors!

It is important to note that the only reason why all the languages ​​like Rust, Go, Kotlin, Nodejs, etc could provide a lightweight thread is because the operating system provides very efficient system calls to perform nonblocking IO operations.

In the past, every time you do a system call to read or write from a file descriptor, your calling thread may be blocked. This is inefficient as the operating system needs to do a context switch (which is very expensive) to schedule other threads.

But thanks to the modern operating systems which provide system calls like select and epoll, these system calls enable us to perform IO operations without blocking the calling thread!

You can read the API spec to know more about how epoll works, but essentially it provides a way to request an IO operation and allows you to poll a bunch of file descriptors to check for activity, the API is designed in a way that you can poll for hundreds or thousands of file descriptor in a single thread – which is exactly what a rector does!

The reactor’s job is to poll/watch the file descriptors – when the reactor notices that IO can now be performed on a specific file descriptor, it finds the corresponding XRoutine which is waiting for this specific descriptor and moves it to the Ready Queue for execution!

You may notice that no operating system threads are blocked on the file descriptors, the only thread that deals with the IO is the Reactor, which polls the descriptors (without blocking) and moves the XRoutine from waiting queue to the ready queue whenever a file descriptor is ready to be used.

Executor

The main responsibility of an Executor is to run/execute the XRoutines present in the “ready queue”, it is composed of multiple schedulers (usually one per CPU core) that handles the load using the well-known worker pattern.

In order for the executor to execute an Xroutine, it uses a fixed API contract – the simplified pseudocode of an executor running an XRoutine will be as shown below:

The important element of the contract is the “context”, when an executor calls the Xroutine, it also passes the context along with other parameters the xroutine requires.

If the xroutine needs to wait for a file descriptor, it returns a value (“Blocked”) that indicates to the executor that the xroutine is blocked, this enables the executor to add the xroutine to a separate queue and continue executing other xroutines from the ready queue.

When an xroutine calls another xroutinethe context gets propagated all the way to the Reactor which is responsible for monitoring the IO file descriptor and move (wake up) the initial xroutine to the ready queue.

XRoutine

You may have noticed in the previous section that every time a reactor notices that a file descriptor is ready to be used, it follows the context propagation tree and moves the very first xroutine to the ready queue so that it gets picked up by the Executor.

This means that every xroutine should “resume” from exactly where it was stopped before!

This brings a new challenge, Xroutines cannot be just a simple function, it should be a smart state machine that saves its state to the context and uses it to resume its execution exactly from where it was blocked.

Let’s take an example of the below function:

If you want to treat the function incrementTwice As an xroutine, it should be smart and stateful so that calling/executing it multiple times will just resume its execution from the point where it was previously blocked.

This is where the compiler helps us, just by annotating the function with a language-specific keyword/annotation (like “go” in golang, “async” in rust & “suspend” in kotlin), the compiler transforms the function into a state machine that handles the resumption logic.

The pseudocode of the generated state machine may look like below:

As you can see in the pseudocode, the compiler will transform the original function into a stateful function which when called jumps to a specific block in the code based on the value of suspensionPoint present in the context.

When it’s called for the first time, it runs the initial block where it increments the result by 1, then when it calls the logger.log(..) it notices that the underlying logger is blocked and hence saves the suspensionPoint to “final” and returns the Blocked response for upstream code to handle similarly- also notice that the context is propagated to the logger which eventually was passed to the Reactor that uses it to move the initial xroutine to the ready queue.

When the xroutine is picked up by the executor from the ready queue and calls it the second time, the xroutine uses the context value of “suspensionPoint” and jumps straight to the final block and resumes its execution exactly from where it was suspended in the previous execution.

Just by adding some compiler keywords, we now got a xroutine which is the smallest unit of execution that could be executed, blocked and resumed without any operating system context switches!

Conclusion

In summary, all the aforementioned languages ​​use a very similar design to support the XRoutine (which is coroutine in kotlin, goroutine in golang and Future in Rust).

The XRoutine is just a compiler-generated state machine, which propagates the context to each other all the way to the Reactor, which is the heart of the runtime that monitors the file descriptors and determines the specific XRoutine to be rescheduled and moves it from the wait queue to the ready queue.

Since each XRoutine is nothing but an object in memory, you can spawn thousands of XRoutines without adding any operating system context switch overheads!

Leave a Comment