I Forked “Asyncpg” — And It Parses Database Records to Numpy 20x Faster | by Vadim Markovtsev | Jun, 2022

Presenting the fork of asyncpg — asynchronous PostgreSQL client for Python — based on NumPy structured arrays

Photo by SpaceX on Unsplash

I hacked asyncpg — an asyncio PostgreSQL client library — to parse the SELECT-ed records from low-level PostgreSQL protocol directly to NumPy structured arrays without materializing Python objects, and avoided most of the overhead.

It worked up to 3x wall and 20x CPU faster; It doesn’t cast spells to accelerate the DB server. The repository to star is athenianco/asyncpg-rkt.

More and more emerging databases choose to speak PostgreSQL wire protocol, eg, Cockroach or Crate. Granted that it’s simple, a PoC facade server in Python is less than 200 lines. Such a big family of compatible DBs suggests an efficient Python client library. An analytics-first client library.

In one of my previous blog posts, I noticed how inefficient the transition from PostgreSQL response to pandas DataFrame was. Let me remind you of the code:

pd.DataFrame.from_records(await connection.fetch("SELECT ..."))

We make many redundant actions underneath:

  1. Parse PostgreSQL wire protocol and create Python objects.
  2. Insert those Python objects into created asyncpg.Record-s.
  3. Iterate rows and insert Python objects into NumPy arrays of object dtype.
  4. Infer better dtypes like int64, datetime64, etc., and convert Python objects.
  5. Construct the dataframe.

However, we know the data types of the returned columns beforehand and could do a ton better:

  1. Parse PostgreSQL wire protocol to typed NumPy arrays.
  2. Construct the dataframe.

Profiling indicated clear bottlenecks in materializing millions of Python objects only to convert them back to the exact same in-memory representation as the PostgreSQL server sent, ignoring the endianness. Every time we copy an object array in Pandas, we increment and decrement the reference counters of each object, which hammers the final nail in the performance coffin.

Unfortunately, the part of asyncpg which is responsible for building the array of returned asyncpg.Record -s is written in Cython and cannot be easily customized. I had to fork.

I considered the following requirements:

  • It must be a drop-in replacement. Don’t break the existing user code.
  • Graceful degradation: fallback to Python objects when a column type is an object (eg, JSON).
  • Handle nulls well. Not all built-in NumPy dtypes support a null-like value such as NaN or NaT, so we must return the positions of nulls.
  • No extra dependencies except NumPy.
  • The best performance that I can arrange.

How to leverage the new superpowers

Install asyncpg- as usual: python -m pip install asyncpg-rkt

We call set_query_dtype() to augment the query string with the result dtype description. The dtype has to be structured. The usage of the returned array doesn’t differ much from the original list of asyncpg.Record-s:

nulls are the flat indexes in arr where we fetched nulls:

Fetting a regular query works the same as in the original library:

Besides, asyncpg-offers an alternative “block” output mode tailored specifically for Pandas:

The returned array is a 1D object array-like where each element is a column with values:

Intended use cases and limitations of the project:

  • The user should care about the performance. In most scenarios, the potential speedup is not ground-shaking, so it’s probably not worth the hassle.
  • SELECT at least ten rows. Otherwise, the benefit is minuscule.
  • The user knows the types of the returned fields in advance. Any data modeling should come in handy, eg, SQLAlchemy.
  • Those types are mainly primitive and map to NumPy: integers, floating points, timestamps, bytea, text, uuidfixed-size geometry.
  • The user knows the maximum length of the returned bytes and strings to specify sufficient S# and U# types. Otherwise, it is possible to fall back to objects, but the performance will degrade.
  • Either use asyncpg directly or through a thin wrapper like morcilla. Async SQLAlchemy 1.4 and encode/databases are going to break because we return a tuple with a NumPy array and null indexes instead of a single list of Record-s; even if those libs survive, they repack the result in their Record-like Python objects and will miss the point.
  • Conversion of the result to pandas DataFrame should fly.
  • Don’t expect magic! This fork will not accelerate the database server.

I compared the performance of four different execution modes:

  1. min — executing SELECT 1. This mode sets the absolute minimum query execution time.
  2. dummy— we read data from PostgreSQL and immediately drop it. The client doesn’t do any real work, fetch() always returns None. This mode approximates the server processing time together with the client-server communication. It serves as the ideal, unreachable target.
  3. record — legacy query execution, we return the list of asyncpg.Record objects as usual.
  4. numpy— mode. We return the structured NumPy array and the list of null indexes.

My Postgres server was running locally, and I connected to localhost:5432. The query was:

SELECT ... FROM generate_series(1, {length})
  • where length is the number of identical rows to be returned
  • ... are hardcoded two booleans (five bigints, one float4, four timestamps, two times, two bytea-s of length 16, and two texts of length 5 and 10)

I measured various length-s from 100 to 50,000. The benchmark heavily utilizes the excellent pytest-benchmark plugin for pytest. I ensured a few warmup iterations to boot the prepared statement properly.

Tukey histograms of the SELECT benchmarks, smaller is better. “dummy” is dropped server response, “record” is returning Record-s in original asyncpg, “numpy” is conversion to NumPy structured array on the fly in asyncpg‑rkt. The numbers at the bottom are the numbers of fetched rows. Image by author.
Tukey histograms of the SELECT benchmarks, smaller is better. “dummy” is dropped server response, “record” is returning Record-s in original asyncpg, “numpy” is conversion to NumPy structured array on the fly in asyncpg‑rkt. The numbers at the bottom are the numbers of fetched rows. Image by author.

The end-to-end query execution is up to 3x faster in “numpy” mode versus “record.” If we subtract the immutable “dummy” time, the pure CPU time speedup reaches 22x. The standard deviation of the “numpy” execution time is the same as “dummy”, while “record” jumps significantly higher. We win about 120ms on 50k rows, up to 250ms when it fluctuates most violently.

Of course, the real-world speedup will be less than 3x because any practical query spends more time in the PostgreSQL server. I do not sell snake oil. The absolute delta should remain the same, though.

Below are thepy-spy --native profiles for comparison. The first is “record”, the second is “numpy”. The profile shows that parsing pgproto to Python’s datetime is a significant bottleneck.

“record” mode profile at 10k rows. We can see a major bottleneck in the datetime manipulation. The other fields are slow, too. “Same time” points at the network entry that lapses constant time in every mode. Image by author.
“numpy” mode profile at 10k rows. The profile is uniform, as it should be. “Same time” is wider now: the CPU time decreased much. Image by author.

The source code of the benchmark is on GitHub.

“Query augmentation” works by prepending the pickled dtype. The backend code in native asyncpg unpicckles back. That’s not as efficient as passing a reference directly, but I didn’t want to change any public interfaces, even backward-compatible. If asyncpg gets called through a wrapper library, nobody is eager to forward the reference passing there, too.

The price to pay is some deserialization overhead. Given the intended average number of fetched rows, it should be negligible.

The original Cython code is already quite fast. It taught me a few clever tricks with extension classes like preallocation — @cython.freelist or disable the garbage collection — @cython.no_gc. I found only one place for improvement: the frb_check() buffer overrun checks should be inlined directly instead of hoping that Cython will do “the right thing.”

The slowdown is the consequence of Cython’s shenanigans of exception forwarding. The pure Python counterpart is twisted in some places. I smile every time I have to trace the query flow in asyncpg.Connection and step in through:

  • execute()
  • _execute()
  • __execute()
  • _do_execute()
  • executor()

asyncpg- relies on NumPy’s structured arrays. They are fixed-shape C structures with optional alignment. NumPy exposes the structure fields both by index and by name. Internally, NumPy keeps the plain list and the dictionary mappings to the nested dtypes and offsets.

PostgreSQL streams rows to the client, and we don’t know how many we’ll fetch until we fetch them all. It’s impossible to resize a NumPy array dynamically, similarly to appending Record-s to a list, and hence we have to allocate page-aligned evenly sized chunks and concatenate them in the end.

Unfortunately, those extra allocations lead to double peak memory consumption and tons of memcpy() from the chunks to the final destination.

PostgreSQL sends timestamps and timedeltas as 64-bit integers with microsecond precision. NumPy allows various units in datetime64 and timedelta64, eg, days, seconds, nanoseconds.

Therefore we must perform the unit conversion by integer multiplication or division. Profiling indicated that the integer division is too slow. It’s nothing new, of course. Luckily, there is a trick from the series of Doom’s fast inverse square root and the C header-only libdivide that is bundled together with numpy≥1.21.

I wrapped numpy/libdivide/libdivide.h in Cython, and it worked like a “magic” charm. Another critical step is adjusting zero, aka the epoch. PostgreSQL’s zero is 2000–01–01, and NumPy’s zero is 1970–01–01, so we must add 30 years to each value.

PostgreSQL sends strings in UTF-8. NumPy, on the other end, expects UCS-4 for the sake of easier vectorization, I guess. We have to recode. CPython exports relevant functions, but they all expect Python string or byte objects, which would kill the performance.

I adapted the implementation from MIT-licensed fontconfig: FcUtf8ToUcs4. It works reasonably fast; the algorithm is straightforward. The fork has an option to copy strings directly to S# dtype — great for ASCII-restricted text.

I had to solve the issue with nulls. We have to support them without sacrificing performance. Each time we encounter a null, we append the flat index to a list and write a NaN surrogate to the array chunk.

  • For float32 and float64, that is a genuine NaN.
  • For datetime64 and timedelta64, that is a NaT that behaves similarly.
  • For objects, we write None.
  • For integers, we write the minimum value given the number of bits.
  • For S#we fill with 0xFF.
  • For U#we fill with 0x00.

Obviously, any of the mentioned values ​​can be legitimate, so the only guaranteed check for nulls is to scan the returned null indexes.

What happens if the user sets the wrong dtypes? They should catch a helpful exception:

asyncpg.pgproto.pgproto.DTypeError: dtype[10] =

Staying on the type topic, I am using an undocumented constructor argument of np.dtype to specify the column output mode:

Every dtype carries a metadata attribute that is either None or a read-only (mappingproxy) dictionary set at the construction time.

Regarding the column output, asyncpg-groups columns of same dtype together and allocates memory in blocks, reminiscent of how Pandas does (used to do?) it. The user-facing arrays are the block views. The block origin is referenced by the base attribute of the column array.

It requires quite an extra effort. The Cython code compilation currently breaks without NumPy. It may be possible to make NumPy an optional dependency. If the community requests the merge and the maintainers of asyncpg are benevolent, then yeah, let’s rock.

I have described asyncpg- —my backward-compatible fork of asyncpg, a Python asyncio client for PostgreSQL.

asyncpg- parses the SELECT-ed records from the low-level PostgreSQL protocol directly to the NumPy structured arrays without materializing Python objects, thus avoiding much overhead. It is up to 3x wall and 20x CPU faster. The repository is athenianco/asyncpg-rkt.

Leave a Comment