Making pip installs a little less slow

Installing your Python application’s dependencies can be surprisingly slow. Whether you’re running tests in CI, building a Docker image, or installing an application, downloading and installing dependencies can take a while.

So how do you speed up installation with pip?

In this article I’ll cover:

  • Avoiding the slow path of installing from source.
  • pip download speed, and the alternatives: Pipenv and Poetry.
  • A useful pip option that can, sometimes, speed up installation significantly.

Avoiding installs from source

When you install a Python package, there are two ways you can install it, typically:

  • The packaged up source file, often a .tar.gz with a setup.py. In this case, installing will often require running Python code (a little slow), and sometimes compiling large amounts of C/C++/Rust code (potentially extremely slow).
  • A wheel (.whl files) that can just be unpacked straight on to the filesystem, with no need to run code or compile native extensions.

If at all possible, you want to install wheels, because installing from source will be slower. If you need to compile significant amounts of C code, installing from source will be much slower; Instead of relying on precompiled binaries, you’ll need to compile it all yourself.

To ensure you’re installing wheels as much as possible:

  • Make sure you’re using the latest version of pip before installing dependencies. Binary wheels sometimes require newer versions of pip than the one packaged by default by your current Python.
  • Don’t use Alpine Linux; stick to Linux distributions that use glibceg Debian/Ubuntu/RedHat/etc.. Standard Linux wheels require glibcbut Alpine uses the musl C library. Wheels for musl-based distributions like Alpine are starting to become available, but they’re still not as common.

Comparing installation speed between pipPipenv, and Poetry

The default Python package manager is pip, but you can also use Pipenv and Poetry, both of which add additional functionality like virtualenv management. I compared the speed of all three.

Methodology

Installing Python packages involves two steps:

  1. Downloading the package.
  2. Installing the already downloaded package.

By default, Python package managers will cache downloaded packages on disk, so if you install them a second time in a different virtualenv the package won’t need to be re-downloaded. I therefore measured both variants: a cold cache where the package had to be downloaded, and a warm cache where the package was already available locally.

In all cases I made sure to create the virtualenvs in advance, and for pip I made sure to use hashes in the requirements.txtto match the hash validation that the other two package managers do by default.

I used the transit dependencies for installing pandas and matplotlibresulting in the installation of 12 different packages in total.

Results

Here’s how long each installation took, measuring both wallclock and CPU time:

Tool Cache Wallclock time CPU time
pip Cold 16.2s 10.7s
pip Warm 10.5s 9.4s
Pipenv Cold 12.5s 26.0s
Pipenv Warm 9.7s 25.2s
Poetry Cold 12.1s 18s
Poetry Warm 10.2s 17.8s

Some things to notice:

  • pip is the slowest by wallclock time when the cache is cold.
  • Wallclock time isn’t really that different between any of them when the cache is warm, ie the packages are already downloaded.
  • Both Pipenv and Poetry use parallelism, as we can see from CPU time that is higher than wallblock time; pip is currently single-threaded.
  • Pipenv uses quite a lot of CPU compare to the other two; Poetry is a bit better, but still higher than pip.

This example was run with 12 packages being installed; With a larger number of dependencies, it’s possible that Poetry’s parallel installation would have more of an impact.

Keeping the cache warm

Notice that in all cases you get a speedup from having a warm cache, ie reusing already downloaded packages. On your local machine, that happens automatically. In most CI services, your cache will start out empty.

To work around that, most CI systems will have some way to store a cache directory at the end of the run, and then load it at the beginning of the next run. If you’re using GitHub Actions, you can use the built-in caching support in the action used to setup Python.

This is still not as fast as running on a dedicated machine, however: storing and loading the cache also takes time.

Going (very slightly) faster by disabling the version check

On startup pip may check if you’re running the latest version or not, and print a warning if you’re not. You can disable this check like so:

pip --disable-pip-version-check install ...

This saves me about 0.2-0.3s, not a very significant improvement; the actual improvement probably depends on your network speed and other factors.

Going faster (sometimes) with disabled compilation

Can we do better? In some cases, yes.

After packages are downloaded (if they’re not cached locally) and installed on the filesystem, package managers do one final step: they compile the .py source files into .pyc bytecode files, and store them in __pycache__ directors. This is not the same as compiling a C extension, this is just an optimization to make loading Python code faster on startup. Instead of having to compile the .pyc at import time, the .pyc is already there.

It turns out that bytecode compilation takes a significant amount of the time spent by pip install. But you can disable this step by calling pip install --no-compile.

Here’s a comparison of how long it takes to install packages both with and without .pyc compilation, in both cases when the cache is warm so no downloads are needed:

Installation method Cache Wallclock time CPU time
pip install Warm 10.5s 9.4s
pip install --no-compile Warm 4.8s 4.0s

So should you always use this option? Not necessarily.
Just because pip install is faster doesn’t mean you’ve saved time overall.

Any module you import will still need to be compiled into a .pyc, it’s just that the work will happen at Python run time, instead of at package installation time. So if you’re importing all or most modules, overall you might not save any time at all, you’ve just moved the work to a different place.

In other cases, however, --no-compile will save you time. For example, in your testing setup you might be installing many third-party packages for integration testing, but only using a small amount of those libraries’ code. As such, there’s no point in compiling lots of modules you won’t be using.

Neither Pipenv nor Poetry seem to support this option at this time.

Package installation could be much faster

Given how many people use Python, slow package installations add up.

It’s difficult to estimate how many pip installs are happening in the world, but pip itself was downloaded 100 million times in the month previous to writing this article, so we can take that as a lower bound. If you could shave just 1 second off of every one of those 100 million installs, that would be 3.17 years of waiting saved every month.

There is clearly a lot of room for improvement in package installation in the Python world:

  • Poetry already implements parallelism to some extent, but it doesn’t seem to be as efficient as one might hope, given higher CPU usage than pip. But it may already be faster on wallclock basis for larger number of dependencies.
  • Pipenv’s CPU usage is even worse.

As for pip:

If you’re interested in helping, the pip repository has a number of issues and in-progress PRs covering various aspects.

Leave a Comment