Creating a PyPi pull-through cache that is ready for Kubernetes

By Frank Wickström

Packages are fun, let’s cache some!

After the recent Fastly downtime, we were once again made aware that the internet is not always as stable as we would think. While Fastly took down many major sites on the internet, it also affected things like package managers. Services such as the central Helm registry were also inaccessible during the downtime. Using a single source for services such as Helm, PyPi and NPM is very convenient, but it sucks when things go down, and it might even block you from delivering software. In this post, we will explore one way of making your setup a bit more robust by adding caching between you and the main service that is usually used.

This can often be seen as a micro-optimization since a provider like Fastly going down is very unlikely, but there are very good reasons for caching. Some of them include:

  • If the main service goes down, you still have access to packages that are in the cache as backup
  • Remember left-pad?
  • It could be within your own network, in other words, an “offline cache.”
  • Can reside close to your CI servers lowering install times and internet traffic
  • Lower costs in cloud environment due to no internet traffic needed

If the cache also serves as a private repository for your packages, you hit two birds with one stone. At this point, we will be focusing on the caching part, however.

What is a pull-through cache?

A pull-through cache or self-populating cache is a cache that, if it does not find what you are requesting from it, goes and gets it from somewhere else. In this case, if a Python package is not found in the cache, it will go and get it from PyPI.org.

Example:

The user runs pip install -I pypi.example.com/simple Django. This would try to get the package from your own PyPI registry. If the package can not be found at pypi.example.com, however, the cache will fall back to another registry (usually pypi.org) and try to get the package from there, store it and then respond with the package as if it was there all along.

Choosing the caching software

First, we need to select the application that is going to do the caching for us. If you already have something like Sonatype Nexus or JFrog Artifactory, you will most likely want to use those to set this up. However, those are giant pieces of software, and setting those up on Kubernetes has its challenges. The main reasons for not using them if you are just looking for a PyPI cache are:

  • Settings them up requires good knowledge of how to configure them, and there is a lot of moving parts.
  • Getting them running on Kubernetes, especially Nexus, is not always easy due to the lack of cloud-specific documentation.
  • Keeping the application container stateless can be tricky. Storing data in object storage can cost you extra due to licensing, or it can be challenging to do. (This, of course, makes the assumption that you want to keep things stateless as one usually does in Kubernetes)

So that brings us to more specific solutions for hosting a PyPI cache. At the moment, there are three main contenders in this field. PyPIServer, PyPICloud, and DevPi. All of these bring similar capabilities to the table. They can all host Python packages, and they can all work as a pull-through cache. So which one should we pick?

Well, let’s first have a look at when the packages got their most recent updates.

  • PyPIServer: Version 1.4.2, released on 2020–10–10
  • PyPICloud: 1.2.4, release date 2021–06–10
  • DevPi (devpi-server): 6.0.1, release date 2021–06–23

So, it seems like PyPIServer is not being that actively developed at the moment. There has also just been a handful of commits done to the project this year. For that reason, we will skip that.

Next up is DevPi. This was the first thing that was tried, and it does indeed work, however upon closer inspection, settings this up on Kubernetes with bucket storage is not that straightforward. Since there is no direct configuration for setting things up with something like S3 or GCS, we will skip this one as well.

Lastly, we have PyPICloud. As the name suggests, this one is made to run in the cloud and has out-of-the-box support for S3 and GCS. It also explicitly mentions that it is designed to be stateless, which is usually what you want from a service you run in Kubernetes. Mainly as that makes upgrades and scaling much simpler.

Setting up PyPICloud

Since we will run on Kubernetes, the first thing we need is Dockerfile (or some other way to create an OCI image). Luckily, PyPICloud offers us one already, and we can use that one as our starting point. One issue, however, is that we can’t configure the application with environment variables (at least not entirely), and configurations need to live in configuration files. While this kinda goes against the stateless nature that the project boasts, we can still make the configuration dynamic with a bit of string replacement magic and some templating.

Creating a flexible configuration file

The first thing we want to do is to get a default configuration file. We can do this by running docker run -it — rm -v $(PWD):/out stevearc/pypicloud make-config -r /out/config.ini . This will generate a configuration file with the default values for PyPICloud. With this configuration, it would run with local storage and with an SQLite database. We want to make these things configurable using environment variables, however, so that we don’t need to hard code anything into the configuration. We can later read these values from a Kubernetes secret.

To do so, we are going to use a modified version of the common gettext utility envsubst. Why we won’t use the standard version is because it does not support default values, and that is something that we would like. What envsubst does it replace any occurrences of the strings such as ${MY_VAR} with the environment variable MY_VAR. The version of the tool we are going to use, however, also supports ${MY_VAR:-42}; in this case, if MY_VAR was not set, the value 42 would be used. The modified version of the tool can be found here https://github.com/a8m/envsubst.

In our case, we will use Google Cloud Storage for storing packages and PostgreSQL as the database for storing metadata. To keep things easier to read, I will not post the entire config here but just an excerpt so that you can get an idea of what the final configuration would look like with variables instead of hard-coded values.

Note that we will use Google Cloud to test out this setup and will be using the GOOGLE_APPLICATION_CREDENTIALS variable for passing authentication credentials to PyPICloud. This variable does not need to be specified in the config file itself but is read directly from the environment when PyPICloud starts.

Creating a Docker setup

Now that we have our template file, we also need to replace the variables when we start up the container. For this, we need to extend the already existing PyPICloud image with some of our custom logic. At the same time, we will also add wait-for-it.sh for good measure to wait for our database to be up before starting the application. This will be useful when running locally in a docker-compose environment, for instance. We also need a custom entrypoint script to run when the application starts that will replace our variables with their proper values. The entrypoint script could look something like this:

Our final image will look like this:

If you want to try this out locally with docker-compose before we start deploying to Kubernetes, a working docker-compose.yml will look like this:

Deploying the application to Kubernetes

The first thing that we need to do is set up our surrounding infrastructure. To keep things short, we will not be using any tools for this, but I would strongly recommend using something like Terraform or Pulumi to set up your infrastructure. What we need is a Kubernetes cluster, a storage bucket, and a database. Settings these things up are worthy of blog posts of their own, so this post will contain only suggestions services to use for this in Google Cloud.

For Kubernetes, we can use GKE; for storage, we can use a GCS bucket; and for our database, we can use Cloud SQL and set up a PostgreSQL database there. We also need to create a service account to access our GCS bucket and the JSON file with the credentials to said service account. Note that all of the big cloud providers offer similar services to these, and the only more significant differentiator will be the way you authenticate with them.

Once you have set up your GKE cluster, you would most likely want to run something like cert-manager on it as well. As well as point a proper DNS name to it so that you can start generating valid SSL certificates using Let’s Encrypt. Again, this is a bit out of scope for this post, but I highly recommend reading up on how to do this before proceeding.

Now for the actual Kubernetes deployment, we at Anders usually use our own Kólga tool for anything CI/CD related, and if you want to try it out, this is what the CI configuration for it would look like if you would be using GitLab:

You would also need to set the following variable under the projects CI/CD configuration (or alternatively use Vault which is also supported by Kólga):

  • K8S_SECRET_DB_URL: For settings the database URL
  • K8S_SECRET_STORAGE_BUCKET: For pointing to your GCS bucket
  • K8S_SECRET_GOOGLE_APPLICATION_CREDENTIALS: For authenticating with Google Cloud

If you are using GitHub Actions, you can use our actions like so:

If you do not want to use our tooling for deploying PyPICloud, you can look at the Helm charts that we are using to deploy our applications using Kólga or then write your own that would use the Docker image that we just created.

Once the deployment is complete, you can visit the domain that you specified for the application, and you should be greeted with the following page:

You are now ready to start using your brand new PyPI cache.

Using the cache

To use the cache that we have now installed, we need to configure our package managers to use it. The most used package manager for Python is, of course, the bundled pip application, but lately, new contenders have hit the package manager space as well, such as poetry, pipenv, and conda. To use your cache with these applications, you can often either supply an argument when installing a package or configure the package manager to always use a specific package registry.

PIP

Command line
pip install -i https://pypi.<YOUR_DOMAIN>.com/simple <PACKAGE>

Global configuration
pip config set global.index-url https://pypi.<YOUR_DOMAIN>.com/simple

Configuration file
Create/Modify the file $HOME/.config/pip/pip.conf and add:

[global]
timeout = 5
index-url = https://pypi.<YOUR_DOMAIN>.com/simple

Alternatively you can tell pip to look for the configuration file in another location by settings the PIP_CONFIG_FILE environment variable to point at the file path of your choosing.

Poetry

Unfortunately, Poetry does not support global settings for a registry, nor does it have a flag that can be added during poetry add. The only way to support other registries is to configure a project’s pyproject.toml file.

Project-specific settings
In ./pyproject.toml of your project add the following:

[[tool.poetry.source]]
name = “my-pypi-cache”
url = https://pypi.<YOUR_DOMAIN>.com/simple
default = true

Pipenv

Command line
pipenv install — pypi-mirror https://pypi.<YOUR_DOMAIN>.com/simple

Project specific settings
See the Pipenv documentation.

Configuration file
See the PIP instructions above.

Conda

Global configuration
anaconda config — set URL https://pypi.<YOUR_DOMAIN>.com/simple
add -site at the end for making it global for all users instead of just the current user.

Configuration file
See the PIP instructions above.

Anders is a Finnish IT company, whose mission is sustainable software development with the greatest colleagues of all time.