Container Image Internals, Part 1: docker pull

The rise in popularity of Docker has led to a proliferation of container image usage in the cloud and devops space. The Docker toolchain makes working with container images easy, allowing users to build, distribute and run these images in just a handful of user-friendly commands.

This series attempts to shed some light on the image format used by Docker (and the new Open Containers Initiative!) images. I will explain some basic examples of how images are constructed, and I’ll work through some basic examples of building images from scratch without using the Docker toolchain.

Prereqs

I tested the following commands on MacOS Sierra. If you plan on following along, you’ll need:

I used the Google Container Registry for this example, so if you want to push or work with private images in later parts of this series you’ll also need to install the Google Cloud SDK to help with authentication.

Manifests and Blobs

Have you ever pulled a Docker container using the ‘docker pull’ command? After reading this post, you should understand exactly how this command works. In the first part of this series, we’re going to use just bash and some standard command line tooling to pull an image from a container registry to our laptop, where we can unpack it and inspect the contents.

What’s in a Container Image

A container image combines two main concepts - a root filesystem and configuration.

A root filesystem is a description of exactly what files should be present inside a running container, and in what places. This is the part of the image that describes what you can see after running a command like: docker run -it $container bash and poking around with cd and ls.

Docker and the new Open Containers Initiative provide image specifications that use a concept of filesystem layers to build and store the root filesystem. Layers primarily serve to cut up a single large root filesystem into a set of smaller chunks that can be shared across images. If you’re working with a lot of images, it’s important to make sure you share as many layers as possible so pulls and pushes are fast, and so your images take up less disk space. We’ll talk more about these layers later.

The configuration section of an image is everything else needed to run a container image. This includes things like environment variables, information about the target architecture, and metadata about how the image was built for viewing later.

The Manifest

Docker and the OCI specification use a container manifest to describe the root filesystem and configuration of an image. This manifest is the canonical definition of our image: uploading a manifest to the registry creates an image, and deleting the manifest deletes the image.

If the word manifest sounds complicated, don’t worry! It’s just a fancier word for JSON file. Here’s what a simple one for an image with one layer might look like:

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
  "config": {
    "mediaType": "application/vnd.docker.container.image.v1+json",
    "size": 190,
    "digest": "sha256:efe184abb97e76d7d900b2e97171cc20830b6b1b0e0fe504a4ee7097a6b5c91b"
  },
  "layers": [
    {
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
      "size": 170,
      "digest": "sha256:9964c16915b8956cb01eb77028b1fd1976287b5ec87cc1663844a0bd32933a47"
    }
  ]
}

The schemaVersion and mediaType fields are just boilerplate explaining that this JSON file happens to represent a Docker image manifest. The config field points to a Docker runtime configuration file that contains instructions for how to run the image. We’ll explain this field in later parts of this series.

The layers field contains a list of the layers used to build our root filesystem. This is the field we’re going to spend most of our time with in the rest of this article.

Want to see the manifest for an image you’ve built and pushed to a registry? You can use curl and the registry API to download and view this manifest. Here’s how to get the manifest for the official public Debian 8 image hosted on the Google Container Registry at l.gcr.io/google/debian8:latest:

curl -L \
  -H 'Accept: application/vnd.docker.distribution.manifest.v2+json' \
  l.gcr.io/v2/google/debian8/manifests/latest | jq .

The -L flag tells curl to follow HTTP redirects. Many Docker registries use redirects to Content Deliver Networks or storage systems like Amazon S3 or Google Cloud Storage for increased performance and reliability, so we’ll need it on most of our curl commands.

It’s also important not to forget the -H flag. This tells curl to pass a header to the registry indicating which schema version we would like to receive the manifest in. For this series we’ll be using the schema version 2, which is much simpler than schema version 1.

The l.gcr.io portion of our image name becomes the hostname of our request. The rest of the path describes the name of the image (google/debian8). Then we’re describing what we’re looking for (manifests), and which one to get (latest).

This is a public image, so you don’t need to pass any authentication information. If you want to try this on one of your own images, you’ll need to do a little bit more work, which we’ll explain once we get to building and pushing images later.