Compression

Because the Limbo specification stores dataset samples using many small files organized into a simple hierarchy, samples are both easy to write and easy to read. However, reading many small files from disk can be slow, especially if you are doing so repeatedly as you run experiments. Fortunately, the Limbo software includes a command line tool - limbo-compress - that you can use to cache subsets of the Limbo data into a few large files that can be read quickly and efficiently.

For example, let’s suppose that you are planning a series of experiments to fine-tune pre-trained VGG-16 models to detect type-48 uranium hexaflouride containers. You plan to use 5000 images for training and 1000 images for testing, all drawn from Campaign 17. You also plan to repeat each experiment ten times and average the results, to ensure that any conclusions you draw are robust and not just the random quirks of a single model. This is a perfect use-case for limbo-compress, since repeatedly loading and parsing the same set of metadata and image files would be extremely inefficient. Let’s look at how you would compress the data you need for these experiments.

To begin, we’ll define a variable containing the path to the data:

[1]:

DATA_ROOT = "/mnt/mc1/limbo"

… if you’re running this notebook yourself, you’ll need to set DATA_ROOT to point to a directory containing Limbo campaign data that you’ve downloaded.

First, we’ll compress 1000 metadata files and images for our test data:

[2]:

!limbo-compress --prefix test --images --metadata -- $DATA_ROOT/campaign17/0049

100%|███████████████████████████████████████| 1000/1000 [04:04<00:00,  4.09it/s]

Each subdirectory within a campaign contains ~1000 files, which is why we chose to compress one subdirectory for this example.

Once limbo-compress finishes, you will find a pair of files with filenames based on the –prefix argument you provided above:

[3]:

!ls test*

test-images.npy  test-metadata.pickle

The test-images.npy file is a numpy array containing all 1000 images, which can be loaded very quickly:

[4]:

import numpy

[5]:

%%time
test_images = numpy.load("test-images.npy")
test_images.shape

CPU times: user 1.43 ms, sys: 77.7 ms, total: 79.1 ms
Wall time: 76.5 ms

[5]:

(1000, 224, 224, 3)

Note that the resulting array has shape (images, width, height, channels). Next, let’s load the metadata, which has been compressed into a single Python pickle file, which also loads very quickly:

[6]:

import pickle

[7]:

%%time
with open("test-metadata.pickle", "rb") as stream:
    test_metadata = pickle.load(stream)

CPU times: user 7.41 s, sys: 583 ms, total: 7.99 s
Wall time: 7.99 s

[8]:

len(test_metadata)

[8]:

The metadata is stored in a list containing one Python dict per sample, in the same order as the images in the image array. Now let’s compress 5000 additional images for training and load them into memory the same way:

[9]:

!limbo-compress --prefix training --images --metadata -- $DATA_ROOT/campaign17/0000 \
$DATA_ROOT/campaign17/0001 $DATA_ROOT/campaign17/0002 $DATA_ROOT/campaign17/0003 \
$DATA_ROOT/campaign17/0004
!ls train*

100%|███████████████████████████████████████| 5000/5000 [20:19<00:00,  4.10it/s]
training-images.npy  training-metadata.pickle

[10]:

%%time
training_images = numpy.load("training-images.npy")
training_images.shape

CPU times: user 0 ns, sys: 363 ms, total: 363 ms
Wall time: 359 ms

[10]:

(5000, 224, 224, 3)

[11]:

%%time
with open("training-metadata.pickle", "rb") as stream:
    training_metadata = pickle.load(stream)

CPU times: user 39.9 s, sys: 2.7 s, total: 42.6 s
Wall time: 42.5 s

Note that we compressed five Campaign 17 subdirectories to get our desired 5000 images, and that loading the compressed data takes a few seconds, while compressing it took nearly 22 minutes - this is the amount of time saved every time you use the compressed data!

Once the images and metadata are loaded, you can easily generate labels:

[12]:

import torch

targets = set(["48G", "48X", "48Y"])

def categories(sample):
    return {annotation["category"] for annotation in sample.get("annotations", [])}

training_labels = torch.tensor([1 if categories(sample) & targets else 0 for sample in training_metadata], dtype=torch.float32).unsqueeze(dim=1)
test_labels = torch.tensor([1 if categories(sample) & targets else 0 for sample in test_metadata], dtype=torch.float32).unsqueeze(dim=1)

If you’re using PyTorch, it’s easy to create a PyTorch-compatible dataset that works with the compressed data:

[13]:

class TorchDataset(torch.utils.data.Dataset):
    """PyTorch compatible dataset that works with our compressed data."""
    def __init__(self, labels, images, training=True):
        self.labels = labels
        self.images = images.to(torch.float32)
        self.training = training

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, key):
        image = self.images[key]
        augmented_image = self.images[key]
        label = self.labels[key]

        if self.training:
            angle = float(torch.empty(1).uniform_(-90.0, 90.0).item())
            translate = (0.0, 0.0)
            scale = float(torch.empty(1).uniform_(0.8, 1.2).item())
            shear = (
                float(torch.empty(1).uniform_(-20.0, 20.0).item()),
                float(torch.empty(1).uniform_(-20.0, 20.0).item()),
                )

            augmented_image = F.affine(augmented_image, angle, translate, scale, shear, torchvision.transforms.InterpolationMode.BILINEAR, fill=(0.485, 0.456, 0.406))

            if torch.rand(1) < 0.5:
                augmented_image = F.hflip(augmented_image)

        augmented_image = F.normalize(augmented_image, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

        return image, augmented_image, label