# InfiniBatch Infinibatch is a library of checkpointable iterators for randomized data loading of massive data sets in deep neural network training. ## Features * support for corpora much larger than fit into RAM * hierarchical block+sentence-level randomization over the whole corpus, different randomization in each epoch * only load the data that is needed * very fast start-up time (does not need to read full corpus) * only requires the most basic of data preparation (e.g. no indexing) * for multi-GPU, only load what the respective GPU needs * 100% accurate check-pointing, restore from checkpoint should not read all data up to the checkpoint * support automatic bucketed batching with dynamic batch sizes * pre-fetching thread * composable, as to support for complex batching, e.g. negative samples from multiple documents ## Getting Started Infinibatch requires Python 3.6 or higher and has no dependencies. There is presently no pip package. To install it, clone this repository and install it locally. ```bash git clone https://github.com/microsoft/infinibatch cd infinibatch pip install -e . ``` ## Documentation The documentation can be found here: https://microsoft.github.io/infinibatch/ ## Tutorial This little tutorial walks you through the steps of preparing your data and consuming them from Python code as batches. ### Infinibatch Basics: Iterators and Checkpointing Infinibatch provides [Python iterators](https://docs.python.org/3.5/glossary.html#term-iterator) to read your data. An iterator represents a stream of data that can be retrieved item by item, e.g. via a `for` loop or repeatedly calling `next()` on it. Infinibatch is agnostic to the data type of the items, which is determined by a user-supplied file-read function. In NLP applications, items would typically be tuples of text. In other applications, they can be images or an audio file with a textual annotation. Infinibatch makes it easy to read your data in randomized order, and supports checkpointing, which allows you to restart training exactly where you left off. Randomization is done _on the fly_, which means that it is not necessary to read the entire data set into memory to be shuffled. Infinibatch implements a hierarchical shuffling algorithm that only holds a subset of the data in RAM at any point in time. Infinibatch iterators are _checkpointable_. Checkpointing lets you retrieve the current position (the "checkpoint") in the data stream at any time, so that later, you can "rewind" to that same position. The sad reality is that long-running trainings occasionally crash. To be able to continue a crashed training as if it had not crashed, save your Infinibatch iterator's checkpoint to disk whenever you save an intermediate model during training. To restart a crashed training, reset the iterator to the saved checkpoint. The data reader will now yield the exact same data-item sequence it would have yielded without the crash. ### Data Preparation Infinibatch has one requirement on your data organization: To use your data with Infinibatch, it must be split into a large number of small chunks. A chunk is the smallest unit of data that is loaded from disk into RAM. Infinibatch holds a random subset of chunks in memory that it randomly draws samples from. Below we want to show how such a split can be created. An easy way to split your data into chunks is with the Linux `split` command. In this tutorial, our "corpus" consists of 6 lines of text, where each line is one data item. To create that corpus, please run this command in a bash shell. It creates a 6-line text file named `corpus.txt`: ```bash echo \\ 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. The quick brown fox jumps over the lazy dog.' \\ > corpus.txt ``` Now let us split it into 3 chunks of 2 lines each. Each chunk is stored as a zipped text file. We will create them inside a new subdirectory called `corpus_chunks`: ```bash mkdir corpus_chunks split --lines 2 --numeric-suffixes \\ --filter 'gzip > corpus_chunks/$FILE.txt.gz' \\ corpus.txt corpus. ``` This will have created three files: `corpus_chunks/corpus.00.txt.gz`, `corpus_chunks/corpus.01.txt.gz`, and `corpus_chunks/corpus.02.txt.gz`. To verify whether the data has been split as expected, you can use this command: ```bash zcat corpus_chunks/corpus.*.txt.gz ``` Hint: For large corpora, we recommend replacing `gzip` by `pigz` (`apt-get install pigz`), which runs notably faster via multi-threading. ### Reading Items in Random Order With Infinibatch We will first show the easiest way to read data with Infinibatch, using the helper function `chunked_dataset_iterator``()`. This function will create an Infinibatch iterator that yields the content of your data in random order. Please the following program: ```python import gzip, glob from infinibatch import datasets as ds ds = ds.chunked_dataset_iterator( chunk_refs = glob.glob('corpus_chunks/corpus.*.txt.gz'), read_chunk_fn = lambda path: iter(gzip.decompress(open(path, "rb") \\ .read()).decode(encoding='utf-8') \\ .splitlines()), buffer_size = 6, seed = 1) for i in range(10): print(next(ds)) ``` You should get output that contains the 6 example lines in randomized order: ```text Lorem ipsum dolor sit amet, consectetur adipiscing elit, Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. The quick brown fox jumps over the lazy dog. sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. consectetur adipiscing elit, Lorem ipsum dolor sit amet, The quick brown fox jumps over the lazy dog. sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. ``` Note: The `buffer_size` parameter determines how many sentences are read into memory at any given time, to draw randomized items from. In real settings with corpora of hundreds of millions of text lines, the `buffer_size` parameter should be set in the millions. RAM usage and startup time will be proportional to the buffer size (but much lower than having to load the entire corpus into RAM). ### Reading Items of Different Lengths in Batches For deep learning, we want to group multiple items into batches. For NLP tasks, items are often lines of text of varying length. Infinibatch implements an algorithm that randomizes the input sequence and groups it into batches of approximately the same length (aka _bucketing_). Infinibatch's `BucketedReadaheadBatchIterator` performs this task. It implements an algorithm modeled after the [Marian toolkit](https://github.com/marian-nmt/marian) that preloads a large number of randomized items (typically millions; in this example: 6), sorts them and groups them into batches of similar length, and then yields them, in turn, in randomized order. Here is an example. Note that the `BucketedReadaheadBatchIterator` accepts the previous randomized sentence sequence iterator (`ds`) as the source of items to randomize over. This is an example how one forms pipelines of iterators with Infinibatch (a concept familiar from Python's own `itertools`). Once an iterator is passed to another as its source, consider it owned by that other iterator, it must no longer be accessed by the calling code. ```python import gzip, glob from infinibatch import datasets as ds from infinibatch import iterators as it ds = ds.chunked_dataset_iterator( chunk_refs = glob.glob('corpus_chunks/corpus.*.txt.gz'), read_chunk_fn = lambda path: iter(gzip.decompress(open(path, "rb") \\ .read()).decode(encoding='utf-8') \\ .splitlines()), buffer_size = 6, seed = 1) bs = it.BucketedReadaheadBatchIterator( source_iterator = ds, # note: this is the iterator from above read_ahead = 6, key = lambda line: len(line), batch_size = 2, seed = 1) for i in range(25): print(next(bs)) ``` This code should output something like this: ```python ['sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.', 'The quick brown fox jumps over the lazy dog.'] ['consectetur adipiscing elit,', 'Lorem ipsum dolor sit amet,'] ['Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.', 'Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.'] ``` followed by different permutations of the same tuples. As you can see, the sentences are in random order and grouped in batches of 2 of approximately the same length. You may notice that there is no variation in how the items get grouped into batches--that is an artifact of this example, and generally not the case in real use when the data size is much larger than the batch size. In NLP, sentence length often varies considerably. As a result, using batches of a fixed number of lines, as in the example above, will waste GPU RAM and cores. This is because the number of lines is limited by the longest possible sequence; batches of shorter lines would leave GPU cycles on the table. Ideally, one would use batches that have as many lines as fit into GPU RAM, given the number of tokens of the longest line in the batch. To support variable batch sizes, Infinibatch allows to pass a function as the `batch_size` parameter. That function will be given the longest item of a batch and should estimate how many items of at most this length can fit. In our example, we assume that batches can hold at most 150 tokens. Please change the above code as follows: ```python batch_size = lambda longest_line: 150 // len(longest_line), ``` The output looks like this: ``` ['consectetur adipiscing elit,', 'Lorem ipsum dolor sit amet,'] ['Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.'] ['sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.', 'The quick brown fox jumps over the lazy dog.'] ['Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.'] ``` That shorter sentences got grouped, while longer did not because they would exceed the total of 150 characters. ### Reading Batches Into Numpy Arrays Lastly, we will need to feed batches into our favorite deep-learning tool. We will show how to convert the batches of text lines into padded `numpy` arrays. In a typical NLP application, text items would be tokenized, and then each token would be represented by an index into a unit vocabulary. For simplicity, in this example each character is its own token, and each token's numeric unit index is just its ASCII code. These sequences are then padded to equal length with -1, and converted into a `numpy` array. Please rerun the previous example, but first insert the following code before the final `for` loop. This example uses an Infinibatch `MapIterator`, which applies a user-supplied function or lambda to each item: ```python import numpy as np def collate(lines_batch): # tokenize all lines in the batch and map to unit ids ids_batch = [[ord(c) for c in line] for line in lines_batch] # create a padded numpy array as wide as the longest line, # where shorter sequences are padded with -1 width = max(len(ids) for ids in ids_batch) return np.array([ids + [-1] * (width-len(ids)) for ids in ids_batch]) bs = it.MapIterator( source_iterator = bs, transform = collate) ``` This will output batches like this. Note that in batches with multiple sentences, some entries are padded with `-1`. ```python [[ 99 111 110 115 101 99 116 101 116 117 114 32 97 100 105 112 105 115 99 105 110 103 32 101 108 105 116 44] [ 76 111 114 101 109 32 105 112 115 117 109 32 100 111 108 111 114 32 115 105 116 32 97 109 101 116 44 -1]] [[ 85 116 32 101 110 105 109 32 97 100 32 109 105 110 105 109 32 118 101 110 105 97 109 44 32 113 117 105 115 32 110 111 115 116 114 117 100 32 101 120 101 114 99 105 116 97 116 105 111 110 32 117 108 108 97 109 99 111 32 108 97 98 111 114 105 115 32 110 105 115 105 32 117 116 32 97 108 105 113 117 105 112 32 101 120 32 101 97 32 99 111 109 109 111 100 111 32 99 111 110 115 101 113 117 97 116 46]] [[115 101 100 32 100 111 32 101 105 117 115 109 111 100 32 116 101 109 112 111 114 32 105 110 99 105 100 105 100 117 110 116 32 117 116 32 108 97 98 111 114 101 32 101 116 32 100 111 108 111 114 101 32 109 97 103 110 97 32 97 108 105 113 117 97 46] [ 84 104 101 32 113 117 105 99 107 32 98 114 111 119 110 32 102 111 120 32 106 117 109 112 115 32 111 118 101 114 32 116 104 101 32 108 97 122 121 32 100 111 103 46 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]] [[ 68 117 105 115 32 97 117 116 101 32 105 114 117 114 101 32 100 111 108 111 114 32 105 110 32 114 101 112 114 101 104 101 110 100 101 114 105 116 32 105 110 32 118 111 108 117 112 116 97 116 101 32 118 101 108 105 116 32 101 115 115 101 32 99 105 108 108 117 109 32 100 111 108 111 114 101 32 101 117 32 102 117 103 105 97 116 32 110 117 108 108 97 32 112 97 114 105 97 116 117 114 46]] ``` ## Where To Go From Here The above tutorial showed you the use of the most common iterator type, as created by the convenience function `chunked_dataset_iterator()`. Not all real-life scenarios are covered by this function. For example, multi-task learning scenarios require more complex combinations of data. To create those, you will need to compose the necessary data reader from the underlying building blocks. This is described at the documentation of the module `iterators`. ## Documentation To view the documentation, please clone the repository and go to docs/infinibatch/index.html When working on the documentation, install pdoc: ``` pip install pdoc3 ``` You can then start a local http server that dynamically updates the documentation: ``` pdoc --template-dir docs --http : infinibatch ``` We currently haven't set up the CI to automatically generate the documentation. Before you merge anything into master, please delete the existing documentation in docs/infinibatch and run ``` pdoc -o docs --template-dir docs --html infinibatch ``` ## Testing To run unit tests, run the following command. ``` python -m unittest discover -s test ``` If you would like the unit tests to stop after the first failed test, use: ``` python -m unittest discover -s test --failfast ``` To type-check with `mypy` (if installed): ``` mypy infinibatch ``` # Contributing This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com. When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA. This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.