mlbench_core.dataset¶

linearmodels¶

pytorch¶

Epsilon Logistic Regression¶

class mlbench_core.dataset.linearmodels.pytorch.dataloader.LMDBDataset(name, data_type, root, target_transform=None)[source]¶

LMDB Dataset

Parameters

root (string) – Either root directory for the database files, or a absolute path pointing to the file.
target_transform (callable, optional) – A function/transform that takes in the target and transforms it.

class mlbench_core.dataset.linearmodels.pytorch.dataloader.LMDBPTClass(root, transform=None, target_transform=None)[source]¶

LMDB Dataset loader Class

Parameters

root (string) – Either root directory for the database files, or a absolute path pointing to the file.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g, transforms.RandomCrop
target_transform (callable, optional) – A function/transform that takes in the target and transforms it.

imagerecognition¶

pytorch¶

CIFAR10V1¶

class mlbench_core.dataset.imagerecognition.pytorch.dataloader.CIFAR10V1(root, train=True, download=False)[source]¶

CIFAR10V1 Dataset.

Loads CIFAR10V1 images with mean and std-dev normalisation. Performs random crop and random horizontal flip on train and only normalisation on val. Based on torchvision.datasets.CIFAR10 and Pytorch CIFAR 10 Example.

Parameters

root (str) – Root folder for the dataset
train (bool) – Whether to get the train or validation set (default=True)
download (bool) – Whether to download the dataset if it’s not present

Imagenet¶

class mlbench_core.dataset.imagerecognition.pytorch.dataloader.Imagenet(root, train=True)[source]¶

Imagenet (ILSVRC2017) Dataset.

Loads Imagenet images with mean and std-dev normalisation. Performs random crop and random horizontal flip on train and resize + center crop on val. Based on torchvision.datasets.ImageFolder

Parameters

root (str) – Root folder of Imagenet dataset (without train/ or val/)
train (bool) – Whether to get the train or validation set (default=True)

tensorflow¶

DatasetCifar¶

class mlbench_core.dataset.imagerecognition.tensorflow.DatasetCifar(dataset, dataset_root, batch_size, world_size, rank, seed, tf_dtype=tf.float32)[source]¶

This clas is adapted from the following script https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_input.py

Parameters

dataset (str) – Name of the dataset e.g. cifar-10, cifar-100.
dataset_root (str) – Root directory to the dataset.
batch_size (int) – Size of batch.
world_size (int) – Size of the world size.
rank (int) – Rank of the process.
seed (int) – Seed of random number.
tf_dtype (tensorflow.python.framework.dtypes.DType, optional) – Defaults to tf.float32. Datatypes of the tensor.

input_fn(self, is_train, repeat_count=- 1, num_shards=1, shard_index=0)¶

Input_fn using the tf.data input pipeline for CIFAR-10 dataset.

In synchronized training, faster nodes may use more batches than the number of batches availble. Thus repeat dataset for engouh times to avoid throwing error.

In the distributed settings, datasets are split into num_shards non-overlapping parts and each process takes one shard by its index.

Parameters

is_train (bool) – A boolean denoting whether the input is for training.
repeat_count (int) – Defaults to -1. Count of dataset repeated times with -1 for infinite.
num_shards (int) – Defaults to 1. Number of Shards the dataset is splitted.
shard_index (int) – Defaults to 0. Index of shard to use.

Returns: tf.data.Dataset object of ((inputs, labels), is_train).

maybe_download_and_extract(self)¶: Download and extract the tarball from Alex’s website.

parse_record(self, raw_record)¶: Parse CIFAR-10/100 image and label from a raw record.

preprocess_image(self, image, is_training)¶: Preprocess a single image of layout [height, width, depth].

record_dataset(self, filenames)¶: Returns an input pipeline Dataset from filenames.

NLP¶

pytorch¶

Translation WMT16¶

class mlbench_core.dataset.nlp.pytorch.WMT16Dataset(root, lang=('en', 'de'), math_precision=None, download=True, train=False, validation=False, lazy=False, preprocessed=False, sort=False, min_len=0, max_len=None, max_size=None)[source]¶

Dataset for WMT16 en to de translation

Parameters

root (str) – Root folder where to download files
lang (tuple) – Language translation pair
math_precision (str) – One of fp16 or fp32. The precision used during training
download (bool) – Download the dataset from source
train (bool) – Load train set
validation (bool) – Load validation set
lazy (bool) – Load the dataset in a lazy format
min_len (int) – Minimum sentence length
max_len (int | None) – Maximum sentence length
max_size (int | None) – Maximum dataset size

class mlbench_core.dataset.nlp.pytorch.wmt16.wmt16_tokenizer.WMT16Tokenizer(base_dir, math_precision=None, separator='@@')[source]¶

Tokenizer Class for WMT16 that uses the whole vocabulary

Parameters

base_dir (str) – Base directory for files
math_precision (str) – Math precision
separator (str) – BPE

detokenize(self, inputs, delim=' ')[source]¶

Detokenizes single sentence and removes token separator characters.

Parameters

inputs – sequence of tokens
delim – tokenization delimiter

returns: string representing detokenized sentence

segment(self, line)[source]¶

Tokenizes single sentence and adds special BOS and EOS tokens.

Parameters: line – sentence

returns: list representing tokenized sentence

Translation WMT17¶

class mlbench_core.dataset.nlp.pytorch.wmt17.Dictionary(pad='<pad>_', eos='<EOS>_')[source]¶

Dictionary Class for WMT17 Dataset. Essentially a mapping from symbols to consecutive integers

Parameters

pad (str) – Padding symbol to use
eos (str) – End of String symbol to use

add_symbol(self, word, n=1)¶: Adds a word to the dictionary

eos(self)¶: Helper to get index of end-of-sentence symbol

index(self, sym)¶: Returns the index of the specified symbol

classmethod load(cls, f, ignore_utf_errors=False)¶

Loads the dictionary from a text file with the format:

` <symbol0> <symbol1> ... `

Parameters

f (str) – Dictionary file name
ignore_utf_errors (bool) – Ignore UTF-8 related errors

pad(self)¶: Helper to get index of pad symbol

string(self, tensor, bpe_symbol=None)¶

Helper for converting a tensor of token indices to a string.

Can optionally remove BPE symbols or escape <unk> words.

update(self, new_dict)¶: Updates counts from new dictionary.

mlbench_core.dataset.nlp.pytorch.wmt17.collate_batch(samples, pad_idx, eos_idx, left_pad_source=True, left_pad_target=False, bsz_mult=8, seq_len_multiple=1)[source]¶

Collate a list of samples into a batch

Parameters

samples (list[dict]) – Samples to collate
pad_idx (int) – Padding symbol index
eos_idx (int) – EOS symbol index
left_pad_source (bool) – Pad sources on the left
left_pad_target (bool) – Pad sources on the right
bsz_mult (int) – Batch size multiple
seq_len_multiple (int) – Sequence length multiple

Returns

Containing keys id (list of indices), ntokens (total num tokens), net_input and target

Return type

(dict)

mlbench_core.dataset¶

linearmodels¶

pytorch¶

Epsilon Logistic Regression¶

imagerecognition¶

pytorch¶

CIFAR10V1¶

Imagenet¶

tensorflow¶

DatasetCifar¶

NLP¶

pytorch¶

Translation WMT16¶

Translation WMT17¶

Language Modeling WikiText2¶