mlbench_core.dataset

linearmodels

pytorch

Epsilon Logistic Regression

class mlbench_core.dataset.linearmodels.pytorch.dataloader.LMDBDataset(name, data_type, root, target_transform=None)[source]

LMDB Dataset

Parameters
  • root (string) – Either root directory for the database files, or a absolute path pointing to the file.

  • target_transform (callable, optional) – A function/transform that takes in the target and transforms it.

class mlbench_core.dataset.linearmodels.pytorch.dataloader.LMDBPTClass(root, transform=None, target_transform=None)[source]

LMDB Dataset loader Class

Parameters
  • root (string) – Either root directory for the database files, or a absolute path pointing to the file.

  • transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g, transforms.RandomCrop

  • target_transform (callable, optional) – A function/transform that takes in the target and transforms it.

imagerecognition

pytorch

CIFAR10V1

class mlbench_core.dataset.imagerecognition.pytorch.dataloader.CIFAR10V1(root, train=True, download=False)[source]

CIFAR10V1 Dataset.

Loads CIFAR10V1 images with mean and std-dev normalisation. Performs random crop and random horizontal flip on train and only normalisation on val. Based on torchvision.datasets.CIFAR10 and Pytorch CIFAR 10 Example.

Parameters
  • root (str) – Root folder for the dataset

  • train (bool) – Whether to get the train or validation set (default=True)

  • download (bool) – Whether to download the dataset if it’s not present

Imagenet

class mlbench_core.dataset.imagerecognition.pytorch.dataloader.Imagenet(root, train=True)[source]

Imagenet (ILSVRC2017) Dataset.

Loads Imagenet images with mean and std-dev normalisation. Performs random crop and random horizontal flip on train and resize + center crop on val. Based on torchvision.datasets.ImageFolder

Parameters
  • root (str) – Root folder of Imagenet dataset (without train/ or val/)

  • train (bool) – Whether to get the train or validation set (default=True)

tensorflow

DatasetCifar

class mlbench_core.dataset.imagerecognition.tensorflow.DatasetCifar(dataset, dataset_root, batch_size, world_size, rank, seed, tf_dtype=tf.float32)[source]

This clas is adapted from the following script https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_input.py

Parameters
  • dataset (str) – Name of the dataset e.g. cifar-10, cifar-100.

  • dataset_root (str) – Root directory to the dataset.

  • batch_size (int) – Size of batch.

  • world_size (int) – Size of the world size.

  • rank (int) – Rank of the process.

  • seed (int) – Seed of random number.

  • tf_dtype (tensorflow.python.framework.dtypes.DType, optional) – Defaults to tf.float32. Datatypes of the tensor.

input_fn(self, is_train, repeat_count=- 1, num_shards=1, shard_index=0)

Input_fn using the tf.data input pipeline for CIFAR-10 dataset.

In synchronized training, faster nodes may use more batches than the number of batches availble. Thus repeat dataset for engouh times to avoid throwing error.

In the distributed settings, datasets are split into num_shards non-overlapping parts and each process takes one shard by its index.

Parameters
  • is_train (bool) – A boolean denoting whether the input is for training.

  • repeat_count (int) – Defaults to -1. Count of dataset repeated times with -1 for infinite.

  • num_shards (int) – Defaults to 1. Number of Shards the dataset is splitted.

  • shard_index (int) – Defaults to 0. Index of shard to use.

Returns

tf.data.Dataset object of ((inputs, labels), is_train).

maybe_download_and_extract(self)

Download and extract the tarball from Alex’s website.

parse_record(self, raw_record)

Parse CIFAR-10/100 image and label from a raw record.

preprocess_image(self, image, is_training)

Preprocess a single image of layout [height, width, depth].

record_dataset(self, filenames)

Returns an input pipeline Dataset from filenames.

NLP

pytorch

Translation WMT16

class mlbench_core.dataset.nlp.pytorch.WMT16Dataset(root, lang=('en', 'de'), math_precision=None, download=True, train=False, validation=False, lazy=False, preprocessed=False, sort=False, min_len=0, max_len=None, max_size=None)[source]

Dataset for WMT16 en to de translation

Parameters
  • root (str) – Root folder where to download files

  • lang (tuple) – Language translation pair

  • math_precision (str) – One of fp16 or fp32. The precision used during training

  • download (bool) – Download the dataset from source

  • train (bool) – Load train set

  • validation (bool) – Load validation set

  • lazy (bool) – Load the dataset in a lazy format

  • min_len (int) – Minimum sentence length

  • max_len (int | None) – Maximum sentence length

  • max_size (int | None) – Maximum dataset size

class mlbench_core.dataset.nlp.pytorch.wmt16.wmt16_tokenizer.WMT16Tokenizer(base_dir, math_precision=None, separator='@@')[source]

Tokenizer Class for WMT16 that uses the whole vocabulary

Parameters
  • base_dir (str) – Base directory for files

  • math_precision (str) – Math precision

  • separator (str) – BPE

detokenize(self, inputs, delim=' ')[source]

Detokenizes single sentence and removes token separator characters.

Parameters
  • inputs – sequence of tokens

  • delim – tokenization delimiter

returns: string representing detokenized sentence

segment(self, line)[source]

Tokenizes single sentence and adds special BOS and EOS tokens.

Parameters

line – sentence

returns: list representing tokenized sentence

Translation WMT17

class mlbench_core.dataset.nlp.pytorch.wmt17.Dictionary(pad='<pad>_', eos='<EOS>_')[source]

Dictionary Class for WMT17 Dataset. Essentially a mapping from symbols to consecutive integers

Parameters
  • pad (str) – Padding symbol to use

  • eos (str) – End of String symbol to use

add_symbol(self, word, n=1)

Adds a word to the dictionary

eos(self)

Helper to get index of end-of-sentence symbol

index(self, sym)

Returns the index of the specified symbol

classmethod load(cls, f, ignore_utf_errors=False)

Loads the dictionary from a text file with the format:

` <symbol0> <symbol1> ... `

Parameters
  • f (str) – Dictionary file name

  • ignore_utf_errors (bool) – Ignore UTF-8 related errors

pad(self)

Helper to get index of pad symbol

string(self, tensor, bpe_symbol=None)

Helper for converting a tensor of token indices to a string.

Can optionally remove BPE symbols or escape <unk> words.

update(self, new_dict)

Updates counts from new dictionary.

mlbench_core.dataset.nlp.pytorch.wmt17.collate_batch(samples, pad_idx, eos_idx, left_pad_source=True, left_pad_target=False, bsz_mult=8, seq_len_multiple=1)[source]

Collate a list of samples into a batch

Parameters
  • samples (list[dict]) – Samples to collate

  • pad_idx (int) – Padding symbol index

  • eos_idx (int) – EOS symbol index

  • left_pad_source (bool) – Pad sources on the left

  • left_pad_target (bool) – Pad sources on the right

  • bsz_mult (int) – Batch size multiple

  • seq_len_multiple (int) – Sequence length multiple

Returns

Containing keys id (list of indices), ntokens (total num tokens), net_input and target

Return type

(dict)

Language Modeling WikiText2