mlbench_core.dataset¶
linearmodels¶
pytorch¶
Epsilon Logistic Regression¶
- class mlbench_core.dataset.linearmodels.pytorch.dataloader.LMDBDataset(name, data_type, root, target_transform=None)[source]¶
LMDB Dataset
- Parameters
root (string) – Either root directory for the database files, or a absolute path pointing to the file.
target_transform (callable, optional) – A function/transform that takes in the target and transforms it.
- class mlbench_core.dataset.linearmodels.pytorch.dataloader.LMDBPTClass(root, transform=None, target_transform=None)[source]¶
LMDB Dataset loader Class
- Parameters
root (string) – Either root directory for the database files, or a absolute path pointing to the file.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g,
transforms.RandomCrop
target_transform (callable, optional) – A function/transform that takes in the target and transforms it.
imagerecognition¶
pytorch¶
CIFAR10V1¶
- class mlbench_core.dataset.imagerecognition.pytorch.dataloader.CIFAR10V1(root, train=True, download=False)[source]¶
CIFAR10V1 Dataset.
Loads CIFAR10V1 images with mean and std-dev normalisation. Performs random crop and random horizontal flip on train and only normalisation on val. Based on torchvision.datasets.CIFAR10 and Pytorch CIFAR 10 Example.
- Parameters
root (str) – Root folder for the dataset
train (bool) – Whether to get the train or validation set (default=True)
download (bool) – Whether to download the dataset if it’s not present
Imagenet¶
- class mlbench_core.dataset.imagerecognition.pytorch.dataloader.Imagenet(root, train=True)[source]¶
Imagenet (ILSVRC2017) Dataset.
Loads Imagenet images with mean and std-dev normalisation. Performs random crop and random horizontal flip on train and resize + center crop on val. Based on torchvision.datasets.ImageFolder
- Parameters
root (str) – Root folder of Imagenet dataset (without train/ or val/)
train (bool) – Whether to get the train or validation set (default=True)
tensorflow¶
DatasetCifar¶
- class mlbench_core.dataset.imagerecognition.tensorflow.DatasetCifar(dataset, dataset_root, batch_size, world_size, rank, seed, tf_dtype=tf.float32)[source]¶
This clas is adapted from the following script https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_input.py
- Parameters
dataset (str) – Name of the dataset e.g. cifar-10, cifar-100.
dataset_root (str) – Root directory to the dataset.
batch_size (int) – Size of batch.
world_size (int) – Size of the world size.
rank (int) – Rank of the process.
seed (int) – Seed of random number.
tf_dtype (tensorflow.python.framework.dtypes.DType, optional) – Defaults to tf.float32. Datatypes of the tensor.
- input_fn(self, is_train, repeat_count=- 1, num_shards=1, shard_index=0)¶
Input_fn using the tf.data input pipeline for CIFAR-10 dataset.
In synchronized training, faster nodes may use more batches than the number of batches availble. Thus repeat dataset for engouh times to avoid throwing error.
In the distributed settings, datasets are split into num_shards non-overlapping parts and each process takes one shard by its index.
- Parameters
is_train (bool) – A boolean denoting whether the input is for training.
repeat_count (int) – Defaults to -1. Count of dataset repeated times with -1 for infinite.
num_shards (int) – Defaults to 1. Number of Shards the dataset is splitted.
shard_index (int) – Defaults to 0. Index of shard to use.
- Returns
tf.data.Dataset object of ((inputs, labels), is_train).
- maybe_download_and_extract(self)¶
Download and extract the tarball from Alex’s website.
- parse_record(self, raw_record)¶
Parse CIFAR-10/100 image and label from a raw record.
- preprocess_image(self, image, is_training)¶
Preprocess a single image of layout [height, width, depth].
- record_dataset(self, filenames)¶
Returns an input pipeline Dataset from filenames.
NLP¶
pytorch¶
Translation WMT16¶
- class mlbench_core.dataset.nlp.pytorch.WMT16Dataset(root, lang=('en', 'de'), math_precision=None, download=True, train=False, validation=False, lazy=False, preprocessed=False, sort=False, min_len=0, max_len=None, max_size=None)[source]¶
Dataset for WMT16 en to de translation
- Parameters
root (str) – Root folder where to download files
lang (tuple) – Language translation pair
math_precision (str) – One of fp16 or fp32. The precision used during training
download (bool) – Download the dataset from source
train (bool) – Load train set
validation (bool) – Load validation set
lazy (bool) – Load the dataset in a lazy format
min_len (int) – Minimum sentence length
max_len (int | None) – Maximum sentence length
max_size (int | None) – Maximum dataset size
- class mlbench_core.dataset.nlp.pytorch.wmt16.wmt16_tokenizer.WMT16Tokenizer(base_dir, math_precision=None, separator='@@')[source]¶
Tokenizer Class for WMT16 that uses the whole vocabulary
- Parameters
base_dir (str) – Base directory for files
math_precision (str) – Math precision
separator (str) – BPE
Translation WMT17¶
- class mlbench_core.dataset.nlp.pytorch.wmt17.Dictionary(pad='<pad>_', eos='<EOS>_')[source]¶
Dictionary Class for WMT17 Dataset. Essentially a mapping from symbols to consecutive integers
- Parameters
pad (str) – Padding symbol to use
eos (str) – End of String symbol to use
- add_symbol(self, word, n=1)¶
Adds a word to the dictionary
- eos(self)¶
Helper to get index of end-of-sentence symbol
- index(self, sym)¶
Returns the index of the specified symbol
- classmethod load(cls, f, ignore_utf_errors=False)¶
Loads the dictionary from a text file with the format:
` <symbol0> <symbol1> ... `
- Parameters
f (str) – Dictionary file name
ignore_utf_errors (bool) – Ignore UTF-8 related errors
- pad(self)¶
Helper to get index of pad symbol
- string(self, tensor, bpe_symbol=None)¶
Helper for converting a tensor of token indices to a string.
Can optionally remove BPE symbols or escape <unk> words.
- update(self, new_dict)¶
Updates counts from new dictionary.
- mlbench_core.dataset.nlp.pytorch.wmt17.collate_batch(samples, pad_idx, eos_idx, left_pad_source=True, left_pad_target=False, bsz_mult=8, seq_len_multiple=1)[source]¶
Collate a list of samples into a batch
- Parameters
samples (list[dict]) – Samples to collate
pad_idx (int) – Padding symbol index
eos_idx (int) – EOS symbol index
left_pad_source (bool) – Pad sources on the left
left_pad_target (bool) – Pad sources on the right
bsz_mult (int) – Batch size multiple
seq_len_multiple (int) – Sequence length multiple
- Returns
Containing keys id (list of indices), ntokens (total num tokens), net_input and target
- Return type
(dict)