mlbench_core.controlflow¶

pytorch¶

Controlflow¶

mlbench_core.controlflow.pytorch.validation_round(dataloader, model, loss_function, metrics, dtype, tracker=None, transform_target_type=False, use_cuda=False, max_batches=None)[source]¶

Evaluate the model on the test dataset.

Parameters

(obj (tracker) – torch.utils.data.DataLoader): The validation set
(obj – torch.nn.Module): The model to train
(obj – torch.nn.Module): The loss function
metrics (list) – List of metrics to track
dtype (str) – The datatype to use, one of fp32`or `fp64
(obj – mlbench_core.utils.Tracker | None): Tracker object to use.
transform_target_type (bool) – Convert target to dtype. Default False
use_cuda (bool) – Whether to use GPU for training, default: False
max_batches (int | None) – Maximum number of batches to validate on

Returns

Dictionary of average of each metric, and average validation loss

Return type

(dict, float)

mlbench_core.controlflow.pytorch.record_train_batch_stats(batch_idx, loss, output, metric_results, tracker, num_batches_per_device_train)[source]¶

Record the stats in a training batch.

Parameters

batch_idx (int) – The id of the current batch
loss (float) – The loss of the batch
output (torch.Tensor) – The model output
metric_results (dict) – of mlbench_core.evaluation.pytorch.metrics.MLBenchMetric: float Metrics and their values
tracker (mlbench_core.utils.Tracker) – Tracker object to use.
num_batches_per_device_train (int) – Number of batches per train epoch

mlbench_core.controlflow.pytorch.record_validation_stats(metrics_values, loss, tracker=None, rank=0)[source]¶

Records the stats of a previously run validation

Parameters

metrics_values (dict) – Dictionary of each metric’s average.
loss (float) – Validation loss
(obj (tracker) – mlbench_core.utils.Tracker, optional): Tracker object to use.
rank (int) – Current distributed rank

Returns

Whether this validation round is the best

Return type

(bool)

CheckpointsEvaluationControlFlow¶

class mlbench_core.controlflow.pytorch.CheckpointsEvaluationControlFlow(ckpt_dir, rank, world_size, checkpointer, model, epochs, loss_function, metrics, use_cuda=False, dtype=None, max_batch_per_epoch=None)[source]¶

Evaluate models on training / validation dataset.

Parameters

ckpt_dir (str) – Path to checkpoints.
rank (int) – The rank of the current process
world_size (int) – The total number of workers
checkpointer (Checkpointer) – Used to load checkpoints.
model (torch.optim.Optimizer) – An optimizer for the given model.
epochs (int) – Number of epochs to traing.
loss_function (torch.nn.modules.loss._Loss) – loss function.
metrics (list of mlbench_core.evaluation.pytorch.*) – metrics like TopKAccuracy.
use_cuda (bool) – Whether to train on GPU or not. Default: False
dtype (str) – The datatype to use for the dataloader data
max_batch_per_epoch (int) – Maximum number of batches per epoch. Whole dataset
used if not specified. Default (is) – None

evaluate_by_epochs(self, dataloader)¶

Evaluate dataset using the averaged models.

In each epoch each process loads models and averages them. The averaged model is used to evaluate train / validation dataset.

Parameters: dataloader (torch.utils.data.DataLoader) – The dataset to be evaluated.
Returns: list of stats of models in each epoch.
Return type: list

Helpers¶

mlbench_core.controlflow.pytorch.helpers.maybe_range(maximum)[source]¶

Map an integer or None to an integer iterator starting from 0 with stride 1.

If maximum number of batches per epoch is limited, then return an finite iterator. Otherwise, return an iterator of infinite length.

Parameters: maximum (int | None) – Maximum number of steps in iterator. If none, returns iterator of infinite length
Returns: (iterator)

mlbench_core.controlflow.pytorch.helpers.convert_dtype(dtype, obj)[source]¶

Converts given tensor to given dtype

Parameters

dtype (str) – One of fp32 or fp64
(obj (obj) – torch.Tensor | obj:torch.nn.Module): Module or tensor to convert

Returns

torch.Tensor | obj:torch.nn.Module): Converted tensor or module

Return type

(obj

mlbench_core.controlflow.pytorch.helpers.prepare_batch(data, target, dtype, transform_target_dtype=False, use_cuda=False)[source]¶

Prepares a batch for training by changing the type and sending to cuda if necessary

Parameters

(obj (target) – torch.Tensor): The input tensor
(obj – torch.Tensor): The target tensor
dtype (str) – One of fp32 or fp64, data type to transform input and/or target
transform_target_dtype (bool) – Transform target to dtype too
use_cuda (bool) – Send tensors to GPU

Returns

torch.Tensor, obj:torch.Tensor): Input and target tensors

Return type

(obj

mlbench_core.controlflow.pytorch.helpers.iterate_dataloader(dataloader, dtype, max_batch_per_epoch=None, use_cuda=False, transform_target_type=False)[source]¶

Function that returns an iterator on the given loader. Can be used to limit the number of batches, converting input and target dtypes and sending to GPU

Parameters

(obj (dataloader) – torch.utils.data.DataLoader): The loader
dtype (str) – Type to convert to (fp32 or fp64)
max_batch_per_epoch (int | None) – Maximum number of batches
use_cuda (bool) – Send tensors to GPU
transform_target_type (bool) – Transform target dtype as well

Returns

An iterator over the data

Return type

(iterator)

tensorflow¶

TrainValidation¶

class mlbench_core.controlflow.tensorflow.TrainValidation(train_op, sess, loss, metrics, max_train_steps, train_epochs, batch_size, num_batches_per_epoch_for_train, num_batches_per_epoch_for_validation, train_set_init_op, validation_set_init_op, run_id, rank, lr_scheduler_level='epoch', tracker=None)[source]¶

A control flow to train and evaluate a model.

Parameters

train_op (tf.Operation) – An operation for training models.
sess (tf.Session) – A session which the control flow will communicate.
loss (tf.Tensor) – The loss tensor.
metrics (list of tf.Tensor) – A list of metrics tensors.
max_train_steps (int) – Number of steps for training (independent of lr)
train_epochs (int) – Number of steps for training (may related to lr).
batch_size (int) – Size of a batch.
num_batches_per_epoch_for_train (int) – Number of batches in one training epoch
num_batches_per_epoch_for_validation (int) – Number of batches in one validation epoch
train_set_init_op (tf.Operation) – Op for initializing training dataset.
validation_set_init_op (tf.Operation) – Op for initializing validation dataset.
run_id (str) – the id of the run in the dashboard
rank (int) – the rank of the current worker
lr_scheduler_level (str) – Learning rate is updated based on epoch or batch.

train_and_eval(self, initial_epoch=0, lr_tensor_name=None)¶

Train and evaluate one epoch.

Parameters

initial_epoch (int, optional) – Defaults to 0. Initial epoch of training.
lr_tensor_name (tf.Tensor, optional) – Defaults to None. A (scalar) float tensor representing name of learning rate

train_one_epoch(self, lr_tensor_name=None)¶

Train a model for an epoch and use tracker to log stats.

Parameters: lr_tensor (obj) – The learningrate schedule tensorflow operation

valid_one_epoch(self)¶: Validate a model for an epoch and use tracker to log stats.