mlbench_core.controlflow.pytorch.validation_round(dataloader, model, loss_function, metrics, dtype, tracker=None, transform_target_type=False, use_cuda=False, max_batches=None, init_hidden=None, package_hidden=None, transform_parameters=None)[source]

Evaluate the model on the test dataset.

  • (obj (tracker) – The validation set

  • (obj – torch.nn.Module): The model to train

  • (obj – torch.nn.Module): The loss function

  • metrics (list) – List of metrics to track

  • dtype (str) – The datatype to use, one of fp32`or `fp64

  • (obj – mlbench_core.utils.Tracker | None): Tracker object to use.

  • transform_target_type (bool) – Convert target to dtype. Default False

  • use_cuda (bool) – Whether to use GPU for training, default: False

  • max_batches (int | None) – Maximum number of batches to validate on

  • init_hidden (func) – Function to initialize hidden state (for RNNs), default: None

  • package_hidden (func) – Function to (re-)package hidden state (for RNNs), default: None


Dictionary of average of each metric, and average validation loss

Return type

(dict, float)

mlbench_core.controlflow.pytorch.record_train_batch_stats(batch_idx, loss, output, metric_results, tracker, num_batches_per_device_train)[source]

Record the stats in a training batch.

  • batch_idx (int) – The id of the current batch

  • loss (float) – The loss of the batch

  • output (torch.Tensor) – The model output

  • metric_results (dict) – of mlbench_core.evaluation.pytorch.metrics.MLBenchMetric: float Metrics and their values

  • tracker (mlbench_core.utils.Tracker) – Tracker object to use.

  • num_batches_per_device_train (int) – Number of batches per train epoch

mlbench_core.controlflow.pytorch.record_validation_stats(metrics_values, loss, tracker=None, rank=0)[source]

Records the stats of a previously run validation

  • metrics_values (dict) – Dictionary of each metric’s average.

  • loss (float) – Validation loss

  • (obj (tracker) – mlbench_core.utils.Tracker, optional): Tracker object to use.

  • rank (int) – Current distributed rank


Whether this validation round is the best

Return type



class mlbench_core.controlflow.pytorch.CheckpointsEvaluationControlFlow(ckpt_dir, rank, world_size, checkpointer, model, epochs, loss_function, metrics, use_cuda=False, dtype=None, max_batch_per_epoch=None)[source]

Evaluate models on training / validation dataset.

  • ckpt_dir (str) – Path to checkpoints.

  • rank (int) – The rank of the current process

  • world_size (int) – The total number of workers

  • checkpointer (Checkpointer) – Used to load checkpoints.

  • model (torch.optim.Optimizer) – An optimizer for the given model.

  • epochs (int) – Number of epochs to traing.

  • loss_function (torch.nn.modules.loss._Loss) – loss function.

  • metrics (list of mlbench_core.evaluation.pytorch.*) – metrics like TopKAccuracy.

  • use_cuda (bool) – Whether to train on GPU or not. Default: False

  • dtype (str) – The datatype to use for the dataloader data

  • max_batch_per_epoch (int) – Maximum number of batches per epoch. Whole dataset

  • used if not specified. Default (is) – None

evaluate_by_epochs(self, dataloader)

Evaluate dataset using the averaged models.

In each epoch each process loads models and averages them. The averaged model is used to evaluate train / validation dataset.


dataloader ( – The dataset to be evaluated.


list of stats of models in each epoch.

Return type




Map an integer or None to an integer iterator starting from 0 with stride 1.

If maximum number of batches per epoch is limited, then return an finite iterator. Otherwise, return an iterator of infinite length.


maximum (int | None) – Maximum number of steps in iterator. If none, returns iterator of infinite length



mlbench_core.controlflow.pytorch.helpers.convert_dtype(dtype, obj)[source]

Converts given tensor to given dtype

  • dtype (str) – One of fp32 or fp64

  • (obj (obj) – torch.Tensor | obj:torch.nn.Module): Module or tensor to convert


torch.Tensor | obj:torch.nn.Module): Converted tensor or module

Return type


mlbench_core.controlflow.pytorch.helpers.prepare_batch(data, target, dtype, transform_target_dtype=False, use_cuda=False)[source]

Prepares a batch for training by changing the type and sending to cuda if necessary

  • (obj (target) – torch.Tensor): The input tensor

  • (obj – torch.Tensor): The target tensor

  • dtype (str) – One of fp32 or fp64, data type to transform input and/or target

  • transform_target_dtype (bool) – Transform target to dtype too

  • use_cuda (bool) – Send tensors to GPU


torch.Tensor, obj:torch.Tensor): Input and target tensors

Return type


mlbench_core.controlflow.pytorch.helpers.iterate_dataloader(dataloader, dtype, max_batch_per_epoch=None, use_cuda=False, transform_target_type=False)[source]

Function that returns an iterator on the given loader. Can be used to limit the number of batches, converting input and target dtypes and sending to GPU

  • (obj (dataloader) – The loader

  • dtype (str) – Type to convert to (fp32 or fp64)

  • max_batch_per_epoch (int | None) – Maximum number of batches

  • use_cuda (bool) – Send tensors to GPU

  • transform_target_type (bool) – Transform target dtype as well


An iterator over the data

Return type




class mlbench_core.controlflow.tensorflow.TrainValidation(train_op, sess, loss, metrics, max_train_steps, train_epochs, batch_size, num_batches_per_epoch_for_train, num_batches_per_epoch_for_validation, train_set_init_op, validation_set_init_op, run_id, rank, lr_scheduler_level='epoch', tracker=None)[source]

A control flow to train and evaluate a model.

  • train_op (tf.Operation) – An operation for training models.

  • sess (tf.Session) – A session which the control flow will communicate.

  • loss (tf.Tensor) – The loss tensor.

  • metrics (list of tf.Tensor) – A list of metrics tensors.

  • max_train_steps (int) – Number of steps for training (independent of lr)

  • train_epochs (int) – Number of steps for training (may related to lr).

  • batch_size (int) – Size of a batch.

  • num_batches_per_epoch_for_train (int) – Number of batches in one training epoch

  • num_batches_per_epoch_for_validation (int) – Number of batches in one validation epoch

  • train_set_init_op (tf.Operation) – Op for initializing training dataset.

  • validation_set_init_op (tf.Operation) – Op for initializing validation dataset.

  • run_id (str) – the id of the run in the dashboard

  • rank (int) – the rank of the current worker

  • lr_scheduler_level (str) – Learning rate is updated based on epoch or batch.

train_and_eval(self, initial_epoch=0, lr_tensor_name=None)

Train and evaluate one epoch.

  • initial_epoch (int, optional) – Defaults to 0. Initial epoch of training.

  • lr_tensor_name (tf.Tensor, optional) – Defaults to None. A (scalar) float tensor representing name of learning rate

train_one_epoch(self, lr_tensor_name=None)

Train a model for an epoch and use tracker to log stats.


lr_tensor (obj) – The learningrate schedule tensorflow operation


Validate a model for an epoch and use tracker to log stats.