mlbench_core.controlflow¶
pytorch¶
Controlflow¶
- mlbench_core.controlflow.pytorch.validation_round(dataloader, model, loss_function, metrics, dtype, tracker=None, transform_target_type=False, use_cuda=False, max_batches=None)[source]¶
Evaluate the model on the test dataset.
- Parameters
(obj (tracker) – torch.utils.data.DataLoader): The validation set
(obj – torch.nn.Module): The model to train
(obj – torch.nn.Module): The loss function
metrics (list) – List of metrics to track
dtype (str) – The datatype to use, one of fp32`or `fp64
(obj – mlbench_core.utils.Tracker | None): Tracker object to use.
transform_target_type (bool) – Convert target to dtype. Default False
use_cuda (bool) – Whether to use GPU for training, default: False
max_batches (int | None) – Maximum number of batches to validate on
- Returns
Dictionary of average of each metric, and average validation loss
- Return type
(dict, float)
- mlbench_core.controlflow.pytorch.record_train_batch_stats(batch_idx, loss, output, metric_results, tracker, num_batches_per_device_train)[source]¶
Record the stats in a training batch.
- Parameters
batch_idx (int) – The id of the current batch
loss (float) – The loss of the batch
output (
torch.Tensor
) – The model outputmetric_results (dict) – of
mlbench_core.evaluation.pytorch.metrics.MLBenchMetric
: float Metrics and their valuestracker (
mlbench_core.utils.Tracker
) – Tracker object to use.num_batches_per_device_train (int) – Number of batches per train epoch
- mlbench_core.controlflow.pytorch.record_validation_stats(metrics_values, loss, tracker=None, rank=0)[source]¶
Records the stats of a previously run validation
- Parameters
metrics_values (dict) – Dictionary of each metric’s average.
loss (float) – Validation loss
(obj (tracker) – mlbench_core.utils.Tracker, optional): Tracker object to use.
rank (int) – Current distributed rank
- Returns
Whether this validation round is the best
- Return type
(bool)
CheckpointsEvaluationControlFlow¶
- class mlbench_core.controlflow.pytorch.CheckpointsEvaluationControlFlow(ckpt_dir, rank, world_size, checkpointer, model, epochs, loss_function, metrics, use_cuda=False, dtype=None, max_batch_per_epoch=None)[source]¶
Evaluate models on training / validation dataset.
- Parameters
ckpt_dir (str) – Path to checkpoints.
rank (int) – The rank of the current process
world_size (int) – The total number of workers
checkpointer (
Checkpointer
) – Used to load checkpoints.model (
torch.optim.Optimizer
) – An optimizer for the given model.epochs (int) – Number of epochs to traing.
loss_function (
torch.nn.modules.loss._Loss
) – loss function.metrics (
list
ofmlbench_core.evaluation.pytorch.*
) – metrics like TopKAccuracy.use_cuda (bool) – Whether to train on GPU or not. Default: False
dtype (str) – The datatype to use for the dataloader data
max_batch_per_epoch (int) – Maximum number of batches per epoch. Whole dataset
used if not specified. Default (is) – None
- evaluate_by_epochs(self, dataloader)¶
Evaluate dataset using the averaged models.
In each epoch each process loads models and averages them. The averaged model is used to evaluate train / validation dataset.
- Parameters
dataloader (
torch.utils.data.DataLoader
) – The dataset to be evaluated.- Returns
list of stats of models in each epoch.
- Return type
list
Helpers¶
- mlbench_core.controlflow.pytorch.helpers.maybe_range(maximum)[source]¶
Map an integer or None to an integer iterator starting from 0 with stride 1.
If maximum number of batches per epoch is limited, then return an finite iterator. Otherwise, return an iterator of infinite length.
- Parameters
maximum (int | None) – Maximum number of steps in iterator. If none, returns iterator of infinite length
- Returns
(iterator)
- mlbench_core.controlflow.pytorch.helpers.convert_dtype(dtype, obj)[source]¶
Converts given tensor to given dtype
- Parameters
dtype (str) – One of fp32 or fp64
(obj (obj) – torch.Tensor | obj:torch.nn.Module): Module or tensor to convert
- Returns
torch.Tensor | obj:torch.nn.Module): Converted tensor or module
- Return type
(obj
- mlbench_core.controlflow.pytorch.helpers.prepare_batch(data, target, dtype, transform_target_dtype=False, use_cuda=False)[source]¶
Prepares a batch for training by changing the type and sending to cuda if necessary
- Parameters
(obj (target) – torch.Tensor): The input tensor
(obj – torch.Tensor): The target tensor
dtype (str) – One of fp32 or fp64, data type to transform input and/or target
transform_target_dtype (bool) – Transform target to dtype too
use_cuda (bool) – Send tensors to GPU
- Returns
torch.Tensor, obj:torch.Tensor): Input and target tensors
- Return type
(obj
- mlbench_core.controlflow.pytorch.helpers.iterate_dataloader(dataloader, dtype, max_batch_per_epoch=None, use_cuda=False, transform_target_type=False)[source]¶
Function that returns an iterator on the given loader. Can be used to limit the number of batches, converting input and target dtypes and sending to GPU
- Parameters
(obj (dataloader) – torch.utils.data.DataLoader): The loader
dtype (str) – Type to convert to (fp32 or fp64)
max_batch_per_epoch (int | None) – Maximum number of batches
use_cuda (bool) – Send tensors to GPU
transform_target_type (bool) – Transform target dtype as well
- Returns
An iterator over the data
- Return type
(iterator)
tensorflow¶
TrainValidation¶
- class mlbench_core.controlflow.tensorflow.TrainValidation(train_op, sess, loss, metrics, max_train_steps, train_epochs, batch_size, num_batches_per_epoch_for_train, num_batches_per_epoch_for_validation, train_set_init_op, validation_set_init_op, run_id, rank, lr_scheduler_level='epoch', tracker=None)[source]¶
A control flow to train and evaluate a model.
- Parameters
train_op (
tf.Operation
) – An operation for training models.sess (
tf.Session
) – A session which the control flow will communicate.loss (
tf.Tensor
) – The loss tensor.metrics (list of
tf.Tensor
) – A list of metrics tensors.max_train_steps (int) – Number of steps for training (independent of lr)
train_epochs (int) – Number of steps for training (may related to lr).
batch_size (int) – Size of a batch.
num_batches_per_epoch_for_train (int) – Number of batches in one training epoch
num_batches_per_epoch_for_validation (int) – Number of batches in one validation epoch
train_set_init_op (
tf.Operation
) – Op for initializing training dataset.validation_set_init_op (
tf.Operation
) – Op for initializing validation dataset.run_id (str) – the id of the run in the dashboard
rank (int) – the rank of the current worker
lr_scheduler_level (str) – Learning rate is updated based on epoch or batch.
- train_and_eval(self, initial_epoch=0, lr_tensor_name=None)¶
Train and evaluate one epoch.
- Parameters
initial_epoch (int, optional) – Defaults to 0. Initial epoch of training.
lr_tensor_name (
tf.Tensor
, optional) – Defaults to None. A (scalar) float tensor representing name of learning rate
- train_one_epoch(self, lr_tensor_name=None)¶
Train a model for an epoch and use tracker to log stats.
- Parameters
lr_tensor (obj) – The learningrate schedule tensorflow operation
- valid_one_epoch(self)¶
Validate a model for an epoch and use tracker to log stats.