mlbench_core.optim¶

pytorch¶

Optimizers¶

The optimizers in this module are not distributed. Their purpose is to implement logic that can be inherited by distributed optimizers.

SparsifiedSGD¶

class mlbench_core.optim.pytorch.optim.SparsifiedSGD(params, lr=required, weight_decay=0, sparse_grad_size=10)[source]¶

Implements sparsified version of stochastic gradient descent.

Parameters

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float) – learning rate
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
sparse_grad_size (int) – Size of the sparsified gradients vector (default: 10).

get_estimated_weights(self)[source]¶: Returns the weighted average parameter tensor

sparsify_gradients(self, param, lr)[source]¶

Calls one of the sparsification functions (random or blockwise)

Parameters

random_sparse (bool) – Indicates the way we want to make the gradients sparse (random or blockwise) (default: False)
( (param) – obj: torch.nn.Parameter): Model parameter

step(self, closure=None)[source]¶

Performs a single optimization step.

Parameters: closure (callable, optional) – A closure that reevaluates the model and returns the loss.

update_estimated_weights(self, iteration, sparse_vector_size)[source]¶

Updates the estimated parameters

Parameters

iteration (int) – Current global iteration
sparse_vector_size (int) – Size of the sparse gradients vector

SignSGD¶

class mlbench_core.optim.pytorch.optim.SignSGD[source]¶

Implements sign stochastic gradient descent (optionally with momentum).

Parameters

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float) – learning rate
momentum (float, optional) – momentum factor (default: 0)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
dampening (float, optional) – dampening for momentum (default: 0)
nesterov (bool, optional) – enables Nesterov momentum (default: False)

step(self, closure=None)[source]¶

Performs a single optimization step.

Parameters: closure (callable, optional) – A closure that reevaluates the model and returns the loss.

Centralized (Synchronous) Optimizers¶

The optimizers in this module are all distributed and synchronous: workers advance in a synchronous manner. All workers communicate with each other using all_reduce or all_gather operations.

Generic Centralized Optimizer¶

class mlbench_core.optim.pytorch.centralized.GenericCentralizedOptimizer(world_size, model, use_cuda=False, by_layer=False, divide_before=False, agg_grad=True)[source]¶

Implements a generic centralized (synchronous) optimizer with AllReduceAggregation. Averages the reduced parameters over the world size, after aggregation. Can aggregate gradients or weights, by layers or all at once.

Parameters

world_size (int) – Size of the network
model (nn.Module) – Model which contains parameters for SGD
use_cuda (bool) – Whether to use cuda tensors for aggregation
by_layer (bool) – Aggregate by layer instead of all layers at once
agg_grad (bool) – Aggregate the gradients before updating weights. If False, weights will be updated and then reduced across all workers. (default: True)

step(self, closure=None, tracker=None)[source]¶

Aggregates the gradients and performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.
tracker (mlbench_core.utils.Tracker, optional) –

CentralizedSGD¶

class mlbench_core.optim.pytorch.centralized.CentralizedSGD(world_size=None, model=None, lr=required, momentum=0, dampening=0, weight_decay=0, nesterov=False, use_cuda=False, by_layer=False, agg_grad=True)[source]¶

Bases: mlbench_core.optim.pytorch.centralized.GenericCentralizedOptimizer

Implements centralized stochastic gradient descent (optionally with momentum). Averages the reduced parameters over the world size.

Parameters

world_size (int) – Size of the network
model (nn.Module) – Model which contains parameters for SGD
lr (float) – learning rate
momentum (float, optional) – momentum factor (default: 0)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
dampening (float, optional) – dampening for momentum (default: 0)
nesterov (bool, optional) – enables Nesterov momentum (default: False)
use_cuda (bool) – Whether to use cuda tensors for aggregation
by_layer (bool) – Aggregate by layer instead of all layers at once
agg_grad (bool) – Aggregate the gradients before updating weights. If False, weights will be updated and then reduced across all workers. (default: True)

CentralizedAdam¶

class mlbench_core.optim.pytorch.centralized.CentralizedAdam(world_size=None, model=None, lr=required, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, use_cuda=False, by_layer=False, agg_grad=True)[source]¶

Bases: mlbench_core.optim.pytorch.centralized.GenericCentralizedOptimizer

Implements centralized Adam algorithm. Averages the reduced parameters over the world size

Parameters

world_size (int) – Size of the network
model (nn.Module) – Model which contains parameters for Adam
lr (float, optional) – learning rate (default: 1e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper [RKK18] (default: False)
use_cuda (bool) – Whether to use cuda tensors for aggregation
by_layer (bool) – Aggregate by layer instead of all layers at once

CustomCentralizedOptimizer¶

class mlbench_core.optim.pytorch.centralized.CustomCentralizedOptimizer(model, world_size, optimizer, use_cuda=False, by_layer=False, agg_grad=True, grad_clip=float('inf'), average_world=False, average_custom=False, divide_before=False)[source]¶

Bases: mlbench_core.optim.pytorch.centralized.GenericCentralizedOptimizer

Custom Centralized Optimizer. Can be used with any optimizer passed as argument. Adds the gradient clipping option, as well as custom average

Parameters

model (torch.nn.Module) – model
world_size (int) – Distributed world size
use_cuda (bool) – Use cuda tensors for aggregation
by_layer (bool) – Aggregate by layer
agg_grad (bool) – Aggregate the gradients before updating weights. If False, weights will be updated and then reduced across all workers. (default: True)
grad_clip (float) – coefficient for gradient clipping, max L2 norm of the gradients
average_world (bool) – Average the gradients by world size
average_custom (bool) – Divide gradients by given denominator at each step, instead of world_size
divide_before (bool) – Divide gradients before reduction (default: False)

step(self, closure=None, tracker=None, denom=None)[source]¶

Performs one step of the optimizer.

Parameters

closure (callable) – Optimizer closure argument
tracker (mlbench_core.utils.Tracker, optional) –
denom (Optional[torch.Tensor]) – Custom denominator to reduce by

CentralizedSparsifiedSGD¶

class mlbench_core.optim.pytorch.centralized.CentralizedSparsifiedSGD(params=None, lr=required, weight_decay=0, sparse_grad_size=10, random_sparse=False, average_world=True)[source]¶

Implements centralized sparsified version of stochastic gradient descent.

Parameters

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float) – Learning rate
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
sparse_grad_size (int) – Size of the sparsified gradients vector (default: 10)
random_sparse (bool) – Whether select random sparsification (default: False)
average_world (bool) – Whether to average models on the world_size (default: True)

step(self, closure=None)[source]¶

Aggregates the gradients and performs a single optimization step.

Parameters: closure (callable, optional) – A closure that reevaluates the model and returns the loss.

PowerSGD¶

class mlbench_core.optim.pytorch.centralized.PowerSGD(model=None, lr=required, momentum=0, weight_decay=0, dampening=0, nesterov=False, average_world=True, use_cuda=False, by_layer=False, reuse_query=False, rank=1)[source]¶

Implements PowerSGD with error feedback (optionally with momentum).

Parameters

model (nn.Module) – Model which contains parameters for SGD
lr (float) – learning rate
momentum (float, optional) – momentum factor (default: 0)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
dampening (float, optional) – dampening for momentum (default: 0)
nesterov (bool, optional) – enables Nesterov momentum (default: False)
average_world (bool) – Whether to average models on the world_size (default: True)
use_cuda (bool) – Whether to use cuda tensors for aggregation
by_layer (bool) – Aggregate by layer instead of all layers at once
reuse_query (bool) – Whether to use warm start to initialize the power iteration
rank (int) – The rank of the gradient approximation

step(self, closure=None, tracker=None)[source]¶

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.
tracker (mlbench_core.utils.Tracker, optional) –

Decentralized (Asynchronous) Optimizers¶

The optimizers in this module are all distributed and asynchronous: workers advance independently from each other, and communication patterns follow an arbitrary graph.

DecentralizedSGD¶

class mlbench_core.optim.pytorch.decentralized.DecentralizedSGD(rank=None, neighbors=None, model=None, lr=required, momentum=0, dampening=0, weight_decay=0, nesterov=False, average_world=True, use_cuda=False, by_layer=False)[source]¶

Implements decentralized stochastic gradient descent (optionally with momentum).

Parameters

rank (int) – rank of current process in the network
neighbors (list) – list of ranks of the neighbors of current process
model (nn.Module) – model which contains parameters for SGD
lr (float) – learning rate
momentum (float, optional) – momentum factor (default: 0)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
dampening (float, optional) – dampening for momentum (default: 0)
nesterov (bool, optional) – enables Nesterov momentum (default: False)
average_world (bool) – Whether to average models on the world_size (default: True)
use_cuda (bool) – Whether to use cuda tensors for aggregation
by_layer (bool) – Aggregate by layer instead of all layers at once

step(self, closure=None, tracker=None)[source]¶

Aggregates the gradients and performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.
tracker (mlbench_core.utils.Tracker, optional) –

References

RKK18: Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations. 2018. URL: https://openreview.net/forum?id=ryQu7f-RZ.

Mixed Precision Optimizers¶

FP16Optimizer¶

class mlbench_core.optim.pytorch.fp_optimizers.FP16Optimizer(fp16_model, world_size, use_cuda=False, use_horovod=False, by_layer=False, grad_clip=float('inf'), init_scale=1024, scale_factor=2, scale_window=128, max_scale=None, min_scale=0.0001, average_world=False, average_custom=False, divide_before=False)[source]¶

Mixed precision optimizer with dynamic loss scaling and backoff. https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#scalefactor

Parameters

fp16_model (torch.nn.Module) – model (previously casted to half)
world_size (int) – Distributed world size
use_cuda (bool) – Use cuda tensors for aggregation
use_horovod (bool) – Use Horovod for aggregation
by_layer (bool) – Aggregate by layer
grad_clip (float) – coefficient for gradient clipping, max L2 norm of the gradients
init_scale (int) – initial loss scale
scale_factor (float) – Factor for upscale/dowscale
scale_window (int) – interval for loss scale upscaling
average_world (bool) – Average the gradients by world size
average_custom (bool) – Divide gradients by given denominator at each step, instead of world_size
divide_before (bool) – Divide gradients before reduction (default: False)

backward_loss(self, loss)[source]¶

Scales and performs backward on the given loss

Parameters: loss (torch.nn.Module) – The loss

static fp16_to_fp32_flat_grad(fp32_params, fp16_model)[source]¶

Copies the parameters in fp16_model into fp32_params in-place

Parameters

fp32_params (torch.Tensor) – Parameters in fp32
fp16_model (torch.nn.Module) – Model in fp16

static fp32_to_fp16_weights(fp16_model, fp32_params)[source]¶

Copies the parameters in fp32_params into fp16_model in-place

Parameters

fp16_model (torch.nn.Module) – Model in fp16
fp32_params (torch.Tensor) – Parameters in fp32

initialize_flat_fp32_weight(self)[source]¶

Initializes the model’s parameters in fp32

Returns: The Parameters in fp32
Return type: (torch.Tensor)

step(self, closure=None, tracker=None, multiplier=1, denom=None)[source]¶

Performs one step of the optimizer. Applies loss scaling, computes gradients in fp16, converts gradients to fp32, inverts scaling and applies optional gradient norm clipping. If gradients are finite, it applies update to fp32 master weights and copies updated parameters to fp16 model for the next iteration. If gradients are not finite, it skips the batch and adjusts scaling factor for the next iteration.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.
tracker (mlbench_core.utils.Tracker, optional) –
multiplier (float) – Multiplier for gradient scaling. Gradient will be scaled using scaled_grad = reduced_grad / (loss_scaler * multiplier)
denom (Optional[torch.Tensor]) – Custom denominator to average by Use with average_batch. (default: None)

zero_grad(self)[source]¶: Resets the gradients of the optimizer and fp16_model