mlbench_core.optim

pytorch

Optimizers

The optimizers in this module are not distributed. Their purpose is to implement logic that can be inherited by distributed optimizers.

SparsifiedSGD

class mlbench_core.optim.pytorch.optim.SparsifiedSGD(params, lr=required, weight_decay=0, sparse_grad_size=10)[source]

Implements sparsified version of stochastic gradient descent.

Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • sparse_grad_size (int) – Size of the sparsified gradients vector (default: 10).

get_estimated_weights(self)[source]

Returns the weighted average parameter tensor

sparsify_gradients(self, param, lr)[source]

Calls one of the sparsification functions (random or blockwise)

Parameters
  • random_sparse (bool) – Indicates the way we want to make the gradients sparse (random or blockwise) (default: False)

  • ( (param) – obj: torch.nn.Parameter): Model parameter

step(self, closure=None)[source]

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

update_estimated_weights(self, iteration, sparse_vector_size)[source]

Updates the estimated parameters

Parameters
  • iteration (int) – Current global iteration

  • sparse_vector_size (int) – Size of the sparse gradients vector

SignSGD

class mlbench_core.optim.pytorch.optim.SignSGD(params, lr, momentum=0, weight_decay=0, dampening=0, nesterov=False)[source]

Implements sign stochastic gradient descent (optionally with momentum).

Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate

  • momentum (float, optional) – momentum factor (default: 0)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • dampening (float, optional) – dampening for momentum (default: 0)

  • nesterov (bool, optional) – enables Nesterov momentum (default: False)

step(self, closure=None)[source]

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

Centralized (Synchronous) Optimizers

The optimizers in this module are all distributed and synchronous: workers advance in a synchronous manner. All workers communicate with each other using all_reduce or all_gather operations.

Generic Centralized Optimizer

class mlbench_core.optim.pytorch.centralized.GenericCentralizedOptimizer(world_size, model, use_cuda=False, by_layer=False, divide_before=False, agg_grad=True)[source]

Implements a generic centralized (synchronous) optimizer with AllReduceAggregation. Averages the reduced parameters over the world size, after aggregation. Can aggregate gradients or weights, by layers or all at once.

Parameters
  • world_size (int) – Size of the network

  • model (nn.Module) – Model which contains parameters for SGD

  • use_cuda (bool) – Whether to use cuda tensors for aggregation

  • by_layer (bool) – Aggregate by layer instead of all layers at once

  • agg_grad (bool) – Aggregate the gradients before updating weights. If False, weights will be updated and then reduced across all workers. (default: True)

step(self, closure=None, tracker=None)[source]

Aggregates the gradients and performs a single optimization step.

Parameters
  • closure (callable, optional) – A closure that reevaluates the model and returns the loss.

  • tracker (mlbench_core.utils.Tracker, optional) –

CentralizedSGD

class mlbench_core.optim.pytorch.centralized.CentralizedSGD(world_size=None, model=None, lr=required, momentum=0, dampening=0, weight_decay=0, nesterov=False, use_cuda=False, by_layer=False, agg_grad=True)[source]

Bases: GenericCentralizedOptimizer

Implements centralized stochastic gradient descent (optionally with momentum). Averages the reduced parameters over the world size.

Parameters
  • world_size (int) – Size of the network

  • model (nn.Module) – Model which contains parameters for SGD

  • lr (float) – learning rate

  • momentum (float, optional) – momentum factor (default: 0)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • dampening (float, optional) – dampening for momentum (default: 0)

  • nesterov (bool, optional) – enables Nesterov momentum (default: False)

  • use_cuda (bool) – Whether to use cuda tensors for aggregation

  • by_layer (bool) – Aggregate by layer instead of all layers at once

  • agg_grad (bool) – Aggregate the gradients before updating weights. If False, weights will be updated and then reduced across all workers. (default: True)

CentralizedAdam

class mlbench_core.optim.pytorch.centralized.CentralizedAdam(world_size=None, model=None, lr=required, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, use_cuda=False, by_layer=False, agg_grad=True)[source]

Bases: GenericCentralizedOptimizer

Implements centralized Adam algorithm. Averages the reduced parameters over the world size

Parameters
  • world_size (int) – Size of the network

  • model (nn.Module) – Model which contains parameters for Adam

  • lr (float, optional) – learning rate (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper [RKK18] (default: False)

  • use_cuda (bool) – Whether to use cuda tensors for aggregation

  • by_layer (bool) – Aggregate by layer instead of all layers at once

CustomCentralizedOptimizer

class mlbench_core.optim.pytorch.centralized.CustomCentralizedOptimizer(model, world_size, optimizer, use_cuda=False, by_layer=False, agg_grad=True, grad_clip=float('inf'), average_world=False, average_custom=False, divide_before=False)[source]

Bases: GenericCentralizedOptimizer

Custom Centralized Optimizer. Can be used with any optimizer passed as argument. Adds the gradient clipping option, as well as custom average

Parameters
  • model (torch.nn.Module) – model

  • world_size (int) – Distributed world size

  • use_cuda (bool) – Use cuda tensors for aggregation

  • by_layer (bool) – Aggregate by layer

  • agg_grad (bool) – Aggregate the gradients before updating weights. If False, weights will be updated and then reduced across all workers. (default: True)

  • grad_clip (float) – coefficient for gradient clipping, max L2 norm of the gradients

  • average_world (bool) – Average the gradients by world size

  • average_custom (bool) – Divide gradients by given denominator at each step, instead of world_size

  • divide_before (bool) – Divide gradients before reduction (default: False)

step(self, closure=None, tracker=None, denom=None)[source]

Performs one step of the optimizer.

Parameters
  • closure (callable) – Optimizer closure argument

  • tracker (mlbench_core.utils.Tracker, optional) –

  • denom (Optional[torch.Tensor]) – Custom denominator to reduce by

CentralizedSparsifiedSGD

class mlbench_core.optim.pytorch.centralized.CentralizedSparsifiedSGD(params=None, lr=required, weight_decay=0, sparse_grad_size=10, random_sparse=False, world_size=1, average_world=True)[source]

Implements centralized sparsified version of stochastic gradient descent.

Parameters
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – Learning rate

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • sparse_grad_size (int) – Size of the sparsified gradients vector (default: 10)

  • random_sparse (bool) – Whether select random sparsification (default: False)

  • average_world (bool) – Whether to average models on the world_size (default: True)

step(self, closure=None)[source]

Aggregates the gradients and performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

PowerSGD

class mlbench_core.optim.pytorch.centralized.PowerSGD(model=None, lr=required, momentum=0, weight_decay=0, dampening=0, nesterov=False, average_world=True, use_cuda=False, by_layer=False, reuse_query=False, world_size=1, rank=1)[source]

Implements PowerSGD with error feedback (optionally with momentum).

Parameters
  • model (nn.Module) – Model which contains parameters for SGD

  • lr (float) – learning rate

  • momentum (float, optional) – momentum factor (default: 0)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • dampening (float, optional) – dampening for momentum (default: 0)

  • nesterov (bool, optional) – enables Nesterov momentum (default: False)

  • average_world (bool) – Whether to average models on the world_size (default: True)

  • use_cuda (bool) – Whether to use cuda tensors for aggregation

  • by_layer (bool) – Aggregate by layer instead of all layers at once

  • reuse_query (bool) – Whether to use warm start to initialize the power iteration

  • rank (int) – The rank of the gradient approximation

step(self, closure=None, tracker=None)[source]

Performs a single optimization step.

Parameters
  • closure (callable, optional) – A closure that reevaluates the model and returns the loss.

  • tracker (mlbench_core.utils.Tracker, optional) –

Decentralized (Asynchronous) Optimizers

The optimizers in this module are all distributed and asynchronous: workers advance independently from each other, and communication patterns follow an arbitrary graph.

DecentralizedSGD

class mlbench_core.optim.pytorch.decentralized.DecentralizedSGD(rank=None, neighbors=None, model=None, lr=required, momentum=0, dampening=0, weight_decay=0, nesterov=False, average_world=True, use_cuda=False, by_layer=False)[source]

Implements decentralized stochastic gradient descent (optionally with momentum).

Parameters
  • rank (int) – rank of current process in the network

  • neighbors (list) – list of ranks of the neighbors of current process

  • model (nn.Module) – model which contains parameters for SGD

  • lr (float) – learning rate

  • momentum (float, optional) – momentum factor (default: 0)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • dampening (float, optional) – dampening for momentum (default: 0)

  • nesterov (bool, optional) – enables Nesterov momentum (default: False)

  • average_world (bool) – Whether to average models on the world_size (default: True)

  • use_cuda (bool) – Whether to use cuda tensors for aggregation

  • by_layer (bool) – Aggregate by layer instead of all layers at once

step(self, closure=None, tracker=None)[source]

Aggregates the gradients and performs a single optimization step.

Parameters
  • closure (callable, optional) – A closure that reevaluates the model and returns the loss.

  • tracker (mlbench_core.utils.Tracker, optional) –

References

RKK18

Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations. 2018. URL: https://openreview.net/forum?id=ryQu7f-RZ.

Mixed Precision Optimizers

FP16Optimizer

class mlbench_core.optim.pytorch.fp_optimizers.FP16Optimizer(fp16_model, world_size, use_cuda=False, use_horovod=False, by_layer=False, grad_clip=float('inf'), init_scale=1024, scale_factor=2, scale_window=128, max_scale=None, min_scale=0.0001, average_world=False, average_custom=False, divide_before=False)[source]

Mixed precision optimizer with dynamic loss scaling and backoff. https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#scalefactor

Parameters
  • fp16_model (torch.nn.Module) – model (previously casted to half)

  • world_size (int) – Distributed world size

  • use_cuda (bool) – Use cuda tensors for aggregation

  • use_horovod (bool) – Use Horovod for aggregation

  • by_layer (bool) – Aggregate by layer

  • grad_clip (float) – coefficient for gradient clipping, max L2 norm of the gradients

  • init_scale (int) – initial loss scale

  • scale_factor (float) – Factor for upscale/dowscale

  • scale_window (int) – interval for loss scale upscaling

  • average_world (bool) – Average the gradients by world size

  • average_custom (bool) – Divide gradients by given denominator at each step, instead of world_size

  • divide_before (bool) – Divide gradients before reduction (default: False)

backward_loss(self, loss)[source]

Scales and performs backward on the given loss

Parameters

loss (torch.nn.Module) – The loss

static fp16_to_fp32_flat_grad(fp32_params, fp16_model)[source]

Copies the parameters in fp16_model into fp32_params in-place

Parameters
  • fp32_params (torch.Tensor) – Parameters in fp32

  • fp16_model (torch.nn.Module) – Model in fp16

static fp32_to_fp16_weights(fp16_model, fp32_params)[source]

Copies the parameters in fp32_params into fp16_model in-place

Parameters
  • fp16_model (torch.nn.Module) – Model in fp16

  • fp32_params (torch.Tensor) – Parameters in fp32

initialize_flat_fp32_weight(self)[source]

Initializes the model’s parameters in fp32

Returns

The Parameters in fp32

Return type

(torch.Tensor)

step(self, closure=None, tracker=None, multiplier=1, denom=None)[source]

Performs one step of the optimizer. Applies loss scaling, computes gradients in fp16, converts gradients to fp32, inverts scaling and applies optional gradient norm clipping. If gradients are finite, it applies update to fp32 master weights and copies updated parameters to fp16 model for the next iteration. If gradients are not finite, it skips the batch and adjusts scaling factor for the next iteration.

Parameters
  • closure (callable, optional) – A closure that reevaluates the model and returns the loss.

  • tracker (mlbench_core.utils.Tracker, optional) –

  • multiplier (float) – Multiplier for gradient scaling. Gradient will be scaled using scaled_grad = reduced_grad / (loss_scaler * multiplier)

  • denom (Optional[torch.Tensor]) – Custom denominator to average by Use with average_batch. (default: None)

zero_grad(self)[source]

Resets the gradients of the optimizer and fp16_model