mlbench_core.optim¶
pytorch¶
Optimizers¶
The optimizers in this module are not distributed. Their purpose is to implement logic that can be inherited by distributed optimizers.
SparsifiedSGD¶
-
class
mlbench_core.optim.pytorch.optim.
SparsifiedSGD
(params, lr=required, weight_decay=0, sparse_grad_size=10)[source]¶ Implements sparsified version of stochastic gradient descent.
- Parameters
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float) – learning rate
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
sparse_grad_size (int) – Size of the sparsified gradients vector (default: 10).
-
sparsify_gradients
(self, param, lr)[source]¶ Calls one of the sparsification functions (random or blockwise)
- Parameters
random_sparse (bool) – Indicates the way we want to make the gradients sparse (random or blockwise) (default: False)
( (param) – obj: torch.nn.Parameter): Model parameter
SignSGD¶
-
class
mlbench_core.optim.pytorch.optim.
SignSGD
[source]¶ Implements sign stochastic gradient descent (optionally with momentum).
- Parameters
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float) – learning rate
momentum (float, optional) – momentum factor (default: 0)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
dampening (float, optional) – dampening for momentum (default: 0)
nesterov (bool, optional) – enables Nesterov momentum (default: False)
Centralized (Synchronous) Optimizers¶
The optimizers in this module are all distributed and synchronous: workers advance in a synchronous manner. All workers communicate with each other using all_reduce or all_gather operations.
Generic Centralized Optimizer¶
-
class
mlbench_core.optim.pytorch.centralized.
GenericCentralizedOptimizer
(world_size, model, use_cuda=False, by_layer=False, divide_before=False, agg_grad=True)[source]¶ Implements a generic centralized (synchronous) optimizer with AllReduceAggregation. Averages the reduced parameters over the world size, after aggregation. Can aggregate gradients or weights, by layers or all at once.
- Parameters
world_size (int) – Size of the network
model (
nn.Module
) – Model which contains parameters for SGDuse_cuda (bool) – Whether to use cuda tensors for aggregation
by_layer (bool) – Aggregate by layer instead of all layers at once
agg_grad (bool) – Aggregate the gradients before updating weights. If False, weights will be updated and then reduced across all workers. (default: True)
CentralizedSGD¶
-
class
mlbench_core.optim.pytorch.centralized.
CentralizedSGD
(world_size=None, model=None, lr=required, momentum=0, dampening=0, weight_decay=0, nesterov=False, use_cuda=False, by_layer=False, agg_grad=True)[source]¶ Bases:
mlbench_core.optim.pytorch.centralized.GenericCentralizedOptimizer
Implements centralized stochastic gradient descent (optionally with momentum). Averages the reduced parameters over the world size.
- Parameters
world_size (int) – Size of the network
model (
nn.Module
) – Model which contains parameters for SGDlr (float) – learning rate
momentum (float, optional) – momentum factor (default: 0)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
dampening (float, optional) – dampening for momentum (default: 0)
nesterov (bool, optional) – enables Nesterov momentum (default: False)
use_cuda (bool) – Whether to use cuda tensors for aggregation
by_layer (bool) – Aggregate by layer instead of all layers at once
agg_grad (bool) – Aggregate the gradients before updating weights. If False, weights will be updated and then reduced across all workers. (default: True)
CentralizedAdam¶
-
class
mlbench_core.optim.pytorch.centralized.
CentralizedAdam
(world_size=None, model=None, lr=required, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, use_cuda=False, by_layer=False, agg_grad=True)[source]¶ Bases:
mlbench_core.optim.pytorch.centralized.GenericCentralizedOptimizer
Implements centralized Adam algorithm. Averages the reduced parameters over the world size
- Parameters
world_size (int) – Size of the network
model (
nn.Module
) – Model which contains parameters for Adamlr (float, optional) – learning rate (default: 1e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper [RKK18] (default: False)
use_cuda (bool) – Whether to use cuda tensors for aggregation
by_layer (bool) – Aggregate by layer instead of all layers at once
CustomCentralizedOptimizer¶
-
class
mlbench_core.optim.pytorch.centralized.
CustomCentralizedOptimizer
(model, world_size, optimizer, use_cuda=False, by_layer=False, agg_grad=True, grad_clip=float('inf'), average_world=False, average_custom=False, divide_before=False)[source]¶ Bases:
mlbench_core.optim.pytorch.centralized.GenericCentralizedOptimizer
Custom Centralized Optimizer. Can be used with any optimizer passed as argument. Adds the gradient clipping option, as well as custom average
- Parameters
model (
torch.nn.Module
) – modelworld_size (int) – Distributed world size
use_cuda (bool) – Use cuda tensors for aggregation
by_layer (bool) – Aggregate by layer
agg_grad (bool) – Aggregate the gradients before updating weights. If False, weights will be updated and then reduced across all workers. (default: True)
grad_clip (float) – coefficient for gradient clipping, max L2 norm of the gradients
average_world (bool) – Average the gradients by world size
average_custom (bool) – Divide gradients by given denominator at each step, instead of world_size
divide_before (bool) – Divide gradients before reduction (default: False)
CentralizedSparsifiedSGD¶
-
class
mlbench_core.optim.pytorch.centralized.
CentralizedSparsifiedSGD
(params=None, lr=required, weight_decay=0, sparse_grad_size=10, random_sparse=False, average_world=True)[source]¶ Implements centralized sparsified version of stochastic gradient descent.
- Parameters
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float) – Learning rate
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
sparse_grad_size (int) – Size of the sparsified gradients vector (default: 10)
random_sparse (bool) – Whether select random sparsification (default: False)
average_world (bool) – Whether to average models on the world_size (default: True)
PowerSGD¶
-
class
mlbench_core.optim.pytorch.centralized.
PowerSGD
(model=None, lr=required, momentum=0, weight_decay=0, dampening=0, nesterov=False, average_world=True, use_cuda=False, by_layer=False, reuse_query=False, rank=1)[source]¶ Implements PowerSGD with error feedback (optionally with momentum).
- Parameters
model (
nn.Module
) – Model which contains parameters for SGDlr (float) – learning rate
momentum (float, optional) – momentum factor (default: 0)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
dampening (float, optional) – dampening for momentum (default: 0)
nesterov (bool, optional) – enables Nesterov momentum (default: False)
average_world (bool) – Whether to average models on the world_size (default: True)
use_cuda (bool) – Whether to use cuda tensors for aggregation
by_layer (bool) – Aggregate by layer instead of all layers at once
reuse_query (bool) – Whether to use warm start to initialize the power iteration
rank (int) – The rank of the gradient approximation
Decentralized (Asynchronous) Optimizers¶
The optimizers in this module are all distributed and asynchronous: workers advance independently from each other, and communication patterns follow an arbitrary graph.
DecentralizedSGD¶
-
class
mlbench_core.optim.pytorch.decentralized.
DecentralizedSGD
(rank=None, neighbors=None, model=None, lr=required, momentum=0, dampening=0, weight_decay=0, nesterov=False, average_world=True, use_cuda=False, by_layer=False)[source]¶ Implements decentralized stochastic gradient descent (optionally with momentum).
- Parameters
rank (int) – rank of current process in the network
neighbors (list) – list of ranks of the neighbors of current process
model (
nn.Module
) – model which contains parameters for SGDlr (float) – learning rate
momentum (float, optional) – momentum factor (default: 0)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
dampening (float, optional) – dampening for momentum (default: 0)
nesterov (bool, optional) – enables Nesterov momentum (default: False)
average_world (bool) – Whether to average models on the world_size (default: True)
use_cuda (bool) – Whether to use cuda tensors for aggregation
by_layer (bool) – Aggregate by layer instead of all layers at once
References
- RKK18
Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations. 2018. URL: https://openreview.net/forum?id=ryQu7f-RZ.
Mixed Precision Optimizers¶
FP16Optimizer¶
-
class
mlbench_core.optim.pytorch.fp_optimizers.
FP16Optimizer
(fp16_model, world_size, use_cuda=False, use_horovod=False, by_layer=False, grad_clip=float('inf'), init_scale=1024, scale_factor=2, scale_window=128, max_scale=None, min_scale=0.0001, average_world=False, average_custom=False, divide_before=False)[source]¶ Mixed precision optimizer with dynamic loss scaling and backoff. https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#scalefactor
- Parameters
fp16_model (
torch.nn.Module
) – model (previously casted to half)world_size (int) – Distributed world size
use_cuda (bool) – Use cuda tensors for aggregation
use_horovod (bool) – Use Horovod for aggregation
by_layer (bool) – Aggregate by layer
grad_clip (float) – coefficient for gradient clipping, max L2 norm of the gradients
init_scale (int) – initial loss scale
scale_factor (float) – Factor for upscale/dowscale
scale_window (int) – interval for loss scale upscaling
average_world (bool) – Average the gradients by world size
average_custom (bool) – Divide gradients by given denominator at each step, instead of world_size
divide_before (bool) – Divide gradients before reduction (default: False)
-
backward_loss
(self, loss)[source]¶ Scales and performs backward on the given loss
- Parameters
loss (torch.nn.Module) – The loss
-
static
fp16_to_fp32_flat_grad
(fp32_params, fp16_model)[source]¶ Copies the parameters in fp16_model into fp32_params in-place
- Parameters
fp32_params (torch.Tensor) – Parameters in fp32
fp16_model (torch.nn.Module) – Model in fp16
-
static
fp32_to_fp16_weights
(fp16_model, fp32_params)[source]¶ Copies the parameters in fp32_params into fp16_model in-place
- Parameters
fp16_model (torch.nn.Module) – Model in fp16
fp32_params (torch.Tensor) – Parameters in fp32
-
initialize_flat_fp32_weight
(self)[source]¶ Initializes the model’s parameters in fp32
- Returns
The Parameters in fp32
- Return type
(
torch.Tensor
)
-
step
(self, closure=None, tracker=None, multiplier=1, denom=None)[source]¶ Performs one step of the optimizer. Applies loss scaling, computes gradients in fp16, converts gradients to fp32, inverts scaling and applies optional gradient norm clipping. If gradients are finite, it applies update to fp32 master weights and copies updated parameters to fp16 model for the next iteration. If gradients are not finite, it skips the batch and adjusts scaling factor for the next iteration.
- Parameters
closure (callable, optional) – A closure that reevaluates the model and returns the loss.
tracker (
mlbench_core.utils.Tracker
, optional) –multiplier (float) – Multiplier for gradient scaling. Gradient will be scaled using scaled_grad = reduced_grad / (loss_scaler * multiplier)
denom (Optional[
torch.Tensor
]) – Custom denominator to average by Use with average_batch. (default: None)