mlbench_core.models¶

pytorch¶

Since Kuang Liu<https://github.com/kuangliu/pytorch-cifar> has already included many classical neural network models. We use their implementation direclty for

VGG

linear_models¶

LogisticRegression¶

class mlbench_core.models.pytorch.linear_models.LogisticRegression(n_features)[source]¶

Logistic regression implementation

Parameters: n_features (int) – Number of features

LinearRegression¶

class mlbench_core.models.pytorch.linear_models.LinearRegression(n_features)[source]¶

Ridge regression implementation

Parameters: n_features (int) – Number of features

resnet¶

Contains definitions for Residual Networks.

Residual networks were originally proposed in [HZRS16a] . Then they improve the [HZRS16b] Here we refer to the settings in [HZRS16a] as v1 and [HZRS16b] as v2.

Since torchvision resnet has already implemented.

ResNet-18
ResNet-34
ResNet-50
ResNet-101
ResNet-152

for image net. Here we only implemented the remaining models

ResNet-20
ResNet-32
ResNet-44
ResNet-56

for CIFAR-10 dataset. Besides, their implementation uses projection shortcut by default.

ResNetCIFAR¶

class mlbench_core.models.pytorch.resnet.ResNetCIFAR(resnet_size, bottleneck, num_classes, version=_DEFAULT_RESNETCIFAR_VERSION)[source]¶

Basic ResNet implementation.

Parameters

resnet_size (int) – Number of layers
bottleneck (bool) – Whether to use a bottleneck layer (Not Implemented)
num_classes (int) – Number of output classes
version (int) – Resnet version (1 or 2). Default: 1

RNN¶

—

Google Neural Machine Translation¶

Model¶

class mlbench_core.models.pytorch.gnmt.GNMT(vocab_size, hidden_size=1024, num_layers=4, dropout=0.2, share_embedding=True, fusion=True)[source]¶

GNMT v2 model

Parameters

vocab_size (int) – size of vocabulary (number of tokens)
hidden_size (int) – internal hidden size of the model
num_layers (int) – number of layers, applies to both encoder and decoder
dropout (float) – probability of dropout (in encoder and decoder) tensors, if false the model uses (seq, batch, feature)
share_embedding (bool) – if True embeddings are shared between encoder and decoder

decode(self, inputs, context, inference=False)¶

Applies the decoder to inputs, given the context from the encoder.

Parameters

inputs (torch.tensor) – tensor with inputs (seq_len, batch)
context – context from the encoder
inference – if True inference mode, if False training mode

Returns

torch.tensor

encode(self, inputs, lengths)¶

Applies the encoder to inputs with a given input sequence lengths.

Parameters

inputs (torch.tensor) – tensor with inputs (seq_len, batch)
lengths – vector with sequence lengths (excluding padding)

Returns

torch.tensor

generate(self, inputs, context, beam_size)¶

Autoregressive generator, works with SequenceGenerator class. Executes decoder (in inference mode), applies log_softmax and topK for inference with beam search decoding.

Parameters

inputs – tensor with inputs to the decoder
context – context from the encoder
beam_size – beam size for the generator

Returns

(words, logprobs, scores, new_context) words: indices of topK tokens logprobs: log probabilities of topK tokens scores: scores from the attention module (for coverage penalty) new_context: new decoder context, includes new hidden states for decoder RNN cells

BahdanauAttention¶

Encoder¶

class mlbench_core.models.pytorch.gnmt.encoder.ResidualRecurrentEncoder(vocab_size, hidden_size=1024, num_layers=4, dropout=0.2, embedder=None, init_weight=0.1)[source]¶

Encoder with Embedding, LSTM layers, residual connections and optional dropout.

The first LSTM layer is bidirectional and uses variable sequence length API, the remaining (num_layers-1) layers are unidirectional. Residual connections are enabled after third LSTM layer, dropout is applied on inputs to LSTM layers.

Parameters

vocab_size – size of vocabulary
hidden_size – hidden size for LSTM layers
num_layers – number of LSTM layers, 1st layer is bidirectional
dropout – probability of dropout (on input to LSTM layers)
embedder – instance of nn.Embedding, if None constructor will create new embedding layer
init_weight – range for the uniform initializer

forward(self, inputs, lengths)[source]¶

Execute the encoder.

Parameters

inputs – tensor with indices from the vocabulary
lengths – vector with sequence lengths (excluding padding)

Returns

tensor with encoded sequences

Decoder¶

class mlbench_core.models.pytorch.gnmt.decoder.RecurrentAttention(input_size=1024, context_size=1024, hidden_size=1024, num_layers=1, dropout=0.2, init_weight=0.1, fusion=True)[source]¶

LSTM wrapped with an attention module.

Parameters

input_size (int) – number of features in input tensor
context_size (int) – number of features in output from encoder
hidden_size (int) – internal hidden size
num_layers (int) – number of layers in LSTM
dropout (float) – probability of dropout (on input to LSTM layer)
init_weight (float) – range for the uniform initializer

forward(self, inputs, hidden, context, context_len)[source]¶

Execute RecurrentAttention.

Parameters

inputs (int) – tensor with inputs
hidden (int) – hidden state for LSTM layer
context – context tensor from encoder
context_len – vector of encoder sequence lengths

Returns

(rnn_outputs, hidden, attn_output, attn_scores)

class mlbench_core.models.pytorch.gnmt.decoder.Classifier(in_features, out_features, init_weight=0.1)[source]¶

Fully-connected classifier

Parameters

in_features (int) – number of input features
out_features (int) – number of output features (size of vocabulary)
init_weight (float) – range for the uniform initializer

forward(self, x)[source]¶

Execute the classifier.

Parameters: x (torch.tensor) –
Returns: torch.tensor

class mlbench_core.models.pytorch.gnmt.decoder.ResidualRecurrentDecoder(vocab_size, hidden_size=1024, num_layers=4, dropout=0.2, embedder=None, init_weight=0.1, fusion=True)[source]¶

Decoder with Embedding, LSTM layers, attention, residual connections and optinal dropout.

Attention implemented in this module is different than the attention discussed in the GNMT arxiv paper. In this model the output from the first LSTM layer of the decoder goes into the attention module, then the re-weighted context is concatenated with inputs to all subsequent LSTM layers in the decoder at the current timestep.

Residual connections are enabled after 3rd LSTM layer, dropout is applied on inputs to LSTM layers.

Parameters

vocab_size (int) – size of vocabulary
hidden_size (int) – hidden size for LSMT layers
num_layers (int) – number of LSTM layers
dropout (float) – probability of dropout (on input to LSTM layers)
embedder (nn.Embedding) – if None constructor will create new embedding layer
init_weight (float) – range for the uniform initializer

append_hidden(self, h)[source]¶

Appends the hidden vector h to the list of internal hidden states.

Parameters: h – hidden vector

forward(self, inputs, context, inference=False)[source]¶

Execute the decoder.

Parameters

inputs – tensor with inputs to the decoder
context – state of encoder, encoder sequence lengths and hidden state of decoder’s LSTM layers
inference – if True stores and repackages hidden state

Returns:

init_hidden(self, hidden)[source]¶: Converts flattened hidden state (from sequence generator) into a tuple of hidden states. :param hidden: None or flattened hidden state for decoder RNN layers

package_hidden(self)[source]¶: Flattens the hidden state from all LSTM layers into one tensor (for the sequence generator).

Transformer Model for Translation¶

Model¶

class mlbench_core.models.pytorch.transformer.TransformerModel(args, src_dict, trg_dict)[source]¶

Transformer model

This model uses MultiHeadAttention as described in [VSP+17]

Parameters

args – Arguments of model. All arguments should be accessible via __getattribute__ method
src_dict (mlbench_core.dataset.nlp.pytorch.wmt17.Dictionary) – Source dictionary
trg_dict (mlbench_core.dataset.nlp.pytorch.wmt17.Dictionary) – Target dictionary

forward(self, src_tokens, src_lengths, prev_output_tokens)¶

Run the forward pass of the transformer model.

Parameters

src_tokens (torch.Tensor) – Source tokens
src_lengths (torch.Tensor) – Source sentence lengths
prev_output_tokens (torch.Tensor) – Previous output tokens

Returns

The model output, and attention weights if needed

Return type

(torch.Tensor, Optional[torch.Tensor])

Encoder¶

class mlbench_core.models.pytorch.transformer.encoder.TransformerEncoder(args, dictionary, embed_tokens, left_pad=True)[source]¶

Transformer encoder consisting of args.encoder_layers layers. Each layer is a TransformerEncoderLayer.

Parameters

args – Arguments of model. All arguments should be accessible via __getattribute__ method
dictionary (mlbench_core.dataset.nlp.pytorch.wmt17.Dictionary) – encoding dictionary
embed_tokens (torch.nn.Embedding) – input embedding
left_pad (bool) – Pad sources to the left (True) or right (False). Default: True

forward(self, src_tokens)[source]¶

Forward function of encoder

Parameters: src_tokens (torch.Tensor) – Source tokens
Returns: {encoder:out (torch.Tensor), encoder_padding_mask (torch.Tensor)}
Return type: (dict)

Decoder¶

class mlbench_core.models.pytorch.transformer.decoder.TransformerDecoder(args, dictionary, embed_tokens, no_encoder_attn=False, left_pad=False)[source]¶

Transformer decoder consisting of args.decoder_layers layers. Each layer is a TransformerDecoderLayer.

Parameters

args – Arguments of model. All arguments should be accessible via __getattribute__ method
dictionary (mlbench_core.dataset.nlp.pytorch.wmt17.Dictionary) – decoding dictionary
embed_tokens (torch.nn.Embedding) – output embedding
no_encoder_attn (bool, optional) – whether to attend to encoder outputs (default: False).
left_pad (bool) – Pad targets to the left (True) or right (False). Default: False

Layers¶

class mlbench_core.models.pytorch.transformer.modules.TransformerEncoderLayer(args)[source]¶

Encoder layer block.

In the original paper each operation (multi-head attention or FFN) is postprocessed with: dropout -> add residual -> layernorm. In the tensor2tensor code they suggest that learning is more robust when preprocessing each layer with layernorm and postprocessing with: dropout -> add residual. We default to the approach in the paper, but the tensor2tensor approach can be enabled by setting args.encoder_normalize_before to True.

Parameters: args (argparse.Namespace) – parsed command-line arguments

class mlbench_core.models.pytorch.transformer.modules.TransformerDecoderLayer(args, no_encoder_attn=False)[source]¶

Decoder layer block.

In the original paper each operation (multi-head attention, encoder attention or FFN) is postprocessed with: dropout -> add residual -> layernorm. In the tensor2tensor code they suggest that learning is more robust when preprocessing each layer with layernorm and postprocessing with: dropout -> add residual. We default to the approach in the paper, but the tensor2tensor approach can be enabled by setting args.decoder_normalize_before to True.

Parameters

args (argparse.Namespace) – parsed command-line arguments
no_encoder_attn (bool, optional) – whether to attend to encoder outputs (default: False).

SequenceGenerator¶

class mlbench_core.models.pytorch.transformer.sequence_generator.SequenceGenerator(model, src_dict, trg_dict, beam_size=1, minlen=1, maxlen=None, stop_early=True, normalize_scores=True, len_penalty=1, retain_dropout=False, sampling=False, sampling_topk=- 1, sampling_temperature=1)[source]¶

Generates translations of a given source sentence.

Parameters

model (torch.nn.Module) – The model to predict on. Should be instance of TransformerModel
src_dict (mlbench_core.dataset.nlp.pytorch.wmt17.Dictionary) – Source dictionary
trg_dict (mlbench_core.dataset.nlp.pytorch.wmt17.Dictionary) – Target dictionary
beam_size (int) – Size of the beam. Default 1
minlen (int) – Minimum generation length. Default 1
maxlen (int) – Maximum generation length. If None, takes value of model.max_decoder_positions(). Default None
stop_early (bool) – Stop generation immediately after we finalize beam_size hypotheses, even though longer hypotheses might have better normalized scores. Default True
normalize_scores (bool) – Normalize scores by the length of the output. Default True
len_penalty (float) – length penalty: <1.0 favors shorter, >1.0 favors longer sentences. Default 1
retain_dropout (bool) – Keep dropout layers. Default False
sampling (bool) – sample hypotheses instead of using beam search. Default False
sampling_topk (int) – sample from top K likely next words instead of all words. Default -1
sampling_temperature (int) – temperature for random sampling. Default 1

generate(self, src_tokens, src_lengths, maxlen=None, prefix_tokens=None)[source]¶: Generate a batch of translations.

generate_batch_translations(self, batch, maxlen_a=0.0, maxlen_b=None, prefix_size=0)[source]¶

Yield individual translations of a batch.

Parameters

batch (dict) – The model input batch. Must have keys net_input, target and ntokens
maxlen_a (float) –
maxlen_b (Optional[int]) – Generate sequences of max lengths maxlen_a*x + maxlen_b where x = input sentence length
prefix_size (int) – Prefix size

translate_batch(self, batch, maxlen_a=1.0, maxlen_b=50, prefix_size=0, remove_bpe=None, nbest=1, ignore_case=True)[source]¶

Parameters

batch (dict) – The model input batch. Must have keys net_input, target and ntokens
maxlen_a (float) – Default 1.0
maxlen_b (Optional[int]) – Generate sequences of max lengths maxlen_a*x + maxlen_b where x = input sentence length. Default 50
prefix_size (int) – Prefix size. Default 0
remove_bpe (Optional[str]) – BPE token. Default None
nbest (int) – Number of hypotheses to output. Default 1
ignore_case (bool) – Ignore case druing online eval. Default True

Returns

The translations and their targets for the given batch

Return type

(list[str], list[str])

References

HZRS16a(1,2,3): Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778. 2016.
HZRS16b(1,2,3): Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, 630–645. Springer, 2016.
VSP+17: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. URL: http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.

NLP¶

LSTM Language Model¶

tensorflow¶

resnet¶

Contains definitions for Residual Networks. Residual networks (‘v1’ ResNets) were originally proposed in [HZRS16a]. The full preactivation ‘v2’ ResNet variant was introduced by [HZRS16b]. The key difference of the full preactivation ‘v2’ variant compared to the ‘v1’ variant in [1] is the use of batch normalization before every weight layer rather than after.

mlbench_core.models.tensorflow.resnet_model.fixed_padding(inputs, kernel_size, data_format)[source]¶

Pads the input along the spatial dimensions independently of input size.

Parameters

inputs (tf.Tensor) – A tensor of size [batch, channels, height_in, width_in] or [batch, height_in, width_in, channels] depending on data_format.
kernel_size (int) – The kernel to be used in the conv2d or max_pool2d operation. Should be a positive integer.
data_format (str) – The input format (‘channels_last’ or ‘channels_first’).

Returns

A tensor with the same format as the input with the data either intact (if kernel_size == 1) or padded (if kernel_size > 1).

mlbench_core.models.tensorflow.resnet_model.conv2d_fixed_padding(inputs, filters, kernel_size, strides, data_format)[source]¶: Strided 2-D convolution with explicit padding.

mlbench_core.models.tensorflow.resnet_model.block_layer(inputs, filters, bottleneck, block_fn, blocks, strides, training, name, data_format)[source]¶

Creates one layer of blocks for the ResNet model.

Parameters

inputs (tf.Tensor) – A tensor of size [batch, channels, height_in, width_in] or [batch, height_in, width_in, channels] depending on data_format.
filters (int) – The number of filters for the first convolution of the layer.
bottleneck (bool) – Is the block created a bottleneck block.
block_fn (callable) – The block to use within the model, either building_block or bottleneck_block.
blocks (int) – The number of blocks contained in the layer.
strides (int) – The stride to use for the first convolution of the layer. If greater than 1, this layer will ultimately downsample the input.
training (bool) – Either True or False, whether we are currently training the model. Needed for batch norm.
name (str) – A string name for the tensor output of the block layer.
data_format (str) – The input format (‘channels_last’ or ‘channels_first’).

Returns

The output tensor of the block layer.

mlbench_core.models.tensorflow.resnet_model.batch_norm(inputs, training, data_format)[source]¶: Performs a batch normalization using a standard set of parameters.

Model¶

class mlbench_core.models.tensorflow.resnet_model.Model(resnet_size, bottleneck, num_classes, num_filters, kernel_size, conv_stride, first_pool_size, first_pool_stride, block_sizes, block_strides, resnet_version=DEFAULT_VERSION, data_format=None, dtype=DEFAULT_DTYPE)[source]¶: Base class for building the Resnet Model.

Cifar10Model¶

class mlbench_core.models.tensorflow.resnet_model.Cifar10Model(resnet_size, data_format=None, num_classes=10, resnet_version=2, dtype=tf.float32)[source]¶: Model class with appropriate defaults for CIFAR-10 data.